Week 5 ํ•™์Šต ์ •๋ฆฌ

Week 6 ํ•™์Šต ์ •๋ฆฌ

Note

**1. Multimodal 1, 2
2. Generative Models
3. 3D Understanding
4. 3D Human

1. Multimodal 1, 2

Multi-modal์—๋Š” ํฌ๊ฒŒ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์ด ์žˆ๋‹ค.

Pasted image 20250107180902.png

1.1 Multimodal Challenge

  1. Modality ๊ฐ„์— ์„œ๋กœ ํ‘œํ˜„ ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅด๋‹ค!
  2. Modality ๊ฐ„์— ํ‘œํ˜„ํ•˜๋Š” ์ •๋ณด๋Ÿ‰์ด ๋‹ค๋ฅด๋‹ค!
  3. ๋‘ Modality ์ค‘์— ๋ณดํ†ต ํ•˜๋‚˜์˜ Modality์— ํŽธํ–ฅ๋œ๋‹ค!

1.2 Multi-modal alignment

โ‡’ ์„œ๋กœ ๋‹ค๋ฅธ Modality ๊ฐ„์˜ ์ •๋ณด๋ฅผ ์กฐํ™”๋กญ๊ฒŒ ์—ฐ๊ฒฐํ•˜๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๊ณผ์ •

1.2.1 Matching

โ‡’ ์„œ๋กœ ๋‹ค๋ฅธ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ(๋ฐ์ดํ„ฐ ์œ ํ˜•) ๊ฐ„์˜ ๊ด€๋ จ์„ฑ์„ ์ฐพ์•„๋‚ด๋Š” ์ž‘์—…

1.2.2 Translating

โ‡’ ํ•œ Modality์˜ ํ‘œํ˜„์„ ๋‹ค๋ฅธ ์–ธ์–ด ๋˜๋Š” ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๊ณผ์ •

1.2.3 Referencing

โ‡’ ์—ฌ๋Ÿฌ Modality๊ฐ€ ์žˆ๋Š” ํ™˜๊ฒฝ์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ ์œ ํ˜•์ด ์–ด๋–ป๊ฒŒ ์ƒํ˜ธ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ ๋ช…ํ™•ํžˆ ํ•˜๊ณ , ๊ทธ ๊ด€๋ จ์„ฑ์„ ๋ฐํžˆ๋Š” ๊ฒƒ์ด ์ค‘์š”

1.2 ์˜ˆ์‹œ ๋ชจ๋ธ

1.2.1 Matching : CLIP

โ‡’ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜์—ฌ ์ƒํ˜ธ ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ์„ ํ•™์Šตํ•˜๋Š” Multi modal ๋ชจ๋ธ

CLIPimages.png

1.2.2 Translating:

1.2.3-1 Referencing: Show, Attend and Tell

โ‡’ ์ด๋ฏธ์ง€์—์„œ ์ค‘์š”ํ•œ ๋ถ€๋ถ„์— ์ง‘์ค‘(attention)ํ•˜์—ฌ ์ž์—ฐ์–ด ์„ค๋ช…(์บก์…˜)์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ๋ชจ๋ธ

image.png

1.2.3-2 Referencing: Flamingo

โ‡’ **๋น„์ „(์‹œ๊ฐ์  ๋ฐ์ดํ„ฐ)**๊ณผ **์–ธ์–ด(ํ…์ŠคํŠธ)**๋ฅผ ๋™์‹œ์— ์ฒ˜๋ฆฌํ•˜์—ฌ ์ด๋ฏธ์ง€๋‚˜ ๋น„๋””์˜ค์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ์„ค๋ช…์„ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜, ์ฃผ์–ด์ง„ ํ…์ŠคํŠธ์™€ ๊ด€๋ จ๋œ ์ด๋ฏธ์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๋“ฑ์˜ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ•๋ ฅํ•œ ๋ชจ๋ธ

image.png

1.3 LLaVA (Large Language and Vision Assistant)

โ‡’ ์‹œ๊ฐ์  ์ถ”๋ก  ๋ชจ๋ธ๋กœ, ๋น„์ „๊ณผ ์–ธ์–ด๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ๋™์‹œ์— ์ดํ•ดํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ์ด๋Š” ๋ณต์žกํ•œ ์‹œ๊ฐ์  ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด์„ํ•˜๊ณ  ์ด๋ฅผ ์–ธ์–ด์ ์œผ๋กœ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ์„ ์ œ๊ณต

1.3.1 Feature alignment (Projection)

โ‡’ Feature alignment)์„ ์œ„ํ•ด ์„ ํ˜• ๋ ˆ์ด์–ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, ์‹œ๊ฐ์  ์ž…๋ ฅ์„ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์‹œ๊ฐ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ์™€ ์–ธ์–ด ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ์—ฐ๊ฒฐ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

image.png

1.3.2 Visual instruction tuning

1.3. InstructBLIP

image.png

1.3.1 InstructBLIP ๊ฐœ์š”

1.3.2 Feature alignment (Q-Former)

image.png

1.3.3 InstructBLIP์˜ ๋ณ€ํ˜•

2. Generative Models

โ‡’ ์‹ค์ œ ๋ฐ์ดํ„ฐ์™€ ๋น„์Šทํ•œ ๊ฐ€์งœ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ โ†’ Training ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•ด ์ƒˆ๋กœ์šด ์ƒ˜ํ”Œ์„ ๋งŒ๋“ค๊ณ  ์ด๋ฅผ ์‹ค์ œ ๋ฐ์ดํ„ฐ์™€์˜ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ ๋‹ค.

2.1 Autoregressive Model

2.3 VAE, Variational Autoencoder

image.png

2.4 DDPM, Denoising Diffusion Probabilistic Models

image.png

image.png

2.5 Latent Diffusion Models, Stable Diffusion

image.png

image.png

2.6 Condition in the Diffusion Models

image.png

2.7 Image Editing

image.png

2.2 ๊นŠ์ด ์ƒ์„ฑ (Depth Generation)

3. 3D Understanding

3.1 3D๊ฐ€ ์ค‘์š”ํ•œ ์ด์œ  (Why is 3D important?)

3.2 3D๋ฅผ ๊ด€์ฐฐํ•˜๋Š” ๋ฐฉ์‹ (The Way We Observe 3D)

image.png

3.3 3D ๋ฐ์ดํ„ฐ ํ‘œํ˜„ ๋ฐฉ์‹ (3D Data Representation)

4. 3D Human

4.1 ์ธ๊ฐ„ ๋ชจ๋ธ์˜ ์ค‘์š”์„ฑ (Why are Human Models Important?)

4.2 ๊ฐ€์ƒ ์ธ๊ฐ„์˜ ๋ชฉ์  (Purpose of Virtual Humans)

4.3 ์ธ๊ฐ„ ๋ชจ๋ธ ์ƒ์„ฑ์˜ ์–ด๋ ค์›€ (Challenges in Human Model Creation)

image.png

4.4 ์‹ ์ฒด ๋ชจ๋ธ์ด๋ž€? (What is a Body Model?)

image.png

4.4 ์„ ํ˜• ๋ธ”๋ Œ๋“œ ์Šคํ‚ค๋‹ (Linear Blend Skinning, LBS)

image.png

4.5 SMPL (Skinned Multi-Person Linear Model)

image.png

4.6 SMPLify

image.png

image.png

4.7 SPIN (SMPL oPtimization IN the loop)

image.png

4.8 MultiPly

image.png