Attention ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ

1. Abstract

2. Introduction

3. Background

3.1 Self Attention

![์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋ฌธ์žฅ ๋‚ด์—์„œ it ๊ฐ€๋ฅดํ‚ค๋Š” ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ Self Attention์„ ์ด์šฉํ•ด ์•Œ ์ˆ˜ ์žˆ์Œ](/img/user/images/ViT images/image.png)

์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด ๋ฌธ์žฅ ๋‚ด์—์„œ it ๊ฐ€๋ฅดํ‚ค๋Š” ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ Self Attention์„ ์ด์šฉํ•ด ์•Œ ์ˆ˜ ์žˆ์Œ

3.2 End-to-End memory Network

4. Model Architecture

![image.png](/img/user/images/ViT images/image 1.png)

![image.png](/img/user/images/R-CNN images/image 2.png)

4.1 Encoder and Decoder Stacks

4.2 Attention

4.2.1 Scaled Dot-Product Attention

![image.png](/img/user/images/R-CNN images/image 3.png)

Query : ์˜ํ–ฅ์„ ๋ฐ›์„ ๋‹จ์–ด (๋ฒกํ„ฐ)
Key : ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋‹จ์–ด (๋ฒกํ„ฐ)
Value : ์˜ํ–ฅ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ (๋ฒกํ„ฐ)

![์ด ์‹์„ ์ด์šฉํ•ด ๊ณ„์‚ฐ](/img/user/images/R-CNN images/image 4.png)

์ด ์‹์„ ์ด์šฉํ•ด ๊ณ„์‚ฐ

< ๊ณ„์‚ฐ ๊ณผ์ • ์š”์•ฝ >

(1) ์›Œ๋“œ ์ž„๋ฒ ๋”ฉ์— ๊ฐ€์ค‘์น˜(WQ,WK,WV)๋ฅผ ๊ณฑํ•ด์„œ Query, Key, Value๋ฅผ ๊ณ„์‚ฐ

(2) Query * Key = attention score โ‡’ ๊ฐ’์ด ๋†’์„ ์ˆ˜๋ก ์—ฐ๊ด€์„ฑ์ด ๋†’๊ณ , ๋‚ฎ์„ ์ˆ˜๋ก ์—ฐ๊ด€์„ฑ์ด ๋‚ฎ๋‹ค.

(3) key ์ฐจ์›์ˆ˜๋กœ ๋‚˜๋ˆ„๊ณ  softmax ์ ์šฉ โ‡’ softmax ๊ฒฐ๊ณผ ๊ฐ’์€ key๊ฐ’์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๊ฐ€ ํ˜„์žฌ ๋‹จ์–ด์— ์–ด๋Š์ •๋„ ์—ฐ๊ด€์„ฑ์ด ์žˆ๋Š”์ง€ ๋‚˜ํƒ€๋ƒ„

(4) ๋ฌธ์žฅ ์†์—์„œ ์ง€๋‹Œ ์ž…๋ ฅ ์›Œ๋“œ์˜ ๊ฐ’ = softmax ๊ฐ’๊ณผ value ๊ฐ’์„ ๊ณฑํ•˜์—ฌ ๋‹ค ๋”ํ•จ

4.2.2 Multi-Head Attention

![image.png](/img/user/images/UNETR images/image 5.png)

![image.png](/img/user/images/UNETR images/image 6.png)

![image.png](/img/user/images/Attention images/image 7.png)

4.2.3 Application in Attention in our Model

Transformer๋Š” ์„ธ ๊ฐ€์ง€ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ Multi-Head Attention์„ ์‚ฌ์šฉ

![image.png](/img/user/images/Attention images/image 8.png)

  1. "self-attention in encoder": encoder์—์„œ ์‚ฌ์šฉ๋˜๋Š” self-attention์œผ๋กœ queries, keys, values ๋ชจ๋‘ encoder๋กœ๋ถ€ํ„ฐ ๊ฐ€์ ธ์˜จ๋‹ค. encoder์˜ ๊ฐ position์€ ๊ทธ ์ „ layer์˜ ๋ชจ๋“  positions๋“ค์„ ์ฐธ์กฐํ•˜๊ณ ,ย ์ด๋Š” ํ•ด๋‹น position๊ณผ ๋ชจ๋“  position๊ฐ„์˜ correlation information์„ ๋”ํ•ด์ฃผ๊ฒŒ ๋œ๋‹ค. ๊ฐ„๋‹จํ•˜๊ฒŒ ์„ค๋ช…ํ•ด์„œ ์–ด๋–ค ํ•œ ๋‹จ์–ด์ด ๋ชจ๋“  ๋‹จ์–ด๋“ค ์ค‘ ์–ด๋–ค ๋‹จ์–ด๋“ค๊ณผ correlation์ด ๋†’๊ณ , ๋˜ ์–ด๋–ค ๋‹จ์–ด์™€๋Š” ๋‚ฎ์€์ง€๋ฅผ ๋ฐฐ์šฐ๊ฒŒ ๋œ๋‹ค.
  2. "self-attention in decoder": ์ „์ฒด์ ์ธ ๊ณผ์ •๊ณผ ๋ชฉํ‘œ๋Š” encoder์˜ self-attention๊ณผ ๊ฐ™๋‹ค. ํ•˜์ง€๋งŒ decoder์˜ ๊ฒฝ์šฐ, sequence model์˜ย auto-regressive property๋ฅผ ๋ณด์กดํ•ด์•ผํ•˜๊ธฐ ๋•Œ๋ฌธ์— masking vector๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด๋‹น position ์ด์ „์˜ ๋ฒกํ„ฐ๋“ค๋งŒ์„ ์ฐธ์กฐํ•œ๋‹ค(์ดํ›„์— ๋‚˜์˜ฌ ๋‹จ์–ด๋“ค์„ ์ฐธ์กฐํ•˜์—ฌ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์€ ์ผ์ข…์˜ ์น˜ํŒ…).
  3. "encoder-decoder attention": decoder์—์„œ self-attention ๋‹ค์Œ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” layer์ด๋‹ค. queries๋Š” ์ด์ „ decoder layer์—์„œ ๊ฐ€์ ธ์˜ค๊ณ , keys์™€ values๋Š” encoder์˜ output์—์„œ ๊ฐ€์ ธ์˜จ๋‹ค. ์ด๋Š” decoder์˜ ๋ชจ๋“  position์˜ vector๋“ค๋กœ encoder์˜ ๋ชจ๋“  position ๊ฐ’๋“ค์„ ์ฐธ์กฐํ•จ์œผ๋กœ์จย decoder์˜ sequence vector๋“ค์ด encoder์˜ sequence vector๋“ค๊ณผ ์–ด๋– ํ•œ correlation์„ ๊ฐ€์ง€๋Š”์ง€๋ฅผ ํ•™์Šตํ•œ๋‹ค.

4.3 Position-wise Feed-Forward Networks

![image.png](/img/user/images/Attention images/image 9.png)

![image.png](/img/user/images/Attention images/image 10.png)

4.4 Embeddings and Softmax

![image.png](/img/user/images/Attention images/image 11.png)

4.5 Positional Encoding

Sequence์˜ ์ˆœ์„œ๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด ํ† ํฐ์˜ ์ƒ๋Œ€์  ๋˜๋Š” ์ ˆ๋Œ€์ ์ธ ์œ„์น˜์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์ œ๊ณต

![pos๋Š” position, i๋Š” dimension](/img/user/images/Attention images/image 12.png)

pos๋Š” position, i๋Š” dimension

5.Why Self-Attention

์™œ Self-Attention์ด RNN, CNN ๋ณด๋‹ค ๋” ์ข‹์€๊ฐ€?

  1. Layer๋‹น ๊ณ„์‚ฐ ๋ณต์žก์„ฑ์ด ์ค„์–ด๋“ฆ
    ์•„๋ž˜์˜ ํ‘œ๋ฅผ ํ™•์ธํ•ด๋ณด๋ฉด Self-Attention์˜ ๊ณ„์‚ฐ ๋ณต์žก๋„๊ฐ€ ๋‹ค๋ฅธ ๊ฒƒ์— ๋น„ํ•ด ๋‚ฎ์€ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

๋ณดํ†ต sequence length n์ด  vector dimension d๋ณด๋‹ค ์ž‘์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. Conv ํฌ๊ธฐ = k,
r์€ ๋ชจ๋“  ๋‹จ์–ด์™€ self attention์ด ์•„๋‹Œ ์ผ๋ถ€ ๋ฒ”์œ„์—์„œ๋งŒ ์ง„ํ–‰

๋ณดํ†ต sequence length n์ด vector dimension d๋ณด๋‹ค ์ž‘์€ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค. Conv ํฌ๊ธฐ = k,
r์€ ๋ชจ๋“  ๋‹จ์–ด์™€ self attention์ด ์•„๋‹Œ ์ผ๋ถ€ ๋ฒ”์œ„์—์„œ๋งŒ ์ง„ํ–‰

  1. ๋ณ‘๋ ฌํ™” ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณ„์‚ฐ๋Ÿ‰
    Self-attention layer๋Š” input์˜ ๋ชจ๋“  position ๊ฐ’๋“ค์„ ์—ฐ๊ฒฐํ•˜์—ฌ ํ•œ๋ฒˆ์— ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ Sequential operations์ด O(1)์„ ๊ฐ€์ง โ‡’ parallel system์—์„œ ์œ ๋ฆฌํ•˜๊ฒŒ ์‚ฌ์šฉ๋จ

  2. ์žฅ๊ธฐ ์˜์กด์„ฑ ๋ฌธ์ œ ํ•ด๊ฒฐ

    ์žฅ๊ธฐ ์˜์กด์„ฑ์„ ์ž˜ ๋ฐฐ์šฐ๊ธฐ ์œ„ํ•ด์„œ๋Š” Length of paths๊ฐ€ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นจ
    length of paths ๋ž€ ๋Œ€์‘๋˜๋Š” input sequence token๊ณผ output sequence token ๊ฐ„์˜ ๊ฑฐ๋ฆฌ
    Maximum path length๋ž€ length of paths์ค‘์—์„œ ๊ฐ€์žฅ ๊ธด ๊ฑฐ๋ฆฌ โ‡’ input์˜ ์ฒซ token๊ณผ output์˜ ๋งˆ์ง€๋ง‰ token (input sequence length + output sequence length)

    Self**-**attention์€ ๊ฐ token๋“ค์„ ๋ชจ๋“  token๋“ค๊ณผ ์ฐธ์กฐํ•˜์—ฌ ๊ทธ correlation information์„ ๊ตฌํ•ด์„œ ๋”ํ•ด์ฃผ๊ธฐ ๋•Œ๋ฌธ์—(์‹ฌ์ง€์–ด encoder-decoder๋ผ๋ฆฌ๋„), maximum path length๋ฅผ O(1)์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ long-range dependencies๋ฅผ ๋” ์‰ฝ๊ฒŒ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์„ ๊ฐ€์ง„๋‹ค.

6. Overall Training

Encoder์—์„œ๋Š” ๋ฌธ์žฅ ๋‚ด ๋‹จ์–ด์™€ ๋ฌธ๋งฅ์„ ์ดํ•ดํ•˜๊ณ , Decoder์—์„œ๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ์„ ๋‚ด๋†“๋Š”๋‹ค.

(1) Encoding ๊ณผ์ •:

์˜ˆ๋ฅผ ๋“ค์–ด, "The cat sits"๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ด…์‹œ๋‹ค. ์ด ๋ฌธ์žฅ์ด Transformer ๋ชจ๋ธ์˜ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€๋ฉด, ๊ฐ ๋‹จ์–ด๋Š” ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜๋ฉ๋‹ˆ๋‹ค( โ†’ ์ž„๋ฒ ๋”ฉ). ๋™์‹œ์—, ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜ ์ •๋ณด๋„ ๋ฒกํ„ฐ๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค( โ†’ํฌ์ง€์…”๋„ ์ž„๋ฒ ๋”ฉ)

์ด๋ ‡๊ฒŒ ์ž„๋ฒ ๋”ฉ๋œ ๋‹จ์–ด ์‹œํ€€์Šค๋Š” Self-Attention๊ณผ Feed-Forward Network๋ฅผ ํฌํ•จํ•˜๋Š” Encoder ๋ธ”๋ก์„ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ณผ์ •์„ ํ†ตํ•ด ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด๋Š” ์ž์‹ ์ด ์–ด๋–ค ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๊ณผ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๋งบ๊ณ  ์žˆ๋Š”์ง€๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ๋ฒกํ„ฐ๋กœ ์ธ์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด, "cat"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ "sits"์™€ ๋ฐ€์ ‘ํ•œ ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค๋ฉด, "cat" ๋ฒกํ„ฐ๋Š” ์ด ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

(2) Decoding ๊ณผ์ •:

์ด์ œ ๋ชจ๋ธ์€ "The cat sits"๋ผ๋Š” ๋ฌธ์žฅ์„ ๋‹ค๋ฅธ ์–ธ์–ด๋กœ ๋ฒˆ์—ญํ•˜๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, "The cat sits"๋ฅผ ํ•œ๊ตญ์–ด๋กœ ๋ฒˆ์—ญํ•œ๋‹ค๊ณ  ํ•ด๋ด…์‹œ๋‹ค.

Decoder๋Š” ๋ฒˆ์—ญ๋œ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ณผ์ •์—์„œ ๋‹จ์–ด๋“ค์€ ํ•˜๋‚˜์”ฉ ์ˆœ์„œ๋Œ€๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ์ฒ˜์Œ์—๋Š” ์‹œ์ž‘ ํ† ํฐ(start token)์ด ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด๊ฐ€ "๊ณ ์–‘์ด"๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ํ™•๋ฅ ๋กœ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

๊ทธ ๋‹ค์Œ์—๋Š” "๊ณ ์–‘์ด"๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ๋‹ค์‹œ ๋””์ฝ”๋”์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋˜์–ด, "๋Š”"์ด๋ผ๋Š” ๋‹จ์–ด๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

๋งˆ์ง€๋ง‰์œผ๋กœ, "๊ณ ์–‘์ด"์™€ "๋Š”"์ด ํ•จ๊ป˜ ์ž…๋ ฅ๋˜์–ด, "์•‰์•„์žˆ๋‹ค"๊ฐ€ ์ถœ๋ ฅ๋ฉ๋‹ˆ๋‹ค.

์ด ๊ณผ์ •์—์„œ Self-Attention ๋ ˆ์ด์–ด๋Š” ํ•ด๋‹น ์œ„์น˜๋ณด๋‹ค ์ด์ „์— ์ƒ์„ฑ๋œ ๋ชจ๋“  ๋‹จ์–ด๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์˜ฌ๋ฐ”๋ฅธ ๋ฒˆ์—ญ์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค. ์ด๋•Œ ํ˜„์žฌ ์œ„์น˜ ์ดํ›„์˜ ๋‹จ์–ด๋Š” ๋งค์šฐ ์ž‘์€ ๊ฐ’(๋ณดํ†ต โˆ’โˆž)์œผ๋กœ ๋งˆ์Šคํ‚นํ•˜์—ฌ, softmax ์—ฐ์‚ฐ์—์„œ ์ œ์™ธ๋˜๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋””์ฝ”๋”๋Š” ์•„์ง ์ƒ์„ฑ๋˜์ง€ ์•Š์€ ๋‹จ์–ด์— ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ณ , ์ด๋ฏธ ์ƒ์„ฑ๋œ ๋‹จ์–ด๋“ค๋งŒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ฒˆ์—ญ์„ ์ด์–ด๊ฐ‘๋‹ˆ๋‹ค.

๋˜ํ•œ ๋””์ฝ”๋”๋Š” Encoder-Decoder Attention ๋ ˆ์ด์–ด๋ฅผ ํ†ตํ•ด, ์ธ์ฝ”๋”์—์„œ ๋‚˜์˜จ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋””์ฝ”๋”์˜ ๊ฐ ๋‹จ์–ด๋Š” ์ธ์ฝ”๋”์—์„œ ์ƒ์„ฑ๋œ "The cat sits"์˜ ๋ฒกํ„ฐ๋“ค๊ณผ ์–ดํ…์…˜์„ ๊ณ„์‚ฐํ•˜์—ฌ, ๋ฒˆ์—ญ ๊ณผ์ •์—์„œ ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ๋ชจ๋“  ์œ„์น˜๋ฅผ ์ฐธ์กฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

ํ•™์Šต์‹œ ์ถœ๋ ฅ๊ฐ’์ด ํ‹€๋ฆฌ์ž๋งˆ์ž ๊ฐ€์ค‘์น˜ ๊ฐ’์ด ์—…๋ฐ์ดํŠธ ๋˜๋Š”์ง€