Artificial Intelligence/Deep Learning

[Deep Learning] Recurrent Neural Network (4) - Transformer

Han Jang 2023. 5. 20. 13:01
๐Ÿง‘๐Ÿปโ€๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ
Neural Networks
Recurrent Neural Network
LSTM
Attention

 

https://arxiv.org/abs/1706.03762

 

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

 

 

Transformer

 

 

์ด์ œ๋Š” Attention๋งŒ ๊ฐ€์ง€๊ณ , ์ˆœํ™˜ ๊ตฌ์กฐ ์—†์ด ๋‹ค๋ฃจ๊ฒ ๋‹ค๋Š” ๋…ผ๋ฌธ์ด ๋“ฑ์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Attention ์•ˆ์—์„œ ์„œ๋กœ์˜ ์—ฐ๊ฒฐ๊ด€๊ณ„๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์„ ๋„์ž…ํ•˜๋‹ˆ ๊ตณ์ด ์ˆœํ™˜ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ ธ๊ฐˆ ํ•„์š”๊ฐ€ ์—†๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

์—ฌ๊ธฐ์„œ self-attention๊ณผ Transformer๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์ด ๋ฐœ์ „ํ•˜์—ฌ BERT, GPT๊ฐ€ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

 

์œ„ ์ˆ˜์‹์„ ๋จผ์ € ๋ด…์‹œ๋‹ค.

 

์ด๊ฒƒ์€ ์ด์ „์— attention์—์„œ ๋ณธ, ์œ ์‚ฌ๋„, weight, context๋กœ ํ‘œํ˜„๋ฉ๋‹ˆ๋‹ค.

 

์ด context ์ž์ฒด๊ฐ€ Attention์ด๋ผ๊ณ  ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ด ์ˆ˜์‹์—์„œ๋„ ์œ ์‚ฌ๋„๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

softmax ๊ด„ํ˜ธ ์•ˆ์— ์žˆ๋Š” ๊ฐ’์€ Q์™€ K์˜ ์œ ์‚ฌ๋„๋ฅผ ๊ณ ๋ คํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์œ„ Attention ๊ตฌ์กฐ๋งŒ ์ผ์„ ๋•Œ๋Š” additive ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•˜์˜€์ง€๋งŒ, ์ด ์ˆ˜์‹์€ dot product ์œ ์‚ฌ๋„๋ฅผ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.

 

๋‚ด์ ์ด์ฃ .

 

๊ทธ๋ฆฌ๊ณ  d k์˜ root ๊ฐ’์œผ๋กœ ๋‚˜๋ˆ ์„œ, scale์— ๋”ฐ๋ผ ๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒฝ์šฐ๋ฅผ ๋Œ€๋น„ํ•˜์—ฌ ๋ชจ๋ธ์ด stableํ•˜๊ฒŒ ๋•์Šต๋‹ˆ๋‹ค.

 

์ด d k๋Š” dimension์œผ๋กœ ๋‚˜๋ˆ„์–ด unit vectorํ™” ์‹œํ‚ต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด ๋ถ€๋ถ„์„ scaled dot product๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์œ„ softmax ๋ถ€๋ถ„์€ ๋‹น์—ฐํžˆ weight๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

์ด weight์— ๋˜ ๋‹ค๋ฅธ matrix V๋ฅผ ๊ณ„์‚ฐํ•ด์„œ ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ด Attention์ž…๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ Weighted V๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ณ , ์ด weight๋Š” Q์™€ K์˜ ์œ ์‚ฌ๋„์— ๊ธฐ๋ฐ˜ํ•œ weight์ž…๋‹ˆ๋‹ค.

 

 

์ด Q์ธ Query๋Š” Decoder์—์„œ์˜ target์ด ๋˜๋Š” hidden state๋ผ๊ณ  ๋ฐ”๊ฟ” ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ Key์ธ K๋Š” encoder์—์„œ ๋ชจ๋“  state๊ฐ€ key ํ›„๋ณด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ์œ ์‚ฌ๋„๊ฐ€ ๋‹ค ๊ณฑํ•ด์ ธ์„œ ๋‚˜๊ฐ€๋‹ˆ, Key์™€ Value๋Š” ๊ฐ™์€ ๊ฐ’์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์„ ์ด์ œ ์ „์ฒด state์— ๋Œ€ํ•ด ์ƒ๊ฐํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

์›๋ž˜๋Š” Encoder์˜ Attention ๊ฐ’์„ ๊ณ„์‚ฐํ•˜์—ฌ Decoder์˜ ์˜ˆ์ธก์— ๋„์›€์„ ์คฌ๋‹ค๋ฉด,

 

์ด๋ฒˆ์—๋Š” Encoder์™€ Encoder ์‚ฌ์ด์˜ Attention์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

Encoder๋„ sequence๋ฅผ vectorํ™”๋งŒ ํ•ด์„œ ๋„ฃ์ง€ ๋ง๊ณ  ์ž…๋ ฅ๋“ค ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ matrix๋กœ ๋งŒ๋“ค์–ด์„œ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์œผ๋ฉด ์ข‹์ง€ ์•Š์„๊นŒํ•˜๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

i am a student ์—์„œ,

 

์ด ๋‹จ์–ด๋“ค์ด ๋…๋ฆฝ์ ์ธ ์ž…๋ ฅ์ด ๋˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ˆœํ™˜์œผ๋กœ ๋ฌถ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์–ด๋–ค ๋‹จ์–ด์™€ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๊ด€๋ จ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ matrixํ™” ํ•ด์„œ ํ‘œํ˜„ํ•˜๋ฉด ๋” ์ข‹๊ฒ ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

๋˜๋Š” Decoder์™€ Decoder ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ๋ด…๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์„ ํ†ตํ•ด ์˜ˆ์ธก ๊ฐ’๋“ค ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„๋ฅผ ์‚ดํ•๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ์•ž์ฒ˜๋Ÿผ Encoder์™€ Decoder ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„ ์ž…๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ 3๊ฐ€์ง€ ๊ฒฝ์šฐ์˜ ์ˆ˜๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๊ฒƒ๋“ค์„ ์กฐ๊ธˆ ๋” ๊ณ ๋ คํ•˜์—ฌ, ์ผ๋ฐ˜ํ™”ํ•ด์„œ ์ƒ๊ฐํ•ด ๋ณธ๋‹ค๋ฉด,

 

Query ์ž์ฒด๋„ ๊ฐ€๋Šฅํ•œ hidden state๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  Key์™€ Value ๋˜ํ•œ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‹ˆ, scaled dot product attention์„ ํ•˜๋Š”๋ฐ,

 

Q์™€ K๊ฐ€ Matmul๊ณผ scale์„ ํ•˜๋Š” ๊ฒƒ์ด softmax ๊ด„ํ˜ธ ์•ˆ์— ์žˆ๋Š” ์‹์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , mask๋Š” optional์ž…๋‹ˆ๋‹ค.

 

softmax๋ฅผ ๊ทธ ๋‹ค์Œ์œผ๋กœ ํ•ด์ค๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  V๊ฐ€ ๋“ค์–ด๊ฐ€์„œ matmul์„ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์„ ํ†ตํ•ด,

 

Encoder-Decoder ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, Encoder-Encoder, Decoder-Decoder ๋ผ๋ฆฌ๋„ ์ƒ๊ฐํ•ด๋ณผ๋งŒ ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

 

์ด scaled dot-product๋ฅผ ํ•œ ๋ฒˆ๋งŒ ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ,

 

Multi-head attention์„ ์‚ฌ์šฉํ•˜์—ฌ parrallelํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒˆ ํ•ด์ค๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ sequentialํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒˆ์ด ์•„๋‹Œ, parrallelํ•˜๊ฒŒ ์—ฌ๋Ÿฌ ๋ฒˆ์ž…๋‹ˆ๋‹ค.

 

์ผ๋‹จ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์™€ ๊ฐ™์ด input word๋“ค์„ vector๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐ, 512์ฐจ์›์œผ๋กœ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์–ด๋–ค sequence๊ฐ€ 512 ์ฐจ์›์ด ์žˆ๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์ด ๊ฒƒ์„, 64์ฐจ์›์”ฉ 8๊ฐœ๋กœ Attention์„ ์ง„ํ–‰ํ•˜๋Š” ๊ฒƒ์ด, ๋ณ‘๋ ฌ์ ์œผ๋กœ Attention์„ ์‹œ์ผœ์ฃผ๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์œ ๋ฆฌํ•˜๋‹ค๊ณ  ์•Œ๋ ค์กŒ์Šต๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ํฐ 512 ์ฐจ์›์˜ word embedding์— ๋Œ€ํ•ด ํ•œ ๋ฒˆ์— Attentionํ•˜์ง€ ๋ง๊ณ , linear mapping์„ ํ†ตํ•ด์„œ ์ฐจ์›์„ ์ค„์ธ ๋‹ค์Œ์— ๊ทธ๊ฒƒ์„ ๋ณ‘๋ ฌ์ ์œผ๋กœ Attentionํ•˜๋Š” ๊ฒƒ์ด ์œ ๋ฆฌํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์ด ์ด Multi-head Attention์ž…๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋ž˜์„œ ์œ„ ์ˆ˜์‹๊ณผ ๊ฐ™์ด 512๋ฅผ 8๊ฐœ๋กœ projectionํ•ฉ๋‹ˆ๋‹ค.

 

์ด projection์€ ์ƒˆ๋กœ์šด trainable parameter๋ฅผ ํ†ตํ•ด์„œ ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

์—ฌ๊ธฐ์„œ Q, K, V๋Š” ์ „๋ถ€ 512์ฐจ์›์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ๊ทธ๋Œ€๋กœ ๋„ฃ์œผ๋ฉด ๋„ˆ๋ฌด ํฌ๋‹ˆ ์ด๊ฒƒ์„ ์ค„์ž…๋‹ˆ๋‹ค.

 

512 ์ฐจ์›์„ 64 ์ฐจ์›์œผ๋กœ ์ค„์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด Linear Mapping์ž…๋‹ˆ๋‹ค.

 

 

512 ์ฐจ์›์„ 64 ์ฐจ์›์œผ๋กœ ์ค„์˜€์œผ๋‹ˆ, ์ •๋ณด๋Ÿ‰๋„ 1/8 ์ •๋„ ์ค„์—ˆ์Šต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ๋ณด์ •ํ•ด์ฃผ๊ธฐ ์œ„ํ•ด์„œ h ๊ฐ€ 8๋กœ,

 

8๊ฐœ์˜ head๋ฅผ ๋‘์–ด parellelํ•˜๊ฒŒ 8๋ฒˆ์˜ Attention์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„ ์ˆ˜์‹์—์„œ๋Š” Attention์ด 8๋ฒˆ ์ ์šฉ๋˜์–ด ๊ฐ๊ฐ์˜ ๊ฒฐ๊ณผ head_1 ~ head_h๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

 

์ด๊ฒƒ ๊ฐ๊ฐ๋„ 64์ฐจ์›์œผ๋กœ h์ธ 8๊ฐœ ๋งŒํผ ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ concatenate๋ฅผ ํ†ตํ•ด ์ญ‰ ๋ถ™์—ฌ์ค๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ๋งˆ์ง€๋ง‰์— Linear transformation ํ›„ output์„ ๋ฑ‰์–ด๋ƒ…๋‹ˆ๋‹ค.

 

 

 

 

 

 

์—ฌ๊ธฐ ์ด๋Ÿฌํ•œ ๋‹จ์–ด๊ฐ€ ์žˆ๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.

 

๊ฐ ๋‹จ์–ด๋ฅผ 512 ์ฐจ์›์œผ๋กœ mappingํ•˜๋ฉด ์œ„์™€ ๊ฐ™์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

 ์ด๊ฒƒ์„ 8๊ฐœ๋กœ ๋‚˜๋ˆ„์–ด parellelํ•˜๊ฒŒ mappingํ•ฉ๋‹ˆ๋‹ค.

 

์ด 64 ์ฐจ์›์œผ๋กœ mapping๋œ ๊ฒƒ์ด Q, K, V ์ค‘ ๋‹ค ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์ด ๊ฐ’๋“ค ์ค‘ Q_1 ~ Q_h, K_1 ~ K_h, V_1 ~ V_h๊ฐ€ ๋‹ค ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ๊ฐ๊ฐ์„ scaled dot product๋ฅผ ํ†ตํ•ด์„œ attention๊นŒ์ง€ ๊ฐ€๋ฉด,

 

Attention_1 ~ Attention_h๊นŒ์ง€์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

์ด Attention์€ ๊ฒฐ๊ตญ softmax() x V์ž…๋‹ˆ๋‹ค.

 

softmax๋Š” scalar ๊ฐ’๋“ค์ด vectorํ™” ๋˜์–ด ๋“ค์–ด๊ฐ„ ๊ฒƒ์ด๊ตฌ์š”.

 

๊ทธ๋ž˜์„œ ๊ฒฐ๊ณผ๊ฐ€ V์˜ dimension๊ณผ ๋˜‘๊ฐ™์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ Attention๊ฒฐ๊ณผ๋ฅผ concatenateํ•ฉ๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด Attention_1 ~ Attention_h๊นŒ์ง€๋ฅผ ์™ผ์ชฝ์—์„œ ์˜ค๋ฅธ์ชฝ์œผ๋กœ ์ฐจ๋ก€๋กœ ์ด์–ด ๋ถ™์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ W_o๋ฅผ ํ†ตํ•ด์„œ linear transformation์„ ๊ฑฐ์ณ ๋˜‘๊ฐ™์€ size๋กœ ๋ฑ‰์–ด๋ƒ…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด multi-head attention์˜ ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์€ encoder์ผ ์ˆ˜๋„ ์žˆ๊ณ  decoder์ผ ์ˆ˜๋„ ์žˆ๊ณ  ์„œ๋กœ ๊ต์ฐจ๋  ์ˆ˜๋„ ์žˆ์ง€๋งŒ,

 

word๋“ค์ด ์žˆ์œผ๋ฉด, ์ด word๋“ค์˜ ๊ฐ๊ฐ์˜ ๊ด€๊ณ„์„ฑ์œผ๋กœ ํ‘œํ˜„๋œ matrix๊ฐ€ ์ถœ๋ ฅ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์ด transformer ๊ตฌ์กฐ๋ฅผ ๋ด…์‹œ๋‹ค.

 

๋ชจ๋“  ๊ฒŒ ๋‹ค parallelized calculation์ž…๋‹ˆ๋‹ค.

 

RNN์€ ์ˆœํ™˜๋˜๋ฉฐ sequentialํ•œ ๊ตฌ์กฐ๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. RNN์€ ์ด์ „ ์‹œ์ ์ด ๋“ค์–ด๊ฐ€์•ผ ๋‹ค์Œ ์‹œ์ ์ด ๋“ค์–ด๊ฐˆ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด์ง€์š”.

 

๊ทธ๋Ÿฐ๋ฐ Transformer์˜ ๊ตฌ์กฐ๋Š” ๊ฐ๊ฐ์˜ ์‹œ์ ์ด ๊ทธ๋Œ€๋กœ ์˜ฌ๋ผ๊ฐ€์„œ ์œ„์—์„œ Attention ๋˜๊ณ ,

 

๊ฐ๊ฐ์ด Attention์ด ๋˜์–ด ๋ชจ๋‘ ๋“ค์–ด๊ฐ€์„œ ์ถœ๋ ฅ์ด ๊ทธ๋Œ€๋กœ ๋‚˜์˜ค๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

 

์ฆ‰, hidden state๋“ค ๋ผ๋ฆฌ ์ˆœํ™˜์ด ๋˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ๋Š” ์ž…๋ ฅ์— ์žˆ๋Š” sentence์™€ ์ถœ๋ ฅ์— ์žˆ๋Š” sentence ๋“ค์ด ๊ธฐ๋‹ค๋ฆด ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ƒฅ parallelํ•˜๊ฒŒ ๋™์‹œ์— ์ฒ˜๋ฆฌ๊ฐ€ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๊ฒŒ Transformer ๊ตฌ์กฐ์˜ ๊ฐ€์žฅ ํฐ ํšจ์œจ์„ฑ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ์—์„œ ์ขŒ์ธก์ด encoder, ์šฐ์ธก์ด decoder ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ๊ทธ๋ฆผ์— ์ด positional encoding์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ ์ˆœ์„œ๋ผ๋Š” ๊ฒŒ ์—†์ด parallelํ•˜๊ฒŒ ์ฒ˜๋ฆฌ๋˜๋‹ค ๋ณด๋‹ˆ, ๋‹จ์–ด๊ฐ€ ์–ด๋– ํ•œ ์ˆœ์„œ์˜€๋Š”์ง€๊ฐ€ ๋ฌด์‹œ๋˜๋Š” ๋Š๋‚Œ์ด ๋“ญ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ positional encoding์„ ํ†ตํ•ด์„œ ์ด ๋‹จ์–ด๊ฐ€ ์–ด๋””์— ์žˆ์—ˆ๋Š”์ง€ highlight๋ฅผ ํ•ด์ค๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด ๋‹จ์–ด์— ๋Œ€ํ•ด embedding๋œ vector์™€ ๊ทธ ๋‹จ์–ด์— ๋Œ€ํ•œ positional encoding ๊ฐ’์„ ๊ฐ๊ฐ ๊ตฌํ•ด ๋”ํ•˜์—ฌ ์ด ๋‹จ์–ด๊ฐ€ ์–ด๋””์— ์žˆ์—ˆ๋Š”์ง€ ์กฐ๊ธˆ ๋” ๊ฐ•์กฐํ•ด์ค€๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ sin ๊ทธ๋ž˜ํ”„๋ฅผ ํ†ตํ•ด์„œ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ ๋‹ค์Œ multi-head attention์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

์œ„ ๊ทธ๋ฆผ์—๋Š” 3๊ฐ€์ง€ ์ด์ง€๋งŒ ์‹ค์ œ๋กœ๋Š” ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ๊ณ , ์šฐ๋ฆฌ๊ฐ€ ์˜ˆ์‹œ๋กœ ๋“ค์€ 512 ์ฐจ์›์— ๋Œ€ํ•ด 64 ์ฐจ์›์”ฉ 8๊ฐœ๋กœ ๋ณธ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋Š” 8๊ฐœ๋กœ ๋‚˜๋‰˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ๋Œ์•„ ๋“ค์–ด๊ฐ€๋Š” ADD & Norm ํ•ญ๋ชฉ์€ redisual network๋ฅผ ๊ฐ€๋ฅดํ‚ต๋‹ˆ๋‹ค.

 

์ด ๋˜ํ•œ ์ด attention์ด ๊ธธ์–ด์ง€๋‹ค ๋ณด๋‹ˆ ์•ž์—์žˆ๋Š” ์ •๋ณด๋ฅผ ๊นŒ๋จน์„ ์ˆ˜ ์žˆ์–ด,

 

skip connection ์„ ์‚ฌ์šฉํ•˜์—ฌ residual ๋ถ€๋ถ„์„ ์ตํžˆ๋ผ๋Š” ๊ฒƒ์ด๊ฒ ์ฃ .

 

 

๊ทธ๋ž˜์„œ Add๊ฐ€ residual network, Norm์ด layer normalization์ž…๋‹ˆ๋‹ค.

 

layer ๋ผ๋ฆฌ normalization ์‹œ์ผœ์ฃผ์–ด Attention ๋ผ๋ฆฌ ๋„ˆ๋ฌด ํฐ ์ฐจ์ด๊ฐ€ ๋‚˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ ๋‹ค์Œ feed forward๋ฅผ ํ†ตํ•ด ํ•œ ๋ฒˆ ๋” ์—ฐ์‚ฐ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

 

์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ dimension์„ ๊ฐ™๊ฒŒ ๋งž์ถฐ์ฃผ๋ฉด์„œ ๋ง์ด์ฃ .

 

๊ทธ๋ฆฌ๊ณ  ๋˜ skip connection์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด ๋ถ€๋ถ„์ด Encoder ์ž…๋‹ˆ๋‹ค.

 

Encoder๋Š” ์ž…๋ ฅ์„ Parallelํ•˜๊ฒŒ ๋ฐ›์•„์„œ, ์ž…๋ ฅ ์‚ฌ์ด์— self-attention์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  Decoder๋„ ๋ด…์‹œ๋‹ค.

 

Decoder๋„ ๋˜‘๊ฐ™์ด output์ด ๋“ค์–ด๊ฐ€์„œ positional encoding ํ•ด์ค€ ๋‹ค์Œ,

 

Masked multi-head attention ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ๋„ decoder-decoder ์‚ฌ์ด์˜ Attentionํ•˜๊ณ ,

 

์œ„ Encoder์—์„œ๋Š” encoder-encoder ์‚ฌ์ด์˜ Attention์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ, ์šฐ๋ฆฌ๊ฐ€ ๋ฒˆ์—ญ์„ ํ•œ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๊ทธ ํ›„์— ๋‚˜์˜ค๋Š” ์ •๋ณด๋“ค์„ ๋ฏธ๋ฆฌ ๊ฐ–์ถœ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์•„๋ž˜ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ Encoder์—์„œ๋Š” ๋‹จ์–ด์™€ sentence๋ฅผ ์ „๋ถ€ ๋‹ค ๋ฐ›์•„๋“ค์ธ ๋‹ค์Œ ๋ฒˆ์—ญ์„ ์‹œ์ž‘ํ•˜๋‹ˆ, ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ๋ชจ๋“  ๊ตฌ์กฐ๊ฐ€ ๋‹ค ์„œ๋กœ๋ฅผ attentionํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ, decoder์—์„œ๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฒˆ์—ญ์ด ๋˜์–ด์„œ ๋‚˜์˜ค๋‹ˆ๊นŒ ์ž๊ธฐ ์ „์— ๋‚˜์˜จ ๊ฒƒ๋งŒ ์ฐธ๊ณ ๋ฅผ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด Masked multi-head attention์ด๋ผ๋Š” ๊ฒƒ์€ ์ž๊ธฐ ์ „ ์‹œ์ ์˜ ๊ฒƒ๋งŒ attention์„ ๊ณ„์‚ฐํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  Decoder์•ˆ์—์„œ๋งŒ attentionํ•˜๋‹ˆ ๊ฒฐ๊ตญ Masked multi-head Decoder Self-Attention์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ๊ทธ ์œ—๋ถ€๋ถ„์„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Encoder์˜ ์ •๋ณด๋ฅผ ๋ฐ›์•„์„œ Attention์„ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„ ๊ทธ๋ฆผ์˜ ๋งˆ์ง€๋ง‰ ๋ถ€๋ถ„์— Encoder-Decoder Attention ๊ตฌ์กฐ๋กœ, Encoder์™€ Decoder์˜ ๊ด€๊ณ„์— ๋Œ€ํ•ด์„œ Attentionํ•ด์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด ๋ถ€๋ถ„์€ ์šฐ๋ฆฌ๊ฐ€ ์•ž์—์„œ ๋‹ค๋ฃฌ ๊ฒƒ๊ณผ ๊ฐ™์€ concept์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ๋˜‘๊ฐ™์ด feed forwardํ•˜๊ณ  residual ํ•˜๊ณ , ๊ทธ๋ฆฌ๊ณ  linear transformation, softmax์—์„œ output์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒŒ transformer ๊ตฌ์กฐ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์€ ๋‚ด๋ถ€ ๊ตฌ์กฐ๋ฅผ encoder, decoder ๋ชจ๋‘ N๋ฒˆ์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ๋งŒ์•ฝ, ์˜์–ด ๋ฌธ์žฅ์ด ๋“ค์–ด์™”๋‹ค๋ฉด,

 

์—ฌ๊ธฐ์— encoder transformer๊ฐ€ N๋ฒˆ์ด ๋ถ™์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ํ•œ๊ตญ์–ด๊ฐ€ ๋“ค์–ด์™“๋‹ค๋ฉด decoder transformer๋„ N๋ฒˆ์ด ๋ถ™์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์ตœ์ข… ์ถœ๋ ฅ์„ ํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

 

 

์ด ํ•œ๊ตญ์–ด ๋ฌธ์žฅ์ด๋ผ๋Š” ๊ฒƒ์€ ์ „ ๋‹จ๊ณ„์—์„œ ์˜ˆ์ธก๋œ ํ•˜๋‚˜๋งŒ ๋“ค์–ด์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ „ ๋‹จ๊ณ„ ์˜ˆ์ธกํ•œ ๊ฒƒ์„ ํ•˜๋‚˜ํ•˜๋‚˜ ๊ธฐ๋‹ค๋ฆด ์ˆ˜ ์—†์œผ๋‹ˆ mask๋ฅผ ํ•ด๋†“๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด masked๋ฅผ ํ†ตํ•ด ์‹œ๊ฐ„์ ์ธ ๊ฐœ๋…์„ ์–ด๋Š์ •๋„ ๋ฐ˜์˜ํ–ˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Decoder-Decoder self-attention์—์„œ ๋ง์ด์ฃ .

 

 

์ด๋Ÿฌํ•œ ๊ฒƒ๋“ค์„ parallelํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ์„œ ์˜ค๋Š” ์žฅ์ ์€ ๊ณ„์‚ฐ์˜ ํšจ์œจ์„ฑ์ž…๋‹ˆ๋‹ค.

 

์›๋ž˜ BPTT๋ฅผ ํ•ด์•ผํ•˜์ง€๋งŒ, ๋ณ‘๋ ฌ์ ์ด๋‹ˆ True time ๊ฐœ๋…์ด ์—†์–ด์ง‘๋‹ˆ๋‹ค.

 

๊ทธ๋ƒฅ Backpropagation๋งŒ ๊ฐ€์ง€๊ณ  ํ•ด๊ฒฐ์ด ๋ฉ๋‹ˆ๋‹ค.

 

 

 

 

๊ทธ๋ž˜์„œ ์•„๋ž˜์™€ ๊ฐ™์€ transformer๋กœ attention๋œ ์˜ˆ์‹œ๋“ค์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ด๊ฒƒ์€ encoder์™€ decoder ์‚ฌ์ด์˜ attention์ด๋ฉฐ,

 

์ด๊ฒƒ์€ multi-head์— ๋Œ€ํ•ด์„œ ๊ฐ head๊ฐ€ parallelํ•˜๊ฒŒ ์ •๋ณด๋ฅผ ์ฒ˜๋ฆฌํ•˜์—ฌ attentionํ•œ ๊ฒƒ์„ ํ‘œํ˜„ํ•ฉ๋‹ˆ๋‹ค.

 

multi-head๋กœ ํ•œ ์ด์œ ๋Š” ํ•˜๋‚˜ ํ•˜๋‚˜์˜ word embedding์„ ๊ทธ ์ฐจ์›์„ ์ง์ ‘ ๋ณด์ง€ ๋ง๊ณ , projection ์‹œ์ผœ์„œ ์ •๋ณด๋ฅผ ์ถ•์†Œํ•œ ๋‹ค์Œ์— parallelํ•˜๊ฒŒ multi๋กœ ๋ณด๋Š” ๊ฒƒ์ด ์˜คํžˆ๋ ค ์œ ๋ฆฌํ•˜๋‹ˆ ๊ทธ๋Ÿฌํ•œ ๊ฒƒ์ด ๋ฐ˜์˜๋œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.