Artificial Intelligence/Natural Language Processing

[NLP] Transformer

Han Jang 2023. 5. 21. 20:40
๐Ÿง‘๐Ÿปโ€๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Neural Networks
RNN
LSTM
Attention
Transformer
Generator
discriminator
self-attention
layer normalization
multi-head attention
positional encoding

 

https://arxiv.org/abs/1706.03762

 

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new

arxiv.org

 

 

 

generator๋Š” generation task๋ฅผ ํ•˜๋Š” model์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ input์— ๋Œ€ํ•ด์„œ ์–ด๋– ํ•œ output์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

discriminator๋Š” ๋ถ„๋ฅ˜๊ธฐ์ž…๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์šฉ์–ด๋Š” ์ธ๊ณต์ง€๋Šฅ์—์„œ ์ž์ฃผ ์“ฐ์ด๋‹ˆ ์•Œ์•„๋‘๊ณ  ๊ฐ‘์‹œ๋‹ค.

 

 

์ง€๊ธˆ๊นŒ์ง€ ์šฐ๋ฆฌ๊ฐ€ ๋ฐฐ์šด

 

RNN, LSTM, GRU ๋Š” sequence๋กœ ์•ž์—์„œ๋ถ€ํ„ฐ ์ฐจ๋ก€๋กœ ๊ณ„์‚ฐ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๊ฒŒ Sequence ๊ธธ์ด๋งŒํผ ์‹œ๊ฐ„์ด ๊ฑธ๋ฆฌ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ seqentialํ•œ ๊ตฌ์กฐ ๋ง๊ณ  parallelํ•œ ๊ตฌ์กฐ๊ฐ€ ์—†์„๊นŒ ์ƒ๊ฐํ•ฉ๋‹ˆ๋‹ค.

 

parallelํ•œ ๊ตฌ์กฐ๋Š” input sequence๊ฐ€ ๋ชจ๋‘ ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ๋˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

 

 

์šฐ๋ฆฌ๊ฐ€ Attention์„ ๋‹ค์‹œ ๋ด…์‹œ๋‹ค.

 

Decoder์˜ hidden state์™€ Encoder์˜ hiddn state์™€ attention์„ ๊ณ„์‚ฐํ•ด์„œ ์ƒˆ๋กœ์šด context vector๋ฅผ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

 

์ง€๊ธˆ์€ h_1 ~ h_n์„ encoder RNN์œผ๋กœ ๊ตฌํ•œ ๋‹ค์Œ, decoder์—์„œ RNN์„ ๊ฐ€์ง€๊ณ  Dot product๋ฅผ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ h_1 ~ h_n์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์ด ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ transformer๋Š” RNN์„ ํ•˜์ง€ ์•Š๊ณ  ๋ฐ”๋กœ sequence ์—†์ด ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

์ด๋Ÿฐ์‹์œผ๋กœ ๋ชจ๋ธ๋ง์ด ๊ฐ€๋Šฅํ•œ๋ฐ, ์ด๊ฒƒ์€ ๋…๋ฆฝ์ ์ด๋ฏ€๋กœ ์„œ๋กœ๊ฐ„์˜ sequence ์ •๋ณด๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ๊ณ ์•ˆ๋œ ๊ฒƒ์ด ์ด Sequence์—์„œ x_1 ~ x_n๋ผ๋ฆฌ attention์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ Transformer ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

 

Transformer

 

์ง€๊ธˆ๊นŒ์ง€๋Š” ์œ„์™€ ๊ฐ™์ด x_1 ~ x_n์„ ์ฐจ๋ก€๋กœ ๋„ฃ๊ณ  output๋„ ์ฐจ๋ก€๋กœ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, ์œ„์™€ ๊ฐ™์ด x_1 ~ x_n์„ ํ•œ ๋ฒˆ์— ์ ์šฉํ•ด๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•œ ๋ฒˆ์— Attention ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜์—ฌ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

๋” ์ž์„ธํžˆ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

encoder์— ์žˆ๋Š” sequence๋ฅผ ํ•œ ๋ฒˆ์— ๋„ฃ์œผ๋ฉด ์–ด๋–ป๊ฒŒ ๋ ๊นŒ์š”?

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด x_1 ~ x_n์„ ๊ฐ€์ง€๊ณ  o_1์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ๋˜, x_1 ~ x_n๊ณผ o_1์„ ์ด์šฉํ•˜์—ฌ o_2๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

x_1 ~ x_n์€ ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ์„ ํ•˜๊ณ , output๋ถ€ํ„ฐ๋Š” decoder๋ฅผ ์ด์šฉํ•ด ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด Output์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์€ RNN์—์„œ ํ–ˆ๋˜ Inference ๊ณผ์ •๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

o_1์„ ๊ตฌํ•˜๊ณ  o_2๋ฅผ ๊ตฌํ•˜๋Š” ํ˜•์‹์ด์ฃ .

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์€ ๊ณ„์‚ฐ์„ ๊ฐ–์ฃ .

 

์ด Transformer๋Š” 2017๋…„์— "Attention is All you need" ๋ผ๋Š” ๋…ผ๋ฌธ์œผ๋กœ Google brain๊ณผ Google research ํŒ€์—์„œ ๋งŒ๋“ค์–ด์กŒ๊ณ ,

 

ํ•ด๋‹น ๋งํฌ๋Š” ๋งจ ์œ„์— ๋‹ฌ์•„๋’€์Šต๋‹ˆ๋‹ค.

 

 

์ด์ œ Transformer ๊ตฌ์กฐ๋ฅผ ๋ด…์‹œ๋‹ค.

 

 

์™ผ์ชฝ์€ input sequence๋ฅผ modelingํ•˜๋Š” encoder ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

 

์˜ค๋ฅธ์ชฝ์€ decoder ๋ถ€๋ถ„์œผ๋กœ output์ด ์ฒซ ๋ฒˆ์งธ token์— ๋“ค์–ด๊ฐ”์„ ๋•Œ,

 

์ฒซ๋ฒˆ์งธ ๋‹จ์–ด ๋‚˜์˜ค๊ณ , ์ฒซ ๋ฒˆ์งธ ๋‹จ์–ด ๋“ค์–ด๊ฐ”์„ ๋•Œ ๋‘ ๋ฒˆ์งธ ๋‹จ์–ด ๋‚˜์˜ค๊ณ  ํ•˜๋Š” ํ˜•์‹์ž…๋‹ˆ๋‹ค.

 

 

 

ํ•œ sequence์— ๋Œ€ํ•ด์„œ encoder์— ๋“ค์–ด๊ฐ”์„ ๋•Œ ์–ด๋– ํ•œ Representation์„ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  decoder์—์„œ๋Š” ํ•˜๋‚˜์”ฉ ์ถœ๋ ฅ์„ ๋ฝ‘์•„๋‚ด๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

 

์ด์ œ ์ฃผ์˜ ๊นŠ๊ฒŒ ๋ด์•ผํ•  ๋ถ€๋ถ„, ๊ฐ€์žฅ ํฐ ์•„์ด๋””์–ด๋Š” multi-head attention์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  positional encoding๋„ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

 

๊ณฑํ•˜๊ธฐ N์€ ์œ„์™€ ๊ฐ™์€ Transformer ๊ตฌ์กฐ๊ฐ€ N๋ฒˆ ๋ฐ˜๋ณต๋จ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.

 

ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

 

 

Encoder

์ถœ์ฒ˜ : https://kikaben.com/transformers-encoder-decoder/

 

 

์ด์ œ ์ƒˆ๋กœ์šด self-attention์ด๋ผ๋Š” ๊ฐœ๋…์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋“ฏ์ด, parallelํ•œ ๊ตฌ์กฐ๋ฅผ ์œ„ํ•ด์„œ self-attention์ด ์‚ฌ์šฉ์ด ๋ฉ๋‹ˆ๋‹ค.

 

 

 

์šฐ๋ฆฌ๋Š” thinking์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ๊ณ„์‚ฐํ•  ๋•Œ,

 

์ด thinking์ด๋ผ๋Š” ๋‹จ์–ด์™€ thinking์„ ํฌํ•จํ•œ sequence์˜ ๋‹จ์–ด์™€ attention์„ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด transformer Layer๋ฅผ ํ•œ ๋ฒˆ ๊ฑฐ์น˜๋ฉด ์ด thinking ๋‹จ์–ด์™€ ์ด ๋‹จ์–ด๋ฅผ ํฌํ•จํ•œ ๋‹ค๋ฅธ ๋‹จ์–ด์™€์˜ attention์„ ์–ป๊ณ ์‹ถ์–ด ํ•ฉ๋‹ˆ๋‹ค.

 

์ด์ „์— ๋ฐฐ์šด ๊ฒƒ์ฒ˜๋Ÿผ, ์ด ๋‹จ์–ด๋ฅผ RNN decoder์˜ hidden state๋กœ ๋ณด๊ณ  ์ด ๋‹จ์–ด ํฌํ•จ ๋‚˜๋จธ์ง€ ๋ถ€๋ถ„๋“ค์„ encoder์˜ hidden state๋กœ ์‚ฌ์šฉํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ dot productํ›„ attention score๋ฅผ ๋ฝ‘์•„์„œ weighted sum์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์ด sumํ•œ ๊ฐ’์„ ์šฐ๋ฆฌ๊ฐ€ ๋‹ค์Œ layer์˜ vector๋ผ๊ณ  ํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ self-attention์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

์ž ์‹œ Q, K, V๋ฅผ ๋ด…์‹œ๋‹ค.

 

 

 

RNN seq2seq ๊ตฌ์กฐ์—์„œ๋Š” Decoder์˜ h_t-1์˜ hidden state์™€ x_1 ~ x_n์˜ sequence์™€ dot product๋ฅผ ํ†ตํ•ด attention ๊ฐ’์„ ๊ตฌํ•˜๊ณ , ํ•ด๋‹น ๊ฐ’์— ๋Œ€ํ•ด์„œ sofrmax func.๋ฅผ ์ง€๋‚˜ attention score๋ฅผ ๊ฐ๊ฐ ๊ตฌํ•˜์—ฌ ์ด ๊ฐ’์„ ๋‹ค์‹œ x_1 ~ x_n์— weighted summation ํ•˜์—ฌ context vector๋ฅผ ๊ตฌํ•˜์˜€์Šต๋‹ˆ๋‹ค.

 

 

 

์—ฌ๊ธฐ์„œ Q, K, V๋ผ๋Š” ๊ฐœ๋…์ด ๋„์ž…๋ฉ๋‹ˆ๋‹ค.

 

Q๋Š” Query, K๋Š” Key, V๋Š” Value์— ํ•ด๋‹น๋ฉ๋‹ˆ๋‹ค.

 

์œ„ ๊ทธ๋ฆผ์—์„œ ์‚ดํŽด๋ณด์ž๋ฉด, decoder์—์„œ ๋„˜์–ด์˜จ hidden state h_t-1์„ query, encoder์˜ x_1 ~ x_n์˜ hidden state ๊ฐ’ key,

 

์ด๋ ‡๊ฒŒ Query์™€ Key ์‚ฌ์ด์— dot product๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๊ฒŒ attention score๋ฅผ ๊ตฌํ•˜๊ณ ,

 

 

key ๊ฐ’๊ณผ value๋Š” ๊ณ„์‚ฐ๋˜๋Š” ๊ฒƒ์€ x_1 ~ x_n์œผ๋กœ ๊ฐ™์ง€๋งŒ, attention score๋ฅผ ๊ณ„์‚ฐํ•˜๋ คํ•œ ๊ฒƒ์€ key,

 

์ตœ์ข…์ ์œผ๋กœ weighted sum์„ ํ•˜๋Š” ๊ฒƒ์€ value๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

 

์šฉ์–ด๋ฅผ ์œ„์™€ ๊ฐ™์ด ๊ตฌ๋ถ„ํ–ˆ๋‹ค๊ณ  ๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด์–ด attention์„ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

 

๋…ผ๋ฌธ์—์„œ๋Š” ์œ„์™€ ๊ฐ™์€ ์ˆ˜์‹์œผ๋กœ ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

k๋Š” ๋ฃจํŠธ d_k๋กœ ์ฐจ์›์ž…๋‹ˆ๋‹ค.

 

dot product๋ฅผ ํ•˜๋Š” ๋ฐ ์žˆ์–ด, ํฐ ์ฐจ์› x ํฐ ์ฐจ์›์œผ๋กœ ์ง„ํ–‰ํ•˜๋ฉด ์—„์ฒญ ํฐ ์ฐจ์›์ด ๋‚˜์˜ค๊ธฐ ๋•Œ๋ฌธ์— ๊ทธ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ณ ์ž ๋ฃจํŠธ d_k๋ฅผ ๋‚˜๋ˆ ์คŒ์œผ๋กœ์จ normalization ํšจ๊ณผ๋ฅผ ๊ฐ€์ ธ์˜ค๋ฉฐ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋ž˜์„œ Self-attention์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

x_1 query์™€ x_1 ~ x_n์ธ key, value ์— ๋Œ€ํ•ด query์™€ key๋ฅผ dot product ํ›„ softmax ์”Œ์šฐ๊ณ ,

 

attention score์— ๋Œ€ํ•ด value์™€ weighted sum์„ ํ•˜๋ฉด context vector๊ฐ€ ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

์ด Query๋ฅผ x_1 ~ x_n๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜์—ฌ ๊ฐ๊ฐ์˜ Context vector๋ฅผ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

์œ„ ์ˆ˜์‹์€ x_i๋ผ๋Š” Query๊ฐ€ ๋“ค์–ด์™”์„ ๋•Œ, input sequence์˜ key ๊ฐ’๊ณผ attention score ๊ตฌํ•˜๊ณ  ๊ทธ๊ฒƒ์œผ๋กœ value ๊ฐ’๊ณผ weighted sum์„ ํ•˜๋Š” ๊ตฌ์กฐ์ด์ฃ .

 

๊ทธ๊ฒƒ์„ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ x_1 ~ x_n์— ๋Œ€ํ•ด ๋ฐ˜๋ณตํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๊ฒ ์ฃ .

 

 

 

encoder์—์„œ๋Š” input sequence ๋ผ๋ฆฌ ์–ผ๋งˆ๋‚˜ ๊ด€๋ จ์ด ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Q, K, V๋ฅผ input์˜ sequence๋งŒ์„ ์ด์šฉํ•˜์„œ ๊ณ„์‚ฐํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ด์ฃ .

 

 

 

 

 

 

 

 

๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ self-attention์„ ๊ตฌํ•ฉ๋‹ˆ๋‹ค.

 

it์— ๋Œ€ํ•ด์„œ self-attention์„ ํ•˜์—ฌ ์—ฐ์‚ฐํ•œ attention score๊ฐ’์ด ์ขŒ์ธก๊ณผ ๊ฐ™์ด ๋‚˜์˜ต๋‹ˆ๋‹ค. ์ด๊ฒƒ์„ weighted sumํ•œ ๊ฐ’์ด ์ƒˆ๋กญ๊ฒŒ ๊ตฌํ•œ attention์ž…๋‹ˆ๋‹ค. ์ฆ‰, self-attention ๊ฐ’์ด์ฃ .

 

์ด๋ ‡๊ฒŒ self-attention layer๋ฅผ ํ•œ ๋ฒˆ ํ†ต๊ณผํ•˜๋ฉด ์ด ๋‹จ์–ด์™€ input sequence์™€ ๊ด€๋ จ ์žˆ๋Š” ๋‹จ์–ด์™€ ์—ฐ๊ด€ํ•˜์—ฌ context๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ƒˆ๋กœ์šด word embedding์œผ๋กœ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์›๋ž˜๋Š” ๊ด€๋ จ ์—†๋˜ Vector ํ˜•ํƒœ์˜€๋˜ ๊ฒƒ์ด ๊ด€๋ จ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ๋ฐ”๋€Œ๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

์„œ๋กœ์˜ ์ •๋ณด๊ฐ€ ๋ฐ˜์˜์ด ๋˜์–ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

 

์žฅ์  :

  • ๋‘ ๊ฑฐ๋ฆฌ๊ฐ€ ๊ธด sequentialํ•œ ๊ตฌ์กฐ์—์„œ RNN์ด ํ•™์Šตํ•˜๊ธฐ ์–ด๋ ต๋˜ long-range dependencies๋ฅผ self-attention์„ ํ†ตํ•ด ๋‹จ์–ด ๊ณ„์‚ฐ์„ Q, K, V๋ฅผ ํ†ตํ•ด ๋™์ผํ•˜๊ฒŒ attention์„ ๊ณ„์‚ฐํ•จ์œผ๋กœ์จ long-range dependencies๋ฅผ ์ž˜ ํ•ด๊ฒฐํ–ˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์›๋ž˜๋Š” ๊ธธ์ด๊ฐ€ 100์ด๋ฉด 100๋ฒˆ ๊ณ„์‚ฐํ•˜์ง€๋งŒ, ์ด๊ฒƒ์€ self-attention ํ•œ ๋ฒˆ์œผ๋กœ parallelization์„ ํ†ตํ•ด, Q, K, V๋ฅผ ํ†ตํ•ด ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.Q, K, V๋ฅผ Matrix๋กœ ๋ฐ”๊ฟจ๊ธฐ ๋•Œ๋ฌธ์— ํ•œ ๋ฒˆ์— ๊ณ„์‚ฐ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํšจ๊ณผ์ ์ธ computation์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

 

Multi-head Attention

 

multi-head์˜ ๊ฐœ์ˆ˜๋งŒํผ ์ชผ๊ฐœ์–ด attention์„ ๊ณ„์‚ฐํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Input sequence๋กœ ๋“ค์–ด์˜จ ์ฐจ์›๋งŒํผ์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ  ์ชผ๊ฐœ์–ด ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

multi-head์˜ ๊ฐœ์ˆ˜๋งŒํผ ์ชผ๊ฐœ์–ด ๋‚˜์˜จ ๊ฒƒ๋“ค์„ ํ•ฉ์นฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ์™œ ํ• ๊นŒ์š”?

 

์žฅ์  :

  • ์˜ˆ๋ฅผ ๋“ค์–ด ํฐ ์ฐจ์›์„ ํ•œ ๋ฒˆ์— ํ•˜๋˜ ๊ฒƒ์„, ๋‚˜๋ˆ„์–ด์„œ ํ•˜๋ฉด, ์—ฌ๋Ÿฌ ๊ด€์ ์—์„œ ์ด sequence๋ฅผ ๋ฐ”๋ผ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ๋‹ค๋ฅธ ๊ด€์ ์œผ๋กœ self-attention์ด ๊ฐ€๋Šฅํ•ด์ง„๋‹ค.
  • ๊ด€์ ์— ๋”ฐ๋ผ attention score๊ฐ€ ๋‹ฌ๋ผ์ง„๋‹ค.
  • ์—ฐ์‚ฐ์€ ๋˜‘๊ฐ™์ด ํ•˜๋˜ ๋” ๋‹ค์–‘ํ•œ ๊ด€์ ์œผ๋กœ ๋ฐ”๋ผ๋ณด๊ธฐ ์œ„ํ•จ.

 

 

 

์‹์€ ์œ„์™€ ๊ฐ™์œผ๋ฉฐ,

 

์ด๋ฅผํ…Œ๋ฉด input sequence 256์ฐจ์›์— multi-head๊ฐœ์ˆ˜๋ฅผ 4๋ผ๊ณ  ํ•˜๋ฉด, ์ด 256์ฐจ์›์„ 64์ฐจ์›์งœ๋ฆฌ๋กœ ๋ฐ”๊ฟ”์•ผํ•˜๋Š” weighted matrix๊ฐ€ W์ž…๋‹ˆ๋‹ค.

 

์ด W๋ฅผ ๊ณฑํ•ด์„œ ์ฐจ์›์„ ์ค„์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

256์ฐจ์›์„ 64์ฐจ์›์œผ๋กœ ๋ฐ”๊พผ๋‹ค์Œ์— ๊ณ„์‚ฐํ•ด๋ผ ์ž…๋‹ˆ๋‹ค.

 

์›๋ž˜ ํ•œ ๋ฒˆ์— ํ•˜๋˜ ๊ฒƒ์„ multi-head์˜ ๊ฐœ์ˆ˜๋งŒํผ ์ค„์ด๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

์›๋ž˜ Attention score๊ฐ€ ํ•˜๋‚˜ ๋ฐ–์— ์•ˆ ๋‚˜์˜ค๋˜ ๊ฒƒ์—์„œ, multi-head attention์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ด€์ ์„ ํ†ตํ•ด ๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์•„๋ž˜์™€ ๊ฐ™์ด multi-head์— ๋”ฐ๋ผ attention score๊ฐ€ ๋ฐ”๋€” ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

 

 

 

 

์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด 256์ฐจ์›์งœ๋ฆฌ๋ฅผ multi-head 4๊ฐœ๋กœ ์„ค์ •ํ•˜์—ฌ 64์ฐจ์›์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฒฐ๊ณผ๋ฅผ 4๋ฒˆ ๋‚ด์–ด ๊ทธ๊ฒƒ๋“ค์„ ๋ถ™์—ฌ๋†“๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

Q 256์ฐจ์› ์งœ๋ฆฌ๋ฅผ 64์ฐจ์›์œผ๋กœ ์ค„์ธ ๋‹ค์Œ์—, ๊ทธ๊ฒƒ๋“ค์— ๋Œ€ํ•œ self attention์„ ๊ณ„์‚ฐํ•˜๊ณ , Z_1 ~ Z_4๋กœ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ๊ฒƒ๋“ค์„ ๋ถ™์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

 

์‚ฌ์‹ค ์ž˜๋ž๋‹ค๊ธฐ ๋ณด๋‹จ 256 ์ฐจ์›์˜ vector๋ฅผ W๋ฅผ ๊ณฑํ•˜์—ฌ 64์ฐจ์›์œผ๋กœ 4๊ฐœ์˜ weighted matrix๊ฐ€ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ด 64์ฐจ์› ์งœ๋ฆฌ๋Š” multi-head ์ข…๋ฅ˜๋ณ„๋กœ ๋‚˜์˜ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ๋“ค์„ ๊ฐ€์ง€๊ณ  attention์„ ํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ multi-head ๊ฐœ์ˆ˜๋งŒํผ weight ๊ฐ’์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด W๋Š” ํ•™์Šตํ•˜๋Š” Parameter์ž…๋‹ˆ๋‹ค.

 

 

 

FeedForward

 

 

 

์ด๋ ‡๊ฒŒ Multi-head์— ๋”ฐ๋ผ ํ•˜๋‚˜์˜ ๋‹จ์–ด์— ๋Œ€ํ•œ multi-head attention์„ ๊ตฌํ–ˆ์Šต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ๋“ค์€ multi-head attention๋งŒ ํ•˜์—ฌ ๋ถ™์ด๊ธฐ๋งŒ ํ•˜์˜€์œผ๋ฏ€๋กœ ์„œ๋กœ๊ฐ€ ๊ด€๋ จ๋˜์–ด ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ํ•œ ๋ฒˆ์˜ LInear transform์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์„ ์œ„ํ•ด์„œ Fully connected layer๋ฅผ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ–ˆ๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ linear transformํ•œ ๋’ค ReLU์˜ ๊ณผ์ •์„ ๊ฐ€์ง€๋Š”๋ฐ, ์ด๊ฒƒ์„ Feed Forwardํ•œ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

 

Positional Encoding

 

 

 

 

 

๋งŒ์•ฝ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์žฅ์ด ์žˆ๋‹ค๊ณ  ํ•ด๋ด…์‹œ๋‹ค.

 

The animal didn't cross the street because it was too tired

 

๊ทธ๋Ÿผ ์œ„ The๊ฐ€ 2๊ฐœ ์žˆ์ฃ .

 

์—ฌ๊ธฐ์„œ ์œ„ the๋Š” ๊ฐ๊ฐ ๋‹ค๋ฅธ Embedding์„ ๊ฐ€์งˆ๊นŒ์š”?

 

์•„๋‹™๋‹ˆ๋‹ค.

 

๋‘˜์€ ๊ฐ™์€ Embedding์„ ๊ฐ€์ง€์ฃ .

 

 

sequence๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋ƒฅ ๋‹จ์–ด ๊ธฐ์ค€ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์„ ๊ณ„์‚ฐํ•  ๋•Œ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ ๋ฟ์ด์ง€, ์ด ๋‹จ์–ด๊ฐ€ ์ฒซ ๋ฒˆ์งธ the, ์—ฌ์„ฏ ๋ฒˆ์งธ the์ž„์€ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ, Self-attention๋งŒ์œผ๋กœ๋Š” ์ด sentence์˜ sequence ์ •๋ณด๋ฅผ ๋‹ด์„ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ Positional Encoding์„ ํ†ตํ•ด์„œ ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜ Encoding์„ ์ค๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ ์œ„ positional encoding ์ด์ „์— Input embedding์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด input embedding์€ ๋‹จ์–ด์— ๋Œ€ํ•œ One-hot vector๋ฅผ ํŠน์ • vector ๊ฐ’์œผ๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

vocablouary๊ฐ€ 256 ์ฐจ์› ์žˆ์„ ๋•Œ,

 

RNN์—์„œ๋Š” ์ˆœ์ฐจ์ ์œผ๋กœ ์ฒซ ๋ฒˆ์งธ์— ๋Œ€ํ•œ embedding์„ ๊ฐ€์ ธ์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๊ฒŒ ํ•ด๋‹น ๋‹จ์–ด์— ๋Œ€ํ•œ embedding์„ ๋ฐฐ์šฐ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์ด input embedding์ด์ฃ .

 

 

๊ทธ๋ž˜์„œ ์œ„ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด thinking์ด๋ผ๋Š” embedding์„ ๊ฐ€์ ธ์˜ค๋Š” ๊ฒƒ์ด์ฃ .

 

 

๊ทธ๋ ‡๋‹ค๋ฉด, ๊ฐ™์€ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ๋Š” ๊ฐ™์€ Embedding์„ ๊ฐ–๋Š”๋ฐ transformer์—์„œ๋Š” ์ด๊ฒƒ์„ ์–ด๋–ป๊ฒŒ ๊ตฌ๋ถ„ํ• ๊นŒ์š”?

 

sequence ์ •๋ณด ์—†์ด๋Š” Attention์€ ๋˜‘๊ฐ™์€ embedding์ด ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์šฐ๋ฆฌ๋Š” ๋งจ์ฒ˜์Œ์— sequence ์ •๋ณด๋ฅผ ๋ถ€์—ฌํ•˜๊ณ  ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ ๋ฐฉ๋ฒ•์€ positional encoding์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ๊ฐ ๋‹จ์–ด์˜ ์œ„์น˜๊ฐ€ ์–ด๋””์— ์žˆ๋Š”์ง€๋ฅผ ์•Œ๋ ค์ฃผ๊ณ  ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ์ด์ฃ .

 

์ด๊ฒƒ์€ ๊ตฌ์กฐ ์ƒ ๋ณดํ†ต ๋งจ ์•„๋ž˜์— ์žˆ์Šต๋‹ˆ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด Positional Embedding์„ Word Emedding์— ๋”ํ•˜๊ณ  ์‹œ์ž‘ํ•˜์ž๋Š” ๊ฒƒ์ด์ฃ .

๊ทธ๋ ‡๋‹ค๋ฉด ์œ„์น˜ ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ๋„ฃ์–ด์ค„๊นŒ์š”?

 

์œ„ ์ˆซ์ž์ ์ธ ๋ฌธ์ž์ ์„ ๊ณ ๋ คํ•˜๋ฉด ์œ„ ์ˆ˜์‹์— ์˜ํ•ด์„œ ์œ„์น˜์ •๋ณด๋ฅผ ๊ณ„์‚ฐํ•˜๋Š” ์•„์ด๋””์–ด๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ input embedding์—๋‹ค๊ฐ€ positional embedding์„ ๋”ํ•œ๋‹ค๊ณ  ํ–ˆ์œผ๋‹ˆ,

 

positional embedding ๊ฐ’์„ ์–ด๋–ค ์œ„์น˜๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” encoding ๊ฐ’์„ ๋„ฃ์–ด์ฃผ๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด sin, cos ๊ฐ’์œผ๋กœ ํ™€์ˆ˜, ์ง์ˆ˜ ๋ฒˆ์งธ์— ๋Œ€ํ•ด์„œ ๋‹ค๋ฅธ ๊ฐ’์„ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.

 

์œ„ ๊ฐ’์ด ์–ด๋”” ์œ„์น˜์— ๋”ฐ๋ผ ๊ฐ’์ด ์ •ํ•ด์ง„๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ณ  ์žˆ์œผ๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ๋‹ค๋“ค ํŠน์ •ํ•œ ๊ฐ’์„ ๊ฐ€์ง„๋‹ค๊ณ  ๋ณด๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

d๋Š” ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๊ณ ์žํ•˜๋Š” ์ฐจ์›์ˆ˜, k๋Š” ์œ„์น˜์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ๊ทธ๋ฆผ ๊ทธ๋ฆฌ๋ฉด ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ uniqueํ•œ vector ๊ฐ’์„ ์ฃผ๊ธฐ ์œ„ํ•ด์„œ Sin, Cos ํ•จ์ˆ˜๋กœ ์ด๋ฃจ์–ด์ง„ ๊ฒƒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ ˆ๋Œ€ ๊ฒน์น˜์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ •๋ฆฌํ•˜์ž๋ฉด, sequence์— ๋Œ€ํ•ด์„œ,

 

word embedding๊ณผ positional encoding์„ ๋”ํ•œ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ๊ฐ€์ง€๊ณ  ์ฒซ ๋ฒˆ์งธ multi-head attention์ด ์‹คํ–‰๋˜๊ณ  self-attentionํ•˜๋ฉฐ feed forwardํ•ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด Encoder ํ•œ ๋ฒˆ ํ†ต๊ณผํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ N๋ฒˆ ํ†ต๊ณผํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ResNet์„ ํ†ตํ•ด ์ด์ „์˜ ์ •๋ณด๋ฅผ ๊นŒ๋จน์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ํ•™์Šต ์ •๋ณด๋ฅผ ์ฝ์ง€ ์•Š๊ณ  ํ•™์Šต์†๋„๋„ ๋น ๋ฆ…๋‹ˆ๋‹ค.

 

์•„๋ž˜์™€ ๊ฐ™์ด Residual connection ์ถ”๊ฐ€ํ•˜๊ณ , layer norm. ๊นŒ์ง€ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

์ด๋ ‡๊ฒŒ transformer๊ตฌ์กฐ๊ฐ€ ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋ž˜์„œ Encoder N๊ฐœ๋ฅผ ์ตœ์ข…์ ์œผ๋กœ ํ†ต๊ณผํ•œ vector ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

Transformer encoder์˜ head ๊ฐœ์ˆ˜๋Š” ๋งŽ๋‹ค๊ณ  ์ข‹์€ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ , ์ผ๋ฐ˜์ ์œผ๋กœ 8๊ฐœ ์ •๋„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

word embedding์€ ๋‹จ์–ด์˜ ๊ฐœ์ˆ˜๋‚˜ ๋ฐ์ดํ„ฐ์˜ ๊ธธ์ด์— ๋”ฐ๋ผ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ 256, 512, 1024 ๋“ฑ๋“ฑ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์ด๋ ‡๊ฒŒ Transformer encoder๋ฅผ ํ†ตํ•ด์„œ layer๋ฅผ N๋ฒˆ ํ†ต๊ณผํ•œ ๊ฐ๊ฐ์˜ word embedding์„ ๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 

Decoder

 

decoder๋„ ๋˜‘๊ฐ™์ด attention์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

 

decoder๋Š” ์ฒซ ๋ฒˆ์งธ ๋‹จ์–ด์— ๋Œ€ํ•ด์„œ output์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ด RNN์˜ inference์™€ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด, Decoder๋Š” ์ฒซ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด๊ฐ€์„œ ์ด๊ฒƒ์ด Query๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

๋งŒ์•ฝ, "๋‚˜๋Š” ํ•™๊ต์— ๊ฐ„๋‹ค"์— ๋Œ€ํ•ด์„œ ๋ด…์‹œ๋‹ค.

 

output์ด 'i'๊ฐ€ ๋“ค์–ด๊ฐ”๋‹ค๊ณ  ํ•ฉ์‹œ๋‹ค.

 

๊ทธ๋Ÿผ, i์™€ ๋‚˜๋Š” ํ•™๊ต์— ๊ฐ„๋‹ค์™€ self-attention์„ ํ†ตํ•ด Q, K, V๊ฐ€ ์ •ํ•ด์ ธ ๊ตฌํ•ด์ง€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Decoder๋Š” attention์ด ๋‘ ๋ฒˆ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

 

๋จผ์ €, self-attention์„ ํ†ตํ•ด decoder ๋‚ด๋ถ€ ๋ผ๋ฆฌ ๊ณ„์‚ฐํ•˜๊ณ , Encoder-Decoder self-attention์„ ํ†ตํ•ด encoder์™€์˜ ๊ด€๋ จ์„ฑ๋„ ํŒŒ์•…ํ•ฉ๋‹ˆ๋‹ค.

 

๊ธˆ๋ฐฉ ๋ง์”€๋“œ๋ฆฐ ๊ฒƒ์€ Encoder-Decoder์™€์˜ self-attention์ด๊ณ , ๊ทธ ์ด์ „์— decoder์—์„œ ์ž๊ธฐ๋“ค๋ผ๋ฆฌ self-attention์„ ๋ฏธ๋ฆฌ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด๋ ‡๊ฒŒ 2๋ฒˆ์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

๋‚˜๋จธ์ง€๋Š” ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

feed forward์™€ ๋“ฑ๋“ฑ.

 

๊ทธ๋ฆฌ๊ณ  ์ตœ์ข…์ ์œผ๋กœ ๋‚˜์˜จ vector ๊ฐ’์— ๋Œ€ํ•ด linear, softmax๋ฅผ ํ†ตํ•ด ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

encoder-decoder self-attention์—์„œ ๋ณด๋ฉด,

 

key-value๋Š” ์œ„์™€ ๊ฐ™์ด encoder์—์„œ ์˜ค๊ณ , Query๋Š” decoder์—์„œ ์˜ค๋Š” ๊ฐ’์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ transformer๊ฐ€ ์„ฑ๋Šฅ์ด ์—„์ฒญ๋‚˜๊ฒŒ ์ข‹์•„์กŒ์Šต๋‹ˆ๋‹ค.

 

 

์•„๋ž˜์™€ ๊ฐ™์ด ์ˆ˜์น˜๋กœ ๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

 

 

 

'Artificial Intelligence > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[NLP] Attention  (0) 2023.04.12
[NLP] Sequential Data Modeling  (0) 2023.04.10
[NLP] RNN - LSTM, GRU  (0) 2023.04.04
[NLP] RNN  (0) 2023.04.04
[NLP] Word Embedding - GloVe [practice]  (0) 2023.03.31