[Transformer] Part 5

2023. 1. 25. 13:45
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Transformer

 

Transformer

  • RNN๊ณผ CNN ์—†์ด attention module๋งŒ์œผ๋กœ ์ „์ฒด Sequence๋ฅผ ์ž…๋ ฅ ๋ฐ ์ถœ๋ ฅ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ๋™์ž‘์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.
  • RNN model์˜ Long-term Dependency Issue๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
    • ์ž…๋ ฅ์œผ๋กœ ์ฃผ๋Š” h t์— ์ •๋ณด๊ฐ€ ๋ณ€์งˆ๋˜์ง€ ์•Š๊ณ  ์ž˜ ์ถ•์ ๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
    • ์ € ๋ฉ€๋ฆฌ์žˆ๋Š” ๊ณณ์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ ์†Œ์‹ค์ด ์ผ์–ด๋‚ฌ๋‹ค๋ฉด, gradient signal์„ ์ € ๋ฉ€๋ฆฌ ์žˆ๋Š” time step๊นŒ์ง€ ์ „๋‹ฌํ•ด ์ค˜์•ผ ํ•˜๋Š”๋ฐ, ์—ฌ๋Ÿฌ time step์— ๊ฑธ์ณ์„œ gradient ์ •๋ณด๊ฐ€ ์ „๋‹ฌ๋˜๋ฉด์„œ, ์ •๋ณด๊ฐ€ ๋ณ€์งˆ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ์ƒ๊ฒจ ํ•™์Šต์ด ์ œ๋Œ€๋กœ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ, transformer model์—์„œ attention model์„ ์‚ฌ์šฉํ•ด์„œ sequence data๋ฅผ encodingํ•  ๋•Œ์—๋Š”, time step ๊ฐ„์˜ ์ฐจ์ด๊ฐ€ ๋ฉ€ ๋•Œ, ์ฆ‰, long-term dependency์˜ ์ด์Šˆ๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฑฐ๋ฆฌ์— ์ƒ๊ด€ ์—†์ด ๋ฐ”๋กœ๋ฐ”๋กœ ์ ‘๊ทผํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์ถ•์ ํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  • Seq2seq and attention model์—์„œ๋Š” ์ •๋ณด๋ฅผ ์ฐพ๊ณ ์ž ํ•˜๋Š” query์— ํ•ด๋‹นํ•˜๋Š” decoder hidden state vector๋Š” ์ €ํฌ๊ฐ€ ์ •๋ณด๋ฅผ ๊บผ๋‚ด๊ฐ€๋ ค๊ณ  ํ•˜๋Š” ์žฌ๋ฃŒ์— ํ•ด๋‹นํ•˜๋Š” encoder hidden state vector์™€๋Š” ๋ถ„๋ฆฌ๊ฐ€ ๋œ ์ƒํ™ฉ์ด์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ด query vector๋ฅผ ์žฌ๋ฃŒ vector๋“ค๊ณผ ๊ฐ™์€ ์ฃผ์ฒด๋กœ์„œ ์ƒ๊ฐํ•œ๋‹ค๋ฉด, sequence๋ฅผ attention model์„ ํ†ตํ•ด encoding ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Ÿฐ Sequence data ๋ฅผ encoding ํ•  ๋•Œ, ์“ฐ์ด๋Š” attention module์„ self-attention์ด๋ผ๊ณ  ๋ถ€๋ฅด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • sequence data ๊ฐ๊ฐ์˜ ๊ฒƒ์„ ์ „์ฒด sequence์—์„œ์˜ ๊ฐ€์ค‘ํ•ฉ์„ ํ†ตํ•ด ์œ ์‚ฌ๋„๋ฅผ ํŒ์ •ํ•ฉ๋‹ˆ๋‹ค.

 

์ถœ์ฒ˜ : https://wikidocs.net/162099

 

 

  • ๋ณ€ํ™˜๋œ vector๋ฅผ query vector๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • vector ์—ฐ์‚ฐ์— ์žˆ์–ด์„œ, query vector, key vector, value vector๋กœ ๋‚˜๋‰ฉ๋‹ˆ๋‹ค.

 

Scaled Dot-product Attention

  • softmax ๊ฐ’์ด ํ•˜๋‚˜์˜ Peak์ด ๋œจ๊ฒŒ ๋˜์–ด ํ•™์Šต์ด ์ž˜ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • dk๋ผ๋Š” ์š”์†Œ๊ฐ€ ์›์น˜ ์•Š๊ฒŒ attention weight์˜ softmax์˜ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š”๋Š” logit vector์˜ ์–ด๋–ค ๋ถ„ํฌ์˜ ๋ถ„์‚ฐ ๊ฐ’์— ์˜ํ–ฅ์„ ์ฃผ๋Š” ๋งŒํผ, ์ด ์˜ํ–ฅ๋ ฅ์„ ์—†์• ๊ธฐ ์œ„ํ•ด์„œ ์ด๋Ÿฐ key vector์™€ query vector ๊ฐ„์— ๊ฐ๊ฐ์˜ measure ๊ฐ’์— key vector ํ˜น์€ query vector์— dimension์˜ ๋ฃจํŠธ ๊ฐ’์„ ๋‚˜๋ˆ  ์คŒ์œผ๋กœ์จ ์ผ์ •ํ•œ ๋ถ„์‚ฐ ๊ฐ’์„ ๊ฐ€์ง€๋„๋ก ํ•˜๋Š” ์ถ”๊ฐ€์ ์ธ ์žฅ์น˜๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

Multi-head Attention

  • Scaled dot product ๊ธฐ๋ฐ˜์˜ self-attention module์„ ์ข€ ๋” ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋กœ ํ™•์žฅํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ฃผ์–ด์ง„ Sequence์˜ ๊ฐ๊ฐ์˜ ์ž…๋ ฅ vector ๋“ค์„ encodingํ•  ๋•Œ, Wq, Wk, Wv๋ฅผ ํ•˜๋‚˜์˜ ์ •ํ•ด์ง„ ๋ณ€ํ™˜์œผ๋กœ attention module์„ ์ˆ˜ํ–‰ํ•ด์„œ Sequence๋ฅผ encodingํ•˜๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋‹จ์–ด encoding ์‹œ์—, ๊ทธ ์‚ฌ๋žŒ์ด ์–ด๋–ค ํ–‰๋™์„ ํ–ˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ์œ„์ฃผ๋กœ encoding ํ•  ์ˆ˜๋„ ์žˆ๊ณ , ์‹œ๊ฐ„, ์žฅ์†Œ ์ •๋ณด ๋“ฑ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ํ•„์š”์„ฑ์„ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ๋งŒ๋“ค ๊ฒƒ์ธ์ง€๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ์„ธ ๊ฐœ์˜ ์„ ํ˜• ๋ณ€ํ™˜์˜ Matrix๋ฅผ ์—ฌ๋Ÿฌ ๊ฐœ set๋ฅผ ๋‘์–ด์„œ ๊ฐ set๋ฅผ ํ†ตํ•ด ๋ณ€ํ™˜๋œ query, key, value vector๋“ค์„ ์‚ฌ์šฉํ•ด์„œ attention model์„ ํ†ตํ•ด sequence์˜ ๊ฐ ๋‹จ์–ด๋“ค์„ encodingํ•˜๊ณ  ๊ฐ set์— ์„ ํ˜• ๋ณ€ํ™˜์„ ์‚ฌ์šฉํ•ด์„œ ๋‚˜์˜จ attention model์˜ Output ๋“ค์„ ๋‚˜์ค‘์— concatํ•˜๋Š” ํ˜•ํƒœ๋กœ ์ •๋ณด๋ฅผ ํ•ฉ์น˜๋Š” ์‹์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์–‘ํ•œ ๊ธฐ์ค€๊ณผ ๋‹ค์–‘ํ•œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ sequence๋ฅผ encodingํ•œ ๊ทธ๋Ÿฐ ํ˜•ํƒœ๋กœ ํ™•์žฅ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • attention์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ linear transformation set๋ฅผ ์‚ฌ์šฉํ•ด์„œ attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค๊ณ  ํ•˜์—ฌ multi-head attention์ด๋ผ ํ•ฉ๋‹ˆ๋‹ค.
  • ์ถ”๊ฐ€์ ์ธ linear transformation์„ ๊ฑฐ์ณ์„œ ์›ํ•˜๋Š” Dimension์œผ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • long-term dependency ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋ฉ”๋ชจ๋ฆฌ ๋ณต์žก๋„๊ฐ€ ๋†’๋‹ค๋Š” ๊ฒƒ์ด ๋‹จ์ ์ด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ถœ์ฒ˜ : https://wikidocs.net/162098

 

  • Output vector์™€ Input vector์˜ dimension์„ ๊ฐ™๊ฒŒ ํ•˜๋Š” ์ด์œ ๋Š” residual network์—์„œ ๋‚˜์˜จ Skip connection์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ vector์— ๋Œ€ํ•˜์—ฌ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ณ„์‚ฐํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์ด ๊ฐ๊ฐ 0๊ณผ 1์ด ๋„๋””๋„๋ก ํ•˜๊ณ , ๊ทธ ๋‹ค์Œ ์ˆ˜๋“ค์— ๋Œ€ํ•ด y = ax + b ๊ผด๋กœ ๋งŒ๋“œ๋Š” affine transformation์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • positional encoding ์€ ์ˆœ์„œ๋ฅผ ๊ตฌ๋ถ„ํ•˜์ง€ ๋ชป ํ•˜๋Š” self-attention model์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ณ ์ž ๋‹จ์–ด์˜ ๋ฌธ์žฅ์—์„œ์˜ ์œ„์น˜๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.
  • ์œ„์น˜์— ๋”ฐ๋ผ ๊ฐ™์€ ๋‹จ์–ด๋„ ๋‹ค๋ฅธ vector๋กœ ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • cheating์˜ ๋ฌธ์ œ๋ฅผ ๋ง‰๊ธฐ ์œ„ํ•ด masking์ด๋ผ๋Š” ๊ฒƒ์„ ์ด์šฉํ•ด query๋“ค์ด ๋ฐ”๋กœ ๋‹ค์Œ ๋‹จ์–ด๋“ค์„ ๋ณด์ง€ ๋ชป ํ•˜๋„๋ก ๊ทธ ๋‹ค์Œ ๊ฐ’๋“ค์˜ logit ๊ฐ’์„ ๋งˆ์ด๋„ˆ์Šค ๋ฌดํ•œ๋Œ€๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.  ์ด๋Ÿฌํ•œ Masked self-attention ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

 

์ถœ์ฒ˜ : https://www.quora.com/How-does-the-relative-positional-encoding-in-a-transformer-work-and-how-can-it-be-implemented-in-Python

 

 

 

  • ์ตœ์ดˆ๋กœ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ task์— ์“ฐ์—ฌ์„œ RNN ๊ธฐ๋ฐ˜, convolution ๊ธฐ๋ฐ˜์˜ sequence๋ฅผ encodingํ•˜๋Š” model๋“ค ๋ณด๋‹ค ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚˜ํƒ€๋ƒˆ์Šต๋‹ˆ๋‹ค.
  • Long-term dependency๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜์—ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ผ ์ˆ˜ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.
  • ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฟ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ์˜์—ญ์—๋„ ํ™•์žฅ ๋˜์–ด ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋˜ํ•œ, transformer์—์„œ ์ œํ•œ๋œ Layer ์ˆ˜๋ฅผ ๋” ๋งŽ์ด ๋Š˜๋ฆฌ๋˜, transformer์—์„œ ์ œํ•œ๋œ block ์ž์ฒด์˜ ๊ทธ ์„ค๊ณ„ ํ˜น์€ ๋””์ž์ธ์€ ๊ทธ๋Œ€๋กœ ๊ณ„์Šนํ•˜์—ฌ model size๋ฅผ ์ ์ฐจ ๋Š˜๋ฆฌ๋˜, self-supervised model์„ ์ ๋ชฉ์‹œ์ผœ task์˜ ์„ฑ๋Šฅ์„ ๋Š˜๋ฆฝ๋‹ˆ๋‹ค.
  • time step ๋งˆ๋‹ค ์˜ˆ์ธกํ•œ๋‹ค๋Š” auto regressive model์˜ ํ•œ๊ณ„๋ฅผ ์—ฌ์ „ํžˆ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

BELATED ARTICLES

more