[Self-Supervised Learning and Large-Scale Pre-Trained Models] Part 6

2023. 1. 25. 17:50
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ
 
BERT

 

Self-Supervised Learning

  • ์‚ฌ๋žŒ์ด ์ง์ ‘ ์ผ์ผ์ด ํ•ด์ค˜์•ผ ํ•˜๋Š” ๊ทธ๋Ÿฐ labeling ๊ณผ์ •์ด ์—†์ด๋„ ์›์‹œ data๋งŒ์œผ๋กœ ์–ด๋–ค ๋จธ์‹ ๋Ÿฌ๋‹ model์„ ํ•™์Šต์‹œํ‚ฌ ์ˆ˜ ์—†์„์ง€์— ๋Œ€ํ•œ ์•„์ด๋””์–ด model์ž…๋‹ˆ๋‹ค.
  • ์ž…๋ ฅ data๋งŒ์œผ๋กœ ์ž…๋ ฅ data์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€๋ ค๋†“๊ณ , ๊ฐ€๋ ค์ง„ ์ž…๋ ฅ data๋ฅผ ์ฃผ์—ˆ์„ ๋•Œ, ๊ฐ€๋ ค์ง„ ๋ถ€๋ถ„์„ ์ž˜ ๋ณต์› ํ˜น์€ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š”, ๊ทธ๋ž˜์„œ ์ฃผ์–ด์ง„ ์ž…๋ ฅ data์˜ ์ผ๋ถ€๋ฅผ ์ถœ๋ ฅ ํ˜น์€ ์˜ˆ์ธก์˜ ๋Œ€์ƒ์œผ๋กœ ์‚ผ์•„ Model์„ ํ•™์Šตํ•˜๋Š” task๊ฐ€ ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.
  • ์ด๋Š” Computer vision ์ƒ์—์„œ inpainting task๋ฅผ ์˜ˆ๋กœ ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ด model์€ ํŠน์ • ๋ฌผ์ฒด์˜ ํŠน์ง•๋“ค์„ ์ž˜ ์•Œ๊ณ  ์žˆ์–ด์•ผ๋งŒ ์ด task๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•  ์ˆ˜๊ฐ€ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋Œ€๊ทœ๋ชจ Data๋กœ ์ž๊ฐ€ ํ•™์Šต๋œ Model์€ ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ํŠน์ • Task๋ฅผ ํ’€๊ธฐ ์œ„ํ•œ Transfer learning ํ˜•ํƒœ๋กœ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์•ž ์ชฝ์—์„œ๋Š” ํ•™์Šต์‹œํ‚ค๊ณ ์ž ํ•˜๋Š” data๋“ค์„ ์ถ”์ถœํ•˜๋„๋ก ํ•™์Šต ๋˜์—ˆ์„ ๊ฒƒ์ด๊ณ , ๋’ค๋กœ ๊ฐˆ ์ˆ˜๋ก ์ง์ ‘์ ์œผ๋กœ ์ฃผ์–ด์ง„ ํŠน์ • task๋“ค, inpainting์ด๋‚˜ ์ง์†Œ ํผ์ฆ task์— ์ง์ ‘์ ์œผ๋กœ ๊ด€๋ จ์ด ๋˜๋Š” ๊ทธ๋Ÿฐ ์ •๋ณด๋“ค์„ ์œ„์ฃผ๋กœ ํ•™์Šต์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ์ชฝ์—์„œ ๋งŽ์ด ์„ฑ๊ณต์„ ๊ฑฐ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

 

BERT (Pre-training of Deep Bidrectional Transformers for Language Understanding)

  • transformer model์„ ๊ธฐ๋ฐ˜์œผ๋กœ, bidrectional์ด๋ผ๋Š” ๋ง์€ Language modeling task์—์„œ์˜ masked language modeling์— ํ•ด๋‹นํ•˜๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ๊ณ , ์ถ”๊ฐ€์ ์ธ ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต task๋กœ์„œ, Next sentence prediction task๋ผ๋Š” ๋‘ ๊ฐ€์ง€ task๋กœ ์ž๊ฐ€ ์ง€๋„ ํ•™์Šต์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • transformer model์—์„œ์˜ encoder๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.
  • ์ด BERT model์„ self-supervised learning์ด๋ผ๋Š” task์˜ ํ˜•ํƒœ๋กœ ํ•™์Šต์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ๋Œ€๊ทœ๋ชจ Text data๋ฅผ ํ•™์Šต Data๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • ์ž…๋ ฅ ๋ฌธ์žฅ์„ BERT model์˜ ์ž…๋ ฅ sequence๋กœ ์ œ๊ณตํ•ด ์ฃผ๋˜, self-supervised learning์—์„œ์˜ ๊ธฐ๋ณธ ๊ฐœ๋…์ธ ์ž…๋ ฅ data์˜ ์ผ๋ถ€๋ฅผ ๊ฐ€๋ ค์ฃผ๊ณ  ๊ทธ๊ฑธ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š” ๊ทธ๋Ÿฐ ๊ด€์ ์—์„œ, ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์ผ๋ถ€ ๋‹จ์–ด๋“ค์„ mask๋ผ๋Š” ํŠน๋ณ„ํ•œ Special token์œผ๋กœ ๋Œ€์ฒดํ•ด์ค๋‹ˆ๋‹ค.
  • ์›๋ž˜ ์–ด๋–ค ๋‹จ์–ด๊ฐ€ ๋“ค์–ด๊ฐ€์•ผํ•˜๋Š”์ง€ ๋งž์ถ”๋Š” ๊ทธ๋Ÿฐ task๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๋˜, ์—ฌ๋Ÿฌ ๋ฌธ์žฅ์ด ์ฃผ์–ด์กŒ์„ ๋•Œ, fine-tuning์—์„œ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก, self-supervised learningํ•˜๋Š” pre-training ๋‹จ๊ณ„์—์„œ๋„ ๋‘ ๋ฌธ์žฅ์ด ์˜๋ฏธ ๊ด€๊ณ„๊ฐ€ ๋ฐ€์ ‘ํ•˜๊ฒŒ ์ฃผ์–ด์ ธ ์žˆ๋Š” ๊ทธ๋Ÿฐ ๋ฌธ์žฅ์ธ์ง€ ํ˜น์€ ๊ทธ๋ ‡์ง€ ์•Š์€์ง€ Next sentence prediction์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๋‘ ๋ฌธ์žฅ ์‚ฌ์ด์˜ special token์ธ separator token์„ ์ถ”๊ฐ€ํ•ด์ฃผ๊ณ , ๋ฌธ์žฅ ๋งˆ์ง€๋ง‰์—๋„ ์ถ”๊ฐ€ํ•˜์—ฌ ๋ฌธ์žฅ๊ฐ„์˜ ๊ตฌ๋ถ„๊ณผ ๋์„ model์—๊ฒŒ ์•Œ๋ ค์ค„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ณ , CLS๋ผ๋Š” classification token์„ ๊ฐ€์žฅ ์ฒซ time step์— ์ถ”๊ฐ€ํ•ด์„œ BERT model์— ์ž…๋ ฅ sequence๋กœ ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์ „์ฒด sequence๋ฅผ ์ž˜ ๋ฐ˜์˜ํ•ด์„œ ํ•„์š”ํ•œ ์ •๋ณด๋“ค์„ ์ž˜ encodingํ•œ Hidden state vector ๋“ค์„ ๊ฐ ๋‹จ์–ด๋ณ„๋กœ ๋งŒ๋“ค์–ด์ฃผ๊ฒŒ ๋˜๊ณ , CLS๊ฐ€ encoding๋œ Vector๊ฐ€ ์žˆ๊ณ , ๋‚˜๋จธ์ง€ mask์— ํ•ด๋‹นํ•˜๋Š” ๋‹จ์–ด๋„ ๋‚˜๋ฆ„๋Œ€๋กœ encoding๋œ vector๊ฐ€ ์ฃผ์–ด์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด ์—ฐ์†๋œ ๋ฌธ์žฅ์ด ๊ด€๋ จ์žˆ์„ ๋ฒ•ํ•œ ๋ฌธ์žฅ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ binary classification์„ ํ•˜๋Š” ๊ทธ๋Ÿฌํ•œ ๋ชฉ์ ์œผ๋กœ CLS๋ผ๋Š” token์ด ์ž˜ encoding๋œ vector๋ฅผ next sentence prediction์ด Output layer์— ์ž…๋ ฅ์œผ๋กœ ์ค˜์„œ binary classification์„ ์ˆ˜ํ–‰ํ•ด ์ค๋‹ˆ๋‹ค.
  • mask๋กœ ๊ฐ์ถฐ์ง„ ํ˜น์€ ๋Œ€์ฒด๋œ ๋‹จ์–ด๊ฐ€ encoding๋œ Hidden state vector๋ฅผ ๋˜ ์ด์ œ Output layer์— ์ž…๋ ฅ์œผ๋กœ ์ค˜์„œ ์›๋ž˜ ๋‹จ์–ด๋ฅผ ๋งž์ถ”๋„๋ก ํ•˜๋Š” task๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

 

๋‘ ๊ฐ€์ง€์˜ BERT์— ๋Œ€ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

Masked Language Model (MLM)

  • ๋Œ€์ƒ ๋‹จ์–ด ์ค‘ 80%๋งŒ mask token์œผ๋กœ ๋ฐ”๊ฟ” ์ž…๋ ฅ sequence๋กœ ๋ฐ”๊ฟ‰๋‹ˆ๋‹ค.
  • 15%๊ฐœ๋Š” ๋ฌธ์ œ๋กœ ์ถœ์ œํ•ฉ๋‹ˆ๋‹ค.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ encodingํ•œ hidden state vector๋ฅผ ๊ฐ€์ง€๊ณ  ๊ทธ ์ž๋ฆฌ์— ๋“ค์–ด๊ฐˆ ๋‹จ์–ด๊ฐ€ ๋ฌด์—‡์ธ์ง€๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ–ˆ์„ ๋•Œ๋Š” ์ด๊ฒƒ์ด Mask๋กœ ๋Œ€์ฒด๋œ ๋‹จ์–ด ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค๋„ ์ตœ๋Œ€ํ•œ ์œ ์šฉํ•œ ์ •๋ณด๋ฅผ ์ž˜ Encoding ํ•˜๋„๋ก model์ด ํ•™์Šต๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋˜ 10% ์ •๋„๋Š” ๋‹จ์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ๋‘๊ณ  ๋˜ ๋‹ค๋ฅธ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์˜ฌ ์ˆ˜ ์žˆ๋Š”์ง€ ๋งž์ถ”๊ฒŒ ํ•˜๋Š” ์‹์œผ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฌผ์–ด๋ด„์œผ๋กœ์จ ์ง€๊ธˆ ์ฃผ์–ด์ง„ ๊ฒƒ์€ ๋‹ต์ด ์•„๋‹ˆ๊ฒ ๊ตฌ๋‚˜๋ผ๋Š” ์ž˜๋ชป๋œ ํŒจํ„ด์„ Deep Learning model์ด ๋ฐฐ์šฐ์ง€ ์•Š๋„๋ก ๊ทธ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๋Š” ์ฐจ์›์—์„œ ์›๋ž˜ ๋‹จ์–ด์™€ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋œ ๋‹จ์–ด๋„ ์›๋ž˜ ๋‹จ์–ด๊ฐ€ ๋งž๋‹ค๊ณ  ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” masked language model์„ ์„ค๊ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
  • 15%์˜ ๋น„์œจ๋กœ ์‹ค์ œ ๋ฌธ์ œ ์ถœ์ œ ๋‹จ์–ด๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • ์ด๊ฒƒ์ด ๋„ˆ๋ฌด ์ ๊ฒŒ ๋˜๋ฉด ํ•™์Šต ๊ณผ์ •์ด ๋น„ํšจ์œจ์ ์ด๊ฒŒ ๋  ์ˆ˜๋„ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋„ˆ๋ฌด ํฌ๋‹ค๋ฉด ๋น ๋ฅด๊ฒŒ ํ•™์Šต์ด ๋˜์ง€๋งŒ ๋„ˆ๋ฌด ๋งŽ์€ ๋‹จ์–ด๋ฅผ ๊ฐ€๋ฆฌ๋ฉด ์‹ค์ œ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜๊ธฐ ๋•Œ๋ฌธ์— 15%๋กœ ์žก์•˜์Šต๋‹ˆ๋‹ค.
  • 80% mask token, 10%๋Š” random word, 10%๋Š” ๊ทธ๋Œ€๋กœ ๋‘ก๋‹ˆ๋‹ค.
    • ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์„ SEP๋กœ ๊ตฌ๋ถ„ํ•˜์—ฌ ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

Next Sentence Prediction (NSP)

  • CLS token์ด ํŠน์ •ํ•œ encoding๋œ Hidden state vector๋กœ ๋‚˜์™”์„ ๋•Œ, ์ด๊ฒƒ์„ Output layer์— ํ†ต๊ณผํ•ด์„œ ์ฃผ์–ด์ง„ ๋ฌธ์žฅ ๋‘ ๊ฐœ๊ฐ€ ์‹ค์ œ original ๋ฌธ์„œ์—์„œ ์—ฐ์†๋˜๊ฒŒ ๋“ฑ์žฅํ–ˆ๋˜ ์ง„์งœ ๊ทธ ๋‘ ๋ฌธ์žฅ์ด์—ˆ๋Š”์ง€, ๊ทธ๋ ‡๋‹ค๋ฉด ์ด ์‹ค์ œ ๋‘ ๋ฌธ์žฅ์ด next sentence ๊ด€๊ณ„์— ์žˆ๋‹ค๋ผ๊ณ  ํ•  ๊ฒƒ์ด๊ณ , ์—ฌ๋Ÿฌ document๋กœ๋ถ€ํ„ฐ ๋ฝ‘์•„์˜จ ๊ฐ๊ฐ์˜ ๋ฌธ์žฅ์ด๋ผ๋ฉด ๋ฌธ๋งฅ์ด ์ž˜ ๋งž์ง€ ์•Š์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๋‘ ๊ฐœ ๋ฌธ์žฅ ๊ฐ„์˜ ๋ฌธ๋งฅ์ด๋‚˜ ๊ด€๊ณ„๋ฅผ ์ž˜ ๋ณด๊ณ , CLS token์œผ๋กœ๋ถ€ํ„ฐ encoding๋œ Hidden state vector์˜ next sentence prediction task์˜ binary classification ๊ฒฐ๊ณผ๊ฐ€ next sentence์ธ์ง€ ์•„๋‹Œ์ง€ ์˜ˆ์ธกํ•˜๋„๋ก ํ•™์Šต์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.
  • CLS๋Š” ํ•„์š”๋กœ ํ•˜๋Š” ๊ทธ ์ •๋ณด๋“ค์„ ์ฃผ์–ด์ง„ ์ž…๋ ฅ sequence ๋‚ด์— self-attention module์„ ํ†ตํ•ด ์ •๋ณด๋“ค์„ ์ž˜ ์ถ”์ถœํ•ด ์˜ค๋„๋ก ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋ฉ๋‹ˆ๋‹ค.

 

Details of BERT

  • Model Architecture
    • BERT BASE      // ๊นŠ์€ Layer๋ฅผ ๊ฐ€์ง
    • BERT LARGE
  •  input representation
    • WordPiece embeddings
    • Learned positional embedding
    • CLS
    • Packed sentence embedding
    • Segment Embedding // positioning์„ ์คŒ์œผ๋กœ์จ ๋ช‡ ๋ฒˆ์งธ ๋ฌธ์žฅ์ธ์ง€ ๊ตฌ๋ณ„ vector๋ฅผ ์ค€๋‹ค.
  • Pre-training Tasks
    • Masked LM
    • Next Sentence Prediction
  • Classificaiton์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋“ค์€ ํ•ด๋‹น ๋‹จ์–ด์˜ ํ’ˆ์‚ฌ๋ฅผ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š” ์ด๋Ÿฐ word level์˜ classification task์˜ fine-tuning ํ˜•ํƒœ๋กœ BERT model์„ ํ™œ์šฉํ•  ์ˆ˜๊ฐ€ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ๋ฌธ์žฅ level์—์„œ ๊ธ์ • ๋˜๋Š” ๋ถ€์ •์˜ classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒฝ์šฐ์—, CLS token์˜ encoding๋œ ๊ฐ’์„ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด ์ถ”๊ฐ€์ ์ธ fully-connected layer ํ•˜๋‚˜๋ฅผ ๋‹ฌ์•„์„œ ๊ธ์ •์ธ์ง€ ๋ถ€์ •์ธ์ง€ ์˜ˆ์ธกํ•˜๋„๋ก ํ•˜๋Š” fine-tuning task๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • ๋‹ค์ˆ˜์˜ ๋ฌธ์žฅ์„ ๋ณด๊ณ  ์˜ˆ์ธกํ•˜๋Š” Target task๋„ ์กด์žฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • netural language inference task 
      • MultiNLI
    • CoLa
  • ๊ธฐ๊ณ„ ๋…ํ•ด
    • ์งˆ๋ฌธ๊ณผ ๋‹ต์„ ํฌํ•จํ•˜๋Š” paragraph ์ฃผ์–ด์ง‘๋‹ˆ๋‹ค.
    • ์ง€๋ฌธ๋‚ด์—์„œ ํŠน์ • ๋‹จ์–ด๋“ค์„ ์ถ”์ถœํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.
      • scalar ๊ฐ’๋“ค์„ logit ๊ฐ’์œผ๋กœ ํ™œ์šฉํ•ด์„œ softmax์— ์ž…๋ ฅ์œผ๋กœ ์ฃผ๊ณ  ๊ฑฐ๊ธฐ์„œ ์ฃผ์–ด์ง„ ์ง€๋ฌธ ๋‹จ์–ด๋“ค์„ ๋Œ€์ƒ์œผ๋กœ classification์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

GPT (Generatvie Pre-Training Transformer task)

  • GPT model์˜ ๊ฒฝ์šฐ๋„ Transformer๋ฅผ ์‚ฌ์šฉํ•˜์˜€๋‹ค.
  • Transformer์—์„œ decoder model์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
    • decoder model ์˜ ๊ฐ€์žฅ ํฐ ํŠน์ง•์€ masked self-attention์ด๋ผ๋Š” ๊ฒƒ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ๊ธฐ๋ณธ์ ์œผ๋กœ auto-regressiveํ•œ ์ฃผ์–ด์ง„ ์ž…๋ ฅ sequence์— ๋Œ€ํ•ด์„œ ํ˜„์žฌ time step์—์„œ ๋‹ค์Œ time step์— ๋‚˜ํƒ€๋‚  ๋‹จ์–ด๋ฅผ ์˜ˆ์ธกํ•˜๋Š” task๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
  • work level์˜ language modeling task๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด GPT model์˜ ํ•ต์‹ฌ idea๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

  • GPT-2 : Language Models are Unsupervised Multi-task Learners
    • transformers์—์„œ์˜ self-attention block์„ ๊ต‰์žฅํžˆ ๊นŠ์ด ์Œ“์•„์„œ model size๋ฅผ ๊ต‰์žฅํžˆ ํ‚ค์› ์Šต๋‹ˆ๋‹ค.
    • model ํ•™์Šต์— ์žˆ์–ด ๊ต‰์žฅํžˆ ๋งŽ์€ ์–‘์˜ text data๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.
    • ์–‘์งˆ์˜ Data๋ฅผ ์ตœ๋Œ€ํ•œ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์ด ๋” ์œ ์˜๋ฏธํ•œ ๊ฒƒ์„ ์‚ฌ์šฉํ•  ๊ฒƒ์ด๋ผ๋Š” ๊ฒƒ์„ ํ†ตํ•ด, ์–ด๋–ค ์ปค๋ฎค๋‹ˆํ‹ฐ์—์„œ ์งˆ๋ฌธ์˜ ๋‹ต์„ ํ†ตํ•ด ์ตœ์†Œํ•œ 3๊ฐœ ์ด์ƒ์˜ ์ข‹์•„์š”๋ฅผ ๋ฐ›์€ ๊ฒƒ์— ๋Œ€ํ•ด ๋งํฌ๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์„ ๋•Œ, ๋งํฌ๋ฅผ ํƒ€๊ณ  ๋“ค์–ด๊ฐ€, ๋งํฌ ๋‚ด์˜ ๋ฌธ์„œ๋ฅผ ์‹ค์ œ ํ•™์Šต data๋กœ ์ˆ˜์ง‘ํ•ด์„œ ์ด data๋ฅผ ๋Œ€์ƒ์œผ๋กœ language modeling task๋ฅผ ํ•™์Šตํ–ˆ์Šต๋‹ˆ๋‹ค.
    • zero-shot setting์—์„œ์˜ down-stream task ์— ๋ฐ”๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
      • TLDR (Too Long Didn't Read)๋ฅผ ๋ณด๊ณ  ๋’ค๋ฅผ ์š”์•ฝํ•ด์•ผํ•œ๋‹ค๋Š” ํŒ๋‹จ์„ ํ•ฉ๋‹ˆ๋‹ค.
      • ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์“ฐ์ง€ ์•Š์•˜์Œ์—๋„ summarization task๋ฅผ ํ•˜์—ฌ zero-shot์ด๋ผ๋Š” ์ด๋ฆ„์ด ๋ถˆ๋ฆฌ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
  • GPT-3
    • decoder ๊ทธ๋Œ€๋กœ ๊ณ„์Šนํ•˜๊ณ  ํ•™์Šต data๋Š” ํ›จ์”ฌ ์ปค์กŒ๊ณ , ์ด์ „๋ณด๋‹ค ์—„์ฒญ๋‚˜๊ฒŒ ๋งŽ์€ layer ์ˆ˜๋ฅผ ๊ฐ€์ง€๊ณ  ์•ฝ 1,750์–ต๊ฐœ์˜ Parameter๋ฅผ ๊ฐ€์ง€๋Š” ๊ทธ๋Ÿฌํ•œ transformer decoder model์„ ํ•™์Šตํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • ์ด model์€ zero-shot๊ณผ ๋”๋ถˆ์–ด Few-shot learning์˜ ์‚ฌ๋ก€๋กœ ๋ณด์—ฌ์ค„ ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
    • ๋‹ค์Œ์— ๋‚˜ํƒ€๋‚  ๋‹จ์–ด๋ฅผ ๋ถˆ์–ด๋กœ ๋ฒˆ์—ญํ•ด์ค˜์™€ ๊ฐ™์€ ์ง€์‹œ๊ฐ€ ๊ฐ€๋Šฅ. ์ด๋Ÿฐ ๊ฒƒ์ด zero-shot learning์ž…๋‹ˆ๋‹ค.
    • GPT-2์—์„œ๋„ ๊ฐ€๋Šฅํ•˜์ง€๋งŒ ํ•™์Šต๋ฐ์ดํ„ฐ ์—†์ด๋Š” ์ œํ•œ์ ์ด๊ธฐ ๋•Œ๋ฌธ์—,  GPT-3์—์„œ ๊ธ€์˜ ์ผ๋ถ€๋ฅผ ์ œ๊ณตํ•ด์คŒ์œผ๋กœ์จ language modeling task ์ž์ฒด๋ฅผ ๋ฐ”๋กœ task์— ํ™œ์šฉํ•˜๋Š” ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.
      • zero-shot์— ๋น„ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜ ์ฃผ์–ด์„œ ์ˆ˜ํ–‰ํ•˜๋ฉด one-shot learning task์ด๊ณ  ๋ช‡ ๊ฐœ ๋” ์ค€๋‹ค๋ฉด few-shot learning์ด ๋ฉ๋‹ˆ๋‹ค.
      • few-shot learning์˜ ์„ฑ๋Šฅ์ด ๊ฝค ์ข‹๋‹ค๊ณ ํ•ฉ๋‹ˆ๋‹ค.
      • ์ด ์˜ˆ์‹œ๋ฅผ ๋ช‡ ๊ฐœ๋ฅผ ์ฃผ์–ด์„œ ์ด GPT-3๋ผ๋Š” ๊ณ ์ •์ ์ธ Model์ด ๊ฐ€์žฅ ์ž˜ ํ•™์Šต๋˜๋Š”์ง€์— ๋Œ€ํ•œ task๋ฅผ prompt tuning์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • GPT model์˜ language modeling ๋Šฅ๋ ฅ์„ ํ™œ์šฉํ•ด์„œ
    • ๊ธ€์˜ ๋’ท ๋ถ€๋ถ„์„ ์ฐฝ์ž‘ํ•˜๋Š” ์šฉ๋„๋กœ ํ™œ์šฉํ•  ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Copilot
      • GPT-3 model์„ ๊ฐ€์ ธ์™€์„œ ํ”„๋กœ๊ทธ๋žจ Data์— ์ง‘์ค‘ํ•˜์—ฌ GPT-3 model์„ fine-tuningํ•œ ํ”„๋กœ๊ทธ๋žจ์˜ ์ž๋™์™„์„ฑ task๋ฅผ ์ž˜ ๊ตฌํ˜„ํ•˜๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
    • ํ•œ๊ธ€ data์—๋„ ๋งŽ์€ ์‹œ๋„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
      • HyperCLOVA model

 

Model์˜ size๋Š” ์ ์  ์ปค์ง€๊ณ  ์š”๊ตฌ data๋„ ์ ์ ์ปค์ง‘๋‹ˆ๋‹ค.

 

GPU ์‚ฌ์šฉ๋Ÿ‰๋„ ์ ์  ์ปค์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•˜๋‚˜์˜ Model์ด ์—ฌ๋Ÿฌ ์šฉ๋„๋กœ ๋ฐ”๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋Š” model์ด ๋Š˜์–ด๋‚จ์— ๋”ฐ๋ผ, ๋ฒ”์šฉ ์ธ๊ณต์ง€๋Šฅ์œผ๋กœ์„œ ํ•œ Model์ด ์—ฌ๋Ÿฌ task๋“ค์„ ์ž˜ ํ•  ์ˆ˜ ์žˆ๋Š” ์œ ์šฉํ•œ ํ˜•ํƒœ๋กœ ๊ธฐ์ˆ ์ด ๋ฐœ๋‹ฌํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

BELATED ARTICLES

more