[Deep Learning] Recurrent Neural Network (1)

2023. 5. 16. 17:14
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Neural Networks
Feed-forward
Backpropagation
Convolutional Neural Network
Recurrent Neural Network
Propagation
unfolding
fold
unfolding computational graph

 

์šฐ๋ฆฌ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€, ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ธฐ์ˆ ๋“ค์„  MLP๋ฅผ ๊ธฐ์ค€์œผ๋กœ ๊ณต๋ถ€ํ•ด์™”์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌํ•œ ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ Deep MLP๋ถ€ํ„ฐ ์„ฑ๊ณต์ ์ธ ๊ตฌ์กฐ์ธ CNN๊นŒ์ง€ ๊ณต๋ถ€๋ฅผ ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.

 

๋˜ ๋‹ค๋ฅธ ์„ฑ๊ณต์ ์ธ ๊ตฌ์กฐ์ธ RNN์— ๋Œ€ํ•ด ๊ณต๋ถ€ํ•ด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

Recurrent Neural Network

 

 

์ด RNN model๋„ 1986๋…„์— ์ด๋ฏธ ์ œ์•ˆ์ด ๋˜์–ด์˜จ model์ž…๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ ์ด RNN๋„ Neural Network์˜ specialized form์ž…๋‹ˆ๋‹ค.

 

์ด RNN์€ data๊ฐ€ sequential ํ•  ๋•Œ, ๋” ์ž˜ ์ž‘๋™ํ•˜๋Š” form์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์ด x 1๋ถ€ํ„ฐ x t๊ทธ๋ฆฌ๊ณ  x ํƒ€์šฐ ์‹œ์ ๊นŒ์ง€ ์ˆœ์ฐจ์ ์ธ sequence๋ฅผ ์ด๋ฃจ๋Š” ๊ตฌ์กฐ๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ input, hidden, output node์˜ ๋ชจ์Šต์ด ์ด์–ด์ง„ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์™ผ์ชฝ์— ์žˆ๋Š” ๊ฒƒ์„ ์˜ค๋ฅธ์ชฝ์— ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ํŽผ์น˜๋ฉด unfold ์ ‘์œผ๋ฉด fold ํ˜•ํƒœ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

 

hidden์—์„œ ์ˆœํ™˜ํ•˜๋Š” ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ์ฃผ์˜๊นŠ๊ฒŒ ๋ณผ ๋ถ€๋ถ„์€, input, state, output์€ t ์‹œ์ ์ด ์กด์žฌํ•˜์ง€๋งŒ, W, V, U๋ผ๋Š” parameter๋Š” ํ•ญ์ƒ ๋˜‘๊ฐ™์€ ๊ฐ’์ด ๋“ค์–ด์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์•ž์—์„œ ๋ฐฐ์šด parameter sharing์˜ ๊ฐœ๋…์ด ์ ์šฉ๋œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋ž˜์„œ Unfolding์˜ ๊ฐœ๋…์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

parameter sharing์„ ํ•˜๋ฉฐ, ๋ฐ˜๋ณต์ ์œผ๋กœ operation์ด ๋˜๋Š” structure ํ˜น์€ chain of events์ž…๋‹ˆ๋‹ค.

 

์œ„ ์ˆ˜์‹๊ณผ ๊ฐ™์ด,

 

t ์‹œ์ ์˜ state๋ฅผ s t-1์˜ ํ˜•ํƒœ์— ๋Œ€ํ•ด์„œ ์„ธํƒ€๋ผ๋Š” parameter๋ฅผ ๊ฐ€์ง€๊ณ  ๋ณ€ํ™˜ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด ์„ธํƒ€๋Š” ์„ธํƒ€ t๊ฐ€ ์•„๋‹Œ ๋งค ์‹œ์ ๋งˆ๋‹ค ๊ฐ™์€ parameter๊ฐ’์ด ์ ์šฉ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

Unfolding Computational Graph

 

 

๊ทธ๋ž˜์„œ ์ด ์„ธํƒ€ ๊ฐ’์— ์˜ํ•ด state๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•ด ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ์ ์šฉ๋˜๋Š” function์€ ์„ธํƒ€์— ๋Œ€ํ•œ function์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ Directed acyclic computational graph๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

๋ฐฉํ–ฅ์„ ๊ฐ€์ง€๋Š” ์ˆœํ™˜๋˜์ง€ ์•Š๋Š” NN์˜ computational graph์ž…๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์€ ๋‹ค์Œ์˜ ํŠน์ง•์„ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

 

  • Traditional dynamical system
  • Without any recurrence

์ด๋Ÿฌํ•œ ๊ฒƒ๋“ค์„ state transition model์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

state๊ฐ€ ๊ณ„์† ๋ณ€ํ™”ํ•˜๋Š” model์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ ์•„์ง๊นŒ์ง„ ์ˆœํ™˜์ ์ธ ํŠน์„ฑ์€ ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ƒฅ state๊ฐ€ transition๋  ๋ฟ์ด์ฃ .

 

 

 

์—ฌ๊ธฐ์— ์ž…๋ ฅ์„ ๋งค ์‹œ์ ๋งˆ๋‹ค ๋„ฃ์–ด์ค€๋‹ค๋ฉด,

 

์ž…๋ ฅ์— ๋”ฐ๋ผ state๊ฐ€ ์ˆœํ™˜๋˜๋Š” ์œ„์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

 

 

t ์‹œ์ ์—์„œ input์— ๋Œ€ํ•ด์„œ hidden state๋ฅผ ๋งŒ๋“œ๋Š”๋ฐ, ๋˜‘๊ฐ™์€ U๋ผ๋Š” parameter๊ฐ€ ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์—ฌ๊ธฐ์„œ ์ด์ „ ์‹œ์ ์˜ hidden ์ •๋ณด๊ฐ€ ๋‹ค์Œ ์‹œ์ ์œผ๋กœ transition๋˜๋ฉฐ W parameter๊นŒ์ง€ ํ•จ๊ป˜ ๊ณ ๋ ค๋ฉ๋‹ˆ๋‹ค.

 

ํ˜น์€ ๋‹ค๋ฅธ state transition ๊ด€์ ์—์„œ ์ƒ๊ฐํ•ด๋ณด๋ฉด,

 

state๊ฐ€ transitionํ•˜๋Š” model์ธ๋ฐ, ๊ฐ ์‹œ์ ์— input signal์ด ๋“ค์–ด์˜ค๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ด ๊ฒฝ์šฐ, hidden state๋ฅผ ๋งŒ๋“œ๋Š” ์‹, ์ด์ „ ์‹œ์ ์˜ hidden state์™€ input์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, 

 

์—ฌ๊ธฐ์„œ parameter ์„ธํƒ€๋Š” U์™€ W์˜ set์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๋ ‡๊ฒŒ ์ˆœํ™˜๋˜๋Š” ๊ตฌ์กฐ์—์„œ๋Š”,

 

์–ด๋–ค ์‹œ์ ์˜ hidden state๋Š” ๊ทธ ์ „ ์‹œ์ ์˜ ๋ชจ๋“  ์ž…๋ ฅ signal๋“ค์˜ ์ด ํ•ฉ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด t ์‹œ์ ์˜ hidden state๋Š” lossy summary๋กœ ์ •๋ณด๋ฅผ ๋‹ค์†Œ ์žƒ์–ด๋ฒ„๋ฆฌ๋Š” summary๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ ์ฒซ input๋ถ€ํ„ฐ ๋ชจ๋“  t๊นŒ์ง€์˜ ๋ชจ๋“  input๋“ค์„ summaryํ•œ๋‹ค๊ณ  ๋ด…๋‹ˆ๋‹ค.

 

๋˜, t ์‹œ์ ์˜ hidden state๋Š” ๋‹ค์†Œ ์ •๋ณด๋“ค์„ ์žƒ์–ด๋ฒ„๋ฆด์ง€๋ผ๋„ ์ตœ๊ทผ ์ •๋ณด๋“ค์„ ์œ„์ฃผ๋กœ ์ •๋ณด๋ฅผ ์š”์•ฝํ•ด ๋†“์Šต๋‹ˆ๋‹ค.

 

์œ„ chain์ด ๊ธธ์–ด์ง€๋ฉด ๊ณผ๊ฑฐ์˜ ์ •๋ณด๋“ค์„ ๊ฝค๋‚˜ ์†Œ์‹คํ•˜๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

parameter๊ฐ€ ๊ณฑํ•ด์ง€๋ฉฐ ์˜ค๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์ตœ๊ทผ ๋ช‡ ์‹œ์ ์˜ ์‹ ํ˜ธ๋“ค์€ ๋‚จ์•„์žˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Šต๋‹ˆ๋‹ค.

 

์ด๋ก  ์ ์œผ๋กœ ๋งํ•˜์ž๋ฉด, t์‹œ์  ์ด์ „์˜ ๋ชจ๋“  input signal๋“ค์˜ ๊ฐ’์ด t ์‹œ์ ์˜ hidden state์— ์ €์žฅ์ด ๋œ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

๊ทธ๊ฒƒ์„ ๊ทธ๋ฆผ์œผ๋กœ ํ‘œํ˜„ํ•˜๋ฉด ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

ํ•ด๋‹น ์˜ํ–ฅ์„ ์ฃผ๋Š” function์„ g(t)๋ผ๊ณ  ํ•œ๋‹ค๋ฉด,

 

x1 ~ x t๊นŒ์ง€์˜ ๊ฐ’๋“ค์ด ๋ชจ๋‘ t์‹œ์ ์˜ hidden state์— ์œ„์™€๊ฐ™์ด ์˜ํ–ฅ์„ ์ค๋‹ˆ๋‹ค.

 

๊ณผ๊ฑฐ์˜ ๋ชจ๋“  signal์˜ ๊ฐ’์„ ํ˜„์‹œ์ ์˜ state์— ๊ณ„์†ํ•ด์„œ ๋ˆ„์ ํ•ด ๋‚˜๊ฐ€๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 



RNN์˜ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

1๋ฒˆ๊ณผ ๊ฐ™์ด hidden์—์„œ ๋งค๋ฒˆ recurrent connections์ด ๋ฐœ์ƒํ•˜๋ฉฐ ๋งค ์‹œ๊ฐ„ output์ด ๋‚˜์˜ต๋‹ˆ๋‹ค.

 

์ด 1๋ฒˆ์ด ๋Œ€ํ‘œ์ ์ธ ๊ตฌ์กฐ์ด์ง€๋งŒ,

 

์š”์ƒˆ๋Š” Transformer์™€ ๊ฐ™์€ ๊ตฌ์กฐ๋“ค์ด ๋งŽ์ด ๋‚˜์™€์„œ ๋Œ€ํ‘œ๊นŒ์ง„ ๋ณด๊ธฐ ์–ด๋ ต์ง€๋งŒ default๋กœ ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค.

 

 

2๋ฒˆ์งธ๋Š” ๊ทธ ์ „ ์‹œ์ ์˜ output์—์„œ ๊ทธ ๋‹ค์Œ ์‹œ์ ์˜ hidden state๋กœ ์ˆœํ™˜์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

3๋ฒˆ์งธ๋Š” ์ „์ฒด sequence๋ฅผ ํ†ตํ‹€์–ด output์ด ํ•˜๋‚˜๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

 

pattern 1

 

์ฒซ ๋ฒˆ์งธ ํŒจํ„ด์„ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๋งค ์‹œ์ ์— input์ด hidden state๋กœ ๊ฐ€๊ณ , hidden state๊ฐ€ output์œผ๋กœ ๊ฐ€๊ณ , output์— ๋Œ€ํ•ด loss๊ฐ€ ๊ณ„์‚ฐ๋˜๋Š”๋ฐ, loss ๊ณ„์‚ฐ์„ ์œ„ํ•ด์„  y๊ฐ’์ด ํ•„์š”ํ•˜์—ฌ y๊ฐ’๋„ ๋“ค์–ด์˜ค๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด hidden state ์‚ฌ์ด์—์„œ ์ˆœํ™˜์ด ๋˜๋Š” ๊ตฌ์กฐ๋ฅผ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ RNN ๊ตฌ์กฐ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ขŒ์ธก์˜ fold๋œ ์‚ฌ์ง„์—์„œ ํ™”์‚ดํ‘œ๋Š” Weight๊ฐ€ ๊ณฑํ•ด์ ธ์„œ ์ˆœํ™˜์ด ๋˜๋Š” ๊ตฌ์กฐ์ž„์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  Weight set์ด ์‹œ์ ์— ์ƒ๊ด€ ์—†์ด parameter sharing์„ ํ†ตํ•ด ์ ์šฉ๋˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

pattern 2

 

๋‘ ๋ฒˆ์งธ ํŒจํ„ด์€,

 

์ „ ์‹œ์ ์˜ output์—์„œ ๋‹ค์Œ ์‹œ์ ์˜ hidden state๋กœ ์ˆœํ™˜์ด ๋˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

hidden state ์ž์ฒด๋ฅผ ๊ณ„์†ํ•ด์„œ ๋ˆ„์ ํ•˜๋Š” ๊ฑฐ๋ž‘, ํ•œ ๋ฒˆ ์ •๋ณด๋ฅผ ์™œ๊ณก์„ ์‹œํ‚ค๊ณ  (์ฆ‰, softmax๋ฅผ ํ†ตํ•œ ํ™•๋ฅ ๊ฐ’ ๋ณ€ํ™˜) , ์ด๋Ÿฌํ•œ ํ™•๋ฅ ๊ฐ’์„ ์ฃผ๋Š” ๊ฒƒ์ด๋ž‘ ์ •๋ณด๊ฐ€ ๋งŽ์ด ๋‹ฌ๋ผ์งˆ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ์ด pattern์€ ์ž˜ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

 

 

pattern 3

 

์œ„์™€ ๊ฐ™์ด,

 

input์ด ๋งค ์‹œ์ ์— ์กด์žฌํ•˜๊ณ  ์ˆœํ™˜์€ hidden state์—์„œ ๋˜์ง€๋งŒ output์€ ๋์— ํ•˜๋‚˜๋งŒ ๋‚˜์˜ค๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์œ„์™€ ๊ฐ™์ด loss๋„ ๊ณ„์‚ฐํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

์ด ๊ตฌ์กฐ๋Š” ๋งŽ์ด ์‚ฌ์šฉ๋˜๋ฉฐ,

 

encoder, decoder ๊ตฌ์กฐ์—์„œ ์‘์šฉ๋˜์–ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

 

 

Forward Propagation

 

ํ•ด๋‹น forward propagation์€ pattern 1์— ๋Œ€ํ•ด ๋‹ค๋ฃจ๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์‚ฌ์‹ค ์ด ๊ณผ์ •์€ MLP์™€ ํฌ๊ฒŒ ๋‹ค๋ฅผ ๊ฒŒ ์—†์Šต๋‹ˆ๋‹ค.

 

 

์œ„ a์˜ ๊ฐ’์€ summation ๊ฐ’์œผ๋กœ ์ƒ๊ฐํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์šฐ๋ฆฌ๊ฐ€ ์•ž์—์„œ ๋‹ค๋ฃฌ net์˜ ๊ฐ’์ด์ฃ .

 

์œ„ ์ˆ˜์‹์„ ๋”ฐ๋ผ๊ฐ€๋ฉด h๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  RNN์—์„œ๋Š” tanh๊ฐ€ ๊ธฐ๋ณธ activation func. ์ž…๋‹ˆ๋‹ค.

 

b, c๋Š” bias ๊ฐ’์ž…๋‹ˆ๋‹ค.

 

Backpropagation

 

 

Backpropagation ์„ ํ†ตํ•ด ์˜ค๋ฅ˜๊ฐ€ ๋ชจ๋“  ์‹œ์  ์ „๋‹ฌ ๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์ปดํ“จํ„ฐ ์ž…์žฅ์—์„  ๊ณ„์‚ฐ์ด ๋งค์šฐ ๋ณต์žกํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ์œ„์—์„œ ์˜ค๋Š”, output์œผ๋กœ๋ถ€ํ„ฐ์˜ weight๋“ค๋„ ๊ณ ๋ คํ•ด์ฃผ์–ด์•ผ ํ•œ๋‹ค๋Š” ์ ์„ ๊ณ ๋ คํ•ด์ฃผ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

 

Backpropagation through time

 

RNN์—์„œ์˜ Backpropagation์„ Backpropagation through time๋ผ๊ณ  ๋ถ€๋ฆ…๋‹ˆ๋‹ค.

 

 

ํŠน๋ณ„ํ•œ ๊ธฐ์ˆ ์ด ํ•„์š”ํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๊ณ ,

 

์ „ ์‹œ์ ์˜ hidden state์—์„œ backward๋กœ ํ•œ ๋ฒˆ ๋” ๊ฐ€์ ธ์™€์•ผํ•œ๋‹ค๋Š” ์ฐจ์ด๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด ์šฐ๋ฆฌ๊ฐ€ update ์‹œ์ผœ์ค˜์•ผํ•  weight๋“ค์ด ์œ„์™€ ๊ฐ™์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ ,

 

Loss๋Š” 1 ์‹œ์ ๋ถ€ํ„ฐ ํƒ€์šฐ์‹œ์ ๊นŒ์ง€ ๋ชจ๋‘ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ „์ฒด์— ๋Œ€ํ•ด์„œ MLE(Maximum-likelihood estimation)๋ฅผ ํ•ด์ค˜์•ผํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์ด๊ฒƒ์ด ์‰ฝ์ง€ ์•Š์œผ๋‹ˆ ๋งค ์‹œ์  ๋”ฐ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ ๊ทธ๊ฒƒ์„ ์ „์ฒด์— ๋Œ€ํ•œ Loss๋กœ ๋ณด์ž๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์›๋ž˜๋Š” ์ „์ฒด์— ๋Œ€ํ•ด์„œ joint probability๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ์•ผ ํ•˜๋Š”๋ฐ, ๊ทธ๋Ÿฌ์ง€๋ง๊ณ  ๋งค ์‹œ์  ๋…๋ฆฝ์ ์œผ๋กœ ๊ตฌํ•œ ๋’ค ์ด loss๋ฅผ ๋‹ค ํ•ฉ์นœ ๊ฒƒ์„ ์šฐ๋ฆฌ network์˜ loss๋กœ ๊ณ„์‚ฐํ•˜์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„ ์ˆ˜์‹์€ MLE์— - ๋ถ™์ธ - MLE ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

 

 

 

 

๋‹ค์‹œ ์œ„ ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

 

๋จผ์ € ์œ„ ์ˆ˜์‹์— ์˜ํ•ด ํ•ด๋‹น L์„ ๋ฏธ๋ถ„ํ•œ ๊ฐ’์€ 1์ด ๋ฉ๋‹ˆ๋‹ค.

 

 

์œ„ ์ˆ˜์‹์—์„œ ๋ณด๋ฉด t ์‹œ์ ์˜ Loss๋ฅผ y hat t๋กœ ๋ฏธ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.

 

์ด ๊ฒฝ์šฐ, cross entropy๋กœ ๋ดค์„ ๋•Œ ๋ฏธ๋ถ„ํ•œ๋‹ค๊ณ  ๋ณด๋ฉด, ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  y hat t๋ฅผ o t๋กœ ๋ฏธ๋ถ„ํ•˜๋ฉด softmax ๋ฏธ๋ถ„์ด๋ฏ€๋กœ, ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์ด ๊ฐ’์€ y t๊ฐ€ 1์ธ ๊ฒฝ์šฐ์™€ y t๊ฐ€ 0์ธ ๊ฒฝ์šฐ๋กœ ๋‚˜๋‰˜๋ฉฐ,

 

0์ธ ๊ฒฝ์šฐ๋Š” 0์ด ๋˜๋ฏ€๋กœ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , y t๊ฐ€ 1์ธ ๊ฒฝ์šฐ์—๋งŒ ๊ณ ๋ คํ•˜์—ฌ ์œ„์™€ ๊ฐ™์€ ์ˆ˜์‹์„ ์ž‘์„ฑํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  o t๋ฅผ h t๋กœ ๋ฏธ๋ถ„ํ•˜๋ฉด ๊ทธ๋ƒฅ V๋งŒ ๋–จ์–ด์ ธ๋‚˜์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ž, ๊ทธ๋Ÿฐ๋ฐ,

 

์šฐ์ธก ๊ทธ๋ฆผ์˜ ํŒŒ๋ž€์ƒ‰ ํ™”์‚ดํ‘œ ๋ถ€๋ถ„๋„ ๊ตฌํ•ด์•ผ์ค˜์•ผํ•ฉ๋‹ˆ๋‹ค.

 

์•„๋ž˜ ์ˆ˜์‹์„ ๋ด…์‹œ๋‹ค.

 

 

์šฐ์„ , L๋ถ€ํ„ฐ h๊นŒ์ง€์˜ ๋ถ€๋ถ„์€ ์ง€๊ธˆ๊นŒ์ง€ ํ•œ ๊ฒƒ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

์ค‘๊ฐ„ ๊ณผ์ •๋˜ํ•œ ์œ„์™€ ๊ฐ™์ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  h t+1์— ๋Œ€ํ•œ ์‹์„ ์œ„์™€ ๊ฐ™์ด h t๋กœ ๋ฏธ๋ถ„ํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์€ ๊ณผ์ •์œผ๋กœ ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  diagonal์€ scalar๋ฅผ matrix ์—ฐ์‚ฐ ์œ„ํ•ด ๋ถ™์—ฌ์ค€ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

์™ผ์ชฝ์‹์€ t + 1 ์‹œ์ , ์˜ค๋ฅธ์ชฝ ์‹์€ t ์‹œ์ ์ž…๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๊ฒƒ ๋•Œ๋ฌธ์— ์ปดํ“จํ„ฐ ์ž…์žฅ์—์„œ๋Š” ๊ณ„์‚ฐ์ด 2๋ฐฐ ์ด์ƒ์€ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์•„๋ž˜ ์ˆ˜์‹์€ weight ๋‹จ์œ„๋กœ ํ•œ ๋‹จ๊ณ„์”ฉ ๋” ๋ฏธ๋ถ„๊ฐ’์ด gradient๋กœ ์ „ํŒŒ๊ฐ€ ๋˜์–ด์„œ ์‹ค์ œ weight update ๊ฐ’์„ ๊ตฌํ–ˆ๋‹ค๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

 

 

RNN Structure

 

RNN์—๋Š” ๊ต‰์žฅํžˆ ๋งŽ์€ structure๊ฐ€ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

ํ•˜๋‚˜์”ฉ ๋ด…์‹œ๋‹ค.

 

Bidirectional

์ด ๊ตฌ์กฐ๋Š” ์–‘๋ฐฉํ–ฅ์œผ๋กœ, ์ตœ๊ทผ์—๋Š” ๊ทธ๋ƒฅ RNN ๋ณด๋‹จ ์–‘๋ฐฉํ–ฅ์ด ๋” ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

hidden state๋“ค์ด ์•ž๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ๋’ท๋ฐฉํ–ฅ์œผ๋กœ๋„ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ตฌ์กฐ๋ฅผ ์‚ดํŽด๋ณด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฏธ๋ž˜์˜ ๋ฐ์ดํ„ฐ์—๋„ ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

hidden state h์™€ g๋Š” ๊ฐ™์€ hidden state๋กœ ๊ฐ™์€ ๊ธฐ์—ฌ๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

 

 

Encoder-Decoder

 

Encoder-Decoder ๊ตฌ์กฐ๋ฅผ ๋ด…์‹œ๋‹ค.

 

์ด๊ฒƒ์€ Sequence๋กœ ๋ฐ›์•„์„œ Sequence๋ฅผ ๋‚ด๋†“๋Š” ํ˜•ํƒœ๋„ ์žˆ์œผ๋ฏ€๋กœ,

 

Seq2Seq ๊ตฌ์กฐ๋ผ๊ณ ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์€ machine translation, QnA์—์„œ ๋งŽ์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

 

์œ„์™€ ๊ฐ™์ด ์ž…๋ ฅ์œผ๋กœ sequence ํ•˜๋‚˜๊ฐ€ ํ†ต์ฑ„๋กœ ๋“ค์–ด์˜ต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด hidden์—์„œ ๊ณ„์† ์Œ“์—ฌ์„œ ํ•˜๋‚˜์˜ context๋ฅผ ์ด๋ฃน๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด ์ญ‰ ํ’€๋ฆฌ๋ฉฐ ์ถœ๋ ฅ sequence๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณผ ์œ„์™€ ๊ฐ™์ด ์ž…๋ ฅ์€ n x, ์ถœ๋ ฅ์€ n y๋กœ ๋‘ ๊ธธ์ด๋Š” ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

encoder์—์„œ๋Š” sequence๋ฅผ ๋ชจ์œผ๊ธฐ๋งŒํ•˜๊ณ ,

 

์ด๊ฒƒ์ด context๊ฐ€ ๋˜์–ด์„œ ์ถœ๋ ฅ์—์„œ๋Š” ๋ฑ‰์–ด๋‚ด๊ธฐ๋งŒ ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋Ÿฐ๋ฐ, ์ด ๊ธธ์ด๋Š” ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ ๊ฐ’์„ ๋ชจ์œผ๋Š” ๊ณผ์ •์„ Encoding, ๊ฐ’์„ ๋ฑ‰์–ด๋‚ด๋Š” ๊ณผ์ •์„ Decoding์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

Deep RNN

 

RNN์—์„œ๋„ Deepํ•œ ๊ตฌ์กฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ์„ ๋ด…์‹œ๋‹ค.

 

 

๊ทธ๋ฆผ์—์„œ ๋ณด๋ฉด,

 

์ฒซ ๋ฒˆ์งธ๋Š” input -> output ๊ฐ€๋Š” ์ˆœํ™˜ ๊ณผ์ •์„ Deepํ•˜๊ฒŒ ๋งŒ๋“  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

layer๋ฅผ ๋Š˜๋ ค์„œ ๋ง์ด์ฃ .

 

๊ทธ๋ฆฌ๋„ ๋‘ ๋ฒˆ์งธ๋Š”,

 

recurrence์— depth๋ฅผ ์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์„ธ ๋ฒˆ์งธ๋Š” skip connection์œผ๋กœ ์ˆœํ™˜์„ ์—ฌ๊ธฐ ์ €๊ธฐ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

skip connection์€ depth๊ฐ€ ๊ธธ์–ด์ง€๋‹ˆ ๊ณ„์†ํ•ด์„œ ์ดˆ๊ธฐ ์ •๋ณด๋ฅผ ๊ธฐ์–ต์‹œ์ผœ์ฃผ๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

 

๊ธด chain์˜ ์ •๋ณด ์†์‹ค์„ ๋ฐฉ์ง€ํ•ด์ฃผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆผ์€ ์œ„์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

Single character example

 

 

word๋ฅผ one-hot encoding์„ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ตœ๊ทผ์—๋Š” word embedding์„ ๋งŽ์ด ํ•ฉ๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด word ์ž์ฒด๋ฅผ vectorํ™” ํ•˜์—ฌ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด ์ถœ๋ ฅ์˜ probability์— ๋Œ€ํ•œ ๊ฐ’์„ ๋ฝ‘์•„๋‚ด๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  weight ๋“ค์€ ์œ„์™€ ๊ฐ™์ด parameter sharing์ด ์ด๋ฃจ์–ด์ง€๋Š” ๊ฒƒ๋„ ๋ณด๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

 

RNN์˜ ์—ฌ๋Ÿฌ ๊ตฌ์กฐ๋ฅผ ๋‹ค์‹œ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

 

 

  • one to one
    • MLP
  • one to many
    • Image captioning
  • many to one
    • Text classification
    • Sentiment analysis
  • many to many
    • Machine Translator
    •  

 

 

Sentiment analysis

-> ๊ธ์ • or ๋ถ€์ •

 

 

Machine Translator

 

์œ„์™€ ๊ฐ™์ด encoder์—์„œ ์ •๋ณด๋ฅผ ๋ชจ๋‘ ๋ชจ์•„ decoder์—์„œ ํ’€์–ด๋ƒ…๋‹ˆ๋‹ค.

 

decoder์—์„œ input signal์— ๋Œ€ํ•ด ๋‚˜์˜จ Output์„ ๋‹ค์‹œ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์ฃผ๋Š” ํ˜•ํƒœ๋ฅผ ๋ฐ˜๋ณตํ•ฉ๋‹ˆ๋‹ค.

 

 

 

Image captioning

 

input์€ image์œผ๋กœ,

 

representation learning์œผ๋กœ classification ํ›„ ๋งˆ์ง€๋ง‰์— class์— ๋Œ€ํ•œ softmax ํ™•๋ฅ ๊ฐ’์ด ๋“ค์–ด์˜ต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ์ž…๋ ฅ์œผ๋กœ ๋„ฃ์–ด์„œ translator ๋Œ๋ฆฌ๋“ฏ์ด ๋Œ๋ฆฝ๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์„ ๋‹จ์–ด๋กœ ํ’€์–ด์„œ ๋‚ด๋ฑ‰๋Š” ์ˆœํ™˜ ์‹ ๊ฒฝ๋ง์„ ๋ถ™์ด๋Š” ๊ตฌ์กฐ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

BELATED ARTICLES

more