[Deep Learning] Deep Neural Network (4)

2023. 4. 19. 15:44
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Neural Networks
Feed-forward
Backpropagation
Deep Neural Network
Optimization

 

 

 

 

 

Optimization

 

์ž, ์„ธ ๋ฒˆ์งธ ๋ฌธ์ œ์— ๋Œ€ํ•ด ์–˜๊ธฐํ•ด ๋ด…์‹œ๋‹ค.

 

์ผ๋ฐ˜์ ์œผ๋กœ optimization์— ๋Œ€ํ•œ ๋ฌธ์ œ๋ฅผ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด, ํšจ์œจ์„ฑ์˜ ์ฆ๊ฐ€๋Š” ์ •ํ™•๋„๋ฅผ ๋–จ์–ด๋œจ๋ฆฝ๋‹ˆ๋‹ค. ์ฆ‰, ๋‘˜์€ trade-off ๊ด€๊ณ„์ด์ฃ .

 

๊ฒฐ๊ตญ, ํšจ์œจ์„ฑ์ด๋ผ๋Š” ๊ฒƒ์€ ์†๋„์™€ ์œ ์‚ฌํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ •ํ™•๋„๋ฅผ ์˜ฌ๋ฆฌ๊ณ ์‹ถ์œผ๋ฉด ์†๋„๊ฐ€ ์กฐ๊ธˆ ๋” ๋‚ฎ์•„์ง€๊ฒŒ ๋˜์ฃ .

 

๊ทธ๋ฆฌ๊ณ , approximation์ด๋‚˜ heuristic ๊ฐ™์€ algorithm์„ ์ƒ๊ฐํ•ด๋ด…์‹œ๋‹ค.

 

์†๋„๋ฅผ ๋†’์ด๊ธฐ ์œ„ํ•ด, ์กฐ๊ธˆ ๋” ํšจ์œจ์ ์ธ solution์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋งŒ๋“ค์–ด์ง„ ๋ฐฉ๋ฒ•๋“ค์ž…๋‹ˆ๋‹ค.

 

๋‹น์—ฐํžˆ ์ •ํ™•๋„๋Š” full-search์™€ ๊ฐ™์€ ๊ฒƒ๋“ค์— ๋น„๊ตํ•ด์„œ, ๋” ๋‚ฎ์•„์ง€๊ธฐ ๋งˆ๋ จ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, Deep Learning ๋ถ„์•ผ์—์„œ์˜ Optimization์€ 

 

ํšจ์œจ์„ฑ์„ ์ฆ๊ฐ€์‹œํ‚ค๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€, ์šฐ๋ฆฌ๊ฐ€ training์„ ๋” ๋งŽ์ด ํ•ด๋ณผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ด ๋˜๊ณ , ๋”ฐ๋ผ์„œ ์ •ํ™•๋„๊ฐ€ ์ฆ๊ฐ€ํ•œ๋‹ค๋Š” ์˜๋ฏธ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.



1 epoch์€ ์ „์ฒด training data๋ฅผ ๋‹ค ๋ณธ ์ƒํƒœ์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ์—ฌ๋Ÿฌ ๋ฒˆ iterationํ•˜๋ฉฐ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ๋ฒˆ ๋ด์•ผ ์ •ํ™•๋„๊ฐ€ ์˜ฌ๋ผ๊ฐ‘๋‹ˆ๋‹ค.

 

 

๊ฒฐ๊ตญ Deep Learning์€ ์ตœ๋Œ€ํ•œ ํšจ์œจ์ ์œผ๋กœ training data๋ฅผ ํ†ตํ•ด ๋งŽ์€ epoch์„ ๋Œ๋ฆฌ๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์•ผ, ์กฐ๊ธˆ ๋” ๋†’์€ ์„ฑ๋Šฅ์˜ Model์„ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ด์ฃ .

 

 

๊ทธ๋ž˜์„œ ์œ„ ๊ทธ๋ž˜ํ”„ ๋‘ ๊ฐœ๋ฅผ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค.

epoch์ด ์ฆ๊ฐ€ํ•˜๋ฉฐ, error rate๊ฐ€ ๋ญ”๊ฐ€ ์ˆ˜๋ ดํ•˜๋Š” ๋ชจ์Šต์ด ๋ณด์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , gradient norm์€ L2 Norm์„ ์‚ฌ์šฉํ•˜๋ฉฐ, gradient์˜ ์ œ๊ณฑ๊ฐ’์„ ์‚ฌ์šฉํ•˜์ฃ .

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด gradient๊ฐ€ ํฌ๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ๋ชจ์Šต์„ ๋ณด์ž…๋‹ˆ๋‹ค.

 

์ถœ์ฒ˜ : https://ai.stackexchange.com/questions/28365/is-y-i-hat-y-ix-i-part-of-the-formula-for-updating-weights-for-perceptro

 

gradient์˜ ํฌ๊ธฐ๊ฐ€ ํฌ๋‹ค๋Š” ์–˜๊ธฐ๋Š”, ๋งค step ๋งˆ๋‹ค weight๊ฐ€ ๋งŽ์ด update๋˜๊ณ  ์žˆ๋‹ค๋Š” ์˜๋ฏธ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด,

 

error rate์€ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ, weight๋Š” ์•„์ง๋„ ๋งŽ์ด update ๋˜๊ณ  ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด, ์ตœ์ ํ™” ์ž์ฒด๋Š” ์ˆ˜๋ ด์ด ์•ˆ ๋์ง€๋งŒ, error rate๊ฐ€ ์ˆ˜๋ ด์ด ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

์ผ๋‹จ, ํšจ์œจ์ ์ธ Algorithm์œผ๋กœ ์ตœ๋Œ€ํ•œ ๋งŽ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ตœ๋Œ€ํ•œ ๋งŽ์ด epoch ํ•ด๋ณด์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

Stochastic Gradient Descent

 

์šฐ๋ฆฌ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€ ์–˜๊ธฐํ•œ Gradient Descent๋Š” ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • Online learning
  • Batch learning

 

 Batch learning์€ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๋ฒˆ ๋‹ค ๋ณด๊ณ  weight๋ฅผ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๊ฒƒ์ด๋ผ๋ฉด, online learning์€ ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ point๋งŒ ๋ณด๊ณ  updateํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ, ์šฐ๋ฆฌ๊ฐ€ ์ฃผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์€, ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ small size์˜ subset data๋ฅผ ๋ฝ‘์•„๋‚ธ ๋‹ค์Œ, ๊ทธ๊ฒƒ์„ batch ์‚ผ์•„ ํ•™์Šต์„ ํ•˜๋Š” ๊ฒƒ์ด mini batch ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ๋ฐฉ์‹์ด Stochastic Gradient Descent๋ผ๊ณ  ๋ณด์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„ Algorithm์— ์ ํžŒ ๋Œ€๋กœ, sample์„ ๋ฝ‘์•„์„œ ๊ทธ๊ฒƒ์„ training data๋กœ ์‚ฌ์šฉํ•˜์—ฌ trainingํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์งˆ ๊ฑฐ๋ผ๊ณ  ์ƒ๊ฐํ•  ํ…Œ์ง€๋งŒ, iteration์„ ๋งŽ์ด ํ•  ์ˆ˜๋ก upper bound๊ฐ€ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

 

 

์ตœ๊ทผ Deep Learning Algorithm๋“ค์€ SGD๋ฅผ ๊น”๊ณ  ๊ฐ„๋‹ค๊ณ  ๋ณด๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

ํ›จ์”ฌ ํšจ์œจ์ ์ด์ง€๋งŒ, ์ •ํ™•๋„์—์„œ๋Š” ํฐ ์ฐจ์ด๊ฐ€ ๋‚˜์ง€ ์•Š์ฃ . epoch ์ˆ˜๋ฅผ ์ถฉ๋ถ„ํžˆ ๋Š˜๋ ธ๋‹ค๋ฉด ๋ง์ž…๋‹ˆ๋‹ค.

 

 

Momentum

 

Weight update = current gradient + past gradient

 

์ด๊ฒƒ์€ NN์— ์›๋ž˜ ์žˆ๋Š” ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

 

์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด weight๊ฐ€ ์—ฌ๊ธฐ ์ €๊ธฐ๋กœ update ๋˜์ง€๋งŒ, ๋นจ๊ฐ„์ƒ‰์ฒ˜๋Ÿผ update๊ฐ€ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ฆ‰, ์›๋ž˜ update ๋˜์–ด์•ผ ํ•˜๋Š” ๊ฒƒ์— ํ˜„์žฌ momentum์˜ ํž˜์„ ๋”ํ•ด์ฃผ๋Š” ๊ฒƒ์ด์ฃ .

 

 

Momentum์ด ์žˆ๋‹ค๋ฉด ์ตœ์ข… ์ง€์ ๊นŒ์ง€ ๊ฐ€๋Š” ์†๋„๊ฐ€ ํ›จ์”ฌ ๋น ๋ฅด๊ฒ ์ฃ ?

 

๊ฐ„๋‹จํ•˜๊ฒŒ ๋ณด๋ฉด, ์ˆ˜์‹์— ๊ด„ํ˜ธ ์•ˆ์— ์žˆ๋Š” ๊ฒƒ์ด gradient์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ƒฅ minibatch gradient๋ฅผ ํ‘œํ˜„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

step 1 -> v = 0์ด์—ˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ v = -(์•ฑ์‹ค๋ก )*(gradient) ์ผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ,

step 2 ๋ถ€ํ„ฐ๋Š” v๊ฐ€ ์œ„์—์„œ ์ •ํ•ด์งˆ ๊ฒƒ์ด๋ฏ€๋กœ,

v๊ฐ€ ๊ณ„์†ํ•ด์„œ ๋ˆ„์ ์ด ๋˜์–ด์„œ ์„ธํƒ€์— ๋ฐ˜์˜๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‹ˆ, ์ด์ „ ๊ณผ๊ฑฐ์˜ ์‹œ์ ์˜ gradient์ผ์ˆ˜๋ก ์•ŒํŒŒ๊ฐ€ ๋งŽ์ด ๊ณฑํ•ด์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์•ŒํŒŒ๋Š” 0 ~ 1 ์‚ฌ์ด์˜ ์ˆ˜์ด๋ฏ€๋กœ, ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ์ด์ „ ๊ณผ๊ฑฐ ์‹œ์ ์ผ์ˆ˜๋ก ์•ŒํŒŒ๊ฐ€ ๊ณ„์† ๊ณฑํ•ด์ ธ ๊ณ„์† ์ž‘์•„์ง‘๋‹ˆ๋‹ค.

 

 ๊ทธ๊ฒƒ์€ "๊ณผ๊ฑฐ ์‹œ์ ์˜ gradient๋Š” momentum์— ์ข€ ์ ๊ฒŒ ๋ฐ˜์˜ํ•˜์ž."๋ผ๋Š” ์˜๋ฏธ๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

 

์ด๋Ÿฐ technique์„ exponential smoothing์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  ๋˜ ๋‹ค๋ฅธ ํ•ด๋ฒ•์„ ์•Œ์•„๋ด…์‹œ๋‹ค.

 

Adaptive Learning Rate

 

์šฐ๋ฆฌ์˜ Hyperparameter ์—ํƒ€(๋˜๋Š” ์•ฑ์‹ค๋ก )๋ฅผ ์ดˆ๊ธฐ ๊ฐ’์—์„œ ๊ณ„์†ํ•ด์„œ ์ค„์—ฌ์คŒ์œผ๋กœ์จ,

 

๊ฐ’์„ ์žฌ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

 

์ฆ‰, ์ฒ˜์Œ์—๋Š” 0.1๋กœ ์„ค์ •ํ–ˆ๋‹ค๊ฐ€, ๋ชฉํ‘œ ์—ํƒ€๋ฅผ ์„ค์ •ํ•˜์—ฌ 0.01๊นŒ์ง€ ์ค„์ผ ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

์•ž ์˜ˆ์‹œ์—์„œ ๋ดค ๋“ฏ, epoch์ด ๋Š˜์–ด๋‚˜๋„ gradient๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ์ˆ˜๋ ดํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ์— ๋Œ€ํ•ด, ์—ํƒ€์˜ ๊ฐ’์„ ๋งค์šฐ ์ž‘๊ฒŒ ํ•˜์—ฌ ์–ด์ฐŒ๋๋“  ์ˆ˜๋ ดํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

ํ•ญ์ƒ ์ž‘๊ฒŒํ•œ๋‹ค๋Š” ๊ฒƒ์€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜, large gradient๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์—๋Š”, learning rate๋ฅผ ์ž‘๊ฒŒ ํ•จ์œผ๋กœ์จ ์ˆ˜๋ ด์— ๋„์›€์„ ์ฃผ๊ณ ,

 

์ด๋ฏธ small gradient๋ฅผ ์‚ฌ์šฉํ•  ๋•Œ์—๋Š” ์ด๋ฏธ model์ด ์ž˜ ํ•™์Šต ์ค‘์ด๋ฏ€๋กœ, learning rate๋ฅผ '์ƒ๋Œ€์ ์œผ๋กœ' ํฌ๊ฒŒํ•จ์œผ๋กœ์จ ์ˆ˜๋ ด์„ ๋•์Šต๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ AdaGrad Algorithm์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์œ„ algorithm์—์„œ ๋นจ๊ฐ„ ๋ฐ•์Šค ์œ„๊นŒ์ง€๊ฐ€ Stochastic gradient descent์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ minibatch์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•ด์ค๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , r์ด ๋“ฑ์žฅํ•˜๊ณ  r ์€ r + g (element-wise) g์ž…๋‹ˆ๋‹ค. // ๋ชจ๋“  element๋ฅผ ๋‹ค ๊ณฑํ•˜๋‹ˆ ์–‘์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‹ˆ r์€ ๋ฌด์กฐ๊ฑด ์–‘์ˆ˜๋กœ ์ ์  ์ปค์ง€๋Š” ์ˆซ์ž์ž…๋‹ˆ๋‹ค.

 

์–ด๋–ป๊ฒŒ ๋ณด๋ฉด gradient์˜ norm, ์ฆ‰ gradient์— ํฌ๊ธฐ๋ฅผ ๊ณฑํ•ด์ค€๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์„ธํƒ€๋Š” ์œ„์™€ ๊ฐ™์ด update ๋ฉ๋‹ˆ๋‹ค.

 

์•ฑ์‹ค๋ก ์— gradient๋ฅผ ๊ณฑํ•˜๋Š” ๊ฒƒ๊นŒ์ง„ ๊ฐ™์ง€๋งŒ, ๋ถ„๋ชจ์— ๋ธํƒ€ + ๋ฃจํŠธ r์ด ๋“ค์–ด๊ฐ‘๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋ฏ€๋กœ, r์€ ๊ณ„์†ํ•ด์„œ ์ปค์ง€๋ฏ€๋กœ, ์ด ์„ธํƒ€๋Š” ํ•ญ์ƒ ์ž‘์•„์งˆ ์ˆ˜๋ฐ–์— ์—†์Šต๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ, gradient๊ฐ€ ํฐ ์ˆ˜๋ผ๋ฉด, r๋„ ๋” ํฐ ์ˆ˜๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

gradient๊ฐ€ ์ž‘์€ ์ˆ˜๋ผ๋ฉด, r์€ ์ƒ๋Œ€์ ์œผ๋กœ ์ข€ ๋œ ์ปค์ง‘๋‹ˆ๋‹ค. 

 

๊ทธ๋ฆฌ๊ณ  ๋ธํƒ€๋Š” ์ƒ์ˆ˜์ž…๋‹ˆ๋‹ค. numerical stability๋ฅผ ์œ„ํ•ด์„œ ์กด์žฌํ•˜๋ฉฐ, 0์— ๊ฐ€๊นŒ์šด ์ˆ˜๋กœ ๋‚˜๋ˆ ์ง€๋ฉด ์•ˆ ๋˜๋‹ˆ ์–ด๋Š ์ •๋„ ์ž‘์€ ์ƒ์ˆ˜๋ฅผ ๋‘๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์ˆ˜ํ•™์  ๊ณ„์‚ฐ์˜ Exception์„ ์—†์• ์ฃผ๊ธฐ ์œ„ํ•จ์ด์ฃ .

 

๊ฒฐ๊ตญ r์€ gradient์˜ ํฌ๊ธฐ ๋ˆ„์ ์ด๋ฏ€๋กœ,

 

gradient์˜ ํฌ๊ธฐ ๋ˆ„์ ์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ, ์˜๋ฏธ ์žˆ๋Š” ๋ถ€๋ถ„์€ ๊นŠ๊ฒŒ ์ฒœ์ฒœํžˆ, ์˜๋ฏธ ์—†๋Š” ๋ถ€๋ถ„์€ ๋นจ๋ฆฌ ๋นจ๋ฆฌ ์ง€๋‚˜๊ฐ€๋„๋ก ํ•˜๋Š” ๊ฒƒ์ด Adaptive learning rate์ž…๋‹ˆ๋‹ค.

 

 

Adam Learning rate

 

์ด๋Ÿฌํ•œ adative learning๊ณผ momentum์˜ ๋ํŒ์™•์ด Adam optimization์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ adative learning๊ณผ momentum๋ฅผ ๋‘˜ ๋‹ค ์‚ฌ์šฉํ–ˆ๋”๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 



 

 

algorithm์„ ๋ด…์‹œ๋‹ค.

 

Sample ๋ถ€๋ถ„๋ถ€ํ„ฐ 3์ค„์€ Stochastic gradient descent์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , s์™€ r์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ๋กœ์šฐ ๊ฐ’์ด ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค. ์ด๊ฒƒ์€ exponential smoothing ๋•Œ๋ฌธ์— ๋ถ™์Šต๋‹ˆ๋‹ค.

 

์ฆ‰, ์ตœ๊ทผ ๊ฒƒ์„ ๋งŽ์ด ๋ฐ˜์˜ํ•˜๊ณ  ๊ณผ๊ฑฐ์˜ ๊ฒƒ์€ ๊นŒ๋จน๋Š” ๋ฐฉํ–ฅ์ž…๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ s์™€ r์— ๊ณตํ†ต์ ์œผ๋กœ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  bais correction์ด๋ผ๋Š” ๋ถ€๋ถ„์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ดˆ๊ธฐ์— gradient๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” ๋ฐ ์žˆ์–ด, ์ดˆ๊ธฐ gradient๊ฐ€ ๋กœ์šฐ ๊ฐ’์— ์˜ํ•ด ๋งค์šฐ ์ž‘๊ฒŒ ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ์ƒˆ ์ดˆ๋ฐ˜์— gradient๊ฐ€ ๋„ˆ๋ฌด ์ž‘๊ฒŒ ๋ฐ˜์˜๋˜๋Š” ๊ฒƒ์„ ๋ง‰๊ธฐ ์œ„ํ•ด,

 

bias correction term์„ ๋‘์–ด ์ด๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค. (1 - (๋กœ์šฐ)๋ฅผ ๋‚˜๋ˆ ์ค๋‹ˆ๋‹ค.)

 

๊ทธ๋ ‡๊ฒŒ solution์— ๋” ๋นจ๋ฆฌ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด bias correction์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ,

 

1. SGD

2. Momentum

3. Adaptive

4. Adam

 

์ฆ‰, 2์™€ 3์„ ๋™์‹œ์— ์‚ฌ์šฉํ•œ ๊ฒƒ์ด Adam์ž…๋‹ˆ๋‹ค.

 

gradient๊ฐ€ ํฌ๋ฉด ํด์ˆ˜๋ก ๋” ์กฐ๊ทธ๋งฃ๊ฒŒ ๋ณด๊ณ , ์ž‘์œผ๋ฉด ์ƒ๋Œ€์ ์œผ๋กœ ๋œ ํฌ๊ฒŒ ๋ณด๊ณ ,

 

๊ทธ๋ฆฌ๊ณ  Momentum์„ ํ†ตํ•ด์„œ ํ˜„์žฌ์˜ gradient๋งŒ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ๊ณผ๊ฑฐ์˜ gradient๋ฅผ ๋‹ค exponential smoothing์„ ํ•œ ๊ทธ ํ•ฉ์„ ํ˜„์žฌ์˜ gradient์ฒ˜๋Ÿผ ์น˜ํ™˜ํ•ด์„œ ์”€์œผ๋กœ์จ ๊ณผ๊ฑฐ๋กœ๋ถ€ํ„ฐ์˜ ๊ฐ’๋“ค์„ ๋ชจ๋‘ ๋ฐ˜์˜์‹œํ‚ต๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ, Adam Optimizer๋Š” adaptive learning rate์™€ momentum์„ ๋™์‹œ์— ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

SGD๋Š” ๊ธฐ๋ณธ์œผ๋กœ ๊น”๊ณ  ๊ฐ€๋ฉฐ, ์†๋„๋ฅผ ๋‚ด๊ธฐ ์œ„ํ•ด์„œ bias correction์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

 

Batch Normalization

 

input์„ ๋„ฃ์–ด์ค„ ๋•Œ Neural Network์—์„œ Normalization์„ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜, NN์€ k-nearest neighbor์™€ ๊ฐ™์€ distance ๊ธฐ๋ฐ˜ ๋ฐฉ์‹์€ normalization์ด ํ•„์ˆ˜์ ์ด์ง„ ์•Š์Šต๋‹ˆ๋‹ค.

 

์‚ฌ์‹ค NN์€ Normalization ์•ˆ ํ•ด๋„ ์ž‘๋™ํ•˜๊ฒŒ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ์—๋„ ํ•˜๋Š” ๊ฒŒ ๋” ์ข‹๋‹ค๊ณ  ์ตœ๊ทผ์— ๋ณด๊ณ ๊ฐ€ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ๋ณ€์ˆ˜ ๋ณ„๋กœ scaling์ด ๋˜๋ฏ€๋กœ weight updateํ•˜๋Š” ๋ฐ iteration์„ ์ค„์ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ๊ฐ€๋Šฅํ•˜๋ฉด zero centered ๋งž์ถ”๋ผ๊ณ  ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์„ ํ†ตํ•ด signal์ด ์ ๋ ค ReLU์™€ ๊ฐ™์€ Activation func.์— ์˜ํ•ด ๊ฐ’์ด sparseํ•˜๊ฒŒ ์ง„์งœ ์ค‘์š”ํ•œ ๋ถ€๋ถ„๋งŒ activation ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ ‡๊ฒŒ, input์—์„œ๋งŒ normalization์„ ํ•˜๋‹ˆ, hidden์—์„œ๋Š” normalization์„ ํ•˜์ง€ ์•Š์€ ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด hidden๋„ normalizatiion์„ ํ•ด์ค˜์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด batch normalization์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ mini batch ๋‹จ์œ„๋กœ normalizationํ•ฉ๋‹ˆ๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ mini batch ๋‹จ์œ„๋กœ weight๋ฅผ updateํ•˜๋‹ˆ ๋‹น์—ฐํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‹ˆ ๊ฐ๊ฐ์˜ hidden node์˜ output ๊ฐ’๋“ค์„ normalizationํ•ด์ฃผ์ž๋Š” ๊ฒƒ์ด์ฃ .

 

 

๊ทธ๋ฆฌ๊ณ  internal covariate shift์˜ ๊ฐœ๋…์„ ๋ด…์‹œ๋‹ค.

 

์‹ค์ œ weight update๋Š” ๊ณ„์† ์Œ“์ด๋ฉฐ ์ด๋ฃจ์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ฆ‰, weight updateํ•  ์–‘์€ staticํ•˜๊ฒŒ ๊ตฌํ•˜์ง€๋งŒ, weight update๋Š” dynamicํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง„๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ normalization์„ ํ•œ๋‹ค๋ฉด, ๋‹ค์‹œ zero centered๋กœ ๋งž์ถฐ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์‹œ ๋Œ๊ณ ์˜ค๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ, ์ˆ˜์‹๊ณผ ๊ฐ™์ด ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ๊ตฌํ•˜๊ณ  z-transformํ•œ ๊ฒƒ์„ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

activation ์ „์— ๋‚˜์˜จ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  normalization์„ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ง€๊ธˆ๊นŒ์ง€ Deep learning ์ฆ‰, Deep Neural Network์— ๋Œ€ํ•ด ์•Œ์•„๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 

 



'Artificial Intelligence > Deep Learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Deep Learning] Convolutional Neural Network (2)  (0) 2023.05.03
[Deep Learning] Convolutional Neural Network (1)  (1) 2023.04.30
[Deep Learning] Deep Neural Network (3)  (0) 2023.04.16
[Deep Learning] Deep Neural Network (2)  (0) 2023.04.14
[Deep Learning] Deep Neural Network (1)  (1) 2023.04.14

BELATED ARTICLES

more