[Training Neural Networks] part 2

2023. 1. 24. 00:21
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Gradient Descent
MNIST handwritten digit classification
Back propagation
Computational graph
sigmoid
tanh
ReLU
Batch Normalization

Gradient Descent

 

  • loss function์„ ์ตœ์†Œํ™”ํ•˜๋Š” Gradient ๊ฐ’์„ ์ฐพ๋Š”๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ๊ณง์ด ๊ณง๋Œ€๋กœ gradient descent ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๋น„ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉ๋  ์ˆ˜ ์žˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ ์—ฌ๋Ÿฌ ๊ฐ€์ง€ Gradient Descent Algorithm ๋“ค์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

 

 

MNIST handwritten digit classification

-> mse loss ์ ์šฉ ์˜ˆ์‹œ

 

๋ชฉ์  : ํ•™์Šต์„ ํ†ตํ•ด ๊ฐ layer ๋“ค์— ์กด์žฌํ•˜๋Š” Parameter ๋“ค.

  • ํ•™์Šต Data์— ๋Œ€ํ•ด์„œ forward propagation์„ ์ˆ˜ํ–‰.
  • parameter๋“ค์€ Random initialization์—์„œ๋ถ€ํ„ฐ ์‹œ์ž‘.
  • ๋จผ์ €, ์˜ˆ์ธก ๊ฐ’๋“ค์€ ์ž…๋ ฅ Data์˜ ground truth ๊ฐ’๊ณผ ์ƒ์ดํ•˜๋‹ค.
  • ๊ณ„์†ํ•ด์„œ ์ด ์ฐจ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์„ํ˜ธ ์†์‹คํ•จ์ˆ˜๋ฅผ ์ •์˜ํ•œ๋‹ค.
  • ์†์‹คํ•จ์ˆ˜๋กœ๋ถ€ํ„ฐ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์„ ๊ตฌํ•˜๊ณ , ํŽธ๋ฏธ๋ถ„ ๊ฐ’์„ ํ†ตํ•ด parameter ๋“ค์„ updateํ•ด์•ผํ•ฉ๋‹ˆ๋‹ค.

 

๊ฒฐ๊ตญ Loss function์— ๋Œ€ํ•œ ํŽธ๋ฏธ๋ถ„์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •์€ ๋งˆ์น˜ neural network์—์„œ์˜ forward propagation์˜ ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ layer ๋ณ„๋กœ ์ˆœ์ฐจ์ ์ธ ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋˜๋Š” back propagation์ด๋ผ๋Š” ๊ณผ์ •์„ ํ†ตํ•ด Parameter ๋“ค์˜ gradient ๊ฐ’ or loss function์˜ ํŽธ๋ฏธ๋ถ„ ๊ฐ’์„ ๊ณ„์‚ฐํ•  ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Neural Network ์˜ ์ˆœ์ฐจ์ ์ธ ๊ณ„์‚ฐ๊ณผ์ •์€ ํ•˜๋‚˜์˜

computational graph

๋ผ๋Š” ํ˜•ํƒœ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ถœ์ฒ˜ : https://math.stackexchange.com/questions/3101031/library-for-visualizing-computation-graph

์ฒ˜์Œ์˜ Parameter๊นŒ์ง€ ํŽธ๋ฏธ๋ถ„ ๊ณผ์ •์„ ๊ฑฐ์ณ ์˜ฌ๋ผ๊ฐ€๊ฒŒ ๋œ๋‹ค๋ฉด,

 

์ฒ˜์Œ์˜ parameter ๊ฐ’์—, learning rate ๊ฐ’๊ณผ Gradient ๊ฐ’์„ ๋”ํ•œ ๊ฒƒ์„ ๋นผ์„œ parameter ๊ฐ’์„ updateํ•ฉ๋‹ˆ๋‹ค.

 

์ค‘๊ฐ„ ๊ณผ์ • ์ค‘์˜ ๊ฒฐ๊ณผ๋ฌผ๋กœ์„œ Gradient ๊ฐ’์ด ๊ณ„์‚ฐ ๋˜์ง€๋งŒ ์ง์ ‘์ ์ธ gradient descent์—๋Š” ์‚ฌ์šฉ๋˜์ง€ ์•Š๋Š” ๊ทธ๋Ÿฌํ•œ ๋…ธ๋“œ๋“ค์ž…๋‹ˆ๋‹ค.

 

ํ•˜๋‚˜์˜ ํ•จ์ˆ˜์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๋ฌผ๋กœ ๋‚˜์˜จ๋‹ค๋ฉด, gradient descent ๊ณผ์ •์—์„œ ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋ฅผ ์ชผ๊ฐœ์–ด ํ•˜๋‚˜์”ฉ์˜ ๊ณ„์‚ฐ๊ณผ์ •์„ ํ•˜๋Š” ๋Œ€์‹  ํ•˜๋‚˜์˜ ํ•จ์ˆ˜๋ฅผ ํŽธ๋ฏธ๋ถ„ํ•˜์—ฌ ๊ตฌํ•  ์ˆ˜ ์žˆ๊ฒŒ๋ฉ๋‹ˆ๋‹ค.

 

 

 

Sigmoid Activation

์ถœ์ฒ˜ : https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6

 

 

์ž…๋ ฅ ์‹ ํ˜ธ๋ฅผ ์ž˜ ์„ ํ˜• ๊ฒฐํ•ฉํ•ด์„œ ๋งŒ๋“ค์–ด์ง„ ๊ฐ’์— hard threshold๋ฅผ ์ ์šฉํ•ด์„œ ์ตœ์ข… ์ถœ๋ ฅ ๊ฐ’์„ ๋‚ด์–ด์คฌ๋˜ ๊ฒƒ์„ ๋ถ€๋“œ๋Ÿฌ์šด ํ˜•ํƒœ์˜ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌํ™”ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

  • ์„ ํ˜• ๊ฒฐํ•ฉ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋งˆ์ด๋„ˆ์Šค ๋ฌดํ•œ๋Œ€๋ถ€ํ„ฐ ๋ฌดํ•œ๋Œ€๊นŒ์ง€ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋Š” ์ด๋Ÿฌํ•œ ๊ฐ’๋“ค์„ 0์—์„œ 1์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ mapping ์‹œ์ผœ์ค๋‹ˆ๋‹ค.
  • ์ด ๊ฐ’์„ logistic regression์˜ ๊ฒฝ์šฐ positive class์— ๋Œ€์‘ํ•˜๋Š” ์˜ˆ์ธก๋œ ํ™•๋ฅ  ๊ฐ’์œผ๋กœ ํ•ด์„ํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • Back propagation์—์„œ ํ•ด๋‹น sigmoid function์„ ํŽธ๋ฏธ๋ถ„ ์‹œ 0์—์„œ 1/4 ์‚ฌ์ด์˜ ๊ฐ’์„ ๊ฐ€์ง€๋ฏ€๋กœ, ์ด ๊ฐ’์ด ๊ณฑํ•ด์ ธ ์ž…๋ ฅ ๊ฐ’์˜ Gradient ๊ฐ€ ๊ฒฐ์ •๋˜๊ณ , ์ด๋Š” 1๋ณด๋‹ค ๋งŽ์ด ์ž‘์€ ๊ฐ’์ด๋ฏ€๋กœ sigmoid node๋ฅผ back propagation ํ•  ๋•Œ๋งˆ๋‹ค gradient ๊ฐ’์ด ์ ์ฐจ ์ž‘์•„์ง€๋Š” ์–‘์ƒ์ด ๋ณด์ด๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์—ฌ๋Ÿฌ Layer์— ์žˆ๋Š” Sigmoid node์— ๋Œ€ํ•ด์„œ Back propagationํ•  ๋•Œ๋งˆ๋‹ค gradient ๊ฐ’์ด ๊ณ„์†ํ•ด์„œ ๊นŽ์—ฌ ๋‚˜๊ฐ‘๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ Gradient๊ฐ€ 0์— ๊ฐ€๊นŒ์›Œ์ง€๋Š” ๋ฌธ์ œ์ ์ด ์•ผ๊ธฐํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ๊ฒฐ๊ตญ ์ด๊ฒƒ์œผ๋กœ Back Propagation ํ–ˆ์„ ๋•Œ, ์•ž์ชฝ Layer์— ์žˆ๋Š” parameter๋“ค์— ๋„๋‹ฌํ•˜๋Š” gradient ๊ฐ’์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๊ต‰์žฅํžˆ ์ž‘์€ ๊ฐ’์„ ์–ป๊ฒŒ ๋˜๊ณ  learning rate๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ, ์•ž์ชฝ parameter๋“ค์˜ Gradient ๊ฐ’์ด ์ž‘์Œ์œผ๋กœ ์ธํ•ด parameter๋“ค์˜ update๊ฐ€ ๊ฑฐ์˜ ์ผ์–ด๋‚˜์ง€ ์•Š๊ฒŒ ๋˜๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋ฏ€๋กœ, Neural network์˜ ์ „์ฒด์ ์ธ ํ•™์Šต์ด ๋Š๋ ค์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ์ด๋ฅผ Gradient Vanishing์ด๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
  • ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ๋‹ค์Œ๊ณผ ๊ฐ™์€ Activation function์ด ์ œ์•ˆ๋ฉ๋‹ˆ๋‹ค.

 

Tanh Activation

์ถœ์ฒ˜ : https://arshad-kazi.com/activation-functions/

 

tanh๋ผ๋Š” ํ•จ์ˆ˜๋ฅผ sigmoid ํ™œ์„ฑ ํ•จ์ˆ˜ ๋Œ€์‹ ์— 

 

  • ๋งˆ์ด๋„ˆ์Šค ๋ฌดํ•œ๋Œ€๋ถ€ํ„ฐ ๋ฌดํ•œ๋Œ€๊นŒ์ง€์˜ ๋ฒ”์œ„๋ฅผ ๋งˆ์ด๋„ˆ์Šค 1๋ถ€ํ„ฐ 1์‚ฌ์ด์˜ ๋ฒ”์œ„๋กœ Mapping ์‹œ์ผœ์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ํ•™์Šต์— ์žˆ์–ด ์†๋„๊ฐ€ ๋” ๋น ๋ฆ…๋‹ˆ๋‹ค.
  • ์ด tanh activation function์„ ํŽธ๋ฏธ๋ถ„ํ•˜๋ฉด 0์—์„œ 1/2 ์‚ฌ์ด์˜ ๊ฐ’์ด ๋‚˜์˜ค๋ฏ€๋กœ Back propagationํ•  ๋•Œ๋งˆ๋‹ค gradient ๊ฐ’์ด ํ›จ์”ฌ ๊นŽ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ๋ž˜์„œ Gradient vanishing ๋ฌธ์ œ๋ฅผ ์—ฌ์ „ํžˆ ๊ฐ€์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • layer๋“ค์ด ๋งŽ์ด ์Œ“์˜€์„ ๋•Œ, gradient vanishing ๋ฌธ์ œ๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ์ œํ•œํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ Activation function์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

ReLU (Rectified Linear Unit)

 

์ถœ์ฒ˜ : https://ai.googleblog.com/2022/04/reproducibility-in-deep-learning-and.html

 

 

  • ์„ ํ˜•๊ฒฐํ•ฉ์œผ๋กœ ๋‚˜์˜จ ๊ฒฐ๊ณผ๊ฐ’์ด ๋งˆ์ด๋„ˆ์Šค ๋ฌดํ•œ๋Œ€๋ถ€ํ„ฐ ๋ฌดํ•œ๋Œ€๊นŒ์ง€์˜ ๊ฐ’์„ ๊ฐ€์งˆ ๋•Œ, 0๋ณด๋‹ค ์ž‘์€ ๊ฒฝ์šฐ๋Š” 0์œผ๋กœ, 0 ๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ๋Š” ๊ทธ ๊ฐ’์„ ๊ณง์ด๊ณง๋Œ€๋กœ ๋‚ด์–ด์ฃผ๋Š” ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.
  • ReLU ํ•จ์ˆ˜๋Š” sigmoid๋‚˜ tanh function์— ๋น„ํ•ด ๋” ๋น ๋ฅด๊ฒŒ ๊ณ„์‚ฐ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์„ ํ˜•๊ฒฐํ•ฉ์œผ๋กœ ๋‚˜์˜จ ๊ฐ’์ด ์Œ์ˆ˜๋ผ๋ฉด gradient๊ฐ€ 0์ด ๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ gradient vanishing ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ์ธก๋ฉด์—์„œ ์žฅ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋ฏ€๋กœ, Lyaer๊ฐ€ ๋งŽ์ด ์Œ“์—ฌ์ ธ ์žˆ์„ ๋•Œ Neural Network์˜ ํ•™์Šต์„ ํ›จ์”ฌ ๋” ๋น ๋ฅด๊ฒŒ ๋งŒ๋“ค์–ด์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

Batch Normalization

  • ํ•™์Šต์˜ ์ตœ์ ํ™”๋ฅผ ๋•๋Š” layer
  • Tanh, sigmoid ๋“ฑ ์ค‘์•™ ์ •๋„์˜ ๊ฐ’์ด ์•„๋‹Œ Satureted ๋œ ๊ฐ’์— ์žˆ์–ด์„œ gradient ๊ฐ’์ด 0์— ์ˆ˜๋ ดํ•˜๊ฒŒ ๋˜๋Š” ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.
  • ๊ทธ๋ ‡๋‹ค๋ฉด Back Propagation ๊ณผ์ •์—์„œ ํ•™์Šต ์‹œ๊ทธ๋„์„ ์ค„ ์ˆ˜ ์—†๋Š” ์ƒํ™ฉ์ด ๋  ์ˆ˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทน๋‹จ์ ์ธ ๊ฒฝ์šฐ ์„ ํ˜•๊ฒฐํ•ฉ์˜ ๊ฒฐ๊ณผ๊ฐ€ ๋ชจ๋‘ ์Œ์˜ ๊ฐ’์ด ๋‚˜์˜ฌ ๊ฒฝ์šฐ ReLU function์—์„œ๋Š” gradient๊ฐ€ ๋ฌด์กฐ๊ฑด 0์ด ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น node๊ฐ€ ํ•™์Šต์ด ์ „ํ˜€ ์ง„ํ–‰๋˜์ง€ ์•Š๊ฒŒ ๋˜๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  • forward propagation ๊ณผ์ •์—์„œ ์„ ํ˜•๊ฒฐํ•ฉ์˜ ๊ฐ’์ด 0์ฃผ๋ณ€์˜ ๊ฐ’์œผ๋กœ ์ด๋ฃจ์–ด์ง„๋‹ค๋ฉด ํ•™์Šต์— ์šฉ์ดํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ์ด๊ฒƒ์ด batch normalization์˜ ๊ธฐ๋ณธ์ ์ธ ์•„์ด๋””์–ด๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.
  • ์–ด๋–ค Mini batch๊ฐ€ ์ฃผ์–ด์ ธ data๊ฐ€, mini batch size๊ฐ€ 10์ด๋ผ๊ณ  ํ•˜๋ฉด, ํ•ด๋‹น ๊ฐ’๋“ค์ด tanh์˜ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๊ธฐ ์ „์—,
    ํ‰๊ท ์ด 0, ๋ถ„์‚ฐ์ด 1์ธ ๊ฐ’์„ Normalizationํ•œ๋‹ค๋ฉด tanh์˜ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๊ฐ’์˜ ๋Œ€๋žต์ ์ธ ๋ฒ”์œ„๋ฅผ 0์„ ์ค‘์‹ฌ์œผ๋กœ ํ•˜๋Š” ๊ทธ๋Ÿฌํ•œ ๋ถ„ํฌ๋กœ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ ๊ฐ’๋“ค์˜ ๋ณ€ํ™” ํญ์ด ๋„ˆ๋ฌด ์ž‘์œผ๋ฉด tanh์˜ Output ๋˜ํ•œ ๋ณ€ํ™” ์ž์ฒด๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์„œ data item๋“ค ๊ฐ„์˜ ์ฐจ์ด๋ฅผ Neural network๊ฐ€ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์–ด๋ ค์šธ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ๊ทธ ๊ฐ’๋“ค์˜ ๋ณ€ํ™” ํญ์ด ๋„ˆ๋ฌด ํฌ๋ฉด ์ ์ •ํ•œ ์˜์—ญ์˜ ๊ฐ’์œผ๋กœ ์ปจํŠธ๋กค์— ์–ด๋ ค์›€์ด ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์ด ๊ณผ์ •์€ fully-connected layer ํ˜น์€ ์„ ํ˜• ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰ํ•œ ์ดํ›„์— ํ™œ์„ฑ ํ•จ์ˆ˜๋กœ ๋“ค์–ด๊ฐ€๊ธฐ ์ง์ „์— ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

 

์ถœ์ฒ˜ : https://towardsdatascience.com/implementing-batch-normalization-in-python-a044b0369567

 

๊ทธ๋Ÿฌ๋‚˜ data์˜ ์˜๋ฏธ๊ฐ€ ์ค‘์š”ํ•œ ๊ฒฝ์šฐ์—๋Š” ์œ„์™€ ๊ฐ™์€ BN(Batch Normalization) ๊ณผ์ •์€ ์šฐ๋ฆฌ์˜ data๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๊ฒŒ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ ์žƒ์–ด๋ฒ„๋ฆฐ ์ •๋ณด๋ฅผ ์›ํ•˜๋Š” ์ •๋ณด๋กœ ๋ณต์›ํ•  ์ˆ˜ ์žˆ๊ฒŒ๋”ํ•˜๋Š” ๊ทธ๋Ÿฐ ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ BN์˜ ๋‘ ๋ฒˆ์งธ ๋‹จ๊ณ„๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋ฏ€๋กœ BN ๊ณผ์ • ์ดํ›„์—๋Š” ํ•ด๋‹น Data๊ฐ€ ๊ฐ€์ง€๋Š” ๊ณ ์œ ์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ ๋˜ํ•œ ๋ณต์›ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌํ•ด ์ฃผ๊ฒŒ ๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ถ”๊ฐ€๋กœ Gradient Vanishing ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ์ด๋Ÿฐ ์ข‹์€ ์žฅ์น˜๊ฐ€ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

BELATED ARTICLES

more