[Deep Learning] Backpropagation in special Functions

2023. 4. 3. 18:54
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Neural Networks
Feed-forward
Backpropagation

 

 

์ง€๊ธˆ๊นŒ์ง€ ์กฐ๊ธˆ์€ generalํ•œ ํ˜•ํƒœ์˜ ๋ถ€๋ถ„๋“ค์„ ๋งŽ์ด ๋ณด์…จ๋‹ค๋ฉด,

 

์ด๋ฒˆ์—๋Š” ์‹ค์ œ๋กœ ๋” ๋งŽ์ด ์“ฐ์ด๋Š” function๋“ค์˜ ๋ฏธ๋ถ„๊ฐ’์„ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

Backpropagation

 

 

 

์•ž์—์„œ๋Š” f'(net)์œผ๋กœ activation function์„ ๊ณ„์‚ฐํ•˜๊ณ  ๋„˜์–ด๊ฐ”์Šต๋‹ˆ๋‹ค.

 

์กฐ๊ธˆ ๋” specialํ•œ ํ˜•ํƒœ๋ฅผ ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

Cross Entropy

 

์ถœ์ฒ˜ : https://www.researchgate.net/figure/Plot-of-cross-entropy-loss-with-increase-in-the-number-of-epochs_fig2_326349021

 

 

 

๊ธฐ์กด์—๋Š” ln์ด ๋ถ™์–ด์žˆ์ง€๋งŒ,

 

์šฐ๋ฆฌ์˜ ๋ชฉํ‘œ๋Š” ์ตœ์ ํ™”์ด๋‹ˆ, ๋ชจ๋“  ํ•ญ์— ๋˜‘๊ฐ™์ด ๊ณฑํ•ด์ง„ ์ƒ์ˆ˜๋Š” ์˜๋ฏธ๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์œ„์™€ ๊ฐ™์ด ์—ฐ์‚ฐ๋ฉ๋‹ˆ๋‹ค.

 

 

Softmax

์ถœ์ฒ˜ : https://www.researchgate.net/figure/Graphic-representation-of-the-softmax-activation-function_fig5_348703101

 

 

softmax ํ•จ์ˆ˜์˜ ๋ฏธ๋ถ„์€ ์‹ ๊ธฐํ•˜๊ฒŒ๋„ ์ž๊ธฐ ์ž์‹ ๋“ค์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ด๊ฒƒ์„ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š”,

 

feed forward ๊ณผ์ •์—์„œ softmax function์˜ ๊ฐ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋“ค์ด ์ €์žฅ๋˜์–ด ์žˆ์Œ์„ ์•Œ๊ณ ,

 

๊ทธ๋ ‡๋‹ค๋ฉด backpropagation ์ค‘, ์œ„ softmax function์˜ ๊ฐ ๊ณ„์‚ฐ ๊ฒฐ๊ณผ๋“ค์ด ์žˆ๋‹ค๋ฉด ์œ„์™€ ๊ฐ™์€ ๋ฏธ๋ถ„ ๊ฒฐ๊ณผ๋กœ ๋ฏธ๋ฃจ์–ด ๋ณผ ๋•Œ, ์šฐ๋ฆฌ๋Š” ๋” ๋งŽ์€ ์—ฐ์‚ฐ์„ softmax์—์„œ๋Š” ํ•ด์ค„ ํ•„์š”๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์„ ์•Œ์•„๋ƒˆ์Šต๋‹ˆ๋‹ค.

 

 

Sigmoid

 

์ถœ์ฒ˜ : https://en.wikipedia.org/wiki/Logistic_function

 

 

 

logistic sigmoid์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์€ ์œ„์™€ ๊ฐ™์ด ์ด๋ฃจ์–ด์ง‘๋‹ˆ๋‹ค.

 

 

softmax์™€ ๋ฏธ๋ถ„ form์ด ๊ฐ™์€ ์–‘์ƒ์ž…๋‹ˆ๋‹ค.

 

์‚ฌ์‹ค softmax๊ฐ€ logistic์„ scaling ํ•ด๋†“์€ ๊ฑฐ๋ผ ๋‹น์—ฐํ•œ ๊ฒฐ๊ณผ์ด๊ธด ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , softmax์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ feedforward๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด, ์ €์žฅํ•˜๋ ค๊ณ  ํ•˜๋ฉด, hidden์ด๋“  output node์ด๋“  ์ƒ๊ด€ ์—†์ด activation function์ด ํ†ต๊ณผ๋œ ๊ฐ’๋“ค์„ ์ €์žฅํ•ด๋†“์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ๋ฐ”๋กœ ๋ฏธ๋ถ„๊ฐ’์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

ReLU

 

์š”์ฆ˜์— ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” hidden node์˜ activation function์ž…๋‹ˆ๋‹ค.

 

 

์ถœ์ฒ˜ : https://www.researchgate.net/figure/ReLU-function-graph_fig2_346250677

 

 

 

 

์œ„์™€ ๊ฐ™์ด ReLU function์ด ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์ด๊ฒƒ์€ net์˜ ๊ฐ’์ด x์ด๋ฏ€๋กœ, ์œ„ ์ˆ˜์‹๊ณผ ๊ฐ™์ด y๋กœ ํ‘œํ˜„๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ๋‹ค์‹œ, ํ‘œํ˜„ํ•ด๋ด…์‹œ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด x๊ฐ€ 0๋ณด๋‹ค ํฐ ๊ฒฝ์šฐ์—๋Š” 1์ด๋‹ค ๋ผ๊ณ  ๋ณผ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, w์— ๋Œ€ํ•œ ์‹์ด๋ฏ€๋กœ w์— ๋Œ€ํ•ด์„œ ๋ฏธ๋ถ„ํ•œ๋‹ค๋ฉด ์œ„์™€ ๊ฐ™์ด x๋กœ ๋‚˜์˜ค๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

๋‘˜ ๋‹ค ๋งž๋Š” ์‹์ž…๋‹ˆ๋‹ค.

 

์ธ์ž๋ฅผ ๋ญ˜๋กœ ๋ณผ ๊ฒƒ์ธ๊ฐ€์— ๋”ฐ๋ผ ๋‹ค๋ฆ…๋‹ˆ๋‹ค.

 

1์ด๊ฑฐ๋‚˜ x์ด๊ฑฐ๋‚˜ x๋Š” ์ด๋ฏธ ์ „ ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ ๋‹ค ๋“ค์–ด์˜ค๋Š” signal์ด๊ธฐ ๋•Œ๋ฌธ์— ์ถ”๊ฐ€ ๊ณ„์‚ฐ์ด ํ•„์š” ์—†์–ด์ง‘๋‹ˆ๋‹ค.

 

 

 

 

๋˜ ๋‹ค๋ฅธ activaiton function์œผ๋กœ๋Š”, regression์—์„œ๋Š” MSE๊ฐ€ ์ž˜ ์“ฐ์ž…๋‹ˆ๋‹ค.

 

 

์ด๋ ‡๊ฒŒ ๋ฏธ๋ถ„ ๊ฐ’๋“ค์ด ๊ฐ„๋‹จํ•œ ๊ฒƒ์ด ์šฐ์—ฐ์ผ๊นŒ์š”?

 

์šฐ๋ฆฌ๊ฐ€ ์—ฌ๊ธฐ์—์„œ ์“ฐ๋Š” functions๋“ค์€ ๋ชจ๋‘ ๋ฏธ๋ถ„ ๊ฐ’์ด ๊ฐ„๋‹จํ•˜๊ธฐ์— ์‚ฌ์šฉ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๋ฏธ๋ถ„ ๊ฐ’์ด ๋ณต์žกํ•˜๋‹ค๋ฉด ๊ณ„์‚ฐ์ด ๋„ˆ๋ฌด ๋งŽ์•„์ง€๋ฏ€๋กœ, ๊ฐ„๋‹จํ•œ ์—ฐ์‚ฐ ์ˆ˜์ค€์˜ function๋“ค๋กœ neural network์—์„œ ์“ฐ์ด๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

 

 

 

 

๊ทธ๋ฆฌ๊ณ , ์‹ค์ œ ๊ฐ„๋‹จํ•œ ์˜ˆ์ œ๋กœ ์‚ดํŽด๋ณธ๋‹ค๋ฉด, ์œ„์™€ ๊ฐ™์€ ์—ฐ์‚ฐ์ด ์ด๋ฃจ์–ด์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์ด๋Ÿฌํ•œ ์˜ˆ์‹œ๋ฅผ ํ†ตํ•ด, chain rule์„ ์ ์šฉํ•˜์—ฌ, ์šฐ๋ฆฌ๊ฐ€ ๋” ์‰ฝ๊ฒŒ, ๋‹ค ์•„๋Š” ๊ฐ’๋“ค์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์‚ฌ์น™์—ฐ์‚ฐ์„ ๊ทธ๋ƒฅ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

 

 

์˜ˆ๋ฅผ ๋“ค์–ด, ์ด๋Ÿฌํ•œ ๊ฐ’์ด ๊ฒฐ๊ตญ 0.1์ด ๋‚˜์™”๋‹ค๋ฉด,

 

ํ˜„์žฌ weight - 0.1ํ•˜์—ฌ weight๋ฅผ updateํ•˜๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

 

1. initial weight ๊ฐ’๋“ค์„ randomizeํ•ด๋ผ.

 

2. network์ด feed forward๋ฅผ ํƒ€๊ณ  ๋‚˜๊ฐ€์„œ prediction์„ ์ง„ํ–‰. (y^์„ ๊ตฌํ•˜๋Š” ๊ณผ์ •)

 

3. tearch-output์€ ๊ทธ์ € ์‹ค์ œ output์„ ์˜๋ฏธ. (supervised learning ๊ณผ ๊ฐ™์ด ๋ถ€๋ฅด๋ฏ€๋กœ,, y๊ฐ’)

 

4. y^ - y๋ฅผ ๊ฐ๊ฐ์˜ units์— ๋Œ€ํ•ด์„œ ํ•ด๋ผ. (๊ทธ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•ด๋ผ ์ •๋„)

 

5. hidden -> output weight update

 

6. input -> hidden weight update (๊ณ„์†ํ•ด์„œ ์—ญ๋ฐฉํ–ฅ์œผ๋กœ backpropagation ์ง„ํ–‰)

 

7. ์ „์ฒด weight updateํ•ด๋ผ.

 

8. ์œ„ ๊ณผ์ •์„ ๋ชจ๋“  examples๋“ค์ด ๋ถ„๋ฅ˜ ๋˜๊ฑฐ๋‚˜ stopping criterion์„ ๋งŒ์กฑ์‹œํ‚ฌ ๋•Œ๊นŒ์ง€ ํ•ด๋ผ.

 

 

 

Learning rate

 

-> ์šฐ๋ฆฌ๊ฐ€ ๋งŒ๋“ค์–ด๋†“์€ gradient์˜ ๊ฐ’์„ ์–ผ๋งˆ๋‚˜ ๋ฐ˜์˜ํ•  ๊ฒƒ์ธ๊ฐ€๋ฅผ ๊ฒฐ์ •ํ•ด์ฃผ๋Š” hyper parameter์ž…๋‹ˆ๋‹ค.

 

=> ๊ทธ๋Œ€๋กœ or ์ถ•์†Œ or ํ™•๋Œ€

 

 

  • Fixed Learning rate
  • Adaptive Learning rate

 

  • Learning rate์ด ๋„ˆ๋ฌด ์ง€๋‚˜์น˜๊ฒŒ ํฌ๋ฉด solution์„ ์ง€๋‚˜๊ฐˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์ ๋‹นํžˆ ํฌ๋ฉด ์†๋„๊ฐ€ ๋น ๋ฆ…๋‹ˆ๋‹ค. 
    • ๋ฐœ์‚ฐ์˜ ์œ„ํ—˜์ด ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๊ทธ๋Ÿฌ๋‚˜ ๋„ˆ๋ฌด ์ง€๋‚˜์น˜๊ฒŒ ์ž‘์œผ๋ฉด ์—„์ฒญ๋‚˜๊ฒŒ ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ๋ฐœ์‚ฐ์˜ ์œ„ํ—˜์€ ์—†์Šต๋‹ˆ๋‹ค.
    • ๋งŽ์€ ์‹œ๊ฐ„์ด ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.

 

=> ์šฐ๋ฆฌ๋Š” ์ด ์‚ฌ์‹ค์„ ์•Œ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ learning rate์˜ ์กด์žฌ๋Š” ๋„ˆ๋ฌด๋‚˜ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ทธ๋ž˜์„œ adaptive learning rate๊ฐ€ ํ•„์š”ํ•ด์ง‘๋‹ˆ๋‹ค.

 

 

Local Optima Problem

 

  • Neural Netowork๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ Gradient Descent๋ฅผ ์‚ฌ์šฉํ•˜๋ฉฐ, Gradient Descent๋Š” Local Optimum์„ ์ฐพ๋Š” ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.
  • Global์ธ์ง€ ์•„๋‹Œ์ง€ ์ฆ๋ช…์กฐ์ฐจ ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Loss function์ด ๋„ˆ๋ฌด ๋ณต์žกํ•˜๊ธฐ ๋•Œ๋ฌธ์—.
    • local optimum๊นŒ์ง€ ๊ฐ€๋ฉด ๋‹คํ–‰์ธ ๊ฒƒ์ž…๋‹ˆ๋‹ค. (์šด์ด ์ข‹์•„์•ผ)
    • ์•ˆ ๊ฐˆ ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์Šต๋‹ˆ๋‹ค.
  • NN์œผ๋กœ๋Š” ํ•ญ์ƒ Best solution๊นŒ์ง€ ๊ฐˆ ์ˆ˜๋Š” ์—†์Šต๋‹ˆ๋‹ค.
  • ํ•ญ์ƒ ๋‹ต์ด ์•ˆ ์ข‹์„ ๊ฐ€๋Šฅ์„ฑ์„ ํ•ญ์ƒ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ•ด๊ฐ€ ์•ˆ ์ข‹์€ ๋ฐ์—์„œ ๋ฉˆ์ถœ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ๋‹ค.
  • ๊ทธ๋‚˜๋งˆ ์ข‹์€ ๊ณณ์—์„œ ๋ฉˆ์ถ”๊ฒŒ ํ•ด์•ผํ•œ๋‹ค.
    • ๊ทธ ๋ฐฉ๋ฒ•์€?

 

 

 

  • ์ข‹์€ initial weight ๊ฐ’์„ ๊ฐ–๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋‚˜ ์ ์ ˆํ•œ ๋ฐฉ๋ฒ•์€ ์—†๋‹ค๊ณ  ๋ณด๋Š” ๊ฒŒ ๋งž์Šต๋‹ˆ๋‹ค.
  • Stopping criterion
    • ์ ๋‹นํžˆ ํ•™์Šต์‹œ์ผœ overfitting์„ ํ”ผํ•˜์—ฌ ํ•™์Šตํ•˜๋‹ค ์ ๋‹นํžˆ ๋ฉˆ์ถ”์ž.
    • local optimum์— ๋น ์ง„ ๊ฒƒ๊ณผ overfitting์ด ๋งค์นญ์ด ์•ˆ ๋  ์ˆ˜ ์žˆ์ง€๋งŒ, local optimum์— ๋น ์กŒ๋‹ค๋Š” ๊ฒƒ์ด, ์•ˆ ์ข‹์€ solution์— overfitting ๋˜์—ˆ๋‹ค๊ณ ๋„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • training-validation error
    • Learning curves ๋ฅผ ๊ทธ๋ ค์„œ training error, validation error ๋ฅผ ๋”ฐ๋กœ ์ฃผ์–ด ๋ฉˆ์ถ”๊ฒŒ ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

 

Summary

 

  • Pros
    • Great Performanc (High Accuracy)
    • (Theoretically) slove any problems (High Capacity ๊ฐ€๋Šฅ)
    • A generalized model (function์„ ํ†ตํ•œ model, ์ด NN์„ ์ดํ•ดํ•˜๋ฉด ๋‹ค๋ฅธ model๋„ ์ดํ•ดํ•˜๋Š” ๋ฐ ํŽธํ•จ)
  • Cons
    • Needs high computing power (input - hidden -output ์—ฐ์‚ฐ * # of training data * epochs)
    • Overfitting (High Capacity, NN์€ overfitting์— ์ทจ์•ฝ, DNN์€ ๋” ์ทจ์•ฝ)
    • Hyperparameter setting (# of Hidden nodes, # of Hidden layers, learning rate, regularization)

 

 

 

 

 

 

 

 

 

 

BELATED ARTICLES

more