[Deep Learning] Deep Neural Network (2)

2023. 4. 14. 21:07
๐Ÿง‘๐Ÿป‍๐Ÿ’ป์šฉ์–ด ์ •๋ฆฌ

Neural Networks
Feed-forward
Backpropagation
Deep Neural Network

 

 

 

์šฐ๋ฆฌ๊ฐ€ ์ง€๊ธˆ๊นŒ์ง€, 1957๋…„ perceptron, 69๋…„ MLP, 89๋…„ Backpropagation ์˜ ๋“ฑ์žฅ์„ ๋ดค์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, 95๋…„์— ๋‚˜์˜จ SVM ์ด ์ด๋ก ์ƒ ์™„๋ฒฝํ•˜๊ฒŒ Neural Network ๋ณด๋‹ค ์šฐ์œ„์— ์žˆ์—ˆ๊ธฐ์—, 95๋…„๋ถ€ํ„ฐ 2000๋…„๊นŒ์ง€ NN์ด ์ฃฝ์–ด์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‹ค๊ฐ€, 2006๋…„์— ๋‹ค์‹œ Deep Learning์ด ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ด๊ณ , 2011๋…„์— ๋‹ค์‹œ ์‚ด์•„๋‚˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ ์ด์ƒํ•œ ๊ฒŒ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์šฐ๋ฆฌ๊ฐ€ 2006๋…„ ์ „๊นŒ์ง€ Deep Learning์ด ์ข‹์€ ์„ฑ๊ณผ๋ฅผ ๋‚ด๊ธฐ ์ „๊นŒ์ง€..

 

์‚ฌ๋žŒ๋“ค์ด 3-MLP๋ฅผ ์ฃผ๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ?

 

์‚ฌ๋žŒ๋“ค์ด 10-MLP๋ฅผ ์จ๋ณด์ž์™€ ๊ฐ™์€ ์ƒ๊ฐ์„ ๊ณผ์—ฐ ์•ˆ ํ–ˆ์„๊นŒ์š”?

 

์—ฌ๊ธฐ์„œ 3-MLP์—์„œ layers๋ฅผ ๋Š˜๋ฆฌ๋ฉด ์ข‹์„ ๊ฑฐ ๊ฐ™๋‹ค๋Š” ์ƒ๊ฐ์€ ๋ˆ„๊ตฌ๋‚˜ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

Deep Learning

 

 

๊ทธ๊ฒƒ์—๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ด์œ ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • Computational power increase
    • SVM์—๊ฒŒ ์ฃฝ์ž„์„ ๋‹นํ•  ๋‹น์‹œ computational power๊ฐ€ ์ง€๊ธˆ๊ณผ ๊ฐ™์ง€ ์•Š์•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
    • GPU makes a breakthrough.
  • The amount of data increases
    • generalization gap์ด ์ค„์–ด๋“ญ๋‹ˆ๋‹ค. ์ด๊ฒƒ์ด ๊ถ๊ทน์ ์ธ machine learning์˜ ๋ชฉ์ ์ž…๋‹ˆ๋‹ค.
    • ๊ทธ๋Ÿฌ๋ฏ€๋กœ, overfitting์˜ ์œ„ํ—˜์—์„œ ๋ฒ—์–ด๋‚ฉ๋‹ˆ๋‹ค.
    • modeling์ด๋‚˜ preprocessing skills์„ ์“ฐ์ง€ ์•Š์•„๋„ Deep Learning์œผ๋กœ ์ง์ ‘ ๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • Acceptable performance : 5,000 labeled training data per category
    • Exceed human performance : 10 M labeled training data per category

 

 

๊ทธ๋ž˜์„œ ์œ„์˜ค ๊ฐ™์€ ์ด์œ ๋กœ ๋ฐ์ดํ„ฐ๊ฐ€ ์—†๊ณ , ์ปดํ“จํ„ฐ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์œผ๋‹ˆ, ํ•™์Šต์„ ํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ์ด ์•„๋‹ˆ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด,

 

์ด ๋‹จ์ˆœํžˆ 2๊ฐ€์ง€ ์ด์œ  ๋•Œ๋ฌธ์— SVM์— ๊ตด์š•์„ ๋‹นํ–ˆ๋‹ค? ์—ฌ๋Ÿฌ๋ถ„์€ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ•˜์‹œ๋‚˜์š”?

 

์ด 2๊ฐ€์ง€ ์ด์œ ๋Š” ๊ต‰์žฅํžˆ ์ค‘์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ธฐ์–ตํ•˜์…”์•ผ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ์—๋„ ๋ฌด์–ธ๊ฐ€ ๋‹ค๋ฅธ ์ด์œ ๊ฐ€ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค๋Š” ์˜์‹ฌ์„ ํ’ˆ๊ณ  ๋„˜์–ด๊ฐ€๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์•ž์—์„  H/W ์  issue๋ฅผ ๋‹ค๋ค˜๋‹ค๋ฉด ์ด์ œ ์ •๋ง Deep Learning์œผ๋กœ ๋“ค์–ด๊ฐ€๋ณด๋„๋ก ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

 

 

 

Vanishing Gradient

 

 

 

๋จผ์ €, ํ•™์Šต์— ์žˆ์–ด weight๋ฅผ ์ •ํ•ฉ๋‹ˆ๋‹ค.

 

weight์˜ initial ๊ฐ’์€ random์œผ๋กœ ๋˜๋ฉฐ, ๊ทธ ๋‹ค์Œ๋ถ€ํ„ฐ weight update๋ฅผ ํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด update๋Š” ์œ„์™€ ๊ฐ™์ด learning rate๋ผ๋Š” ์ƒ์ˆ˜์— gradient๋ฅผ ๊ณฑํ•˜๊ณ , ์ด gradient ์—ญ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ€๋Š” ๊ฒƒ์ด Backpropagation์ž…๋‹ˆ๋‹ค.

 

์ด Backpropagation์„ ์œ„ํ•ด์„œ ์šฐ๋ฆฌ๊ฐ€ Neural Network๋ฅผ ํ•™์Šต ์‹œํ‚ต๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์ด 3-layer MLP์˜ ๊ฒฝ์šฐ์—๋Š” ๊ทธ๋ ‡๊ฒŒ ๋ณต์žกํ•˜์ง„ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

 

weight update rule ์ž์ฒด๋Š” ๋˜‘๊ฐ™์ง€๋งŒ ๋ง์ด์ฃ .

 

 

 

์—ฌ๊ธฐ์„œ, Error๊ฐ€ loss๋กœ ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

 

์ด loss๋กœ๋ถ€ํ„ฐ Backpropagation์€ ๋’ค์ชฝ์œผ๋กœ ์—ญ์ „ํŒŒ๋ฅผ ํ•ด๋‚˜๊ฐ‘๋‹ˆ๋‹ค.

 

์ œ์ผ high layer๋ถ€ํ„ฐ ์ œ์ผ low layer๊นŒ์ง€ gradient๋ฅผ ํƒ€๊ณ  ํƒ€๊ณ  ๊ฐˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๊ฒŒ ์—ญ์ˆœ์œผ๋กœ ํƒ€๊ณ  ๊ฐ€๋Š” ๊ฒŒ Backpropagation์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋Ÿฐ๋ฐ, Vanishing gradient๋ผ๋Š” ๊ฒƒ์€ ์ ์  ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ž‘์•„์ง€๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๊ณ , exploding gradient๋Š” ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ ์  ๋ฐœ์‚ฐํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

์ผ๋ฐ˜์ ์œผ๋กœ๋Š” vanishing gradient๊ฐ€ ๋” ํฐ ์—ญํ• ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋ฏ€๋กœ, ๊ธฐ์šธ๊ธฐ์˜ ํฌ๊ธฐ๋ฅผ Backpropagation ๊ณผ์ •์—์„œ ์–ด๋Š ์ •๋„ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•ด์ค˜์•ผ ํ•˜๋Š”๋ฐ,

 

layer๊ฐ€ ๋งŽ์•„์ง€๋ฉด ๋งŽ์•„์งˆ ์ˆ˜๋ก, ์šฐ๋ฆฌ์˜ model์ด deepํ•˜๋ฉด deepํ•  ์ˆ˜๋ก Backpropagationํ•  ๋•Œ, ๊ธฐ์šธ๊ธฐ๊ฐ€ ์ ์  ์ž‘์•„์ง€๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

ํ˜น์€ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์ด ๋˜์–ด update๊ฐ€ ์•ˆ ๋˜๋Š” ๊ฒƒ๊นŒ์ง€ ์˜ต๋‹ˆ๋‹ค.

 

gradient ํฌ๊ธฐ ๋งŒํผ update๊ฐ€ ๋˜๋Š” ๊ฒƒ์ธ๋ฐ, gradient๊ฐ€ ์—„์ฒญ ์ž‘๊ฑฐ๋‚˜ 0์ด ๋˜๋ฉด ์ดˆ๊ธฐ ๋‹จ๊ณ„ layer๋“ค์˜ weight๊ฐ€ update๊ฐ€ ์•ˆ ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

 

activation function์„ logistic sigmoid๋กœ ๊ฐ€์ •ํ•˜๋ฉด,

 

์œ„์™€ ๊ฐ™์ด Backpropagation ๊ณผ์ •์—์„œ ์ง€๊ธˆ ๋‘ layer๋งŒ ๋‚ด๋ ค๊ฐ€๋„ weight update ์ˆ˜์น˜๊ฐ€ ์—„์ฒญ๋‚˜๊ฒŒ ๋‚ฎ์•„์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

3-layer MLP๋งŒ ํ•ด๋„ ์ด์ •๋„์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  ์ด logistic sigmoid์—์„œ ๊ธฐ์šธ๊ธฐ๊ฐ€ 0์— ๊ฐ€๊นŒ์šด ๊ฐ’๋“ค์ด ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ด 0์— ๊ฐ€๊นŒ์šด ์• ๋“ค์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ค‘์ฒฉํ•ด์„œ ๊ณฑํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ Backpropagation ์ค‘์—์„œ vanishing gradient ๋ฌธ์ œ์— ์ง๋ฉดํ•˜๋Š” ๊ฒƒ์€ activation function์˜ ๋ฌธ์ œ์ž…๋‹ˆ๋‹ค.

 

 

 

Remedy 1 : Pre-training

 

๊ทธ๋ ‡๋‹ค๋ฉด, ๊ณ„์†ํ•ด์„œ weight ๊ฐ€ ์†Œ์‹ค์ด ๋ฉ๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ์„œ initial weight๊ฐ€ randomํ•˜๋ฉด deepํ•  ์ˆ˜๋ก ๊ณ„์† weight๊ฐ€ ์†Œ์‹ค๋˜๊ฒŒ ๋˜์ฃ .

 

๊ทธ๋ ‡๋‹ค๋ฉด, ์–ด๋Š์ •๋„ ๋˜‘๋˜‘ํ•œ weight๋ฅผ ์ฒ˜์Œ์— initializationํ•ด๋†“์œผ๋ฉด ๋˜์ง€ ์•Š๋‚˜?

 

๋ณธ๊ฒฉ์ ์œผ๋กœ Neural Networks๊ฐ€ ํ•™์Šตํ•˜๊ธฐ ์ „์—, ์‚ฌ์ „์— ๋ฌด์–ธ๊ฐ€๋ฅผ ํ•ด๋†“์ž.

 

๊ทธ๋Ÿฐ ์•„์ด๋””์–ด๊ฐ€ ๋ฐ”๋กœ "Pre-training"์ž…๋‹ˆ๋‹ค.

 

 

Deep Belief Networks (DBN)

 

๊ทธ๋ ‡๋‹ค๋ฉด, Restricted Bolzmann Machine (RBM)์œผ๋กœ Pre-training์„ ํ•ด๋†“๊ณ ,

 

๊ทธ๋ฆฌ๊ณ  ๋‚˜์ค‘์— Backpropagation ์‹œ์—๋Š” fine-tunning์„ ํ•˜๋Š” ๋ฐฉ์‹์ด ์ด DBN์ž…๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋ ‡๋‹ค๋ฉด, ์ด ๊ทธ๋ฆผ์˜ ์šฐ์ธก์— layer ๋ผ๋ฆฌ์˜ RBM์„ ๋Œ๋ ค๋†“๊ณ , ์ตœ๋Œ€ํ•œ weight๋ฅผ ์–ด๋Š์ •๋„๋Š” ํ•™์Šต์„ ์‹œ์ผœ ๋†“์ž๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ฆ‰, RBM์„ ์‚ฌ์šฉํ•ด์„œ, layer-wise๋กœ pre-training์„ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ์šฐ๋ฆฌ๊ฐ€ initial weight๋กœ ์ผ๋˜ ์ด W0๋ผ๋Š” ๊ฐ’์€, random์ด ์•„๋‹Œ, ์–ด๋Š ์ •๋„ ๊ฐ layer ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋Š” weight๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , Backpropagation ์‹œ fine-tunning๋งŒ ์ง„ํ–‰ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ขŒ์ธก๊ณผ ๊ฐ™์€ ๊ทธ๋ฆผ์˜ ์ˆœ์„œ๊ฐ€ ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด ๋ฐฉ๋ฒ•์„ Unsupervised Pre-train์ด๋ผ๊ณ  ํ•˜๊ธฐ๋„ ํ•ฉ๋‹ˆ๋‹ค.

 

 

RBM ๋‹จ๊ณ„์—์„œ๋Š” y ๊ฐ’์ด ์กด์žฌํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

๊ทธ์ € layer ๋ผ๋ฆฌ RBM ํ•ฉ๋‹ˆ๋‹ค.

 

y ๊ฐ’์€ fine-tunning ์‹œ ๋“ค์–ด์˜ต๋‹ˆ๋‹ค.

 

 

์ •๋ฆฌํ•˜์ž๋ฉด, Supervised ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” network์ด์ง€๋งŒ, Unsupervised ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜์ธ ์ด RBM์œผ๋กœ Pre-training์„ ๋ฏธ๋ฆฌ ์‹œ์ผœ๋†“์ž ์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ด๊ฒƒ์€ Unsupervised learning ๋ฌธ์ œ๋ฅผ ํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

unsupervised ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด์ง€, ์‹ค์ œ๋กœ Supervised learning์˜ ๋ฌธ์ œ๋ฅผ ํ‘ธ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์˜คํ•ด ์—†์œผ์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฐ๋ฐ, ์ด DBN์€ ํ˜„์žฌ ์ž˜ ์‚ฌ์šฉํ•˜์ง€๋Š” ์•Š์Šต๋‹ˆ๋‹ค.

 

์ด Pre-training์˜ ๊ณผ์ •์ด ๋„ˆ๋ฌด๋‚˜ ํฌ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

 

 

DrawBack : Pre-training process is too heavy.

 

 

์ด RBM์€ ๋ถ„ํฌ ์ถ”์ •์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ต‰์žฅํžˆ ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ๊ทธ๋ž˜์„œ Pre-training์—๋Š” ํฐ cost๊ฐ€ ๋“ญ๋‹ˆ๋‹ค.

 

Backpropagation์€ ๋ถ„ํฌ ๊ณผ์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

Backpropagation์€ ๋‚˜์ค‘์— ๋ถ„ํฌ ๊ณผ์ •์ด ํ•„์š”ํ•œ maximum likelihood์™€ ๋น„์Šทํ•œ ์—ญํ• ์„ ํ•œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ,

 

์–ด์งธํŠผ, Backpropagation์€ ๋ถ„ํฌ ๊ณผ์ •์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

RBM์€ ๊ทธ์— ๋น„ํ•ด ์กฐ๊ธˆ ๋” ํ™•๋ฅ  function ๊ธฐ๋ฐ˜์ž…๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ํ•˜๋‚˜ํ•˜๋‚˜ ํ•™์Šตํ•˜๋Š” ์‹œ๊ฐ„์ด Backpropagation๊ณผ ๋น„๊ตํ•˜์—ฌ ์—„์ฒญ ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค.

 

 

์ง€๊ธˆ์€ Pre-trainํ•˜์ง€ ์•Š์•„๋„ Vainishing Gradient๋ฅผ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์ด ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ด DBN์€ ์ •๋ง ์ค‘์š”ํ•œ ๊ฐœ๋…์ž…๋‹ˆ๋‹ค.

 

 

Deep Learning์ด ์ฒ˜์Œ์œผ๋กœ SVM์„ ์ด๊ธธ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๋‚˜์˜จ Model์ž…๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋Ÿผ ์ง€๊ธˆ๊นŒ์ง€ DBN์œผ๋กœ๋ถ€ํ„ฐ ์šฐ๋ฆฌ๋Š” ํฌ๋ง์„ ๋ดค์Šต๋‹ˆ๋‹ค.

 

Deepํ•˜๊ฒŒ layers๋ฅผ ์Œ“๋Š” ๊ฒƒ์€, Vanishing Gradient Problem์œผ๋กœ ๋ชป ํ•  ๊ฑฐ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์ง€๋งŒ, ํฌ๋ง์ด ์ƒ๊ฒผ์Šต๋‹ˆ๋‹ค.

 

Remedy 2 : ReLU Activation (Rectified Linear Unit)

 

 

 

์—ฌ๊ธฐ์„œ ์‚ดํŽด๋ณด๋Š” activation function์€,

 

logistic ๊ณ„์—ด๊ณผ ReLU์˜ ๋ณ€ํ˜•๋“ค๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ReLU Activation (Rectified Linear Unit)์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

  • Easy Calculation
  • Gradient is large

 

๊ทธ๋Ÿฐ๋ฐ, ์ด ํ•จ์ˆ˜๋Š” (0,0)์—์„œ ๊ฐ์ด ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ฏธ๋ถ„์ด ๋ถˆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

 

ํ•™๊ณ„์—์„œ๋Š” ์‹ค์ œ๋กœ ํ•™์Šตํ•  ๋•Œ, ์‹ ๊ฒฝ์“ธ ๋ถ€๋ถ„์ด ์•„๋‹ˆ๋ผ๊ณ  ๋งํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๊ฒƒ์€ 0.000000...0์˜ ๊ฐ’์ด ์‹ค์ œ๋กœ ๋‚˜ํƒ€๋‚  ์ผ์ด ๊ฑฐ์˜ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

DrawBack

-> Deactivate for every nodes

 

sigmoid์™€ ๊ฐ™์€ logistic ์ข…๋ฅ˜์˜ activation function์€ ์•„๋ฌด๋ฆฌ ์•ˆ ์ข‹์€ ๊ฐ’์ด ๋‚˜๊ฐ€๋„ 0.01์ด ๋‚˜๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด, ์–ด๋–ป๊ฒŒ ํ•ด์„œ๋“  ํ•จ์ˆซ๊ฐ’์ด feed forward process์—์„œ ์ „๋‹ฌ์ด ๋ฉ๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋Ÿฐ๋ฐ, ReLU์˜ ๊ฒฝ์šฐ์—๋Š” signal์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค.

 

๊ฒŒ๋‹ค๊ฐ€ Weighted Summationํ•˜๋‹ˆ๊นŒ ๋ง์ž…๋‹ˆ๋‹ค.

 

์•ž์ชฝ์—์„œ ๊ฐ’์ด ๋งŽ์ด activation์ด ๋งŽ์ด ์•ˆ ๋˜์–ด ์žˆ๋‹ค๋ฉด, ๋’ค์ชฝ์—์„œ๋Š” ๊ฐ’์ด deactivate ๋˜๋Š”, ๊ทธ๋ž˜์„œ ๊ฐ’์ด g(z) == 0์ด ๋˜๋Š”, 

 

๊ฑฐ์˜ ๋ชจ๋“  Node๊ฐ€ deactivate๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 



๊ทธ๋ž˜์„œ ReLU function์ด ์ฒ˜์Œ ๋‚˜์™”์„ ๋•Œ, sparse activation์ด๋ผ๊ณ  ๋ถˆ๋ €์Šต๋‹ˆ๋‹ค.

 

๋“ฌ์„ฑ ๋“ฌ์„ฑ activation์ด ๋œ๋‹ค๋Š” ์†Œ๋ฆฌ์ฃ .

 

๊ทธ sparseํ•จ์ด ๋„ˆ๋ฌด sparse ํ•ด์ง€๋ฉด ์•„์˜ˆ ๋ถˆ์„ ๋‹ค ๊บผ๋ฒ„๋ฆด ์ˆ˜๋„ ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ReLU๋ฅผ ์“ธ ๋•Œ๋Š”, ์ดˆ๊ธฐ์— ๊ฑฐ์˜ ๋‹ค Activation์„ ์‹œ์ผœ๋†“๊ณ  ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

 

deactivation ๋˜๋ฉด, gradient๋„ ๋‹น์—ฐํžˆ 0์ด๋‹ˆ, ๋’ค์ชฝ node ๋“ค๋„ update๋„ ์•„์˜ˆ ์•ˆ ๋˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

๊ฐ€๋Šฅํ•˜๋ฉด ์ดˆ๊ธฐ์—๋Š” ๋งŽ์ด activation ์‹œ์ผœ๋†“๊ณ , ์–ต์ง€๋กœ ์‹œ์ž‘์„ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

 

์œ„ sparseํ•œ ์ ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ,

 

Leaky ReLU๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๋งˆ์ด๋„ˆ์Šค ๊ฐ’์— ๋Œ€ํ•ด์„œ ๋ญ”๊ฐ€ signal์„ ์ „๋‹ฌ ๋ฐ›๊ณ  weighted summation์—์„œ์˜ ์˜ํ–ฅ๋ ฅ์ด ์—†์ง€ ์•Š๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

์ฆ‰, Zero-gradient๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•ฉ์ž…๋‹ˆ๋‹ค.

 

 

maxout๋„ ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ NLP ๋ถ„์•ผ์—์„œ ๊ฝค ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

 

 

linear nodes์—์„œ๋Š” activation ๋˜์ง€ ์•Š๊ณ , H1์—์„œ activation๋ฉ๋‹ˆ๋‹ค.

 

์ด H1์—์„œ linear nodes ๋“ค ์ค‘ ๊ฐ€์žฅ ํฐ ๊ฐ’๋งŒ ๋‚ด๋ณด๋ƒ…๋‹ˆ๋‹ค.

 

Weighted Summation์„ ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

 

์‚ฌ์ „์˜ node๋“ค์„ Linearํ•˜๊ฒŒ ์กฐํ•ฉํ•˜๋˜, ๊ทธ ์ค‘์—์„œ ๊ฐ€์žฅ ํฌ๊ฒŒ activation๋œ ์• ๋งŒ linearํ•˜๊ฒŒ ํ†ต๊ณผ์‹œ์ผœ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์€ deactivation ๋˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

์•„๋ž˜ ์‚ฌ์ง„๊ณผ ๊ฐ™์ด ๋น„๊ตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

 

 

๊ฒฐ๊ตญ ReLU๋กœ ์‹œ์ž‘ํ•ด์„œ ์ด๋Ÿฌํ•œ linear function์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

linear function ์ข…๋ฅ˜๋ฅผ ์“ฐ๋ฉด์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ๋‚ด๋ณด๋‚ธ๋‹ค ๋“ฑ์˜ non-linear transformation์„ ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

ํ•˜์ง€๋งŒ out์ด ๋‚˜๊ฐ€๋Š” ํ˜•ํƒœ ์ž์ฒด๋Š” Linearํ•˜๊ฒŒ ๋‚˜๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

์ด๊ฒƒ์„ ํ†ตํ•ด ๊ณ„์‚ฐ์„ ์‰ฝ๊ฒŒ ํ•˜๊ณ , gradient์˜ ๊ฐ’๋„ ํฌ๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ, ์—„์ฒญ๋‚œ Vanishing gradient problem์„ ์กฐ๊ธˆ์”ฉ ๊ฐœ์„ ํ•˜์—ฌ ํ”ผํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 



BELATED ARTICLES

more