[Gradient Descent] part 3 - 1

2023. 1. 16. 19:13

๐ŸŽฏ Keyword ๐ŸŽฏ

- Gradient Descent
- Batch Gradient Descent
- Stochastic Gradient Descent

Gradient Descent

 

์ด์ „์— ์„ค๋ช…ํ•œ Gradient Descent Algorithm์€ "Batch Gradient Descent Algorithm"์ž…๋‹ˆ๋‹ค.

 

Batch Gradient Descent Algorithm

- Linear regression model์—์„œ ๋ชฉ์  ํ•จ์ˆ˜ J์˜ partial derivative term์„ ๋„ฃ์–ด์„œ θ0, θ1์„ ๋ฐ”๊พธ๊ณ  ์žˆ๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- ๋น„๋ก local optimum์ด ์ทจ์•ฝํ•˜์ง€๋งŒ, ์–ด๋Š ์ •๋„ ์ˆ˜๋ ด์ด ๋˜์–ด ๊ฐ€๋Š” ์žฅ๋ฉด์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

- ๋‹จ์  : θ0, θ1์„ updateํ•˜๋Š” ๊ณผ์ •์—์„œ sample์˜ ๊ฐœ์ˆ˜๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•ด์•ผํ•œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.
            ์šฐ๋ฆฌ๊ฐ€ ์ „์ฒด sample์˜ error๋ฅผ ๊ณ„์‚ฐํ•˜๊ฒŒ ๋˜๊ณ , ๋ชจ๋“  sample์— ๋Œ€ํ•ด์„œ ์ „๋ถ€๋‹ค acumulation์„ ํ•ด์•ผ θ๋“ค์„ ํ•œ ๋ฒˆ updateํ•  ์ˆ˜ ์žˆ

            ์Šต๋‹ˆ๋‹ค.

            ์ฆ‰, data sample์˜ ๊ฐœ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด ์ฆ๊ฐ€ํ•  ์ˆ˜๋ก, ๋ณต์žก๋„๊ฐ€ ๊ต‰์žฅํžˆ ์ปค์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, sample์˜ ๊ฐœ์ˆ˜ m์„ 1๋กœ ๊ทน๋‹จ์ ์œผ๋กœ ์ค„์—ฌ์„œ ๋งŒ๋“  algorithm์„ 

Stochastic Gradient Descent (SGD)

๋ผ๊ณ  ํ•ฉ๋‹ˆ๋‹ค.

 

- ์žฅ์  : batch gradient descent์— ๋น„ํ•ด ๋น ๋ฅด๊ฒŒ ๋Œ ์ˆ˜ ์žˆ๋‹ค.

- ๋‹จ์  : sample ํ•˜๋‚˜ํ•˜๋‚˜ ๋งˆ๋‹ค ๊ณ„์‚ฐ์„ ํ†ตํ•ด์„œ parameter๋ฅผ ์—ฐ์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— noise์˜ ์˜ํ–ฅ์„ ๋ฐ›๊ธฐ ์‰ฝ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
            ์ˆ˜๋ ดํ•˜๋Š” ๊ณผ์ •์—์„œ ๋งŽ์€ oscillation์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

  • Gradient Descent๊ฐ€ ์‹œ์ž‘ํ•˜๋Š” point์— ๋”ฐ๋ผ์„œ local optimum์„ ๋‹ฌ์„ฑํ•˜๋Š๋ƒ, global optimum์„ ๋‹ฌ์„ฑํ•˜๋Š๋ƒ๊ฐ€ ๊ฐˆ๋ฆฝ๋‹ˆ๋‹ค.
  • Saddle point์™€ ๊ฐ™์ด ์–ด๋Š ํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ˆ˜๋ ดํ•˜๊ฒŒ ๋  ๋•Œ, gradient ๊ฐ’์ด 0์ด ๋˜์–ด์„œ local optimum์— ๋น ์ง€๊ฒŒ ๋˜๋Š” ์ง€์ ๋“ค์ด ๋‹ค์ˆ˜ ์กด์žฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์•Œ๋ ค์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ด๋Ÿฌํ•œ suboptimumํ•œ problem๋“ค์„ ํ•ด๊ฒฐ ํ•˜๊ธฐ ์œ„ํ•ด์„  ์–ด๋–ป๊ฒŒ ํ•ด์•ผํ• ๊นŒ์š”?

 

=> ๊ธฐ์กด์˜ gradient descent algorithm์—์„œ ๋‹ค์–‘ํ•œ ๋ณ€ํ˜• algorithm๋“ค์ด ๊ฐœ๋ฐœ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

 

  • Momentum
๊ณผ๊ฑฐ์˜ Gradient๊ฐ€ ์—…๋ฐ์ดํŠธ ๋˜์–ด์˜ค๋˜ ๋ฐฉํ–ฅ ๋ฐ ์†๋„๋ฅผ ์–ด๋Š ์ •๋„ ๋ฐ˜์˜ํ•ด์„œ ํ˜„์žฌ ํฌ์ธํŠธ์—์„œ Gradient๊ฐ€ 0์ด ๋˜๋”๋ผ๋„ ๊ณ„์†ํ•ด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๋Š” ๋™๋ ฅ์„ ์ œ๊ณตํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

  • Method of momentum
    • Exponentially weighted moving average

 

-> low pass filtering์˜ ์—ฐ์‚ฐ

-> ํ˜„์žฌ ํฌ์ธํŠธ์—์„œ์˜ saddle point๋‚˜ ์ž‘์€ noise gradient ๊ฐ’์— ๋ณ€ํ™”๊ฐ€ ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

 

๐Ÿ’ก ์„ค๋ น gradient ๊ฐ’์ด 0์ด ๋˜๋”๋ผ๋„ ํ•™์Šต์„ ์ด์–ด์„œ ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค. 

 

 

 

-> ๋” Advanced๋œ ๋ฐฉ์‹์ธ nesterov momentum ๋ฐฉ์‹์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.

 

  • nesterov momentum
    • ๊ธฐ์กด์˜ ๋ฐฉ์‹๊ณผ ๋‹ฌ๋ฆฌ gradient๋ฅผ ๋จผ์ € ํ‰๊ฐ€ํ•˜๊ณ  ์—…๋ฐ์ดํŠธ๋ฅผ ํ•ด์ฃผ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
    • ๊ทธ๋ž˜์„œ, ์ด ๋ฐฉ์‹์„ "look ahead gradient step"์„ ์ด์šฉํ•œ๋‹ค๊ณ  ์ด์•ผ๊ธฐ ํ•ฉ๋‹ˆ๋‹ค.

 

 

Momemtum vs. Nesterov momentum

 

Momentum ๋ฐฉ์‹์€ momentum step๊ณผ gradient step์„ ํ•ฉ์ณ์„œ actual step์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
Nesterov momentum ๋ฐฉ์‹์€ mementum step์„ ๊ณ„์‚ฐํ•œ ์ƒํƒœ์—์„œ "lookahead" gradient step์„ ๊ณ„์‚ฐํ•˜์—ฌ ๋ฒกํ„ฐ ํ•ฉ์„ ๊ณ„์‚ฐํ•˜์—ฌ actual step์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค. 

 

 

 

  • AdaGrad
๊ฐ ๋ฐฉํ–ฅ์œผ๋กœ์˜ learning rate๋ฅผ ์ ์‘์ ์œผ๋กœ ์กฐ์ ˆํ•˜์—ฌ ํ•™์Šต ํšจ์œจ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

์ฆ‰, Acumulate gradient ๊ฐ’์„ ํ†ตํ•ด learning rate๋ฅผ ์กฐ์ ˆํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

์–ด๋Š ํ•œ ์ชฝ์œผ๋กœ gradient ๊ฐ’์ด ํฌ๋‹ค๋ฉด ์ด๋ฏธ ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ๋งŽ์ด ์ง„ํ–‰๋˜์—ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด gradient ๊ฐ’์ด ๋งŽ์ด ๋ˆ„์ ๋˜์–ด โˆ†θ๊ฐ’์ด ์ž‘์•„์ ธ์„œ ๊ทธ๋งŒํผ ์ˆ˜๋ ด ์†๋„๋ฅผ ์ค„์ด๊ฒŒ ๋˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๋‹จ์  : gradient ๊ฐ’์ด ๊ณ„์†ํ•ด์„œ ๋ˆ„์  ๋จ์— ๋”ฐ๋ผ, learning rate ๊ฐ’์ด ๊ต‰์žฅํžˆ ์ž‘์•„์ง€๊ฒŒ ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

AdaGrad

Learning rate ๊ฐ’์ด ์ž‘์•„์ง„๋‹ค๋ฉด ?

-> Learning rate ๊ฐ’์ด ์ž‘์•„์ง€๊ฒŒ ๋˜๋ฉด ํ•™์Šต์ด ๊ทธ ์ง€์ ์—์„œ ์ผ์–ด๋‚˜์ง€ ์•Š๊ฒŒ ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

=> ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์„ ์ˆ˜์ •ํ•œ ๊ฒƒ์ด RMSProp algorithm์ž…๋‹ˆ๋‹ค.

 

  • RMSProp algorithm
r์˜ ๊ฐ’์„ ์šฐ๋ฆฌ๊ฐ€ ๊ณผ๊ฑฐ์— r ๋งŒํผ์˜  ρ(๋กœ์šฐ) factor๋ฅผ ๊ณฑํ•ด์„œ ์–ด๋Š ์ •๋„ ์กฐ์ ˆ์„ ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

๊ณผ๊ฑฐ์™€ ๊ฐ™์ด ๊ทน๋‹จ์ ์œผ๋กœ gradient ๊ฐ’์ด ๋ˆ„์ ๋จ์— ๋”ฐ๋ผ์„œ θ๊ฐ’์ด ์ค„์–ด๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ์–ด๋Š ์ •๋„ ์™„์ถฉ๋œ ํ˜•ํƒœ๋กœ ํ•™์Šต ์†๋„๊ฐ€ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

RMSProp

 

  • Adam
Adaptive moment estimation

RMSProp + Momentum ๋ฐฉ์‹์„ ํ˜ผํ•ฉํ•œ ๊ฒƒ.
  1. ์ฒซ ๋ฒˆ์งธ momentum์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  2. RMSProp ๋ฐฉ์‹์œผ๋กœ ๋‘ ๋ฒˆ์งธ momentum์„ ๊ณ„์‚ฐํ•ฉ๋‹ˆ๋‹ค.
  3. bias๋ฅผ correctionํ•ฉ๋‹ˆ๋‹ค.
  4. parameter๋ฅผ ์—…๋ฐ์ดํŠธ ํ•ฉ๋‹ˆ๋‹ค.

1 ~ 4 ๋ฒˆ

 

 

 

'Artificial Intelligence' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Linear Classification] part 4 - 1  (2) 2023.01.19
[Parameter] part 3 - 2  (0) 2023.01.16
[Gradient Descent] part 2 - 3  (0) 2023.01.14
[Linear Regression] part 2 - 2  (0) 2023.01.14
[Linear Regression] part 2 - 1  (0) 2023.01.14

BELATED ARTICLES

more