[NLP] Word Embedding - GloVe

2023. 3. 31. 17:38
๐Ÿง‘๐Ÿป‍๐Ÿ’ป ์ฃผ์š” ์ •๋ฆฌ
 
NLP
Word Embedding
GloVe

 

์ด๋ฒˆ์—๋Š” GloVe์— ๋Œ€ํ•ด ์•Œ์•„๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

ํ†ต๊ณ„ ๊ธฐ๋ฐ˜์˜ Word2Vec๋ผ๋Š” ๊ฐœ๋…์„ ๊ฐ€์ง€๊ณ  ์ดํ•ดํ•ด์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

 

Word2Vec๋Š” Softmax regression ์„ ์ด์šฉํ•˜์—ฌ ๋‹จ์–ด์  ์˜๋ฏธ์˜ ์œ ์‚ฌ์„ฑ์„ ๋ณด์กดํ•˜์—ฌ, ๋น„์Šทํ•œ semantic์„ ๊ฐ€์ง€๋ฉด ๋น„์Šทํ•œ vector๋ฅผ ๊ฐ–๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.

 

Word2Vec ์€ context words distribution ์ด ๋น„์Šทํ•œ ๋‘ ๋‹จ์–ด๊ฐ€ ๋น„์Šทํ•œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ง€๋‹ˆ๋„๋ก ํ•™์Šตํ•จ๊ณผ ๋™์‹œ์—, co-occurrence ๊ฐ€ ๋†’์€ ๋‹จ์–ด๋“ค์ด ๋น„์Šทํ•œ ์ž„๋ฒ ๋”ฉ ๋ฒกํ„ฐ๋ฅผ ์ง€๋‹ˆ๋„๋ก ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

์œ„ ๋‚ด์šฉ์„ ์ฐธ๊ณ ๋กœ, GloVe์™€ Word2Vec์˜ ์ฐจ์ด๋ฅผ ์ž˜ ์‚ดํŽด๋ณด๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

Introduction to GloVe

Word2Vec ์€ ํ•˜๋‚˜์˜ ๊ธฐ์ค€ ๋‹จ์–ด์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋กœ ๋ฌธ๋งฅ ๋‹จ์–ด์˜ ๋ฒกํ„ฐ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

 

 

GloVe ์˜ ๋‹จ์–ด ๋ฒกํ„ฐ ํ•™์Šต ๋ฐฉ์‹์€ ์ด์™€ ๋น„์Šทํ•˜๋ฉด์„œ๋„ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. Co-occurrence ๊ฐ€ ์žˆ๋Š” ๋‘ ๋‹จ์–ด์˜ ๋‹จ์–ด ๋ฒกํ„ฐ๋ฅผ ์ด์šฉํ•˜์—ฌ co-occurrence ๊ฐ’์„ ์˜ˆ์ธกํ•˜๋Š” regression ๋ฌธ์ œ๋ฅผ ํ’‰๋‹ˆ๋‹ค.

 

  • Latent Semantic Analysis
    • Pro : efficiently leverage statistical information
    • Con : relatively poor on the word analogy task
  • Word2Vec
    • Pro : do better on the analogy ask
    • Con : poorly utilize the statistics of the corpus
      • They focus on local context windows instead of on global co-occurrence counts.

 

์ด ์‚ฌ์‹ค์„ ๋ฐ”ํƒ•์œผ๋กœ,

 

๊ธฐ์กด์˜ ๊ฒƒ๋“ค์ธ Word2Vec๋Š” ๋นˆ๋„์ˆ˜๋ฅผ ๋ฐ˜์˜ํ•˜์ง€ ์•Š์ง€๋งŒ, GloVe์—์„œ๋Š” ๋นˆ๋„์ˆ˜๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ emvedding์„ ๊ตฌํ˜„ํ•˜๊ฒ ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ฆ‰, frequency ๊ธฐ๋ฐ˜์˜ ํ•™์Šต์ž…๋‹ˆ๋‹ค.

 

 

 

์ถœ์ฒ˜ : https://lovit.github.io/nlp/representation/2018/09/05/glove/

 

 

์•„์ด๋””์–ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

๐Ÿ’ก Basic idea : The inner products of the embeddings of two words needs to be close to their co-occurring frequencies.

 

 

์ถœ์ฒ˜ : https://becominghuman.ai/mathematical-introduction-to-glove-word-embedding-60f24154e54c?gi=c57f02bbc23a

 

์œ„์™€ ๊ฐ™์€ formula๋ฅผ ๊ฐ€์ง‘๋‹ˆ๋‹ค.

 

 

vector์— weight์ธ w์™€์˜ ๊ณฑ๊ณผ bias์ธ b๋ฅผ ๋”ํ•œ ๊ฐ’์ด co-ouccrence์˜ log์™€ ๋น„์Šทํ•ด์ง€๋„๋ก weight์™€ bias๋ฅผ ํ•™์Šตํ•ฉ๋‹ˆ๋‹ค.

 

์ฆ‰, ๊ฐ๊ฐ์ด ๊ฐ™์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๊ตฌํ•˜์—ฌ dataset์—์„œ ํ•ด๋‹น ํ™•๋ฅ ๋“ค์„ ๊ตฌํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

 

์šฐ๋ฆฌ๋Š” ๊ฒฐ๊ตญ,

 

embedding์„ ์ž‘๊ฒŒ ํ•˜๋Š” ๊ฐ’์„ ์ฐพ์•„์•ผํ•ฉ๋‹ˆ๋‹ค.

 

๋‹จ์–ด๋Š” vector๋กœ embeddingํ•œ ๊ฐ’์ด ๋“ค์–ด๊ฐ€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

 

 

์œ„ ์ˆ˜์‹์—์„œ weight๋ฅผ ๋ณด๋ฉด,

 

๊ณผ์ •์€ ์ด๋Ÿฌํ•ฉ๋‹ˆ๋‹ค.

 

 

์ด๋ฅผ ํ†ตํ•ด ์šฐ๋ฆฌ๋Š” ๋‹ค์Œ์˜ ์‹์„ ์–ป์Šต๋‹ˆ๋‹ค.

 

 

 

 

๊ทธ๋ฆฌ๊ณ , ์ตœ์ข…์ ์œผ๋กœ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

 

 

์—ฌ๊ธฐ์„œ ์šฐ๋ฆฌ๋Š” high frequency words์˜ frequencies update๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ์œ„์™€ ๊ฐ™์ด ํ•ฉ๋‹ˆ๋‹ค.

 

์ˆ˜์‹์œผ๋กœ๋ถ€ํ„ฐ ์šฐ๋ฆฌ๊ฐ€ ์•Œ์•„์•ผ ํ•  ๊ฒƒ์€, ๋™์‹œ์— ๋ฐœ์ƒํ•˜๋Š” frequency๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค๋Š” ๊ฐœ๋…์„ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

 

GloVe๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•œ ๊ฒฐ๊ณผ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

 

 

'Artificial Intelligence > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[NLP] RNN  (0) 2023.04.04
[NLP] Word Embedding - GloVe [practice]  (0) 2023.03.31
[NLP] Word Embedding - CBOW and Skip-Gram  (2) 2023.03.27
[NLP] Word Embedding - Word2Vec  (0) 2023.03.27
[NLP] Word Embedding - Skip Gram  (0) 2023.03.27

BELATED ARTICLES

more