[NLP] Word Embedding - CBOW and Skip-Gram

2023. 3. 27. 21:34

 

๐Ÿง‘๐Ÿป‍๐Ÿ’ป ์ฃผ์š” ์ •๋ฆฌ
 
NLP
Word Embedding
CBOW

 

 

Represent the meaning of word

 

  • Two basic neural network models:
    • Continuous Bag of Word(CBOW) : use a window of word to predict the middle word.
    • Skip-gram (SG) : use a word to predict the surrounbding ones in window.

 

 

 

์œ„์™€ ๊ฐ™์€ ์ฐจ์ด๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค.

 

 

ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ด…์‹œ๋‹ค.

 

# see http://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

 

๋จผ์ €, ์œ„์™€ ๊ฐ™์ด ํ•„์š”ํ•œ library๋ฅผ import ํ•ฉ๋‹ˆ๋‹ค.

 

 

 

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right # ์•ž์— ๋ช‡ ๋‹จ์–ด๋ฅผ ๋ณผ ๊ฒƒ์ธ๊ฐ€. ์ด 4๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ๋ด„.
text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split() # ๋ฐ์ดํ„ฐ

split_ind = (int)(len(text) * 0.8)

# By deriving a set from `raw_text`, we deduplicate the array
'''
์ฒซ ๋ฒˆ์งธ ์ž‘์—…. vocavluary๋ฅผ ๋งŒ๋“ฆ. (0, 0, 0, 0, 0, 0, 0) ๋‹จ์–ด๋ฅผ ๋ช‡ ๋ฒˆ์งธ index๋งŒ 1๋กœ ๋ฐ”๊พธ๋Š” One-hot encoding์„ ์‚ฌ์šฉ.

we = (1, 0, 0, 0, 0, 0, 0)
are = (0, 1, 0, 0, 0, 0, 0)
about = (0, 0, 1, 0, 0, 0, 0)
to = (0, 0, 0, 1, 0, 0, 0)
์ด๋ ‡๊ฒŒ ์ €์žฅํ•˜๋Š” ๊ฒƒ์€ ํž˜๋“ฆ.
'''
# text๋ฅผ ๊ทธ๋ƒฅ splitํ•˜๋ฉด ๋ชจ๋“  ๊ฒƒ์ด ๋‹ค ๋‚˜์˜ด.
vocab = set(text) # set์„ ์‚ฌ์šฉํ•ด์„œ vocalvulary๋กœ ๋งŒ๋“ฆ.

vocab_size = len(vocab)
print('vocab_size:', vocab_size)


# ์•ž์—์„œ๋ถ€ํ„ฐ ํ•˜๋‚˜์”ฉ ๋ฐฐ์ •
w2i = {w: i for i, w in enumerate(vocab)}
i2w = {i: w for i, w in enumerate(vocab)}

# ์ž์—ฐ์–ด์—์„œ ๊ฐ€์žฅ ๋จผ์ €ํ•˜๋Š” ๊ฒƒ์ด ์ด๋Ÿฌํ•œ vocalvulary๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด๋‹ค.
print(w2i)
print(i2w)

 

์œ„ ์‹์„ ์กฐ๊ธˆ์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

๋จผ์ €, CONTEXT_SIZE = 2๋ฅผ ์‚ดํŽด๋ณด๋ฉด, ์•ž ๋’ค๋กœ 2๋‹จ์–ด์”ฉ์„ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

์ด๋Š” ์šฐ๋ฆฌ๊ฐ€ ์•ž ๋’ค์˜ ๋‘ ๋‹จ์–ด ์”ฉ์„ ๋ณด๊ณ , ๊ฐ€์šด๋ฐ ๋‹จ์–ด๋ฅผ ํ•™์Šต์‹œ์ผœ์„œ ์˜ˆ์ธก์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ๋ฐ์ดํ„ฐ๋ฅผ text์— ๋‹ด์•„์„œ split์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ๋‹จ์–ด ํ•˜๋‚˜ํ•˜๋‚˜์”ฉ์„ ๋‹ด์•„๋‘ก๋‹ˆ๋‹ค.

 

  • ์ฒซ ๋ฒˆ์งธ ์ž‘์—…. vocavluary๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ทธ๋Ÿฐ๋ฐ,

 

(0, 0, 0, 0, 0, 0, 0) ๋‹จ์–ด๋ฅผ ๋ช‡ ๋ฒˆ์งธ index๋งŒ 1๋กœ ๋ฐ”๊พธ๋Š” One-hot encoding์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

we = (1, 0, 0, 0, 0, 0, 0)
are = (0, 1, 0, 0, 0, 0, 0)
about = (0, 0, 1, 0, 0, 0, 0)
to = (0, 0, 0, 1, 0, 0, 0)

 

์ด๋ ‡๊ฒŒ ์ €์žฅํ•˜๋Š” ๊ฒƒ์€ ํž˜๋“ญ๋‹ˆ๋‹ค.

 

์ˆ˜๋งŽ์€ ๋‹จ์–ด๋ฅผ One-hot encoding์„ ํ•˜๋Š” ๊ฒƒ์€ ์ƒ๋‹นํžˆ ๋งŽ์€ vectors๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌํ•œ ํ•œ๊ณ„๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ์ˆซ์ž๋กœ ํ•˜๋‚˜์”ฉ ๋งค์นญ์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ ,

 

vocab์— ํ•ด๋‹น text๋ฅผ set์œผ๋กœ ๋ฐ›์•„์„œ vocabulary๋ฅผ ๋งŒ๋“ญ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ž˜์„œ ์•„๋ž˜์™€ ๊ฐ™์€ Output์„ ๋ฝ‘์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

vocab_size: 49
{'computers.': 0, 'other': 1, 'directed': 2, 'that': 3, 'data.': 4, 'processes.': 5, 'evolution': 6, 'pattern': 7, 'called': 8, 'programs': 9, 'effect,': 10, 'beings': 11, 'process': 12, 'We': 13, 'a': 14, 'our': 15, 'manipulate': 16, 'inhabit': 17, 'rules': 18, 'the': 19, 'The': 20, 'is': 21, 'program.': 22, 'direct': 23, 'In': 24, 'study': 25, 'computer': 26, 'with': 27, 'As': 28, 'spirits': 29, 'they': 30, 'People': 31, 'conjure': 32, 'spells.': 33, 'abstract': 34, 'are': 35, 'to': 36, 'of': 37, 'about': 38, 'things': 39, 'we': 40, 'create': 41, 'Computational': 42, 'evolve,': 43, 'by': 44, 'computational': 45, 'process.': 46, 'processes': 47, 'idea': 48}
{0: 'computers.', 1: 'other', 2: 'directed', 3: 'that', 4: 'data.', 5: 'processes.', 6: 'evolution', 7: 'pattern', 8: 'called', 9: 'programs', 10: 'effect,', 11: 'beings', 12: 'process', 13: 'We', 14: 'a', 15: 'our', 16: 'manipulate', 17: 'inhabit', 18: 'rules', 19: 'the', 20: 'The', 21: 'is', 22: 'program.', 23: 'direct', 24: 'In', 25: 'study', 26: 'computer', 27: 'with', 28: 'As', 29: 'spirits', 30: 'they', 31: 'People', 32: 'conjure', 33: 'spells.', 34: 'abstract', 35: 'are', 36: 'to', 37: 'of', 38: 'about', 39: 'things', 40: 'we', 41: 'create', 42: 'Computational', 43: 'evolve,', 44: 'by', 45: 'computational', 46: 'process.', 47: 'processes', 48: 'idea'}

 

 

์ž ์ง€๊ธˆ๊นŒ์ง€ vocabulary๋ฅผ ๋งŒ๋“ค์—ˆ์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ ‡๋‹ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์€ dataset์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

CBOW๋Š” input์€ t - 2, t - 1, t + 1, t + 2๋ฅผ ๊ทธ๋ฆฌ๊ณ  Output์œผ๋กœ t๋ฅผ ๋‚ด๋†“๋Š” ํ˜•์‹์ž…๋‹ˆ๋‹ค.

 

์•„๋ž˜์™€ ๊ฐ™์ด create_cbow_dataset ํ•จ์ˆ˜์—์„œ text๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  data๋Š” 0๋ฒˆ์งธ ์ธ๋ฑ์Šค์™€ 1๋ฒˆ์งธ ์ธ๋ฑ์Šค๋Š” ์•ž์— ๋‘ ๊ฐœ์˜ text๊ฐ€ ์—†์œผ๋ฏ€๋กœ ํ•ด๋‹น ๊ฒฝ์šฐ๋Š” ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค.

 

๋‚˜๋จธ์ง€๋Š” context, ํ˜„์žฌ t ์ธ๋ฑ์Šค์˜ ๊ฐ’์„ target์œผ๋กœ ์ง€์ •ํ•ด์„œ context์™€ data์˜ pair๋กœ ์ด๋ฃจ์–ด์ง„ data๋ฅผ ๋งŒ๋“ค๊ฒ ์Šต๋‹ˆ๋‹ค.

 

# context window size is two


# input : t-2, t-1, t+1, t+2
# Output : t
# ๊ฐ ๋‹จ์–ด๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ๋ชฉ์ ์œผ๋กœ ํ•˜๋Š” Output์˜ ๊ฐ’์„ ์ •ํ•˜๋ฉด ์–‘์ชฝ 2๊ฐœ์”ฉ์˜ ๊ฐ’์ด ํ•„์š”.

def create_cbow_dataset(text):
    data = []
    for i in range(2, len(text) - 2): # 0๋ฒˆ์งธ๋Š”๋Š” ์•ž์—์— ๋‘๋‘ ๊ฐœ๊ฐœ ์—†์Œ์Œ.
        context = [text[i - 2], text[i - 1],
                   text[i + 1], text[i + 2]]
        target = text[i] # ๋‚˜๋จธ์ง€๋Š” context, ํ˜„์žฌ t๋ฒˆ์งธ๋ฅผ target์œผ๋กœ ์ง€์ •ํ•ด์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค๊ฒ ๋‹ค.
        data.append((context, target))
    return data

'''
์šฐ๋ฆฌ๋Š” input์€ ํ˜„์žฌ ๋‹จ์–ด, Output์€ 4๊ฐœ.

๋ณดํ†ต 4๊ฐœ์˜ output์œผ๋กœ ํ•˜์ง€๋Š” ์•Š๊ณ ,
t - > t-2
t - > t-1
t - > t + 1
t - > t + 2
์ž…๋ ฅ ๊ฐ’์— ๋Œ€ํ•œ context๋ฅผ ํ•™์Šต.
'''
# input : t
# Output : t-2, t-1, t+1, t+2

def create_skipgram_dataset(text):
    import random
    data = []
    for i in range(2, len(text) - 2):
        data.append((text[i], text[i-2], 1))
        data.append((text[i], text[i-1], 1))
        data.append((text[i], text[i+1], 1))
        data.append((text[i], text[i+2], 1))
        # negative sampling
        for _ in range(4):
            if random.random() < 0.5 or i >= len(text) - 3:
                rand_id = random.randint(0, i-1)
            else:
                rand_id = random.randint(i+3, len(text)-1)
            data.append((text[i], text[rand_id], 0))
    return data

cbow_train = create_cbow_dataset(text)
skipgram_train = create_skipgram_dataset(text)
print('cbow sample', cbow_train[0])
print('skipgram sample', skipgram_train[0])

# ๋ฐ์ดํ„ฐ ๋งŒ๋“œ๋Š” ๊ฒƒ๊นŒ์ง€ ํ•จ.

 

๊ทธ๋ฆฌ๊ณ  ์œ„์™€ ๊ฐ™์ด create_skipgram_dataset ํ•จ์ˆ˜๋ฅผ ๊ตฌํ˜„ํ•˜์—ฌ, text๋ฅผ input์œผ๋กœ ๋ฐ›์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ , ์œ„ for๋ฌธ์—์„œ๋„ ์ฒซ๋ฒˆ์งธ์™€ ๋‘ ๋ฒˆ์งธ ์ธ๋ฑ์Šค๋Š” ๊ฐ’์ด ์ขŒ์šฐ 2๊ฐ€์ง€ ์ˆซ์ž๊ฐ€ ์กด์žฌํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, ์œ„์™€ ๊ฐ™์ด ๊พธ๋ ค์ง‘๋‹ˆ๋‹ค.

 

input text i๋ฒˆ์งธ text์— ๋Œ€ํ•ด t - 2, t - 1, t + 1, t + 2์„ output์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

 

๊ทธ๋ฆฌ๊ณ  negative sampling์„ ์‚ฌ์šฉํ•˜์—ฌ rand_id๋ฅผ ์ธ๋ฑ์Šค๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋žœ๋ค์œผ๋กœ ๋ฐฐ์น˜ํ•ฉ๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ  cbow_train, skipgram_train์— ํ•ด๋‹น ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ๋“ค์„ ๋„ฃ์Šต๋‹ˆ๋‹ค.

 

์—ฌ๊ธฐ๊นŒ์ง€ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“œ๋Š” ๊ฒƒ๊นŒ์ง€ ํ•ด๋ดค์Šต๋‹ˆ๋‹ค.

 

 

cbow sample (['We', 'are', 'to', 'study'], 'about')
skipgram sample ('about', 'We', 1)

์œ„ ์ฝ”๋“œ์˜ ์ถœ๋ ฅ์ž…๋‹ˆ๋‹ค.

 

 

 

๊ทธ๋Ÿผ ์•„๋ž˜์™€ ๊ฐ™์ด CBOW์™€ SkipGram model์„ ์‚ดํŽด๋ด…์‹œ๋‹ค,

 

์•„๋ž˜์™€ ๊ฐ™์ด 

 

class CBOW(nn.Module):
    def __init__(self, vocab_size, embd_size, context_size, hidden_size):
        super(CBOW, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
        self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size) # input์ด์ด 4๊ฐœ๊ฐœ ์—์— ๋Œ€ํ•ดํ•ด hidden layer ์ง€๋‚ฌ๋‹ค๊ฐ€๊ฐ€ ๊ฐ๊ฐ.
        self.linear2 = nn.Linear(hidden_size, vocab_size) # (0, 0, 0, 0, 0, 1, 0, 0) # 6๋ฒˆ์งธ์งธ ์ธ๋ฑ์Šค๋ฅผ๋ฅผ ๊ฐ–๋Š”๋Š” ๋‹จ์–ด๋‹ค๋‹ค.
        
    def forward(self, inputs):
        embedded = self.embeddings(inputs).view((1, -1))
        hid = F.relu(self.linear1(embedded))
        out = self.linear2(hid)
        log_probs = F.log_softmax(out)
        return log_probs

class SkipGram(nn.Module):
    def __init__(self, vocab_size, embd_size):
        super(SkipGram, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embd_size)
    
    def forward(self, focus, context):
        embed_focus = self.embeddings(focus).view((1, -1))
        embed_ctx = self.embeddings(context).view((1, -1))
        score = torch.mm(embed_focus, torch.t(embed_ctx))
        log_probs = F.logsigmoid(score)
    
        return log_probs

 

์œ„ ์ฝ”๋“œ์—์„œ input๊ณผ vector, ๊ทธ๋ฆฌ๊ณ  embedding์— ๋Œ€ํ•˜์—ฌ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

ํ•ด๋‹น ํ•จ์ˆ˜๋“ค์— ๋Œ€ํ•œ input ๊ฐ’์„ ์ž˜ ์‚ดํŽด๋ณด์‹œ๊ธธ ๋ฐ”๋ž๋‹ˆ๋‹ค.

 

 

๊ทธ๋ฆฌ๊ณ , ํ•ด๋‹น ๋ชจ๋ธ์— ๋Œ€ํ•œ training ์ฝ”๋“œ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

embd_size = 100
learning_rate = 0.001
n_epoch = 30

def train_cbow():
    hidden_size = 64
    losses = []
    loss_fn = nn.NLLLoss()
    model = CBOW(vocab_size, embd_size, CONTEXT_SIZE, hidden_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)

    for epoch in range(n_epoch):
        total_loss = .0
        for context, target in cbow_train:
            ctx_idxs = [w2i[w] for w in context]
            ctx_var = Variable(torch.LongTensor(ctx_idxs))

            model.zero_grad()
            log_probs = model(ctx_var)

            loss = loss_fn(log_probs, Variable(torch.LongTensor([w2i[target]])))

            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        losses.append(total_loss)
    return model, losses

def train_skipgram():
    losses = []
    loss_fn = nn.MSELoss()
    model = SkipGram(vocab_size, embd_size)
    print(model)
    optimizer = optim.SGD(model.parameters(), lr=learning_rate)
    
    for epoch in range(n_epoch):
        total_loss = .0
        for in_w, out_w, target in skipgram_train:
            in_w_var = Variable(torch.LongTensor([w2i[in_w]]))
            out_w_var = Variable(torch.LongTensor([w2i[out_w]]))
            
            model.zero_grad()
            log_probs = model(in_w_var, out_w_var)
            loss = loss_fn(log_probs[0], Variable(torch.Tensor([target])))
            
            loss.backward()
            optimizer.step()

            total_loss += loss.item()
        losses.append(total_loss)
    return model, losses
    
cbow_model, cbow_losses = train_cbow()
sg_model, sg_losses = train_skipgram()

 

์œ„ train ๊ณผ์ •์— ๋Œ€ํ•ด์„œ๋Š” ๊ธฐ๋ณธ์ ์œผ๋กœ loss, model call, optimization, epoch ๋ฐ˜๋ณต์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.

 

์ž์„ธํ•œ ๋‚ด์šฉ์€ ์ƒ๋žตํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์ถœ๋ ฅ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

 

BOW(
  (embeddings): Embedding(49, 100)
  (linear1): Linear(in_features=400, out_features=64, bias=True)
  (linear2): Linear(in_features=64, out_features=49, bias=True)
)
<ipython-input-5-49cae8ab3769>:12: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_probs = F.log_softmax(out)
SkipGram(
  (embeddings): Embedding(49, 100)
)

 

 

 

 

 

๊ทธ๋ฆฌ๊ณ ,

 

test ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค.

 

# test
# You have to use other dataset for test, but in this case I use training data because this dataset is too small
def test_cbow(test_data, model):
    print('====Test CBOW===')
    correct_ct = 0
    for ctx, target in test_data:
        ctx_idxs = [w2i[w] for w in ctx]
        ctx_var = Variable(torch.LongTensor(ctx_idxs))

        model.zero_grad()
        log_probs = model(ctx_var)
        _, predicted = torch.max(log_probs.data, 1)
        predicted_word = i2w[predicted.item()] # predicted๋Š” tensor์ž„. ์ด๊ฒƒ์„์„ ์–ด๋– ํ•œํ•œ ๊ฐ’์œผ๋กœ๋กœ ๋ฐ”๊ฟ”์•ผํ•จํ•จ. ๊ทธ๊ฒƒ์ด์ด Item.
        print('predicted:', predicted_word)
        print('label    :', target)
        if predicted_word == target:
            correct_ct += 1
            
    print('Accuracy: {:.1f}% ({:d}/{:d})'.format(correct_ct/len(test_data)*100, correct_ct, len(test_data)))

def test_skipgram(test_data, model):
    print('====Test SkipGram===')
    correct_ct = 0
    for in_w, out_w, target in test_data:
        in_w_var = Variable(torch.LongTensor([w2i[in_w]]))
        out_w_var = Variable(torch.LongTensor([w2i[out_w]]))

        model.zero_grad()
        log_probs = model(in_w_var, out_w_var)
        _, predicted = torch.max(log_probs.data, 1)
        predicted = predicted[0]
        if predicted == target:
            correct_ct += 1

    print('Accuracy: {:.1f}% ({:d}/{:d})'.format(correct_ct/len(test_data)*100, correct_ct, len(test_data)))

test_cbow(cbow_train, cbow_model)
print('------')
test_skipgram(skipgram_train, sg_model)

 

์œ„๋Š” training๊ณผ ๊ฐ™์€ ๋ฐ์ดํ„ฐ์…‹์„ ์‚ฌ์šฉํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

 

๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ๊ฒฐ๊ณผ๊ฐ€ ์•„๋‹Œ ์ฝ”๋“œ ๊ทธ ์ž์ฒด๋ฅผ ๋ณด๊ณ  ๊ณต๋ถ€ํ•ด์ฃผ์„ธ์š”.

 

์ถœ๋ ฅ ๊ฒฐ๊ณผ์˜ ์ผ๋ถ€๋งŒ ์ฒจ๋ถ€ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

 

====Test CBOW===
predicted: about
label    : about
predicted: to
label    : to
predicted: study
label    : study
predicted: the
label    : the
predicted: idea
label    : idea
predicted: of
label    : of
predicted: a
label    : a
:
:
:
predicted: with
label    : with
Accuracy: 100.0% (58/58)
------
====Test SkipGram===
Accuracy: 50.0% (232/464)
<ipython-input-5-49cae8ab3769>:12: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
  log_probs = F.log_softmax(out)

 

 

์ž, ๊ฒฐ๊ณผ๋ฅผ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

 

์œ„ ๊ฒฐ๊ณผ์—์„œ CBOW๋Š” ์ •ํ™•๋„๊ฐ€ 100ํ”„๋กœ๊ฐ€ ๋‚˜์™”์Šต๋‹ˆ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜, SkipGram์€ ์ •ํ™•๋„๊ฐ€ 50ํ”„๋กœ์ž…๋‹ˆ๋‹ค.

 

๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ training set๊ณผ test set์ด ๊ฐ™์€ ๊ฒฝ์šฐ์ธ๋ฐ ์™œ ๊ทธ๋Ÿด๊นŒ์š”?

 

SkipGram์€ ์ž˜ ๋ชป ๋งž์ถ”๋Š” ์ด์œ .

data์— ๋Œ€ํ•ด cbow๋Š” ์ž…๋ ฅ 4๊ฐœ์— ๋Œ€ํ•ด output 1๊ฐœ, skipgram์€ ์ •ํ™•๋„๊ฐ€ ๋” ๋‚ฎ์„๊นŒ?

-> ๋‘ ๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“  ํ˜•์‹์ด, cbow๋Š” ์ž…๋ ฅ 4๊ฐœ์— ๋Œ€ํ•ด 1๊ฐœ๋ฅผ ๋งž์ถ”๋Š” ๊ฒƒ.
-> skipgram์€ ํŠน์ • ๋‹จ์— ๋Œ€ํ•œ 4๊ฐœ์˜ ๋‹จ์–ด๋ฅผ ํ•™์Šตํ–ˆ์Œ.

t๊ฐ€ ๋“ค์–ด์™€๋„ t -2, t-1, t+1, t+2๊ฐ€ ๋Œ€์ƒ์ด๋ฏ€๋กœ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง„๋‹ค.
์ •ํ™•๋„๊ฐ€ ๋‚ฎ๋‹ค๊ณ  ์•ˆ ์ข‹์€ ๋ชจ๋ธ์ด ์•„๋‹ˆ๋‹ค. ๋‹จ์–ด๋ฅผ ๋งž์ถ”๋Š” ๊ฒŒ ๋ชฉ์ ์ด ์•„๋‹ˆ๋‹ค. Weight๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒŒ ๋ชฉ์ ์ด๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ skipgram์ด์ด ๋” ์„ฑ๋Šฅ์ด ์ข‹๋‹ค. cbow๋Š” 4๊ฐœ๋ฅผ ๋ฐ›๊ณ  ํ•˜๋‚˜์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋ฉฐ, ๊ต‰์žฅํžˆ specificํ•œ ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด๋‹ค.

 

๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— SkipGram์ด ๋” ์ผ๋ฐ˜์ ์œผ๋กœ ์ž˜ ์„ฑ๋Šฅ์ด ์ข‹๋‹ค๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

 

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

def showPlot(points, title):
    plt.figure()
    fig, ax = plt.subplots()
    plt.plot(points)

showPlot(cbow_losses, 'CBOW Losses')
showPlot(sg_losses, 'SkipGram Losses') # loss ํ•™์Šต ์ฐจ์ด๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

์œ„์™€ ๊ฐ™์ด loss๋ฅผ ํ†ตํ•ด ํ•™์Šต ์ฐจ์ด๋ฅผ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

<matplotlib.figure.Figure at 0x7fe1a5910f28>

CBOW

<matplotlib.figure.Figure at 0x7fe1ac39f2e8>

SkipGram

'Artificial Intelligence > Natural Language Processing' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[NLP] Word Embedding - GloVe [practice]  (0) 2023.03.31
[NLP] Word Embedding - GloVe  (0) 2023.03.31
[NLP] Word Embedding - Word2Vec  (0) 2023.03.27
[NLP] Word Embedding - Skip Gram  (0) 2023.03.27
[NLP] Word Embedding - CBOW  (1) 2023.03.27

BELATED ARTICLES

more