본문 바로가기
AI & 머신러닝 coding skill

[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Word embeddings

by 호빵님 2020. 11. 11.
반응형

코세라의   deeplearning.AI tensorflow developer 전문가 자격증 과정내에 Natural Language Processing in TensorFlow

과정의 2주차 word embeddings 챕터의 코드 예제입니다.

 

1) imdb reviews 데이터 로드

2) training과 testing data 분리

3) 문장 token화

4) Deep Nural Network 모델 구성 -> 첫 layer에 Embedding layer를 사용

5) vecs.tsv 와meta.tsv를만들어서 단어 비쥬얼라이즈에 사용할수도있다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import tensorflow as tf
print(tf.__version__)
 
# !pip install -q tensorflow-datasets
 
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
 
import numpy as np
 
train_data, test_data = imdb['train'], imdb['test']
 
training_sentences = []
training_labels = []
 
testing_sentences = []
testing_labels = []
 
# str(s.tonumpy()) is needed in Python3 instead of just s.numpy()
for s,l in train_data:
  training_sentences.append(s.numpy().decode('utf8'))
  training_labels.append(l.numpy())
  
for s,l in test_data:
  testing_sentences.append(s.numpy().decode('utf8'))
  testing_labels.append(l.numpy())
  
training_labels_final = np.array(training_labels)
testing_labels_final = np.array(testing_labels)
 
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
 
 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
 
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences)
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type)
 
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
 
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
 
def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?'for i in text])
 
print(decode_review(padded[3]))
print(training_sentences[3])
 
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(6, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
model.summary()
 
num_epochs = 10
model.fit(padded, training_labels_final, epochs=num_epochs, validation_data=(testing_padded, testing_labels_final))
 
= model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
 
import io
 
out_v = io.open('vecs.tsv''w', encoding='utf-8')
out_m = io.open('meta.tsv''w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()
 
try:
  from google.colab import files
except ImportError:
  pass
else:
  files.download('vecs.tsv')
  files.download('meta.tsv')
 
sentence = "I really think this is amazing. honest."
sequence = tokenizer.texts_to_sequences([sentence])
print(sequence)
cs

 

Flatten layer대신에 GlobalAveragePooling1D를 사용할 수도 있다.

1
2
3
4
5
6
7
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.GlobalAveragePooling1D(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
cs

 

반응형