반응형
코세라의 deeplearning.AI tensorflow developer 전문가 자격증 과정내에 Natural Language Processing in TensorFlow
과정의 2주차 word embeddings 챕터의 코드 예제입니다.
1) imdb reviews 데이터 로드
2) training과 testing data 분리
3) 미리 token화된 subword 데이터를 로드해서 token으로 사용
4) Deep Nural Network 모델 구성 -> 첫 layer에 Embedding layer를 사용
5) vecs.tsv 와meta.tsv를 만들어서 단어 비쥬얼라이즈tool에서 로드해서 각 단어의 의미별 vector 군집도를 확인하는데 사용할수도있다.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
|
import tensorflow as tf
print(tf.__version__)
# Double check TF 2.0x is installed. If you ran the above block, there was a
# 'reset all runtimes' button at the bottom that you needed to press
import tensorflow as tf
print(tf.__version__)
# If the import fails, run this
# !pip install -q tensorflow-datasets
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews/subwords8k", with_info=True, as_supervised=True)
train_data, test_data = imdb['train'], imdb['test']
tokenizer = info.features['text'].encoder
print(tokenizer.subwords)
sample_string = 'TensorFlow, from basics to mastery'
tokenized_string = tokenizer.encode(sample_string)
print ('Tokenized string is {}'.format(tokenized_string))
original_string = tokenizer.decode(tokenized_string)
print ('The original string: {}'.format(original_string))
for ts in tokenized_string:
print ('{} ----> {}'.format(ts, tokenizer.decode([ts])))
BUFFER_SIZE = 10000
BATCH_SIZE = 64
train_dataset = train_data.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(train_dataset))
test_dataset = test_data.padded_batch(BATCH_SIZE, tf.compat.v1.data.get_output_shapes(test_data))
embedding_dim = 64
model = tf.keras.Sequential([
tf.keras.layers.Embedding(tokenizer.vocab_size, embedding_dim),
tf.keras.layers.GlobalAveragePooling1D(),
tf.keras.layers.Dense(6, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.summary()
num_epochs = 10
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
history = model.fit(train_dataset, epochs=num_epochs, validation_data=test_dataset)
import matplotlib.pyplot as plt
def plot_graphs(history, string):
plt.plot(history.history[string])
plt.plot(history.history['val_'+string])
plt.xlabel("Epochs")
plt.ylabel(string)
plt.legend([string, 'val_'+string])
plt.show()
plot_graphs(history, "accuracy")
plot_graphs(history, "loss")
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)
import io
out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, tokenizer.vocab_size):
word = tokenizer.decode([word_num])
embeddings = weights[word_num]
out_m.write(word + "\n")
out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()
try:
from google.colab import files
except ImportError:
pass
else:
files.download('vecs.tsv')
files.download('meta.tsv')
|
cs |
반응형
'AI & 머신러닝 coding skill' 카테고리의 다른 글
[SEQUENCES, TIME SERIES AND PREDICTION] Deep neural network for time series 예측 (0) | 2020.11.11 |
---|---|
[SEQUENCES, TIME SERIES AND PREDICTION] Preparing features and labels (0) | 2020.11.11 |
[SEQUENCES, TIME SERIES AND PREDICTION] Sequences and Prediction (0) | 2020.11.11 |
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Sequence models and Literature (0) | 2020.11.11 |
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Sequence models (0) | 2020.11.11 |
training data in tensorflow site (0) | 2020.11.11 |
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Word embeddings (0) | 2020.11.11 |
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Text to sequence and padding (0) | 2020.11.11 |