[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Text to sequence and padding

코세라의 deeplearning.AI tensorflow developer 전문가 자격증 과정내에 Natural Language Processing in TensorFlow

과정의 1주차 introduction sentiment in text 챕터의 코드 예제입니다.

1) 문장을 token화 한다

2) 문장을 token 들의 sequence로 변환한다

3) 문장 길이를 동일하게 하도록 padding한다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

import tensorflow as tf
from tensorflow import keras
 
 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
 
sentences = [
    'I love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]
 
tokenizer = Tokenizer(num_words = 100, oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
 
sequences = tokenizer.texts_to_sequences(sentences)
 
padded = pad_sequences(sequences, maxlen=5)
print("\nWord Index = " , word_index)
print("\nSequences = " , sequences)
print("\nPadded Sequences:")
print(padded)
 
 
# Try with words that the tokenizer wasn't fit to
test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]
Colored by Color Scripter

cs

오리지날 소스 위치 :

colab.research.google.com/github/lmoroney/dlaicourse/blob/master/TensorFlow%20In%20Practice/Course%203%20-%20NLP/Course%203%20-%20Week%201%20-%20Lesson%202.ipynb

Google Colaboratory

colab.research.google.com

실 데이터 적용 예제로서, 영화 'sarcasm' 의 비평 데이터를 불러와서, 문장들을 token화하여 문장들의 감정 평가를 하기위해서 준비하는 코드이다.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32

!wget --no-check-certificate \
    https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sarcasm.json \
    -O /tmp/sarcasm.json
  
import json
 
with open("/tmp/sarcasm.json", 'r') as f:
    datastore = json.load(f)
 
 
sentences = [] 
labels = []
urls = []
for item in datastore:
    sentences.append(item['headline'])
    labels.append(item['is_sarcastic'])
    urls.append(item['article_link'])
 
 
 
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
 
word_index = tokenizer.word_index
print(len(word_index))
print(word_index)
sequences = tokenizer.texts_to_sequences(sentences)
padded = pad_sequences(sequences, padding='post')
print(padded[0])
print(padded.shape)
Colored by Color Scripter

cs

저작자표시 비영리 변경금지 (새창열림)

'AI & 머신러닝 coding skill' 카테고리의 다른 글

[SEQUENCES, TIME SERIES AND PREDICTION] Deep neural network for time series 예측 (0)	2020.11.11
[SEQUENCES, TIME SERIES AND PREDICTION] Preparing features and labels (0)	2020.11.11
[SEQUENCES, TIME SERIES AND PREDICTION] Sequences and Prediction (0)	2020.11.11
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Sequence models and Literature (0)	2020.11.11
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Sequence models (0)	2020.11.11
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] subwords text encoder (0)	2020.11.11
training data in tensorflow site (0)	2020.11.11
[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Word embeddings (0)	2020.11.11

세상탐험대 블로그

[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Text to sequence and padding

'AI & 머신러닝 coding skill' 카테고리의 다른 글

티스토리툴바

[NATURAL LANGUAGE PROCESSING IN TENSORFLOW] Text to sequence and padding

'AI & 머신러닝 coding skill' 카테고리의 다른 글

관련글

티스토리툴바