使用Bi-LSTM進行句子分類技术頭條網

使用Bi-LSTM進行句子分類

2019-03-29 01:09:42 編程小寶

在本文中，我將主要討論使用深度學習模型（特別是Bi-LSTM）的句子分類任務。

前言

對於句子分類，我們主要有兩種方式：

Bag of words模型(BOW)
深度神經網絡模型

BOW模型的工作原理是分別處理每個單詞並對每個單詞進行編碼。對於BOW方法，我們可以使用TF-IDF方法，但它不保留句子中每個單詞的上下文。

因此，為了實現更好的性能，如命名實體提取，情感分析，我們使用深度神經網絡。

Python實現

數據集：

在本文中，我使用了Reddit - 機器學習數據集，它基於四種情緒類別，如rage, happy, gore 和 creepy.。

對於深度神經模型，我們需要對文本進行嵌入。嵌入捕獲單詞在高維平面中的表示。通過嵌入，我們創建了單詞的向量表示，它是通過理解單詞的上下文來學習的。我們可以使用預訓練的嵌入，比如glove, fasttext，它們是在數十億個文檔上訓練的，或者我們可以使用gensim包創建我們自己的嵌入(在我們自己的語料庫上訓練)。

在本文中，我使用了預訓練的glove-twitter嵌入，它適合於我們的社交網絡數據上下文。此外，我選擇了100維嵌入，它的性能非常好，不需要太多的時間來訓練。

embedding_path = "~/glove.twitter.27B.100d.txt" ## change 
# create the word2vec dict from the dictionary
def get_word2vec(file_path):
 file = open(embedding_path, "r")
 if (file):
 word2vec = dict()
 split = file.read().splitlines()
 for line in split:
 key = line.split(' ',1)[0] # the first word is the key
 value = np.array([float(val) for val in line.split(' ')[1:]])
 word2vec[key] = value
 return (word2vec)
 else:
 print("invalid fiel path")
w2v = get_word2vec(embedding_path)

預處理文本：

數據有四個文件，代表了四種不同的情緒，所以我們需要合併文件來完成多類分類任務。

df_rage = pd.read_csv(os.path.join(dir_path,'processed_rage.csv'))
df_happy = pd.read_csv(os.path.join(dir_path,'processed_happy.csv'))
df_gore = pd.read_csv(os.path.join(dir_path,'processed_gore.csv'))
df_creepy = pd.read_csv(os.path.join(dir_path,'processed_creepy.csv'))
# create a random balances dataset of all of the categories
length = np.min([len(df_rage),len(df_happy),len(df_creepy),len(df_gore)])
df_final = pd.concat([df_rage[:length], df_happy[:length], df_gore[:length], df_creepy[:length]], ignore_index=True)

標記：

為了將句子分解成更簡單的標記或單詞，我們將文本標記化。在這裡，我們將使用nltk Tweet tokenizer，因為它可以很好地處理社交網絡數據。

import nltk
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
nltk.download('wordnet')
nltk.download('stopwords')
from nltk.tokenize import TweetTokenizer
from nltk.corpus import wordnet as wn
tknzr = TweetTokenizer()
def get_tokens(sentence):
# tokens = nltk.word_tokenize(sentence) # now using tweet tokenizer
 tokens = tknzr.tokenize(sentence)
 tokens = [token for token in tokens if (token not in stopwords and len(token) > 1)]
 tokens = [get_lemma(token) for token in tokens]
 return (tokens)
def get_lemma(word):
 lemma = wn.morphy(word)
 if lemma is None:
 return word
 else:
 return lemma
token_list = (df_final['title'].apply(get_tokens))

準備輸入變量

# integer encode the documents
encoded_docs = t.texts_to_sequences(sentences)
# pad documents to a max length of 4 words
max_length = max_len
X = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

輸出變量：

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
Y_new = df_final['subreddit'] 

Y_new = le.fit_transform(Y_new)

將機器學習數據拆分為訓練集和測試集

## now splitting into test and training data
from sklearn.model_selection import train_test_split
X_train,X_test, Y_train, Y_test = train_test_split(X, y,test_size =0.20,random_state= 4 )

基線模型

在使用LSTM模型獲得分數之前，我從基線模型中獲得了一些指標：

對於基線模型，我們可以簡單地計算word2vec嵌入的平均值。

# the object is a word2vec dictionary with value as array vector,
# creates a mean of word vecotr for sentences
class MeanVect(object):
 def __init__(self, word2vec):
 self.word2vec = word2vec
 # if a text is empty we should return a vector of zeros
 # with the same dimensionality as all the other vectors
 self.dim = len(next(iter(word2vec.values())))
 
 # pass a word list
 def transform(self, X):
 return np.array([
 np.mean([self.word2vec[w] for w in words if w in self.word2vec]
 or [np.zeros(self.dim)], axis=0)
 for words in (X)
 ])

SVM

def svm_wrapper(X_train,Y_train):
 param_grid = [
 {'C': [1, 10], 'kernel': ['linear']},
 {'C': [1, 10], 'gamma': [0.1,0.01], 'kernel': ['rbf']},]
 svm = GridSearchCV(SVC(),param_grid)
 svm.fit(X_train, Y_train)
 return(svm)

度量

# svm 

svm = svm_wrapper(X_train,Y_train)
Y_pred = svm.predict(X_test)
score = accuracy_score(Y_test,Y_pred)
print("accuarcy :", score)

0.70

對於基線，您可以進一步應用其他分類器（如隨機森林等），但是我用SVM得到了最好的F1分數。

對於語言環境中的神經模型，最流行的是LSTM（長短期記憶），它是一種RNN（循環神經網絡），它保留了文本的長期依賴性。

雙向LSTM：

對於雙向LSTM，我們有一個嵌入層，而不是加載隨機權重，我們將從我們的glove嵌入加載權重

# get the embedding matrix from the embedding layer
from numpy import zeros
embedding_matrix = zeros((vocab_size, 100))
for word, i in t.word_index.items():
 embedding_vector = w2v.get(word)
 if embedding_vector is not None:
 embedding_matrix[i] = embedding_vector

計算神經模型的詞彙大小。

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
# prepare tokenizer
t = Tokenizer()
t.fit_on_texts(token_list)
vocab_size = len(t.word_index) + 1
# integer encode the documents
encoded_docs = t.texts_to_sequences(sentences)
# pad documents to a max length of 4 words
max_length = max_len
X = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
y = Y_new

最終模型

# main model
input = Input(shape=(max_len,))
model = Embedding(vocab_size,100,weights=[embedding_matrix],input_length=max_len)(input) 

model = Bidirectional (LSTM (100,return_sequences=True,dropout=0.50),merge_mode='concat')(model)
model = TimeDistributed(Dense(100,activation='relu'))(model)
model = Flatten()(model)
model = Dense(100,activation='relu')(model)
output = Dense(3,activation='softmax')(model)
model = Model(input,output)
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam', metrics=['accuracy'])

對於我們的神經模型，上面的max_len必須是固定的，它可以是一個沒有單詞的句子，也可以是一個靜態值。我把它定義為60。

模型摘要

將訓練數據擬合到機器學習模型：

model.fit(X_train,Y_train,validation_split=0.25, nb_epoch = 10, verbose = 2)

結果

評估模型

# evaluate the model
loss, accuracy = model.evaluate(X_test, Y_test, verbose=2)
print('Accuracy: %f' % (accuracy*100))

Accuracy: 74.593496

分類報告

from sklearn.metrics import classification_report,confusion_matrix
Y_pred = model.predict(X_test)
y_pred = np.array([np.argmax(pred) for pred in Y_pred])
print(' Classification Report:\\n',classification_report(Y_test,y_pred),'\\n')