使用Python進行詞嵌入


使用Python進行詞嵌入

在這篇文章中,我們將看到如何使用Python生成詞嵌入,並繪製相應單詞的圖表。

Python依賴庫

  • NLTK
  • Sklearn
  • Gensim
  • Plotly
  • Pandas

首先,我們需要獲取一個段落或文本來查找其嵌入內容。

<code>paragraph = '''Jupiter is the fifth planet from the Sun and the largest in the Solar System.     It is a gas giant with a mass one-thousandth that of the Sun,     but two-and-a-half times that of all the other planets in the Solar System combined.     Jupiter is one of the brightest objects visible to the naked eye in the night sky,     and has been known to ancient civilizations since before recorded history.     It is named after the Roman god Jupiter. When viewed from Earth,     Jupiter can be bright enough for its reflected light to cast shadows,     and is on average the third-brightest natural object in the night sky after the Moon and Venus.'''/<code>

接下來,我們需要標記文本,因此我們使用nltk庫來完成它:

<code>import nltk# tokeninizing the paragraphsent_text = nltk.sent_tokenize(paragraph)word_text = [nltk.word_tokenize(sent) for sent in sent_text]print(word_text)/<code>

標記化後,我們將獲得一個2D數組,如下所示:

使用Python進行詞嵌入

現在,我們將使用該gensim包從Word2Vec模型中獲取詞嵌入的信息。

<code>from gensim.models import Word2Vec# train model to get the embeddingsmodel = Word2Vec(word_text, min_count=1)/<code>

要繪製詞嵌入,我們需要首先將多維嵌入轉換為2D數組。為了將它轉換成二維數組,我們使用PCA

<code># getting the embedding vectorsX = model[model.wv.vocab]# dimentionality reduction using PCApca = PCA(n_components=2)# running the transformationsresult = pca.fit_transform(X)# getting the corresponding wordswords = list(model.wv.vocab)/<code>

我們需要進行一些處理以將PCA結果轉換為dataframe,如下所示:

<code>import pandas as pd# creating a dataframe from the resultsdf = pd.DataFrame(result, columns=list('XY'))# adding a columns for the corresponding wordsdf['Words'] = words# converting the lower case text to title casedf['Words'] = df['Words'].str.title()/<code>

得到所需的數組後,我們可以使用 plotly

<code>import plotly.express as px# plotting a scatter plotfig = px.scatter(df, x="X", y="Y", text="Words", log_x=True, size_max=60)# adjusting the text positionfig.update_traces(textposition='top center')# setting up the height and titlefig.update_layout(    height=600,    title_text='Word embedding chart')# displaying the figurefig.show()/<code>

現在,詞嵌入圖將如下所示:

使用Python進行詞嵌入

完整Python代碼如下:

使用Python進行詞嵌入


分享到:


相關文章: