相比Rwordseg,jiebaR程序包是另外一種做文本挖掘的高效選擇。jiebaR具有安裝簡單、分詞引擎多、函數數量多、更新速度塊的特點,因此也越來越受到數據分析師的喜愛。它的安裝不像Rwordseg那樣需要依賴Java環境,只需要一件簡單的命令,即可輕鬆上手使用。
> install.packages("jiebaR")
接下來我們來詳細瞭解jiebaR包在文本挖掘中的使用。
worker()分詞函數
使用worker()函數,可以設置一些分詞類型、用戶詞典、停用詞等等,它的使用格式如下:
worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH,
idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5,
encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05,
output = NULL, bylines = F, user_weight = "max")
其中,
type指分詞引擎類型,這個包包括mix, mp, hmm, full, query, tag, simhash, keyword,分別指混合模型,支持最大概率,隱馬爾科夫模型,全模式,索引模型,詞性標註,文本Simhash相似度比較,關鍵字提取;
dict指詞庫路徑,默認為DICTPATH;
hmm用來指定隱馬爾可夫模型的路徑,默認值為DICTPATH,當然也可以指定其他分詞引擎;
user指用戶自定義的詞庫;
idf用來指定逆文本頻率指數路徑,默認為DICTPATH,也可以用於simhash和keyword分詞引擎;
stop_word用來指定停用詞的路徑;
qmax指詞的最大查詢長度,默認為20,可用於query分詞類型;
topn指關鍵詞的個數,默認為5,可以用於simhash和keyword分詞類型;
symbol指輸出是否保留符號,默認為F;
-
Lines指從文件中最大一次讀取的行數,默認為1e+05;
output指輸出文件,文件名一般為系統時間;
bylines返回輸入的文件有多少行;
user_weight指用戶詞典的詞權重,有"min" "max" or "median"三個選項。
segment()函數
segment()函數的使用格式如下:
segment(code, engine, mod)
其中,
code表示要分詞的對象;
engine用於設置分詞的引擎,也就是worker函數;
mod用於改變默認的分詞引擎類型,它包括:mix、hmm、query、full、level 和mp。
結合這兩個函數做一個簡單的例子:
> library(jiebaR)
> engine
> words
> segment(words,engine)
[1] "大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"
[15] "最" "完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向" "堅定" "志向"
[29] "堅定" "才" "能夠" "鎮靜" "不躁" "鎮靜" "不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮"
[43] "周祥" "思慮" "周祥" "才" "能夠" "有所" "收穫" "每樣" "東西" "都" "有" "根本" "有枝末" "每件"
[57] "事情" "都" "有" "開始" "有" "終結" "明白" "了" "這" "本末" "始終" "的" "道理" "就"
[71] "接近" "事物" "發展" "的" "規律" "了" "古代" "那些" "要" "想" "在" "天下" "弘揚" "光明正大"
[85] "品德" "的" "人" "先要" "治理" "好" "自己" "的" "國家" "要" "想" "治理" "好" "自己"
[99] "的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和" "家" "族" "要" "想" "管理"
[113] "好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的" "品性" "要" "想" "修養"
[127] "自身" "的" "品性" "先" "要端正" "自己" "的" "心思" "要" "想" "端正" "自己" "的" "心思"
[141] "先" "要" "使" "自己" "的" "意念" "真誠" "要" "想" "使" "自己" "的" "意念" "真誠"
[155] "先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物" "通過"
[169] "對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念" "才能"
[183] "真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性" "品性"
[197] "修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族" "後"
[211] "才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自" "國家元首"
[225] "下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本" "被"
[239] "擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的" "不分"
[253] "輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不" "可能" "的"
添加用戶自定義詞或詞庫
添加自定義詞或詞庫有兩種方法,一是使用使用new_user_word函數;二是使用worker函數中通過user參數添加詞庫。
上面的例子中,“公眾號”本來就是一個詞,結果被分成兩個詞,因此我們需要添加這個詞。另外,我也想要“R語言”也被分成一個詞。以下代碼為添加自定義詞示例:
> new_user_word(engine_new_word,c("最完善","志向堅定","鎮靜不躁","思慮周祥","每樣東西","有","根本","枝末","本末始終","要想","家族","先要","端正","不可能"))
[1] TRUE
> segment(words,engine_new_word)
[1] "大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"
[15] "最完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向堅定" "志向堅定" "才" "能夠"
[29] "鎮靜不躁" "鎮靜不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮周祥" "思慮" "周祥" "才" "能夠" "有所"
[43] "收穫" "每樣東西" "都" "有" "根本" "有" "枝末" "每件" "事情" "都" "有" "開始" "有" "終結"
[57] "明白" "了" "這" "本末始終" "的" "道理" "就" "接近" "事物" "發展" "的" "規律" "了" "古代"
[71] "那些" "要想" "在" "天下" "弘揚" "光明正大" "品德" "的" "人" "先要" "治理" "好" "自己" "的"
[85] "國家" "要想" "治理" "好" "自己" "的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和"
[99] "家" "族" "要想" "管理" "好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的"
[113] "品性" "要想" "修養" "自身" "的" "品性" "先要" "端正" "自己" "的" "心思" "要想" "端正" "自己"
[127] "的" "心思" "先" "要" "使" "自己" "的" "意念" "真誠" "要想" "使" "自己" "的" "意念"
[141] "真誠" "先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物"
[155] "通過" "對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念"
[169] "才能" "真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性"
[183] "品性" "修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族"
[197] "後" "才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自"
[211] "國家元首" "下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本"
[225] "被" "擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的"
[239] "不分" "輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不可能" "的"
我們也可以通過worker函數的user參數添加一個詞庫,然後從詞庫裡面直接讀。當自定義詞量很大時,建議採用這種方法。例如,我們現在本地工作空間新建一個“dictionary.txt”的文件,裡面寫入我們需要自定義的詞庫,代碼如下:
> engine_user
> segment(words,engine_user)
這裡需要注意幾點:
詞庫的第一行一定要空著,否則第一個詞就會莫名其妙的失效;
如果詞庫是用記事本寫的話,因為編碼有時不是UTF-8,使用時會出現 各種錯誤。所以建議使用notepad++編輯,將編碼設置為utf-8,另存為txt文件;
如果需要添加搜狗細胞詞庫的話,我們需要安裝cidian包,它可以幫助 我們把搜狗細胞詞庫轉換為jiebaR可以使用的詞庫。
使用詞庫也可以使用new_user_word函數。示例代碼如下:
> new_user_word(engine_new_word, scan("dictionary.txt",what="",sep="\n"))
Read 2 items
[1] TRUE
> segment(words,engine_new_word)
刪除停用詞
上面的例子,分詞之後的“卻”、“想”等等就是停用詞,因此我們需要刪掉。這裡我們需要使用worker函數的stop_word參數。
> engine_s
> segment(words,engine_s)
[1] "大學" "宗旨" "弘揚" "光明正大" "品德" "棄舊圖新" "完善" "境界" "知道" "境界" "志向" "堅定" "志向" "堅定"
[15] "鎮靜" "不躁" "鎮靜" "不躁" "心安理得" "心安理得" "思慮" "周祥" "思慮" "周祥" "有所" "收穫" "根本" "有枝末"
[29] "事情" "開始" "終結" "明白" "本末" "始終" "道理" "就" "接近" "事物" "發展" "規律" "天下" "弘揚"
[43] "光明正大" "品德" "治理" "國家" "治理" "國家" "管理" "家庭" "家" "族" "管理" "家庭" "家族" "修養"
[57] "自身" "品性" "修養" "自身" "品性" "要端正" "心思" "端正" "心思" "意念" "真誠" "意念" "真誠" "獲得"
[71] "知識" "獲得" "知識" "途徑" "認識" "研究" "萬事萬物" "通過" "對" "萬事萬物" "認識" "研究" "才能" "獲得"
[85] "知識" "獲得" "知識" "意念" "才能" "真誠" "意念" "真誠" "心思" "才能" "端正" "心思" "端正" "才能"
[99] "修養" "品性" "品性" "修養" "才能" "管理" "家庭" "家族" "管理" "家庭" "家族" "才能" "治理" "國家"
[113] "治理" "國家" "天下" "才能" "太平" "國家元首" "平民百姓" "修養" "品性" "為" "根本" "根本" "擾亂" "家庭"
[127] "家族" "國家" "天下" "治理" "不可" "輕重緩急" "本末倒置" "做好" "事情"
統計詞頻
使用freq函數可以自動計算獲取詞頻
> freq(segment(words,engine_s))
char freq
1 做好 1
2 本末倒置 1
3 輕重緩急 1
4 擾亂 1
5 為 1
6 平民百姓 1
7 太平 1
8 才能 7
9 對 1
10 通過 1
11 萬事萬物 2
12 認識 2
13 途徑 1
14 知識 4
15 真誠 4
16 意念 4
17 就 1
18 有枝末 1
19 完善 1
20 周祥 2
21 思慮 2
22 收穫 1
23 心安理得 2
24 要端正 1
25 不躁 2
26 心思 4
27 獲得 4
28 鎮靜 2
29 堅定 2
30 根本 3
31 終結 1
32 品性 5
33 弘揚 2
34 大學 1
35 事物 1
36 棄舊圖新 1
37 管理 4
38 家庭 5
39 光明正大 2
40 天下 3
41 品德 2
42 事情 2
43 知道 1
44 志向 2
45 明白 1
46 國家 5
47 開始 1
48 本末 1
49 境界 2
50 有所 1
51 始終 1
52 道理 1
53 接近 1
54 研究 2
55 發展 1
56 國家元首 1
57 治理 5
58 家 1
59 家族 4
60 宗旨 1
61 族 1
62 不可 1
63 規律 1
64 修養 5
65 自身 2
66 端正 3
詞性標註
詞性標註使用worker函數的type參數,type默認為mix,僅需將它設置為tag即可。
> tagger
> tagger<=words
n uj n v nr i uj n v x i v x v
"大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"
d v uj n v v v uj n d v n a n
"最" "完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向" "堅定" "志向"
a d v a x a x d v i i d v v
"堅定" "才" "能夠" "鎮靜" "不躁" "鎮靜" "不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮"
x v x d v n v r ns d v a x d
"周祥" "思慮" "周祥" "才" "能夠" "有所" "收穫" "每樣" "東西" "都" "有" "根本" "有枝末" "每件"
n d v v v v nr ul r t d uj n d
"事情" "都" "有" "開始" "有" "終結" "明白" "了" "這" "本末" "始終" "的" "道理" "就"
v n vn uj n ul t r v v p s nr i
"接近" "事物" "發展" "的" "規律" "了" "古代" "那些" "要" "想" "在" "天下" "弘揚" "光明正大"
n uj n b v a r uj n v v v a r
"品德" "的" "人" "先要" "治理" "好" "自己" "的" "國家" "要" "想" "治理" "好" "自己"
uj n b vn a r uj n c q ng v v vn
"的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和" "家" "族" "要" "想" "管理"
a r uj n c nz b v r uj n v v v
"好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的" "品性" "要" "想" "修養"
r uj n d v r uj n v v nz r uj n
"自身" "的" "品性" "先" "要端正" "自己" "的" "心思" "要" "想" "端正" "自己" "的" "心思"
d v v r uj n a v v v r uj n a
"先" "要" "使" "自己" "的" "意念" "真誠" "要" "想" "使" "自己" "的" "意念" "真誠"
b v r v v v v uj n v v vn l p
"先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物" "通過"
p l uj v vn f v v v v v f n v
"對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念" "才能"
a n a f n v nz n nz f v v n n
"真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性" "品性"
v f v vn a n c nz vn a n c nz f
"修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族" "後"
v v a n v a n f s v ns f p n
"才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自" "國家元首"
v n n d v p v n p a c r a p
"下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本" "被"
v ul n nz n s v v a v v v uj d
"擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的" "不分"
z i d v v n r d d v d v uj
"輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不" "可能" "的"
提取關鍵字
我們需要把worker裡面的參數type設置為keyword或者simhash,使用參數topn設置提取關鍵字的個數,默認為5個。
> #type=keywords
> keys
> keys<=words
51.7266 43.9134
"品性" "修養"
> #type=simhash
> keys2
> keys2<=words
$simhash
[1] "7879176380556209572"
$keyword
51.7266 43.9134
"品性" "修養"
閱讀更多 數據分析和挖掘 的文章