R語言數據挖掘實踐——jiebaR文本挖掘包簡介

R語言數據挖掘實踐——jiebaR文本挖掘包簡介

相比Rwordseg,jiebaR程序包是另外一種做文本挖掘的高效選擇。jiebaR具有安裝簡單、分詞引擎多、函數數量多、更新速度塊的特點,因此也越來越受到數據分析師的喜愛。它的安裝不像Rwordseg那樣需要依賴Java環境,只需要一件簡單的命令,即可輕鬆上手使用。

> install.packages("jiebaR")

接下來我們來詳細瞭解jiebaR包在文本挖掘中的使用。

worker()分詞函數

使用worker()函數,可以設置一些分詞類型、用戶詞典、停用詞等等,它的使用格式如下:

worker(type = "mix", dict = DICTPATH, hmm = HMMPATH, user = USERPATH,

idf = IDFPATH, stop_word = STOPPATH, write = T, qmax = 20, topn = 5,

encoding = "UTF-8", detect = T, symbol = F, lines = 1e+05,

output = NULL, bylines = F, user_weight = "max")

其中,

  • type指分詞引擎類型,這個包包括mix, mp, hmm, full, query, tag, simhash, keyword,分別指混合模型,支持最大概率,隱馬爾科夫模型,全模式,索引模型,詞性標註,文本Simhash相似度比較,關鍵字提取;

  • dict指詞庫路徑,默認為DICTPATH;

  • hmm用來指定隱馬爾可夫模型的路徑,默認值為DICTPATH,當然也可以指定其他分詞引擎;

  • user指用戶自定義的詞庫;

  • idf用來指定逆文本頻率指數路徑,默認為DICTPATH,也可以用於simhash和keyword分詞引擎;

  • stop_word用來指定停用詞的路徑;

  • qmax指詞的最大查詢長度,默認為20,可用於query分詞類型;

  • topn指關鍵詞的個數,默認為5,可以用於simhash和keyword分詞類型;

  • symbol指輸出是否保留符號,默認為F;

  • Lines指從文件中最大一次讀取的行數,默認為1e+05;

  • output指輸出文件,文件名一般為系統時間;

  • bylines返回輸入的文件有多少行;

  • user_weight指用戶詞典的詞權重,有"min" "max" or "median"三個選項。

segment()函數

segment()函數的使用格式如下:

segment(code, engine, mod)

其中,

  • code表示要分詞的對象;

  • engine用於設置分詞的引擎,也就是worker函數;

  • mod用於改變默認的分詞引擎類型,它包括:mix、hmm、query、full、level 和mp。

結合這兩個函數做一個簡單的例子:

> library(jiebaR)

> engine

> words

> segment(words,engine)

[1] "大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"

[15] "最" "完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向" "堅定" "志向"

[29] "堅定" "才" "能夠" "鎮靜" "不躁" "鎮靜" "不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮"

[43] "周祥" "思慮" "周祥" "才" "能夠" "有所" "收穫" "每樣" "東西" "都" "有" "根本" "有枝末" "每件"

[57] "事情" "都" "有" "開始" "有" "終結" "明白" "了" "這" "本末" "始終" "的" "道理" "就"

[71] "接近" "事物" "發展" "的" "規律" "了" "古代" "那些" "要" "想" "在" "天下" "弘揚" "光明正大"

[85] "品德" "的" "人" "先要" "治理" "好" "自己" "的" "國家" "要" "想" "治理" "好" "自己"

[99] "的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和" "家" "族" "要" "想" "管理"

[113] "好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的" "品性" "要" "想" "修養"

[127] "自身" "的" "品性" "先" "要端正" "自己" "的" "心思" "要" "想" "端正" "自己" "的" "心思"

[141] "先" "要" "使" "自己" "的" "意念" "真誠" "要" "想" "使" "自己" "的" "意念" "真誠"

[155] "先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物" "通過"

[169] "對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念" "才能"

[183] "真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性" "品性"

[197] "修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族" "後"

[211] "才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自" "國家元首"

[225] "下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本" "被"

[239] "擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的" "不分"

[253] "輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不" "可能" "的"

添加用戶自定義詞或詞庫

添加自定義詞或詞庫有兩種方法,一是使用使用new_user_word函數;二是使用worker函數中通過user參數添加詞庫。

上面的例子中,“公眾號”本來就是一個詞,結果被分成兩個詞,因此我們需要添加這個詞。另外,我也想要“R語言”也被分成一個詞。以下代碼為添加自定義詞示例:

> new_user_word(engine_new_word,c("最完善","志向堅定","鎮靜不躁","思慮周祥","每樣東西","有","根本","枝末","本末始終","要想","家族","先要","端正","不可能"))

[1] TRUE

> segment(words,engine_new_word)

[1] "大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"

[15] "最完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向堅定" "志向堅定" "才" "能夠"

[29] "鎮靜不躁" "鎮靜不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮周祥" "思慮" "周祥" "才" "能夠" "有所"

[43] "收穫" "每樣東西" "都" "有" "根本" "有" "枝末" "每件" "事情" "都" "有" "開始" "有" "終結"

[57] "明白" "了" "這" "本末始終" "的" "道理" "就" "接近" "事物" "發展" "的" "規律" "了" "古代"

[71] "那些" "要想" "在" "天下" "弘揚" "光明正大" "品德" "的" "人" "先要" "治理" "好" "自己" "的"

[85] "國家" "要想" "治理" "好" "自己" "的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和"

[99] "家" "族" "要想" "管理" "好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的"

[113] "品性" "要想" "修養" "自身" "的" "品性" "先要" "端正" "自己" "的" "心思" "要想" "端正" "自己"

[127] "的" "心思" "先" "要" "使" "自己" "的" "意念" "真誠" "要想" "使" "自己" "的" "意念"

[141] "真誠" "先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物"

[155] "通過" "對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念"

[169] "才能" "真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性"

[183] "品性" "修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族"

[197] "後" "才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自"

[211] "國家元首" "下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本"

[225] "被" "擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的"

[239] "不分" "輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不可能" "的"

我們也可以通過worker函數的user參數添加一個詞庫,然後從詞庫裡面直接讀。當自定義詞量很大時,建議採用這種方法。例如,我們現在本地工作空間新建一個“dictionary.txt”的文件,裡面寫入我們需要自定義的詞庫,代碼如下:

> engine_user

> segment(words,engine_user)

這裡需要注意幾點:

  • 詞庫的第一行一定要空著,否則第一個詞就會莫名其妙的失效;

  • 如果詞庫是用記事本寫的話,因為編碼有時不是UTF-8,使用時會出現 各種錯誤。所以建議使用notepad++編輯,將編碼設置為utf-8,另存為txt文件;

  • 如果需要添加搜狗細胞詞庫的話,我們需要安裝cidian包,它可以幫助 我們把搜狗細胞詞庫轉換為jiebaR可以使用的詞庫。

使用詞庫也可以使用new_user_word函數。示例代碼如下:

> new_user_word(engine_new_word, scan("dictionary.txt",what="",sep="\n"))

Read 2 items

[1] TRUE

> segment(words,engine_new_word)

刪除停用詞

上面的例子,分詞之後的“卻”、“想”等等就是停用詞,因此我們需要刪掉。這裡我們需要使用worker函數的stop_word參數。

> engine_s

> segment(words,engine_s)

[1] "大學" "宗旨" "弘揚" "光明正大" "品德" "棄舊圖新" "完善" "境界" "知道" "境界" "志向" "堅定" "志向" "堅定"

[15] "鎮靜" "不躁" "鎮靜" "不躁" "心安理得" "心安理得" "思慮" "周祥" "思慮" "周祥" "有所" "收穫" "根本" "有枝末"

[29] "事情" "開始" "終結" "明白" "本末" "始終" "道理" "就" "接近" "事物" "發展" "規律" "天下" "弘揚"

[43] "光明正大" "品德" "治理" "國家" "治理" "國家" "管理" "家庭" "家" "族" "管理" "家庭" "家族" "修養"

[57] "自身" "品性" "修養" "自身" "品性" "要端正" "心思" "端正" "心思" "意念" "真誠" "意念" "真誠" "獲得"

[71] "知識" "獲得" "知識" "途徑" "認識" "研究" "萬事萬物" "通過" "對" "萬事萬物" "認識" "研究" "才能" "獲得"

[85] "知識" "獲得" "知識" "意念" "才能" "真誠" "意念" "真誠" "心思" "才能" "端正" "心思" "端正" "才能"

[99] "修養" "品性" "品性" "修養" "才能" "管理" "家庭" "家族" "管理" "家庭" "家族" "才能" "治理" "國家"

[113] "治理" "國家" "天下" "才能" "太平" "國家元首" "平民百姓" "修養" "品性" "為" "根本" "根本" "擾亂" "家庭"

[127] "家族" "國家" "天下" "治理" "不可" "輕重緩急" "本末倒置" "做好" "事情"

統計詞頻

使用freq函數可以自動計算獲取詞頻

> freq(segment(words,engine_s))

char freq

1 做好 1

2 本末倒置 1

3 輕重緩急 1

4 擾亂 1

5 為 1

6 平民百姓 1

7 太平 1

8 才能 7

9 對 1

10 通過 1

11 萬事萬物 2

12 認識 2

13 途徑 1

14 知識 4

15 真誠 4

16 意念 4

17 就 1

18 有枝末 1

19 完善 1

20 周祥 2

21 思慮 2

22 收穫 1

23 心安理得 2

24 要端正 1

25 不躁 2

26 心思 4

27 獲得 4

28 鎮靜 2

29 堅定 2

30 根本 3

31 終結 1

32 品性 5

33 弘揚 2

34 大學 1

35 事物 1

36 棄舊圖新 1

37 管理 4

38 家庭 5

39 光明正大 2

40 天下 3

41 品德 2

42 事情 2

43 知道 1

44 志向 2

45 明白 1

46 國家 5

47 開始 1

48 本末 1

49 境界 2

50 有所 1

51 始終 1

52 道理 1

53 接近 1

54 研究 2

55 發展 1

56 國家元首 1

57 治理 5

58 家 1

59 家族 4

60 宗旨 1

61 族 1

62 不可 1

63 規律 1

64 修養 5

65 自身 2

66 端正 3

詞性標註

詞性標註使用worker函數的type參數,type默認為mix,僅需將它設置為tag即可。

> tagger

> tagger<=words

n uj n v nr i uj n v x i v x v

"大學" "的" "宗旨" "在於" "弘揚" "光明正大" "的" "品德" "在於" "使人" "棄舊圖新" "在於" "使人" "達到"

d v uj n v v v uj n d v n a n

"最" "完善" "的" "境界" "知道" "應" "達到" "的" "境界" "才" "能夠" "志向" "堅定" "志向"

a d v a x a x d v i i d v v

"堅定" "才" "能夠" "鎮靜" "不躁" "鎮靜" "不躁" "才" "能夠" "心安理得" "心安理得" "才" "能夠" "思慮"

x v x d v n v r ns d v a x d

"周祥" "思慮" "周祥" "才" "能夠" "有所" "收穫" "每樣" "東西" "都" "有" "根本" "有枝末" "每件"

n d v v v v nr ul r t d uj n d

"事情" "都" "有" "開始" "有" "終結" "明白" "了" "這" "本末" "始終" "的" "道理" "就"

v n vn uj n ul t r v v p s nr i

"接近" "事物" "發展" "的" "規律" "了" "古代" "那些" "要" "想" "在" "天下" "弘揚" "光明正大"

n uj n b v a r uj n v v v a r

"品德" "的" "人" "先要" "治理" "好" "自己" "的" "國家" "要" "想" "治理" "好" "自己"

uj n b vn a r uj n c q ng v v vn

"的" "國家" "先要" "管理" "好" "自己" "的" "家庭" "和" "家" "族" "要" "想" "管理"

a r uj n c nz b v r uj n v v v

"好" "自己" "的" "家庭" "和" "家族" "先要" "修養" "自身" "的" "品性" "要" "想" "修養"

r uj n d v r uj n v v nz r uj n

"自身" "的" "品性" "先" "要端正" "自己" "的" "心思" "要" "想" "端正" "自己" "的" "心思"

d v v r uj n a v v v r uj n a

"先" "要" "使" "自己" "的" "意念" "真誠" "要" "想" "使" "自己" "的" "意念" "真誠"

b v r v v v v uj n v v vn l p

"先要" "使" "自己" "獲得" "知識" "獲得" "知識" "的" "途徑" "在於" "認識" "研究" "萬事萬物" "通過"

p l uj v vn f v v v v v f n v

"對" "萬事萬物" "的" "認識" "研究" "後" "才能" "獲得" "知識" "獲得" "知識" "後" "意念" "才能"

a n a f n v nz n nz f v v n n

"真誠" "意念" "真誠" "後" "心思" "才能" "端正" "心思" "端正" "後" "才能" "修養" "品性" "品性"

v f v vn a n c nz vn a n c nz f

"修養" "後" "才能" "管理" "好" "家庭" "和" "家族" "管理" "好" "家庭" "和" "家族" "後"

v v a n v a n f s v ns f p n

"才能" "治理" "好" "國家" "治理" "好" "國家" "後" "天下" "才能" "太平" "上" "自" "國家元首"

v n n d v p v n p a c r a p

"下至" "平民百姓" "人人" "都" "要" "以" "修養" "品性" "為" "根本" "若" "這個" "根本" "被"

v ul n nz n s v v a v v v uj d

"擾亂" "了" "家庭" "家族" "國家" "天下" "要" "治理" "好" "是" "不可" "能" "的" "不分"

z i d v v n r d d v d v uj

"輕重緩急" "本末倒置" "卻" "想" "做好" "事情" "這" "也" "同樣" "是" "不" "可能" "的"

提取關鍵字

我們需要把worker裡面的參數type設置為keyword或者simhash,使用參數topn設置提取關鍵字的個數,默認為5個。

> #type=keywords

> keys

> keys<=words

51.7266 43.9134

"品性" "修養"

> #type=simhash

> keys2

> keys2<=words

$simhash

[1] "7879176380556209572"

$keyword

51.7266 43.9134

"品性" "修養"


分享到:


相關文章: