大數據的數據清洗利器是什麼呢?

2. Flashtext

Flashtext 是一種基於 Trie 字典數據佈局和 Aho Corasick 的算法。它的任務編制是,起首它將一切相干的關頭字作為輸入。應用這些關頭字建立一個 trie 字典,以下圖3所示:

大數據的數據清洗利器是什麼呢?

start 和 eot 是兩個出格的字符,用來定義詞的邊界,這和我們下面提到的正則表達式是一樣的。這個 trie 字典就是我們前面要用來搜刮和互換的數據佈局。

2.1 應用 Flashtext 遏制搜刮

關於輸入字符串(文檔),我們對字符遏制一一遍歷。當我們在文檔中的字符序列 word 婚配到字典中的 word 時(start 和 eot 辨別是字符序列的末尾標籤和中斷標籤),我們覺得這是一個完全婚配了。我們將婚配到的字符序列所對應的規範關頭字遏制輸入,具體以下:

2.2 應用 Flashtext 遏制互換

關於輸入字符串(文檔),我們對字符遏制一一遍歷它。我們先創建一個空的字符串,當我們字符序列中的 word 沒法在 Trie 字典中找到婚配時,那麼我們就簡單的原始字符複製到前去字符串中。然則,當我們可以從 Trie 字典中找到婚配時,那麼我們將將婚配到的字符的規範字符複製到前去字符串中。是以,前去字符串是輸入字符串的一個正本,獨一的不合是互換了婚配到的字符序列,具體以下:

2.3 Flashtext 算法

Flashtext 算法那首要分為三局部,我們接上去將對每局部遏制伶仃分解:

  1. 構建 Trie 字典;

  2. 關頭字搜刮;

  3. 關頭字互換;

2.3.1 構建 Trie 字典

為了構建 trie 字典,我們必須設立一個空的節點指向我們的空字典。這個節點被用作一切單詞的終點。我們在字典中拔出一個單詞。這個單詞中的下一個字符在本字典中作為關頭字,並且這個指針需要再次指向一個空字典。這個過程賡續反覆,直到我們達到單詞中的最後一個字符。當我們達到單詞的末尾時,我們拔出一個出格的字符(eot)來暗示詞尾。

輸入

關頭詞 w = c1c2c3...cn,箇中 ci 暗示一個輸入字符。規範詞我們用 s 來暗示。

代碼:用於 Flashtext 的初始化並向字典添加關頭字

class FlashText(object): def __init__(self, case_sensitive=False): self._keyword = '_keyword_' # end of term (eot) and key to store standardized name sef._white_space_chars = set(['.', '\t', '\n', '\a', ' ', ',']) self.keyword_trie_dict = dict() sefl.case_sensitive = case_sensitive def add_keyword(self, keyword, clean_name = None): if not clean_name and keyword: clean_name = keyword if keyword and clean_name: # if both keyword and clean_name are not empty. if not self.case_sensitive: # if not case_sensitive then lowercase the keyword keyword = keyword.lower() current_dict = self.keyword_trie_dict for letter in keyword: current_dict = current_dict.setdefault(letter, {}) current_dict[self._keyword] = clean_name

輸入

上述法度典型將會創建一個字典,如圖3所示。

2.3.2 關頭字搜刮

一旦我們將一切的詞都添加到 trie 字典中,我們便可以在輸入字符串中找到關頭字。

輸入

字符串 x = a1a2...an,箇中 ai 是輸入字符串 x 中的第 i 個字符。

代碼:Python 代碼用來獲得字典中的輸入字符串中的關頭字。

def extract_keywords(self, sentence): keywords_extracted = [] if not self.case_sensitive: # if not case_sensitive then lowercase the sentence sentence = sentence.lower() current_dict = self.keyword_trie_dict sequence_and_pos = 0 idx = 0 sentence_len = len(sentence) while idx < sentence_len: char = sentence[idx] # when we reach a character that might denote word end if char not in self.non_word_boundaries: # if eot is present in current_dict if self._keyword in current_dict or char in current_dict: # update longest sequence found sequence_found = None longest_sequence_found = None is_longer_seq_found = False if self._keyword in current_dict: sequence_found = current_dict[self._keyword] longest_sequence_found = current_dict[self._keyword] sequence_end_pos = idx # re look for longest_sequence from this position if char in current_dict: current_dict_continued = current_dict[char] idy = idx + 1 while idy < sentence_len: inner_char = sentence[idy] if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continued: # update longest sequence found longest_sequence_found = current_dict_continued[self._keyword] sequence_end_ops = idy is_longer_seq_found = True if inner_char in current_dict_continued: current_dict_continued = current_dict_continued[inner_char] else: break idy += 1 else: # end of sentence reached if self._keyword in current_dict_continued: # update longest sequence found longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if is_longer_seq_found: idx = sequence_end_pos current_dict = self.keyword_trie_dict if longest_sequence_found: keywords_extracted.append(longest_sequence_found) else: # we reset current_dict current_dict = self.keyword_trie_dict elif char in current_dict: # char is present in current dictionary position current_dict = current_dict[char] else: # we reset current_dict current_dict = self.keyword_trie_dict # skip to end of word idy = idx + 1 while idy < sentence_len: char = sentence[idy] if char not in self.non_word_boundaries: break idy += 1 idx = idy # if we are end of sentence and have a sequence discovered if idx + 1 >= sentence_len: if self._keyword in current_dict: sequence_found = current_dict[self._keyword] keywords_extracted.append(sequence_found) idx += 1 return keywords_extracted

輸入

前去一個列表,輸入字符串 x 中找到的一切規範化以後的字,如圖 4 所示。

2.3.3 關頭字互換

我們應用不異的字典用規範化的字來互換輸入字符串中的關頭字。

輸入

輸入字符串 x = a1a2...an,箇中 ai 暗示第 i 個字符。

代碼:Python 代碼用於從輸入字符串頂用規範詞互換。

def replace_keywords(self, sentence): new_sentence = '' orig_sentence = sentence if not self.case_sensitive: sentence = sentence.lower() current_word = '' current_dict = self.keyword_trie_dict current_white_space = '' sequence_end_pos = 0 idx = 0 sentence_len = len(sentence) while idx < sentence_len: char = sentence[idx] current_word += orig_sentence[idx] # when we reach whitespace if char not in self.non_word_boundaries: current_white_space = char # if end is present in current_dict if self._keyword in current_dict or char in current_dict: # update longest sequence found sequence_found = None longest_sequence_found = None is_longer_seq_found = False if self._keyword in current_dict: sequence_found = curretn_dcit[self._keyword] longest_sequence_found = current_dict[self._keyword] sequence_end_pos = idx # re look for longest_sequence from this position if char in current_dict: current_dict_continued = current_dict[char] current_word_continued = current_word idy = idx + 1 while idy < sentence_len: inner_char = sentence[idy] current_word_continued += orig_sentence[idy] if inner_char not in self.non_word_boundaries and self._keyword in current_dict_continuted: # update longest sequence found current_white_space = inner_char longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if inner_char in current_dict_continued: current_dict_continued = curretn_dict_continued[inner_char] else: break idy += 1 else: # end of sentence reached. if self._keyword in current_dict_continued: # update longest sequence found current_white_space = '' longest_sequence_found = current_dict_continued[self._keyword] sequence_end_pos = idy is_longer_seq_found = True if is_longer_seq_found: idx = sequence_end_pos current_word = current_word_continued current_dict = self.keyword_trie_dict if longest_sequence_found: new_sentence += longest_sequence_found + curretn_white_space current_word = '' current_white_space = '' else: new_sentence += current_word current_word = '' current_white_space = '' else: # we reset current_dict current_dict = self.keyword_trie_dict new_sentence += current_word current_word = '' current_white_space = '' elif char in current_dict: # we can continue from this char current_dict = current_dict[char] else: # we reset current_dict current_dict = self.keyword_trie_dict # skip to end of word idy = idx + 1 while idy < sentence_len: char = sentence[idy] current_word += orig_sentence[idy] if char not in self.non_word_boundaries: break idy += 1 idx = idy new_sentence += current_word current_word = '' current_white_space = '' # if we are end of sentence and have a sequence disv=convered if idx + 1 >= sentence_len: if self._keyword in current_dict: sequence_found = current_dict[self._keyword] new_sentence += sequence_found idx += 1 return new_sentence

輸入

在字符串 x 中找到需要互換的詞,然後用規範詞遏制互換輸入,如圖 5 所示。

3. Flashtext 和正則表達式的基準測試

正如在圖 1 和圖 2 中所揭示的,Flashtext 比正則表達式的措置速度要快很多。此刻我們來做一些基準測試來加倍詮釋這個結果。

3.1 關頭字搜刮

我們應用 Python 代碼來完成這個關頭字搜刮的基準測試。起首,我們會隨機創建一個 10K 的語料庫。然後,我們將從單詞列表當選擇 1K 的詞用來創建一個新的文檔。

我們將從語料庫當選擇 k 個詞,箇中 k ∈ {0, 1000, 2000, .. , 20000}。我們將應用正則表達式和 Flashtext 辨別對這個文檔中的關頭詞遏制搜刮,然後對比分解。具體 Python 代碼以下:

from flashtext.keyword import KeywordProcessorimport randomimport reimport stringimport timedef get_word_of_length(str_length): # generate a random word of given length return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))# generate a list of 100K words of randomly chosen sizeall_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]print('Count | FlashText | Regex ')print('------------------------------')for keywords_length in [0, 1000, 5000, 10000, 15000]: # chose 1000 terms and create a string to search in. all_words_chosen = random.sample(all_words, 1000) story = ' '.join(all_words_chosen) # get unique keywrods from the list of words generated. unique_keywords_sublist = list(set(random.sample(all_words, keywords_length))) # compile Regex compiled_re = re.compile('|'.join([r'\b' + keyword + r'\b' for keyword in unique_keywords_sublist])) # add keywords to Flashtext keyword_processor = KeywordProcessor() keyword_processor.add_keywords_from_list(unique_keywords_sublist) # time the modules start = time.time() _ = keyword_processor.extract_keywords(story) mid = time.time() _ = compiled_re.findall(story) end = time.time() # print output print(str(keywords_length).ljust(6), '|', "{0:.5f}".format(mid - start).ljust(9), '|', "{0:.5f}".format(end - mid).ljust(9), '|') # output: Data for Figure 1

3.2 關頭詞互換

下面的代碼是用來做關頭詞互換的 Python 代碼。

from flashtext.keyword import KeywordProcessorimport randomimport stringimport reimport timedef get_word_of_length(str_length): # generate a random word of given length return ''.join(random.choice(string.ascii_lowercase) for _ in range(str_length))# generate a list of 100K words of randomly chosen sizeall_words = [get_word_of_length(random.choice([3, 4, 5, 6, 7, 8])) for i in range(100000)]print('Count | FlashText | Regex ')print('-------------------------------')for keywords_length in range(1, 20002, 1000): # chose 5000 terms and create a string to search in. all_words_chosen = random.sample(all_words, 5000) story = ' '.join(all_words_chosen) # get unique keywords from the list of words generated. unique_keywords_sublist = list(set(random.sample(all_words, keywords_length))) # compile regex # source: https://stackoverflow.com/questions/6116978/python-replace-multiple-strings rep = dict([(key, '_keyword_') for key in unique_keywords_sublist]) compiled_re = re.compile("|".join(rep.keys())) # add keywords to flashtext keyword_processor = KeywordProcessor() for keyword in unique_keywords_sublist: keyword_processor.add_keyword(keyword, '_keyword_') # time the modules start = time.time() _ = keyword_processor.replace_keywords(story) mid = time.time() _ = compiled_re.sub(lambda m: rep[re.escape(m.group(0))], story) end = time.time() # print output print(str(keywords_length).ljust(6), '|', "{0:.5f}".format(mid - start).ljust(9), '|', "{0:.5f}".format(end - mid).ljust(9), '|',)# output: Data for Figure 2

3.3 結論

正如我們鄙人面看到的對比結果,Flashtext 在關頭字搜刮和互換下面比正則表達式都快很多。出格是在措置大年夜大年夜範圍數據的時辰,這個優勢會加倍的顯示被表示出來。

4. Flashtext 應用文檔

4.1 裝配

pip install flashtext

4.2 應用例子

4.2.1 關頭字提取

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor()>>> # keyword_processor.add_keyword(, )>>> keyword_processor.add_keyword('Big Apple', 'New York')>>> keyword_processor.add_keyword('Bay Area')>>> keywords_found = keyword_processor.extract_keywords('I love Big Apple and Bay Area.')>>> keywords_found>>> # ['New York', 'Bay Area']

4.2.2 關頭字互換

>>> keyword_processor.add_keyword('New Delhi', 'NCR region')>>> new_sentence = keyword_processor.replace_keywords('I love Big Apple and new delhi.')>>> new_sentence>>> # 'I love New York and NCR region.'

4.2.3 辨別大年夜大年夜小寫字母

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor(case_sensitive=True)>>> keyword_processor.add_keyword('Big Apple', 'New York')>>> keyword_processor.add_keyword('Bay Area')>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')>>> keywords_found>>> # ['Bay Area']

4.2.4 關頭字不清楚

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor()>>> keyword_processor.add_keyword('Big Apple')>>> keyword_processor.add_keyword('Bay Area')>>> keywords_found = keyword_processor.extract_keywords('I love big Apple and Bay Area.')>>> keywords_found>>> # ['Big Apple', 'Bay Area']

4.2.5 同時添加多個關頭詞

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor()>>> keyword_dict = {>>> "java": ["java_2e", "java programing"],>>> "product management": ["PM", "product manager"]>>> }>>> # {'clean_name': ['list of unclean names']}>>> keyword_processor.add_keywords_from_dict(keyword_dict)>>> # Or add keywords from a list:>>> keyword_processor.add_keywords_from_list(["java", "python"])>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')>>> # output ['product management', 'java']

4.2.6 刪除關頭字

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor()>>> keyword_dict = {>>> "java": ["java_2e", "java programing"],>>> "product management": ["PM", "product manager"]>>> }>>> keyword_processor.add_keywords_from_dict(keyword_dict)>>> print(keyword_processor.extract_keywords('I am a product manager for a java_2e platform'))>>> # output ['product management', 'java']>>> keyword_processor.remove_keyword('java_2e')>>> # you can also remove keywords from a list/ dictionary>>> keyword_processor.remove_keywords_from_dict({"product management": ["PM"]})>>> keyword_processor.remove_keywords_from_list(["java programing"])>>> keyword_processor.extract_keywords('I am a product manager for a java_2e platform')>>> # output ['product management']

有時辰我們會將一些出格符號作為字符邊界,比如 空格,\ 等等。為了從頭設定字邊界,我們需要添加一些符號通知算法,這是單詞字符的一局部。

>>> from flashtext import KeywordProcessor>>> keyword_processor = KeywordProcessor()>>> keyword_processor.add_keyword('Big Apple')>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))>>> # ['Big Apple']>>> keyword_processor.add_non_word_boundary('/')>>> print(keyword_processor.extract_keywords('I love Big Apple/Bay Area.'))>>> # []

4.3 API 文檔

具體的 API 文檔,你可以點擊這裡遏制查抄。


參考:

Stack overflow

Flashtext: github

Flashtext: paper

Flashtext: medium

Flashtext: API doc


分享到:


相關文章: