当前位置:网站首页>[NLP] HMM hidden Markov + Viterbi word segmentation
[NLP] HMM hidden Markov + Viterbi word segmentation
2022-04-23 14:45:00 【myaijarvis】
【 Reference resources :NLP-HMM Hidden Markov + Viterbi participle , Code + data + Explain _ Bili, Bili _bilibili】 PPT simple , Very good
【 Reference resources :shouxieai/nlp-hmm-word-cut: nlp-hmm-word-cut】
How to explain viterbi Algorithm ? - Lu Sheng's answer - You know
How to explain viterbi Algorithm ? - JustCoder Answer - You know
PPT
Code
import pickle
from tqdm import tqdm
import numpy as np
import os
def make_label(text_str): # From words to label Transformation , Such as : today ----> BE Spicy fat cattle : ---> BMME Of ---> S
text_len = len(text_str)
if text_len == 1:
return "S"
return "B" + "M" * (text_len - 2) + "E" # Except that it starts with B, It ends with E, It's all in the middle M
# Convert the original corpus into Corresponding status file Such as : I have to go to school today -> BE S BE
def text_to_state(file="all_train_text.txt"): # Convert the original corpus into Corresponding status file
# all_train_text It has been divided into words with spaces
if os.path.exists("all_train_state.txt"): # If the file exists , Just quit
return
all_data = open(file, "r", encoding="utf-8").read().split("\n") # Open the file and slice it to all_data in , all_data It's a list
with open("all_train_state.txt", "w", encoding="utf-8") as f: # Open the written file
for d_index, data in tqdm(enumerate(all_data)): # Line by line Traverse , tqdm Is the progress bar prompt , data It's an article ( a line ), It could be empty
if data: # If data Not empty
state_ = ""
for w in data.split(" "): # At present The article is segmented according to the space , w Is a word in the article
if w: # If w Not empty
state_ = state_ + make_label(w) + " " # Make a single word label
if d_index != len(all_data) - 1: # Don't add... To the last line "\n" Add... To all other lines "\n"
state_ = state_.strip() + "\n" # Take out every line The last space
f.write(state_) # write file , state_ Is a string
# Definition HMM class , In fact, the key is the three matrices
class HMM:
def __init__(self, file_text="all_train_text.txt", file_state="all_train_state.txt"):
self.all_states = open(file_state, "r", encoding="utf-8").read().split("\n")[:200] # Get all status by line
self.all_texts = open(file_text, "r", encoding="utf-8").read().split("\n")[:200] # Get all text by line
self.states_to_index = {
"B": 0, "M": 1, "S": 2, "E": 3} # Define an index for each state , Later, you can get the index according to the status
self.index_to_states = ["B", "M", "S", "E"] # Get the corresponding status according to the index
self.len_states = len(self.states_to_index) # State length : Here is 4
# The most important is the following three matrices
self.init_matrix = np.zeros((self.len_states)) # The initial matrix : 1 * 4 , The corresponding is BMSE
self.transfer_matrix = np.zeros((self.len_states, self.len_states)) # Transition state matrix : 4 * 4 ,
# Emission matrix , The use of 2 level Dictionary nesting
# # Notice that a... Is initialized here total key , Stores the total number of occurrences of the current state , For later normalization, use
self.emit_matrix = {
"B": {
"total": 0}, "M": {
"total": 0}, "S": {
"total": 0}, "E": {
"total": 0}}
# Calculation The initial matrix
def cal_init_matrix(self, state):
self.init_matrix[self.states_to_index[state[0]]] += 1 # BMSE Four kinds of state , The corresponding status appears 1 Time Just +1
# Calculate the transfer matrix
def cal_transfer_matrix(self, states):
sta_join = "".join(states) # State shift Transfer from the current state to the next state , namely from sta1 Each element is transferred to sta2 in
sta1 = sta_join[:-1]
sta2 = sta_join[1:]
for s1, s2 in zip(sta1, sta2): # At the same time through s1 , s2
self.transfer_matrix[self.states_to_index[s1], self.states_to_index[s2]] += 1
# Calculate the emission matrix
def cal_emit_matrix(self, words, states):
for word, state in zip("".join(words), "".join(states)): # The first words and states Put them together and go through , Because there's a space in the middle
self.emit_matrix[state][word] = self.emit_matrix[state].get(word, 0) + 1
self.emit_matrix[state]["total"] += 1 # Note that there is an additional total key , Stores the total number of occurrences of the current state , For later normalization, use
# Normalize the matrix
def normalize(self):
self.init_matrix = self.init_matrix / np.sum(self.init_matrix)
self.transfer_matrix = self.transfer_matrix / np.sum(self.transfer_matrix, axis=1, keepdims=True) # Every line
# here *1000 In order not to make the probability too small , You can also do without this
self.emit_matrix = {
state: {
word: t / word_times["total"] * 1000 for word, t in word_times.items() if word != "total"} for
state, word_times in self.emit_matrix.items()}
# Training begins , In fact, that is 3 The process of solving a matrix
def train(self):
if os.path.exists("three_matrix.pkl"): # If parameters already exist No more training
self.init_matrix, self.transfer_matrix, self.emit_matrix = pickle.load(open("three_matrix.pkl", "rb"))
return
for words, states in tqdm(zip(self.all_texts, self.all_states)): # Read the file by line , call 3 A matrix solving function
words = words.split(" ") # In the document They are segmented according to the space
states = states.split(" ")
self.cal_init_matrix(states[0]) # Calculate three matrices
self.cal_transfer_matrix(states)
self.cal_emit_matrix(words, states)
self.normalize() # After the matrix is solved, it is normalized
pickle.dump([self.init_matrix, self.transfer_matrix, self.emit_matrix], open("three_matrix.pkl", "wb")) # Save parameters
# This implementation is a little difficult
def viterbi_t(text, hmm):
states = hmm.index_to_states
emit_p = hmm.emit_matrix
trans_p = hmm.transfer_matrix
start_p = hmm.init_matrix
V = [{
}]
path = {
}
for y in states:
V[0][y] = start_p[hmm.states_to_index[y]] * emit_p[y].get(text[0], 0)
path[y] = [y]
for t in range(1, len(text)):
V.append({
})
newpath = {
}
# Check whether there is this word in the transmission probability matrix of training
neverSeen = text[t] not in emit_p['S'].keys() and \
text[t] not in emit_p['M'].keys() and \
text[t] not in emit_p['E'].keys() and \
text[t] not in emit_p['B'].keys()
for y in states:
emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 # Set unknown words to separate words
# Choose one of the four paths with the greatest probability
temp = []
for y0 in states:
if V[t - 1][y0] >= 0: # error correction Here is >=
temp.append((V[t - 1][y0] * trans_p[hmm.states_to_index[y0], hmm.states_to_index[y]] * emitP, y0))
(prob, state) = max(temp)
# (prob, state) = max([(V[t - 1][y0] * trans_p[hmm.states_to_index[y0],hmm.states_to_index[y]] * emitP, y0) for y0 in states if V[t - 1][y0] >= 0])
V[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
(prob, state) = max([(V[len(text) - 1][y], y) for y in states]) # Find the path of the maximum concept
result = "" # Joining together the results
for t, s in zip(text, path[state]):
result += t
if s == "S" or s == "E": # If it is S perhaps E Just add a space after
result += " "
return result
if __name__ == "__main__":
text_to_state()
# text = " Although the procession was silent all the way "
text = " No matter what major a person studies , You have to know something about literature , A little artistic quality , This is for enriching your thoughts and life , It's good to improve your aesthetic ability "
# text = " Peking University, Beijing " # It's not good to involve professional vocabulary , Because there is no relevant corpus in the training text
# debug In contrast ppt It's easier to understand the process
hmm = HMM()
hmm.train()
result = viterbi_t(text, hmm)
print(result)
One people nothing On learning what major , It has to be understand some literature knowledge , Yes One o'clock art Accomplishment , this about Enrich own Of thought and life , Improve own Of taste Ability very Yes benefits
版权声明
本文为[myaijarvis]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231426205752.html
边栏推荐
- [servlet] detailed explanation of servlet (use + principle)
- 单相交交变频器的Matlab Simulink建模设计,附Matlab仿真、PPT和论文等资料
- go基础 反射
- OpenFaaS实战之四:模板操作(template)
- Interviewer: let's talk about the process of class loading and the mechanism of class loading (parental delegation mechanism)
- Want to be an architect? Tamping the foundation is the most important
- Detailed explanation of C language P2 selection branch statement
- 51 MCU + LCD12864 LCD Tetris game, proteus simulation, ad schematic diagram, code, thesis, etc
- [jz46 translate numbers into strings]
- 帧同步 实现
猜你喜欢
do(Local scope)、初始化器、内存冲突、Swift指针、inout、unsafepointer、unsafeBitCast、successor、
1N5408-ASEMI整流二极管1N5408
线程同步、生命周期
基于TLC5615的多路可调数控直流稳压电源,51单片机,含Proteus仿真和C代码等
Detailed explanation of C language knowledge points -- first knowledge of C language [1]
《JVM系列》 第七章 -- 字节码执行引擎
直流可调稳压电源的Proteus仿真设计(附仿真+论文等资料)
The art of automation
[jz46 translate numbers into strings]
ASEMI三相整流桥和单相整流桥的详细对比
随机推荐
【STC8G2K64S4】比较器介绍以及比较器掉电检测示例程序
Detailed explanation of C language knowledge points -- first knowledge of C language [1]
[jz46 translate numbers into strings]
剑指 Offer II 019. 最多删除一个字符得到回文(简单)
We reference My97DatePicker to realize the use of time plug-in
【Servlet】Servlet 详解(使用+原理)
抑郁症治疗的进展
面试官:说一下类加载的过程以及类加载的机制(双亲委派机制)
Design of single chip microcomputer Proteus for temperature and humidity monitoring and alarm system of SHT11 sensor (with simulation + paper + program, etc.)
ArrayList collection basic usage
Select receives both normal data and out of band data
【NLP】HMM隐马尔可夫+维特比分词
epoll 的EPOLLONESHOT 事件———实例程序
51单片机的花卉、农田自动浇水灌溉系统开发,Proteus仿真,原理图和C代码
Don't you know the usage scenario of the responsibility chain model?
科技的成就(二十一)
Unity_代码方式添加绑定按钮点击事件
Eight way responder system 51 Single Chip Microcomputer Design [with Proteus simulation, C program, schematic diagram, PCB files, component list and papers, etc.]
Detailed explanation of SAR command
电子秤称重系统设计,HX711压力传感器,51单片机(Proteus仿真、C程序、原理图、论文等全套资料)