当前位置:网站首页>[NLP] HMM hidden Markov + Viterbi word segmentation
[NLP] HMM hidden Markov + Viterbi word segmentation
2022-04-23 14:45:00 【myaijarvis】
【 Reference resources :NLP-HMM Hidden Markov + Viterbi participle , Code + data + Explain _ Bili, Bili _bilibili】 PPT simple , Very good
【 Reference resources :shouxieai/nlp-hmm-word-cut: nlp-hmm-word-cut】
How to explain viterbi Algorithm ? - Lu Sheng's answer - You know
How to explain viterbi Algorithm ? - JustCoder Answer - You know
PPT
Code
import pickle
from tqdm import tqdm
import numpy as np
import os
def make_label(text_str): # From words to label Transformation , Such as : today ----> BE Spicy fat cattle : ---> BMME Of ---> S
text_len = len(text_str)
if text_len == 1:
return "S"
return "B" + "M" * (text_len - 2) + "E" # Except that it starts with B, It ends with E, It's all in the middle M
# Convert the original corpus into Corresponding status file Such as : I have to go to school today -> BE S BE
def text_to_state(file="all_train_text.txt"): # Convert the original corpus into Corresponding status file
# all_train_text It has been divided into words with spaces
if os.path.exists("all_train_state.txt"): # If the file exists , Just quit
return
all_data = open(file, "r", encoding="utf-8").read().split("\n") # Open the file and slice it to all_data in , all_data It's a list
with open("all_train_state.txt", "w", encoding="utf-8") as f: # Open the written file
for d_index, data in tqdm(enumerate(all_data)): # Line by line Traverse , tqdm Is the progress bar prompt , data It's an article ( a line ), It could be empty
if data: # If data Not empty
state_ = ""
for w in data.split(" "): # At present The article is segmented according to the space , w Is a word in the article
if w: # If w Not empty
state_ = state_ + make_label(w) + " " # Make a single word label
if d_index != len(all_data) - 1: # Don't add... To the last line "\n" Add... To all other lines "\n"
state_ = state_.strip() + "\n" # Take out every line The last space
f.write(state_) # write file , state_ Is a string
# Definition HMM class , In fact, the key is the three matrices
class HMM:
def __init__(self, file_text="all_train_text.txt", file_state="all_train_state.txt"):
self.all_states = open(file_state, "r", encoding="utf-8").read().split("\n")[:200] # Get all status by line
self.all_texts = open(file_text, "r", encoding="utf-8").read().split("\n")[:200] # Get all text by line
self.states_to_index = {
"B": 0, "M": 1, "S": 2, "E": 3} # Define an index for each state , Later, you can get the index according to the status
self.index_to_states = ["B", "M", "S", "E"] # Get the corresponding status according to the index
self.len_states = len(self.states_to_index) # State length : Here is 4
# The most important is the following three matrices
self.init_matrix = np.zeros((self.len_states)) # The initial matrix : 1 * 4 , The corresponding is BMSE
self.transfer_matrix = np.zeros((self.len_states, self.len_states)) # Transition state matrix : 4 * 4 ,
# Emission matrix , The use of 2 level Dictionary nesting
# # Notice that a... Is initialized here total key , Stores the total number of occurrences of the current state , For later normalization, use
self.emit_matrix = {
"B": {
"total": 0}, "M": {
"total": 0}, "S": {
"total": 0}, "E": {
"total": 0}}
# Calculation The initial matrix
def cal_init_matrix(self, state):
self.init_matrix[self.states_to_index[state[0]]] += 1 # BMSE Four kinds of state , The corresponding status appears 1 Time Just +1
# Calculate the transfer matrix
def cal_transfer_matrix(self, states):
sta_join = "".join(states) # State shift Transfer from the current state to the next state , namely from sta1 Each element is transferred to sta2 in
sta1 = sta_join[:-1]
sta2 = sta_join[1:]
for s1, s2 in zip(sta1, sta2): # At the same time through s1 , s2
self.transfer_matrix[self.states_to_index[s1], self.states_to_index[s2]] += 1
# Calculate the emission matrix
def cal_emit_matrix(self, words, states):
for word, state in zip("".join(words), "".join(states)): # The first words and states Put them together and go through , Because there's a space in the middle
self.emit_matrix[state][word] = self.emit_matrix[state].get(word, 0) + 1
self.emit_matrix[state]["total"] += 1 # Note that there is an additional total key , Stores the total number of occurrences of the current state , For later normalization, use
# Normalize the matrix
def normalize(self):
self.init_matrix = self.init_matrix / np.sum(self.init_matrix)
self.transfer_matrix = self.transfer_matrix / np.sum(self.transfer_matrix, axis=1, keepdims=True) # Every line
# here *1000 In order not to make the probability too small , You can also do without this
self.emit_matrix = {
state: {
word: t / word_times["total"] * 1000 for word, t in word_times.items() if word != "total"} for
state, word_times in self.emit_matrix.items()}
# Training begins , In fact, that is 3 The process of solving a matrix
def train(self):
if os.path.exists("three_matrix.pkl"): # If parameters already exist No more training
self.init_matrix, self.transfer_matrix, self.emit_matrix = pickle.load(open("three_matrix.pkl", "rb"))
return
for words, states in tqdm(zip(self.all_texts, self.all_states)): # Read the file by line , call 3 A matrix solving function
words = words.split(" ") # In the document They are segmented according to the space
states = states.split(" ")
self.cal_init_matrix(states[0]) # Calculate three matrices
self.cal_transfer_matrix(states)
self.cal_emit_matrix(words, states)
self.normalize() # After the matrix is solved, it is normalized
pickle.dump([self.init_matrix, self.transfer_matrix, self.emit_matrix], open("three_matrix.pkl", "wb")) # Save parameters
# This implementation is a little difficult
def viterbi_t(text, hmm):
states = hmm.index_to_states
emit_p = hmm.emit_matrix
trans_p = hmm.transfer_matrix
start_p = hmm.init_matrix
V = [{
}]
path = {
}
for y in states:
V[0][y] = start_p[hmm.states_to_index[y]] * emit_p[y].get(text[0], 0)
path[y] = [y]
for t in range(1, len(text)):
V.append({
})
newpath = {
}
# Check whether there is this word in the transmission probability matrix of training
neverSeen = text[t] not in emit_p['S'].keys() and \
text[t] not in emit_p['M'].keys() and \
text[t] not in emit_p['E'].keys() and \
text[t] not in emit_p['B'].keys()
for y in states:
emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0 # Set unknown words to separate words
# Choose one of the four paths with the greatest probability
temp = []
for y0 in states:
if V[t - 1][y0] >= 0: # error correction Here is >=
temp.append((V[t - 1][y0] * trans_p[hmm.states_to_index[y0], hmm.states_to_index[y]] * emitP, y0))
(prob, state) = max(temp)
# (prob, state) = max([(V[t - 1][y0] * trans_p[hmm.states_to_index[y0],hmm.states_to_index[y]] * emitP, y0) for y0 in states if V[t - 1][y0] >= 0])
V[t][y] = prob
newpath[y] = path[state] + [y]
path = newpath
(prob, state) = max([(V[len(text) - 1][y], y) for y in states]) # Find the path of the maximum concept
result = "" # Joining together the results
for t, s in zip(text, path[state]):
result += t
if s == "S" or s == "E": # If it is S perhaps E Just add a space after
result += " "
return result
if __name__ == "__main__":
text_to_state()
# text = " Although the procession was silent all the way "
text = " No matter what major a person studies , You have to know something about literature , A little artistic quality , This is for enriching your thoughts and life , It's good to improve your aesthetic ability "
# text = " Peking University, Beijing " # It's not good to involve professional vocabulary , Because there is no relevant corpus in the training text
# debug In contrast ppt It's easier to understand the process
hmm = HMM()
hmm.train()
result = viterbi_t(text, hmm)
print(result)
One people nothing On learning what major , It has to be understand some literature knowledge , Yes One o'clock art Accomplishment , this about Enrich own Of thought and life , Improve own Of taste Ability very Yes benefits
版权声明
本文为[myaijarvis]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/04/202204231426205752.html
边栏推荐
- 如何打开Win10启动文件夹?
- 基于单片机的DS18B20的数字温度监控报警系统设计【LCD1602显示+Proteus仿真+C程序+论文+按键设置等】
- 金九银十,入职字节跳动那一天,我哭了(蘑菇街被裁,奋战7个月拿下offer)
- 8.4 循环神经网络从零实现
- Sed learning for application
- 【NLP】HMM隐马尔可夫+维特比分词
- I thought I could lie down and enter Huawei, but I was confused when I received JD / didi / iqiyi offers one after another
- Ali developed three sides, and the interviewer's set of combined punches made me confused on the spot
- Advanced application of I / O multiplexing: Processing TCP and UDP services at the same time
- 一款不错的工具:aardio
猜你喜欢
Detailed explanation of SAR command
【STC8G2K64S4】比较器介绍以及比较器掉电检测示例程序
【NLP】HMM隐马尔可夫+维特比分词
抑郁症治疗的进展
电子秤称重系统设计,HX711压力传感器,51单片机(Proteus仿真、C程序、原理图、论文等全套资料)
do(Local scope)、初始化器、内存冲突、Swift指针、inout、unsafepointer、unsafeBitCast、successor、
Set up an AI team in the game world and start the super parametric multi-agent "chaos fight"
一个月把字节,腾讯,阿里都面了,写点面经总结……
Mq-2 and DS18B20 fire temperature smoke alarm system design, 51 single chip microcomputer, with simulation, C code, schematic diagram, PCB, etc
3、 Gradient descent solution θ
随机推荐
OC 转 Swift 条件编译、标记、宏、 Log、 版本检测、过期提示
【STC8G2K64S4】比较器介绍以及比较器掉电检测示例程序
SVN详细使用教程
在游戏世界组建一支AI团队,超参数的多智能体「大乱斗」开赛
Outsourcing for four years, abandoned
Chapter 7 of JVM series -- bytecode execution engine
AT89C52 MCU frequency meter (1Hz ~ 20MHz) design, LCD1602 display, including simulation, schematic diagram, PCB and code, etc
UML项目实例——抖音的UML图描述
2-Go变量操作
电容
UML project example -- UML diagram description of tiktok
Basic regular expression
DS1302的电子万年历_51单片机,年月日、星期、时分秒、农历和温度,带闹钟,全套资料
Matrix exchange row and column
Swift - Literal,字面量协议,基本数据类型、dictionary/array之间的转换
一个月把字节,腾讯,阿里都面了,写点面经总结……
8.5 循环神经网络简洁实现
The art of automation
Master in minutes --- ternary operator (ternary operator)
Swift - literal, literal protocol, conversion between basic data types and dictionary / array