当前位置：网站首页>[NLP] HMM hidden Markov + Viterbi word segmentation

[NLP] HMM hidden Markov + Viterbi word segmentation

2022-04-23 14:45:00 【myaijarvis】

【 Reference resources ：NLP-HMM Hidden Markov + Viterbi participle , Code + data + Explain _ Bili, Bili _bilibili】 PPT simple , Very good

【 Reference resources ：shouxieai/nlp-hmm-word-cut: nlp-hmm-word-cut】

How to explain viterbi Algorithm ？ - Lu Sheng's answer - You know
How to explain viterbi Algorithm ？ - JustCoder Answer - You know

PPT

Insert picture description here

Code

import pickle
from tqdm import tqdm
import numpy as np
import os


def make_label(text_str):  #  From words to label Transformation ,  Such as :  today  ----> BE  Spicy fat cattle : ---> BMME  Of  ---> S
    text_len = len(text_str)
    if text_len == 1:
        return "S"
    return "B" + "M" * (text_len - 2) + "E"  #  Except that it starts with  B,  It ends with  E, It's all in the middle Ｍ


#  Convert the original corpus into   Corresponding status file   Such as :  I have to go to school today  -> BE S BE
def text_to_state(file="all_train_text.txt"):  #  Convert the original corpus into   Corresponding status file 
    # all_train_text  It has been divided into words with spaces 
    if os.path.exists("all_train_state.txt"):  #  If the file exists ,  Just quit 
        return

    all_data = open(file, "r", encoding="utf-8").read().split("\n")  #  Open the file and slice it to  all_data  in  , all_data  It's a list
    with open("all_train_state.txt", "w", encoding="utf-8") as f:  #  Open the written file 
        for d_index, data in tqdm(enumerate(all_data)):  #  Line by line   Traverse  , tqdm  Is the progress bar prompt  , data  It's an article （ a line ）,  It could be empty 
            if data:  #  If  data  Not empty 
                state_ = ""
                for w in data.split(" "):  #  At present   The article is segmented according to the space , w Is a word in the article 
                    if w:  #  If  w  Not empty 
                        state_ = state_ + make_label(w) + " "  #  Make a single word label
                if d_index != len(all_data) - 1:  #  Don't add... To the last line  "\n"  Add... To all other lines  "\n"
                    state_ = state_.strip() + "\n"  #  Take out every line   The last space 
                f.write(state_)  #  write file , state_  Is a string 


#  Definition  HMM class ,  In fact, the key is the three matrices 
class HMM:
    def __init__(self, file_text="all_train_text.txt", file_state="all_train_state.txt"):
        self.all_states = open(file_state, "r", encoding="utf-8").read().split("\n")[:200]  #  Get all status by line 
        self.all_texts = open(file_text, "r", encoding="utf-8").read().split("\n")[:200]  #  Get all text by line 
        self.states_to_index = {
    "B": 0, "M": 1, "S": 2, "E": 3}  #  Define an index for each state ,  Later, you can get the index according to the status 
        self.index_to_states = ["B", "M", "S", "E"]  #  Get the corresponding status according to the index 
        self.len_states = len(self.states_to_index)  #  State length  :  Here is 4

        #  The most important is the following three matrices 
        self.init_matrix = np.zeros((self.len_states))  #  The initial matrix  : 1 * 4 ,  The corresponding is  BMSE
        self.transfer_matrix = np.zeros((self.len_states, self.len_states))  #  Transition state matrix : 4 * 4 ,

        #  Emission matrix ,  The use of  2 level   Dictionary nesting 
        # #  Notice that a... Is initialized here  total  key  ,  Stores the total number of occurrences of the current state ,  For later normalization, use 
        self.emit_matrix = {
    "B": {
    "total": 0}, "M": {
    "total": 0}, "S": {
    "total": 0}, "E": {
    "total": 0}}

    #  Calculation   The initial matrix 
    def cal_init_matrix(self, state):
        self.init_matrix[self.states_to_index[state[0]]] += 1  # BMSE  Four kinds of state ,  The corresponding status appears  1 Time   Just  +1

    #  Calculate the transfer matrix 
    def cal_transfer_matrix(self, states):
        sta_join = "".join(states)  #  State shift   Transfer from the current state to the next state ,  namely   from  sta1  Each element is transferred to  sta2  in 
        sta1 = sta_join[:-1]
        sta2 = sta_join[1:]
        for s1, s2 in zip(sta1, sta2):  #  At the same time through  s1 , s2
            self.transfer_matrix[self.states_to_index[s1], self.states_to_index[s2]] += 1

    #  Calculate the emission matrix 
    def cal_emit_matrix(self, words, states):
        for word, state in zip("".join(words), "".join(states)):  #  The first words  and  states  Put them together and go through ,  Because there's a space in the middle 
            self.emit_matrix[state][word] = self.emit_matrix[state].get(word, 0) + 1
            self.emit_matrix[state]["total"] += 1  #  Note that there is an additional  total  key  ,  Stores the total number of occurrences of the current state ,  For later normalization, use 

    #  Normalize the matrix 
    def normalize(self):
        self.init_matrix = self.init_matrix / np.sum(self.init_matrix)
        self.transfer_matrix = self.transfer_matrix / np.sum(self.transfer_matrix, axis=1, keepdims=True) #  Every line 
        #  here *1000  In order not to make the probability too small , You can also do without this 
        self.emit_matrix = {
    
            state: {
    word: t / word_times["total"] * 1000 for word, t in word_times.items() if word != "total"} for
            state, word_times in self.emit_matrix.items()}

    #  Training begins ,  In fact, that is 3 The process of solving a matrix 
    def train(self):
        if os.path.exists("three_matrix.pkl"):  #  If parameters already exist   No more training 
            self.init_matrix, self.transfer_matrix, self.emit_matrix = pickle.load(open("three_matrix.pkl", "rb"))
            return
        for words, states in tqdm(zip(self.all_texts, self.all_states)):  #  Read the file by line ,  call 3 A matrix solving function 
            words = words.split(" ")  #  In the document   They are segmented according to the space 
            states = states.split(" ")
            self.cal_init_matrix(states[0])  #  Calculate three matrices 
            self.cal_transfer_matrix(states)
            self.cal_emit_matrix(words, states)
        self.normalize()  #  After the matrix is solved, it is normalized 
        pickle.dump([self.init_matrix, self.transfer_matrix, self.emit_matrix], open("three_matrix.pkl", "wb"))  #  Save parameters 


#  This implementation is a little difficult 
def viterbi_t(text, hmm):
    states = hmm.index_to_states
    emit_p = hmm.emit_matrix
    trans_p = hmm.transfer_matrix
    start_p = hmm.init_matrix
    V = [{
    }]
    path = {
    }

    for y in states:
        V[0][y] = start_p[hmm.states_to_index[y]] * emit_p[y].get(text[0], 0)
        path[y] = [y]

    for t in range(1, len(text)):
        V.append({
    })
        newpath = {
    }

        #  Check whether there is this word in the transmission probability matrix of training 
        neverSeen = text[t] not in emit_p['S'].keys() and \
                    text[t] not in emit_p['M'].keys() and \
                    text[t] not in emit_p['E'].keys() and \
                    text[t] not in emit_p['B'].keys()

        for y in states:
            emitP = emit_p[y].get(text[t], 0) if not neverSeen else 1.0  #  Set unknown words to separate words 
            #  Choose one of the four paths with the greatest probability 
            temp = []
            for y0 in states:
                if V[t - 1][y0] >= 0: #  error correction   Here is >=
                    temp.append((V[t - 1][y0] * trans_p[hmm.states_to_index[y0], hmm.states_to_index[y]] * emitP, y0))
            (prob, state) = max(temp)
            # (prob, state) = max([(V[t - 1][y0] * trans_p[hmm.states_to_index[y0],hmm.states_to_index[y]] * emitP, y0) for y0 in states if V[t - 1][y0] >= 0])
            V[t][y] = prob
            newpath[y] = path[state] + [y]

        path = newpath

    (prob, state) = max([(V[len(text) - 1][y], y) for y in states])  #  Find the path of the maximum concept 

    result = ""  #  Joining together the results 
    for t, s in zip(text, path[state]):
        result += t
        if s == "S" or s == "E":  #  If it is  S  perhaps  E  Just add a space after 
            result += " "
    return result


if __name__ == "__main__":
    text_to_state()
    # text = " Although the procession was silent all the way "
    text = " No matter what major a person studies , You have to know something about literature , A little artistic quality , This is for enriching your thoughts and life , It's good to improve your aesthetic ability "
    # text = " Peking University, Beijing " #  It's not good to involve professional vocabulary , Because there is no relevant corpus in the training text 

    # debug  In contrast ppt It's easier to understand the process 
    hmm = HMM()
    hmm.train()
    result = viterbi_t(text, hmm)

    print(result)

 One   people   nothing   On learning   what   major  ,  It has to be   understand   some   literature   knowledge  ,  Yes   One o'clock   art   Accomplishment  ,  this   about   Enrich   own   Of   thought   and   life  ,  Improve   own   Of   taste   Ability   very   Yes   benefits

版权声明
本文为[myaijarvis]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/04/202204231426205752.html

当前位置：网站首页>[NLP] HMM hidden Markov + Viterbi word segmentation

[NLP] HMM hidden Markov + Viterbi word segmentation

PPT

Code

边栏推荐

猜你喜欢

随机推荐