当前位置：网站首页>[NLP notes] preliminary study on CRF principle

[NLP notes] preliminary study on CRF principle

2022-04-23 07:54:00 【Apple Laboratory of Central South University】

Write it at the front ： I just got started NLP Three months , I hope to consolidate my knowledge by recording blogs , Improve the understanding of knowledge .

I'm tagging the sequence (sequence tagging) When learning about , First came into contact with two classic statistical learning methods ： One is HMM( hidden Markov model ), One is CRF( Conditional random field ). In the CRF Related articles , The findings fall into two broad categories ： A class of hard core parsing , Starting from the formula ; One kind of value concept , From the principle . Many blog posts are well written , But I think , understand CRF, Mathematics and concepts should be paid attention to , To be effective . I hope this superficial article can help beginners like me NLPer Clear away some doubts .

One 、 Sequence annotation Sequence tagging

understand CRF Before , Let's start with sequence labeling .

Sequence labeling problem is NLP Basic problems in , Simply put, it is to mark or label a sequence . Many classic NLP Mission , Like part of speech tagging 、 participle 、 Named entity recognition 、 Pinyin input method, etc , In essence, it is the tag prediction of the elements in the sentence . Sequence annotation is also a lot of higher-level NLP The precursor of the task , Because I recently did a word segmentation task , So let's briefly explain the common annotation methods , It is convenient for readers to read later ：

BIO mark ：B-Beginning The beginning of a word ;I-Inside In one word ;O-Outside Independent word .( Common in named entity recognition ,B-Name And so on )
BMES mark ：B-Begin The beginning of a word ;E-End The end of a word ;M-Middle The middle of a word ;S-Single Independent word .（ Often used in Chinese word segmentation ）
BIOES mark ： and 1、2 The Chinese letters have the same meaning .

Next , Let's see CRF Well ！

Two 、 Why CRF？

Let's start with a personal conclusion ：CRF Can learn more information in the training corpus , Characterize more features .

I've seen a very interesting example , The original text is here , The translator is here , Maybe the idea is , You have a set of pictures on hand , The sequence of pictures shows a person's life in a day . Now? , You need to label this group of pictures , Note what the person is doing , What would you do ？

Generally speaking , We'll look at it one by one , Sometimes you need to use the front and back photos for reasoning . Because this group of pictures has a temporal and spatial order , If the last picture is eating, the next picture can't be cooking .

CRF So is the core idea of , By learning the characteristics between a certain time and adjacent time , To get better prediction results . In fact, if you have friends who know neural networks before , I'll think of it when I see here LSTM Model bar. ？LSTM The model is also very suitable for solving the problem of sequence type , therefore , Image model Bi-LSTM After the meeting CRF Layer to get richer feature information between tags .

To show CRF Stronger learning ability of training data in word segmentation , Show an example （ take HMM The comparison ）：

For sentences that appear in the training data (BMES mark )： [ by /S Ben /S single /B position /E clothing /B service /E Of /S The earth /B earthquake /E prison /B measuring /E platform /B network /E]

The corresponding word segmentation result should be ： [‘ by ’, ‘ Ben ’, ‘ Company ’, ‘ service ’, ‘ Of ’, ‘ The earthquake ’, ‘ monitoring ’, ‘ Network ’]

HMM Predicted results ：

CRF Predicted results ：

CRF The result of participle

It can be seen that ,CRF When encountering a sequence that appears in the training data , It can reflect the information learned by itself and mark it , For the hidden information provided by training data, the representation ability is stronger .

Next , Let's learn more about CRF The nature and difference of .

3、 ... and 、 Discriminant (Discriminative) Model and generative (Generative) Model

These two concepts are often easy to encounter , It's also the basic knowledge that I think can't be confused .

First put an intuitive picture ：

What is a discriminant ？

seeing the name of a thing one thinks of its function , The discriminant is , according to xx Judge xx. For input X, Prediction marks Y, That's conditional probability P(Y|X). So in my understanding , Discriminant is actually a boundary condition learned in training , Or split surface , For model input , It can be classified directly by this criterion . Like the left half of the picture above , In fact, what we get is the difference between the two categories , Therefore, discriminant has a very good performance in classification and prediction tasks .
What is generative ？

seeing the name of a thing one thinks of its function , The generative expression is , Generate multiple types of training data according to the training data “ Model ”, Instead of learning the boundaries between categories like discriminant . therefore , This method calculates a joint probability P(X,Y), For input X, Calculate multiple joint probabilities , Take the largest as the most likely case . As shown on the right in the above figure , The generator will learn the complete boundary of the whole class , Instead of just focusing on the relationship between classes .
Take a chestnut

Now I have Coca Cola and Pepsi in my training data , Then I input a picture containing a can into the trained machine .

If it's a discriminant model ： First extract the feature information from the picture , The discriminant model uses some significant distinguishing features ( Color 、LOGO wait ), We can directly give the probability of Coca Cola and the probability of Pepsi .

If it's a generative model ： First extract the feature information from the picture , When I passed the training, I had established a good model for Coca Cola and Pepsi , Incoming features are calculated one by one , Finally, the one with the highest probability is the predicted type .
and CRF What does it matter ？

HMM It's a generative model ,CRF It's a discriminant model . Why? ？ because HMM The training process is to establish a statistical probability density model for all samples , This model is through HMM The transfer matrix and emission matrix are realized . and CRF Different ,CRF The calculation is conditional probability , Directly model the classification rules obtained from the training data , For example, the relationship between the front and back data and the current position .CRF More attention is paid to learning the characteristics of the sequence through the characteristic function , And the constraints in the sequence . That's why CRF There will be no Second point in HMM What happened , because HMM Only a probability estimate is made for one word and what its next word is , I didn't really pay attention to the front and back features in the whole sentence .

in fact , There is a certain transformation relationship between discriminant model and generative model . Logically, it can be understood as , Generative model after modeling various types , In fact, we also get the boundary of each model , It provides the prerequisite for transforming into discriminant . Yes HMM Readers who know something can also think about a problem ,HMM And CRF Whether there is a certain transformation relationship ？

Four 、 From Markov to CRF

Random Airport

In a sample space , The value of each point is given according to a certain distribution .
Markov random Airport

Random Airport + Markov nature , That is, a position in the random field is only related to the value of its adjacent position , Independent of the value of non adjacent positions .
Conditional random field

Special Markov random field ,Y Satisfy Markov . There is also an observation value under each position in the random field X(observation), Essentially , Is given the observation value X Random field , There are... In this field X and Y Two random variables , And Y Satisfy Markov .
Linear chain element random field Linear-CRF

The most common CRF In the form of , Characteristic is X and Y All have the same structure , And satisfy Markov property , That is, a position in the random field is only related to the value of its adjacent position , Independent of the value of non adjacent positions .

Linear-CRF Sketch Map ：

5、 ... and 、 Maximum entropy model and CRF

Maximum entropy MaxEnt

The maximum entropy model is not only applied to the sequence annotation task , The greatest thing about this model is , The characteristic function and its corresponding parameters are introduced to learn the overall characteristics of the input . The mathematical formula will not be brought up , In fact, on the whole, it is similar to CRF The final results are very similar .

Personal understanding , The introduction of feature function is to guide the model to learn some features that we think are helpful for the task . In this way, the calculation of conditional probability is established , It becomes a discriminant model . The simple generative model does not contain characteristic function , Directly study the distribution of the whole data .
Maximum Entropy Markov model MEMM

MEMM Compared with HMM The progress of the model is , To study the MaxEnt To calculate the conditional probability . But its limitation is ,MEMM It is based on the calculation of each local node , Merge again . The problem with this is , The conditional probability obtained by the maximum entropy model of each step is only based on the information of two points related to this point , And it is only normalized in this part , Lack of comprehensiveness .

MEMM The progress of is , The method of discriminant is introduced , And based on HMM The properties of are operated locally , It's also very fast .
CRF Characteristic function in

CRF More like the crystallization of the above methods .

CRF The indispensable concept in is the characteristic function . At first I saw CRF When , Suddenly two characteristic functions pop up, which makes me confused , Later, I found that they were all models slowly explored through the continuous research and trial and error of predecessors ,respect. The characteristic function is actually artificially defined , For example, in the word segmentation task , I don't want adjectives or verbs after verbs , Then I can make this clear by setting the characteristic function , Give the machine an adjusted direction .

CRF There are two kinds of characteristic functions ：

① The state characteristic function on the node ：
$s_l(y_i,x,i),i=1,2,...,L$

It shows the characteristics between the observation sequence and the corresponding state on the node .

② Transfer characteristic function between nodes ：
$t_k(y_{i-1},y_i,x,i),k=1,2,...,K$

Between nodes ( Here is the previous node and the current node ) The characteristic relationship of .

6、 ... and 、CRF Formula analysis

Enter the formula link , Before that , We should first add some knowledge that is convenient to understand ：

Probability undirected graph

actually , If the joint probability distribution satisfies the pairing , Local or global markov , This joint probability distribution is called the probability undirected graph model . This definition is similar to the above definition of Markov field . in other words , Markov field can be regarded as a probability undirected graph , Points are random variables , Edges are relationships between variables . The conditional random field can be regarded as a special Markov random field , Therefore, it can be represented by probability undirected graph .

In probability undirected graph , There is also a very important concept , It's called the largest group .

Thank you for this article , This set of pictures is very clear . It's very simple , A group of nodes in a maximal clique that are fully connected , Another node will break the condition of this full connection . The maximal clique is the maximal clique with the largest number of nodes in the maximal clique .

Then why do we need the largest regiment ？ Because according to a very mathematical theorem (Hammersley-Clifford Theorem ), Joint probability distribution of probability undirected graph model P(Y) Can be obtained from the largest group ：

$\frac{1}{Z}\sum_{Y}\psi_c(Y_c)$

$\sum_Y \prod_c\psi_c(Y_c)$

$\psi_c(Y_c) = e^{-E(Y_c)}$

$E(x_c) = \sum_{u,v\in C,u \neq v}\alpha_{u,v}t_{u,v}(u,v) + \sum_{v \in C}\beta_v s_v(v)$

In the first formula c, Is the largest clique of an undirected graph . ${Y_c}$ Represents the random variable on the node , ${\psi_c}$ Is a strictly positive function ,Z It's the normalization factor .

The second formula is the calculation formula of normalization factor . Same as the first formula , It just increases the multiplication of all the largest regiments .

The third and fourth formulas are right ${\psi}$ function , Or say , Further explanation of potential function . But actually , There is no regulation ${\psi}$ The function must be like this .
Because the definition here is related to the Boltzmann distribution in Physics , So we usually set the potential function like this . there ${\alpha}$ And ${\beta}$ Are all parameters. ,t() And s() It's a characteristic function .
And one more detail , The characteristic function here t() It's about the relationship between two nodes in the maximal clique , and s() It's about nodes alone . This is related to CRF The definition of characteristic function in is very similar .

It doesn't matter if the formula is not very clear , Whether the reader finds , The fourth formula looks very familiar ？ Let's continue to observe CRF Formula .
CRF The formula

For the sake of illustration , Take linear conditional random field as an example .

CRF The parametric definition of is like this ：
$\frac{1}{Z(x)}exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i) + \sum_{i,l}\mu_ls_l(y_i,x,i))$

$\sum_y exp(\sum_{i,k}\lambda_kt_k(y_{i-1},y_i,x,i) + \sum_{i,l}\mu_ls_l(y_i,x,i))$

Is it familiar ？ In fact, it is to integrate the formulas in the probability undirected graph ！λ and μ They're all learnable parameters , The characteristic function is also the same as what we defined earlier . Actually ,MEMM The formula for calculating the maximum entropy is also very similar to this result , Because the definition and design of information entropy itself have some similarities with the potential function here . But it's connected to CRF The main difference is , The calculation of normalization factor is different .MEMM The calculated normalization factor is obtained by multiplying the normalization factors on each node ( Personal observation , Not necessarily right );CRF Directly calculate the global normalization factor , Therefore, the overall situation is stronger .

See here , Is it possible to understand CRF Where did the formula come from ？

For the convenience of expression , Generally, the formula will be simplified as follows ：

$\frac{1}{Z(x)}exp(\sum_{k}\omega_kf_k(y,x))$

$\sum_Y exp(\sum_{k}\omega_kf_k(y,x))$

use ω Instead of representing two parameters , use f() Instead of representing two characteristic functions . After using this formula , In fact, we can find CRF Some of the working principles of . The more characteristic functions are satisfied , The greater the corresponding conditional probability value ( Will not ω Think about it , take ω If you consider it, you should add a premise ： In the case of features we want to learn .)

thus , You can use some algorithms ( gradient descent 、L-BFGS wait ) To study , Got the parameters .

7、 ... and 、 summary

Personally think that ,CRF The core point is to introduce globality . The above example is about Linear-CRF The situation of , actually ,CRF It can be much more complicated , All this is determined by your characteristic function .CRF This design method makes it possible to mine more constraints and information between tags , But the disadvantages are obvious , It's just that the training speed will be slow .
Last , Thanks to brother Junyi and KNLP The rest of the group ！！

A premise ： In the case of features we want to learn .)

thus , You can use some algorithms ( gradient descent 、L-BFGS wait ) To study , Got the parameters .