1 长程依赖问题

   Why in the practical application,RNN很难处理长距离的依赖?

  Gradient disappear, for example

  Three methods to deal with gradient disappear question


  更加推荐:ReLUFunction of the image and the derivative figure

 2 长短期记忆网络(LSTM)


  LSTM Repeat module

   LSTM 的核心思想


  逐步理解 LSTM之遗忘门

  逐步理解 LSTM之输入门

  逐步理解 LSTMUpdate the unit state of

  逐步理解 LSTM之输出门

(1) LSTM训练算法框架





 3 门控循环神经网络(GRU)



4 深层循环神经网络

  堆叠循环神经网络 (Stacked Recurrent Neural Network, SRNN)

  双向循环神经网络(Bidirectional Recurrent Neural Network)


权重矩阵W和UThe optimization of process need to backThe historical data before(BPTT算法的重点).

1 长程依赖问题

RNN One advantage is that it can use the previous information on the current task,Especially when the relevant information and forecast the gap between the words is small effect.
As predicted sentence“the clouds are in the sky”中的最后一个词.

 However, in the interval when increasing,RNN 会丧失学习到连接如此远的信息的能力.

As predicted sentence“I grew up in France… I speak fluentFrench”In the last word.

   Why in the practical application,RNN很难处理长距离的依赖?

上一节关于RNN的推导中,误差项沿时间Back propagation formula for:

 According to the following inequality,To obtain the upper bound of the die(模可以看做对 Every value in a largeThe measurement of small):

 其中,𝛽𝑓𝛽𝑊Respectively is a diagonal matrix and matrixW模的上界.

可以看到,Error term from𝑡时刻传递到𝑘时刻,其值的上界是𝛽𝑓𝛽𝑊指数函数.
𝑡 − 𝑘很大时(也就是误差传递很多个时刻时), 整个式子的值就会变得极小(当 𝛽𝑓𝛽𝑊乘积小于1)或者极大(当𝛽𝑓𝛽𝑊乘积大于1),前者是梯度消失,后者是梯度爆炸.
Gradient disappeared or gradient explosion causes gradient as0或NaN,Unable to continue training update parameters,也就是RNN的长程依赖问题.

  Gradient disappear, for example

RNN中权重矩阵WFinally the gradient is各个时刻的梯度之和,即:


 从t-3时刻开始,梯度已经几乎减少到0了.Before the beginning of a moment and then go,得到的梯度(几乎为零)Don't for theEnd of the gradient value any contribution.这就是原始RNNCan't handle long distanceAway from the dependence on the reason.

  Three methods to deal with gradient disappear question

通常来说,梯度爆炸It is easier to deal with some.因为梯度爆炸的时候,程序会收到NaN错误.Can also set a gradient threshold,When the gradient exceeds this threshold direct interception of.
梯度消失更难检测,Also more difficult to deal with some.总的来说,There are three ways to deal with gradient disappear question:
1. 合理的初始化权重值.初始化权重,Make each neuron as far as possible not to take maximum or minimum
2. 使用relu代替sigmoid和tanh作为激活函数.
3. 使用其他结构的RNNs,比如长短时记忆网络(LTSM)和门控循环神经网络Gated Recurrent Unit(GRU).


sigmoidFunction derivative value in the range of(0,0.25],Back propagation will cause gradient disappear
tanhFunction derivative value in the range of (0,1],Relatively wide,But still can cause gradient disappear
sigmoidFunction is not the origin center symmetry,Output were greater than0
tanhFunction is the origin center symmetry,可以使网络收敛的更好


 虽然tanh函数相较于sigmoidFunction is similar to,但是tanh函数的导函数(0~1)比sigmoid函数的导函数(0~1/4)大,tanh函数的函数图和导数图:

  更加推荐:ReLUFunction of the image and the derivative figure

ReLU函数的左侧导数为0,右侧导数恒为1,To avoid the decimal LianCheng,但反向传播中仍有权值的累乘.ReLU函数改善了“梯度消失”现象.

Very borrowed from human neurons unilateral unilateral inhibition activated,ReLUFunction of the image and the derivative figure:

 缺陷:On the left side of the base for0,Easy to make neurons die directly to learn,So basically will useReLU函数的变体.

 2 长短期记忆网络(LSTM)

Long Short Term Memory networks(以下简称LSTMs),一种特殊的RNN网络,The network is designed in order to solve the problem of long-range dependence.
增加状态c,称为单元状态(cell state),让它来保存长期的状态

 LSTMs首先继承了RNN模型的特性,So it is aShort-term memory function. 其次,The memory of its special unit Settings,Also make it have the function that the long-term memory.


 LSTMThrough the so-called switch Settings,Such a state to achieve our unitCAnd the output of the hidden layer,Then we take a look at the three very important below what is the status of the unit.

LSTM的关键,就是怎样控制长期状态c.LSTM使用三个控制开关 :
        ① 第个开关,Responsible for controlling how to continue to save state for a long timec
        ② 第个开关,负责控制把即时状态输入到长期状态c;
        ③ 第个开关,负责控制是否把长期状态c作为当前的LSTM的输出;

  LSTM Repeat module

Each module represents different moment.

标准RNNRepeat modules as follows,One a moment inside actually like this onetanhActivation function so a process,Internal structure is relatively simple.

而LSTM Repeat modules as follows,除了h在随时间流动,单元状态c也在随时间流动,单元状态c就代表着长期记忆.

  • 黄框:Said we learned neural network layer;
  • Pink circle:Represents some arithmetic operations;
  • 单箭头:Direction of flow arrow said vector transmission;
  • Two arrows to merge:According to vector a splicing process;
  • Arrow y:According to vector replication process;

   LSTM 的核心思想

与RNNA big difference between:LSTM 的关键是状态单元C,Such as horizontal line in the figure above throughout the run.
Unit of the state of the relay is similar to the conveyor belt,Its direct run on the entire chain,Only some small amount of linear interaction,Easy to save information.


LSTM 通过精心设计的称作为“门”(gate)The structure of the unit to remove or add the information in the state.门是一种让信息选择式通过的方法.

 The door includes a sigmoid 神经网络层和一个 pointwise 乘法操作.

补充:LSTM用两个门来控制单元状态c的内容 :
        * 遗忘门(forget gate),它决定了上一时刻的单元状态ct-1How many reserves to当前时刻ct(How much will be to remember);
        * 输入门(input gate),它决定了当前时刻网络的输入xtHow many saved to a single元状态ct .
                • LSTM用输出门来控制单元状态ct有多少输出到 LSTM的当前输出值ht

  逐步理解 LSTM之遗忘门


The example of choose and employ persons,Our mind to remember a lot of knowledge,But knowledge when not,We are can't think of it.

Suppose now out of the question to test you,You will be on its related knowledge to recall,This process of memories will consciously forgotten something,Remember some of the content,That in the,Why some are remembered,Some have been forgotten?

Is this new stimulusXtMade a choice,Or do a whip,To decide which is the need of forgotten,What is the need to remember.Of course the remember and forget what is the proportion of,So we use byS得到一个0~1Value of a weigh the choice.

  逐步理解 LSTM之输入门

sigmoid The function called 为输入门,Decide what to update values ;
tanh 层创建一个新的候选值向量, 会被加入到状态中;

  逐步理解 LSTM之更新单元状态


  逐步理解 LSTM之输出门

Output door control the effects of long-term memory for the current output,The status of output by the door and unit jointly set.

(1) LSTM训练算法框架 

  • 遗忘门:公式1
  • 输入门:公式2和公式3,公式4(The current state unitCtAn update to the process)
  • 输出门:公式5和公式6


        ① 前向计算每个神经元的输出值,对于LSTM来说,即五个 向量的值.Calculation method has been described in the previous page.
        ② 反向计算每个神经元的误差项值.与循环神经网络一样,LSTM误差项的反向传播也是包括两个方向:一个是沿时间的反向传播,即从当前t时刻开始,计算每个时刻的误差项;一个是将误差项向上一层传播.
        ③ 根据相应的误差项,计算每个权重的梯度.


Set the doorgate的激活函数为sigmoid函数,输出的激活函数为tanh函数.Their derivatives, respectively:

 sigmoid和tanh函数的导数都是原函数的函数.这样,When calculating the value of the function,就可以用它来计算出导数的值.

  • 遗忘门的权重矩阵和偏置项 、
  • 输入门Weight matrices and bias item 、
  • 输出门的权重矩阵 和偏置项 、
  • 计算单元状态Weight matrices and bias item .
因为权重矩阵的两部分在反向传播中使用不同的公式,因此在后续的推导中,权重矩阵 都将被写为分开的两个矩阵:


当O作用于两个矩阵时,两个矩阵对应位置的元素相乘.According to the elements by can, in some cases, simplified matrix
And the vector arithmetic.例如,当一个对角矩阵右乘一个矩阵时,Is equivalent to using diagonal matrix of the diagonal vector
According to the elements by the matrix:

 当一个行向量右乘一个对角矩阵时,相当于这个行向量According to the elements by the matrix对角线组成的向量:

 上面这两点,In subsequent derivation will often use.


上述公式就是将误差沿时间反向传播一个时刻的公式.有了它,Can write transfer error term forward to any𝑘时刻的公式:



对于的权重梯度,我们知道它的 The gradient is the sum of each moment gradient, 我们首先求出它们在t时刻的梯度,然后再求出他们最终的梯度.
对于偏置项bf,bi,bc,bo的梯度,也是将Every moment of the gradient together.Below is a partial every momentBuy a gradient:

对于的权重梯度, 只需要根据相应的误差项直接计算即可


 3 门控循环神经网络(GRU)

GRU(Gate Recurrent Unit)是循环神经网络RNN的一种.和LSTM一样,也是为了解决长期记忆和反向传播中的梯度等问题而提出来的.


LSTMIntroduces three door function:输入门遗忘门输出门To control the input values、记忆值和输出值.而在GRU模型中只有两个门,分别是更新门和重置门. 另外,GRU将单元状态与输出合并为一个状态h.



GRU的参数更少,Thus training faster or require less data to generalize.
如果你有足够的数据,LSTMThe strong power of expression may produce better results.
Greff, et al. (2016)对流行的LSTMContrasted the variant experiment,Found their performance almost unanimously. Jozefowicz, et al. (2015)Test in more than ten thousandRNN结构,Find some tasks case,有Some variants thanLSTM工作得更好.

4 深层循环神经网络

Cycle of neural network can be deep to shallow network
        * 深网络:Spread out the loop network according to the time,Long time interval between the state of the path is very long ;
        * 浅网络:At the same time the network between the input to the output path xtyt 非常浅 ;
The significance of increasing cycle the depth of the neural network
        * The ability to increase circulation neural network ;
        * Increase at the same time the network between the input to the output path xtyt , Such as to increase the number of hidden states output htyt ,And input to the hidden states xthtThe path between the depth of the;

  堆叠循环神经网络 (Stacked Recurrent Neural Network, SRNN)

  双向循环神经网络(Bidirectional Recurrent Neural Network)

双向循环神经网络(Bidirectional Recurrent Neural Network)Consists of two layers of circulation neural network,它们的输入相同,只是信息传递的方向不同


