Next: EXPERIMENTS Up: LSTM CAN SOLVE HARD Previous: TRIVIAL PREVIOUS LONG TIME

LONG SHORT-TERM MEMORY

Memory cells and gate units: basic ideas. LSTM's basic unit is called a memory cell. Within each memory cell, there is a linear unit with a fixed-weight self-connection (compare Mozer's time constants [12]). This enforces constant, non-exploding, non-vanishing error flow within the memory cell. A multiplicative input gate unit learns to protect the constant error flow within the memory cell from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit learns to protect other units from perturbation by currently irrelevant memory contents stored in the memory cell. The gates learn to open and close access to constant error flow. Why is constant error flow important? For instance, with conventional ``backprop through time'' (BPTT, e.g., [20]) or RTRL (e.g., [15]), error signals ``flowing backwards in time'' tend to vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights. For the first theoretical error flow analysis see [7]. See [3] for a more recent, independent, essentially identical analysis.

LSTM details. In what follows, $w_{uv}$ denotes the weight on the connection from unit to unit . are net input and activation of unit (with activation function ) at time . For all non-input units that aren't memory cells (e.g. output units), we have $y^{u}(t)=f_{u}(net_{u}(t))$ , where $net_{u}(t)=\sum_v w_{uv} y^v(t-1)$ . The -th memory cell is denoted . Each memory cell is built around a central linear unit with a fixed self-connection (weight 1.0) and identity function as activation function (see definition of $s_{c_j}$ below). In addition to $net_{c_j}(t)=\sum_u w_{c_ju} y^u(t-1)$ , also gets input from a special unit (the ``output gate''), and from another special unit (the ``input gate''). 's activation at time is denoted by $y^{in_{j}}(t)$ . 's activation at time is denoted by $y^{out_{j}}(t)$ . , are viewed as ordinary hidden units. We have $y^{out_{j}}(t)=f_{out_j}(net_{out_j}(t)), y^{in_{j}}(t) = f_{in_j}(net_{in_j}(t))$ , where $net_{out_j}(t)=\sum_u w_{out_ju} y^u(t-1)$ , $net_{in_j}(t)=\sum_u w_{in_ju} y^u(t-1)$ . The summation indices may stand for input units, gate units, memory cells, or even conventional hidden units if there are any (see also paragraph on ``network topology'' below). All these different types of units may convey useful information about the current state of the net. For instance, an input gate (output gate) may use inputs from other memory cells to decide whether to store (access) certain information in its memory cell. There even may be recurrent self-connections like $w_{c_jc_j}$ . It is up to the user to define the network topology. At time , 's output $y^{c_{j}}(t)$ is computed in a sigma-pi-like fashion: $y^{c_j}(t) = y^{out_{j}}(t) h(s_{c_{j}}(t))$ , where

$\begin{displaymath} s_{c_{j}}(0) = 0, s_{c_{j}}(t) = s_{c_{j}}(t-1) + y^{in_{j}}(t) g \left( net_{c_{j}}(t) \right) \mbox{ for } t > 0. \end{displaymath}$

The differentiable function

scales $net_{c_{j}}$ . The differentiable function

scales memory cell outputs computed from the internal state $s_{c_{j}}$ .

Why gate units? $in_{j}$ controls the error flow to memory cell 's input connections $w_{c_{j}u}$ . $out_{j}$ controls the error flow from unit 's output connections. Error signals trapped within a memory cell cannot change - but different error signals flowing into the cell (at different times) via its output gate may get superimposed. The output gate will have to learn which errors to trap in its memory cell, by appropriately scaling them. Likewise, the input gate will have to learn when to release errors. Gates open and close access to constant error flow.

Network topology. There is one input, one hidden, and one output layer. The fully self-connected hidden layer contains memory cells and corresponding gate units (for convenience, we refer to both memory cells and gate units as hidden units located in the hidden layer). The hidden layer may also contain ``conventional'' hidden units providing inputs to gate units and memory cells. All units (except for gate units) in all layers have directed connections (serve as inputs) to all units in higher layers.

Memory cell blocks. memory cells sharing one input gate and one output gate form a ``memory cell block of size ''. They can facilitate information storage.

Learning with excellent computational complexity -- see details in appendix of [8]. We use a variant of RTRL which properly takes into account the altered (sigma-pi-like) dynamics caused by input and output gates. However, to ensure constant error backprop, like with truncated BPTT [20], errors arriving at ``memory cell net inputs'' (for cell , this includes $net_{c_j}$ , $net_{in_j}$ , $net_{out_j}$ ) do not get propagated back further in time (although they do serve to change the incoming weights). Only within memory cells, errors are propagated back through previous internal states $s_{c_j}$ . This enforces constant error flow within memory cells. Thus only the derivatives $\frac{\partial s_{c_{j}}}{\partial w_{il}}$ need to be stored and updated. Hence, the algorithm is very efficient, and LSTM's update complexity per time step is excellent in comparison to other approaches such as RTRL: given units and a fixed number of output units, LSTM's update complexity per time step is at most , just like BPTT's.

Next: EXPERIMENTS Up: LSTM CAN SOLVE HARD Previous: TRIVIAL PREVIOUS LONG TIME

Juergen Schmidhuber 2003-02-25

Back to Recurrent Neural Networks page