Memory cells and gate units: basic ideas. LSTM's basic unit is called a memory cell. Within each memory cell, there is a linear unit with a fixed-weight self-connection (compare Mozer's time constants ). This enforces constant, non-exploding, non-vanishing error flow within the memory cell. A multiplicative input gate unit learns to protect the constant error flow within the memory cell from perturbation by irrelevant inputs. Likewise, a multiplicative output gate unit learns to protect other units from perturbation by currently irrelevant memory contents stored in the memory cell. The gates learn to open and close access to constant error flow. Why is constant error flow important? For instance, with conventional ``backprop through time'' (BPTT, e.g., ) or RTRL (e.g., ), error signals ``flowing backwards in time'' tend to vanish: the temporal evolution of the backpropagated error exponentially depends on the size of the weights. For the first theoretical error flow analysis see . See  for a more recent, independent, essentially identical analysis.
In what follows,
denotes the weight on the connection from unit to unit .
are net input and activation
of unit (with activation function )
at time .
For all non-input units that aren't memory cells (e.g. output units),
The -th memory cell is denoted .
Each memory cell is built around a central linear unit with
a fixed self-connection
(weight 1.0) and identity function as activation function
(see definition of below).
In addition to
also gets input from a special unit
(the ``output gate''),
and from another special unit
(the ``input gate'').
's activation at time is denoted by
's activation at time is denoted by
, are viewed as ordinary hidden units.
The summation indices may stand for
memory cells, or even
conventional hidden units if there are any
(see also paragraph on ``network topology'' below).
All these different types of
units may convey useful information about
the current state of the net.
For instance, an input gate (output gate)
may use inputs from other memory cells to
decide whether to store (access)
certain information in its memory cell.
There even may be recurrent self-connections like .
It is up to the user to define the network topology.
At time , 's output is computed
in a sigma-pi-like fashion:
Why gate units? controls the error flow to memory cell 's input connections . controls the error flow from unit 's output connections. Error signals trapped within a memory cell cannot change - but different error signals flowing into the cell (at different times) via its output gate may get superimposed. The output gate will have to learn which errors to trap in its memory cell, by appropriately scaling them. Likewise, the input gate will have to learn when to release errors. Gates open and close access to constant error flow.
Network topology. There is one input, one hidden, and one output layer. The fully self-connected hidden layer contains memory cells and corresponding gate units (for convenience, we refer to both memory cells and gate units as hidden units located in the hidden layer). The hidden layer may also contain ``conventional'' hidden units providing inputs to gate units and memory cells. All units (except for gate units) in all layers have directed connections (serve as inputs) to all units in higher layers.
Memory cell blocks. memory cells sharing one input gate and one output gate form a ``memory cell block of size ''. They can facilitate information storage.
Learning with excellent computational complexity -- see details in appendix of . We use a variant of RTRL which properly takes into account the altered (sigma-pi-like) dynamics caused by input and output gates. However, to ensure constant error backprop, like with truncated BPTT , errors arriving at ``memory cell net inputs'' (for cell , this includes , , ) do not get propagated back further in time (although they do serve to change the incoming weights). Only within memory cells, errors are propagated back through previous internal states . This enforces constant error flow within memory cells. Thus only the derivatives need to be stored and updated. Hence, the algorithm is very efficient, and LSTM's update complexity per time step is excellent in comparison to other approaches such as RTRL: given units and a fixed number of output units, LSTM's update complexity per time step is at most , just like BPTT's.