next up previous


I will keep the architecture and the objective function from section 1 but I will modify the system dynamics. Recall that unquantized variables are assumed to take on their maximal range. For our single training sequence with $n_{time}$ discrete time steps, the system dynamics (explanation follows below) are defined by

x_k(t)\leftarrow environment,~~
y_k(t) = f_{y_k}(net_{y_k}(t)),~~
net_{y_k}(t+1) = \sum_l w_{y_kl}(t)l(t),
\end{displaymath} (3)

w_{ij}(1) \leftarrow initialization,~~
w_{ij}(t+1) =
\sigma_{ij} \left[ w_{ij}(t) + g(j(t))h(i(t+1)) \right],
\end{displaymath} (4)

where $\sigma_{ij}$ is a differentiable function (e.g. for limiting the weight on $w_{ij}$ to a given interval), and $g$ and $h$ are differentiable monotonic functions (the `threshold approximators', to be explained below).

Equation (3) is just the conventional recurrent net update rule (1). Unlike with conventional recurrent nets, however, the weights do not remain constant during sequence processing : Equation (4) says that connections between units active at successive time steps are immediately strengthened or weakened essentially in proportion to pre-synaptic and post-synaptic activity. These intra-sequence weight changes are modulated by the non-linear functions $g$ and $h$ and may be negative (anti-Hebb-like) or zero as well as positive. Let us assume that all input vectors and all $f_i$ are such that all units can take on only activations between 0 and 1. $g$ and $h$ are meant to specify the upper and lower thresholds that determine how strongly units have to be excited or inhibited to contribute to intra-sequence weight changes. A reasonable choice for $g$ and $h$ is one where $g$ and $h$ are strongly negative only if their argument is close to 0 and are strongly positive only if their argument is close to 1. Both $g$ and $h$ should return values close to 0 for arguments from the largest part of the interval between 0 and 1. This implies hardly any intra-sequence weight changes for connections between units that have non-extreme activations during successive time steps.

The overall effect is that only connections between units that are exceptionally active or exceptionally inactive during successive time steps can be significantly modified. Intra-sequence weight changes essentially occur only if the network `pays a lot of attention' to certain units by strongly exciting them or strongly inhibiting them. Weights to units that are not `illuminated by adaptive internal spotlights of attention' essentially remain invariant and participate only in `automatic processing' as opposed to `active intra-sequence learning'. The remainder of this paper derives an exact gradient-based algorithm designed to adjust the system (via inter-sequence weight changes) such that it creates appropriate intra-sequence weight changes at appropriate time steps.

next up previous
Juergen Schmidhuber 2003-02-21

Back to Recurrent Neural Networks page