next up previous
Next: CONCLUDING REMARKS, OUTLOOK Up: LEARNING FACTORIAL CODES BY Previous: NON-UNIFORMLY DISTRIBUTED INPUTS

PREDICTABILITY MINIMIZATION AND TIME

Let us now consider the case of input sequences. This section describes an entirely local method designed to find unambiguous, non-redundant, reduced sequence descriptions.

The initial state vector $y^p(0)$ is the same for all sequences $p$. The input at time $t>0$ of sequence $p$ is the concatenation $x^p(t) \circ y^p(t-1)$ of the input $x^p(t)$ and the last internal state $y^p(t-1)$. The output is $y^p(t)$ itself.

We minimize and maximize essentially the same objective functions as described above. That is, for the $i$-th module which now needs recurrent connections to itself and the other modules, there is again an adaptive predictor $P_i$ which need not be recurrent. $P_i$'s input at time $t$ is the concatenation of the outputs $y^p_k(t)$ of all units $k \neq i$. $P_i$'s one-dimensional output $P^p_i(t)$ is trained to equal the expectation of the output $y_i$, given the outputs of the other units, $E(y_i \mid \{y_k(t), k \neq i \})$, by defining $P_i$'s error function as

\begin{displaymath}
\frac{1}{2}
\sum_p \sum_t (P^p_i(t) - y^p_i(t))^2.
\end{displaymath}

In addition, all units are trained to take on values that maximize

\begin{displaymath}
\bar{E} = \sum_t T(t),
\end{displaymath}

where $T(t)$ is defined analogously to the respective stationary cases.

The only way a unit can protect itself from being predictable from the other units is to store properties of the input sequences that are independent of aspects stored by the other units. In other words, this method will tend to throw away redundant temporal information much as the systems in (Schmidhuber, 1992a) and (Schmidhuber, 1992b) . For computing weight changes, each module looks back only to the last time step. In the on-line case, this implies an entirely local learning algorithm. Still, even when there are long time lags, the algorithm theoretically may learn unique representations of extended sequences - as can be seen by induction over the length of the longest training sequence:

1. $y$ can learn unique representations of the beginnings of all sequences.

2. Suppose all sequences and sub-sequences with length $<k$ are uniquely represented in $y$. Then, by looking back only one time step at a time, $y$ can learn unique representations of all sub-sequences with length $k$.

The argument neglects all on-line effects and possible cross-talk.

On-line variants of the system described above were implemented by Daniel Prelinger. Preliminary experiments were conducted with the resulting recurrent systems. These experiments demonstrated that there are entirely local sequence learning methods that allow for learning unique representations of all subsequences of non-trivial sequences (like a sequence consisting of 8 consecutive presentations of the same input pattern represented by the activation of a single input unit). Best results were obtained by introducing additional modifications (like other error functions than mean squared error for the representational modules). A future paper will elaborate on sequence learning by predictability minimization.


next up previous
Next: CONCLUDING REMARKS, OUTLOOK Up: LEARNING FACTORIAL CODES BY Previous: NON-UNIFORMLY DISTRIBUTED INPUTS
Juergen Schmidhuber 2003-02-13


Back to Independent Component Analysis page.