next up previous
Next: SIMULATIONS: IMAGE CODING Up: EXAMPLE 3: Discovering Factorial Previous: EXAMPLE 3: Discovering Factorial

PREDICTABILITY MINIMIZATION

In its most simple form, predictability minimization is based on a feedforward network with $m$ input units and $n$ output units (or code units). The $i$-th code unit produces an output value $y^p_i \in [0, 1]$ in response to the current external input vector $x^p$. The central idea is: For each code unit there is an adaptive predictor network that tries to predict the code unit from the remaining $n-1$ code units. But each code unit in turn tries to become as unpredictable as possible. The only way it can do so is by representing environmental properties that are statistically independent from environmental properties represented by the remaining code units. The principle of predictability minimization was first described in [22].

The predictor network for code unit $i$ is called $P_i$. Its output in response to the $\{ y_k^p ,~~ k \neq i \}$ is called $P^p_i$. $P_i$ is trained to minimize

\begin{displaymath}
\sum_p (P_i^p - y_i^p)^2,
\end{displaymath} (2)

thus learning to predict the the conditional expectation $E(y_i \mid \{y_k, k \neq i \})$ of $y_i$, given the set $\{ y_k,~~ k \neq i \}$. But the code units try to maximize the same (!) objective function the predictors try to minimize:
\begin{displaymath}
V_C = \sum_{i,p}(P_i^p - y_i^p)^2.
\end{displaymath} (3)

Predictors and code units co-evolve by fighting each other. See details in [22] and especially in [24].

Let us assume that the $P_i$ do not get trapped in local minima and perfectly learn the conditional expectations. It then turns out that the objective function $V_C$ (first given in [22]) is essentially equivalent to the following one (also given in [22]):

\begin{displaymath}
\sum_i VAR(y_i) -\sum_{i,p} (P^p_i - \bar{y_i})^2,
\end{displaymath} (4)

where $\bar{y_i}$ denotes the mean activation of unit $i$, and $VAR$ denotes the variance operator. The equivalence of (3) and (4) was observed by Peter Dayan, Richard Zemel and Alex Pouget (SALK Institute, 1992). See [24] for details. (4) gives some intuition about what is going on while (3) is maximized. The first term of (4) tends to enforce binary units, while the second term tends to make the conditional expectations equal to the unconditional expectations, thus encouraging statistical independence.

Note that unlike with many previous approaches to unsupervised learning, the system is not limited to local ``winner-take-all'' representations. Instead, there may be distributed code representations based on many code units that are active simultaneously - as long as the ``winners'' stand for independent abstract features extracted from the input data. And unlike previous approaches, the method allows for discovering nonlinear predictability, and for nonlinear pattern transformations to obtain codes with statistically independent components. Note that statistical independence implies decorrelation. But decorrelation does not imply statistical independence.


next up previous
Next: SIMULATIONS: IMAGE CODING Up: EXAMPLE 3: Discovering Factorial Previous: EXAMPLE 3: Discovering Factorial
Juergen Schmidhuber 2003-02-19