   Next: APPLICATION: IMAGE PROCESSING Up: SEMILINEAR PREDICTABILITY MINIMIZATION PRODUCES Previous: INTRODUCTION

# PREDICTABILITY MINIMIZATION: DETAILS

In its most simple form, PM is based on a feedforward network with sigmoid output units (or code units). See Figure 1. The -th code unit produces a real-valued output value (the unit interval) in response to the -th external input vector (later we will see that training tends to make the ouput values near-binary). There are additional feedforward nets called predictors, each having one output unit and input units. The predictor for code unit is called . Its real-valued output in response to the is called . is trained (in our experiments by conventional online backprop) to minimize (1)

thus learning to approximate the conditional expectation of , given the activations of the remaining code units. Of course, this conditional expectation typically will be very different from the actual activations of the code unit. For instance, assume that a certain code unit will be switched on in one third of all cases within a given context (defined by the activations of the remaining code units), while it will be switched off in two thirds of all such cases. Then, given this context, the predictor will predict a value of 0.3333.

The clue is: the code units are trained (in our experiments by online backprop) to maximize essentially the same objective function [Schmidhuber, 1992] the predictors try to minimize: (2)

Predictors and code units co-evolve by fighting each other.

Justification. Let us assume that the never get trapped in local minima and always perfectly learn the conditional expectations. It then turns out that the objective function is essentially equivalent to the following one (also given in Schmidhuber, 1992): (3)

where denotes the mean activation of unit , and VAR denotes the variance operator. The equivalence of (2) and (3) was observed by Peter Dayan, Richard Zemel and Alex Pouget (personal communication, SALK Institute, 1992 -- see [Schmidhuber, 1993] for details). (3) gives some intuition about what is going on while (2) is maximized. Mazimizing the first term of (3) tends to enforce binary units, and also local maximization of information throughput (given the binary constraint). Maximizing the second (negative) term (or minimizing the corresponding unsigned term) tends to make the conditional expectations equal to the unconditional expectations, thus encouraging mutual statistical independence (zero mutual information) and global maximization of information throughput.   Next: APPLICATION: IMAGE PROCESSING Up: SEMILINEAR PREDICTABILITY MINIMIZATION PRODUCES Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-17

Back to Independent Component Analysis page.