Next: ILLUSTRATIVE EXPERIMENTS Up: DISCOVERING PREDICTABLE CLASSIFICATIONS (Neural Previous: PREDICTABILITY MINIMIZATION

RELATION TO PREVIOUS WORK

Becker and Hinton (1989) solve symmetric problems (like the one of example 2, see section 1) by maximizing the mutual information between the outputs of

and

(IMAX). This corresponds to the notion of finding mutually predictable yet informative input transformations.

One variation of the IMAX approach assumes that and have single binary probabilistic output units. In another variation, and have single real-valued output units. The latter case, however, requires certain (not always realistic) Gaussian assumptions about the input and output signals (see also section 2.3 on Infomax).

In the case of vector-valued output representations, Zemel and Hinton (1991) again make simplifying Gaussian assumptions and maximize functions of the determinant of the $q \times q$ -covariance matrices (MAX) of the output activations [Shannon, 1948] (see again section 2.3). MAX can remove only linear redundancy among the output units. (It should be mentioned, however, that with Zemel's and Hinton's approach the outputs may be non-linear functions of the inputs).

The nice thing about IMAX is that it expresses the goal of finding mutually predictable yet informative input transformations in a principled way (in terms of a single objective function). In contrast, our approach involves two separate objective functions that have to be combined using a relative weight factor. An interesting feature of our approach is that it conceptually separates two issues: (A) the desire for discriminating mappings from input to representation, and (B) the desire for mutually predictable representations. There are many different approaches (with mutual advantages and disadvantages) for satisfying (A). In the context of a given problem, the most appropriate alternative approach can be `plugged into' our basic architecture.

Another difference between IMAX and our approach is that our approach does not only enforce mutual predictability but also equality of $y^{p,1}$ and $y^{p,2}$ . This does not at all affect the generality of the approach. Note that one could introduce additional `predictor networks' - one for learning to predict $y^{p,2}$ from $y^{p,1}$ and another one for learning to predict $y^{p,1}$ from $y^{p,2}$ . Then one could design error functions enforcing mutual predictability (instead of using the essentially equivalent error function used in this paper). However, this would not increase the power of the approach but would only introduce unnecessary additional complexity. In fact, one advantage of our simple approach is that it makes it trivial to decide whether the outputs of both networks essentially represent the same thing.

The following section includes an experiment that compares IMAX to our approach.

Next: ILLUSTRATIVE EXPERIMENTS Up: DISCOVERING PREDICTABLE CLASSIFICATIONS (Neural Previous: PREDICTABILITY MINIMIZATION

Juergen Schmidhuber 2003-02-13