In all our experiments we associate input data with itself, using an FMS-trained 3-layer autoassociator (AA). Unless stated otherwise we use 700,000 training exemplars, sigmoid hidden units (HUs) with activation function (AF) , sigmoid output units with AF , noninput units with an additional bias input, normal weights initialized in , bias hidden weights with -1.0, with 0.5. The HU AFs do make sparseness better recognizable, but the output AFs are fairly arbitrary -- linear AFs or those of the HUs will do as well. Targets are scaled to , except for Task 2.2. Target scaling (1) prevents tiny first order derivatives of output units (which may cause floating point overflows), and (2) allows for proving that the FMS algorithm makes the Hessian entries of output units decrease where the weight precisions or increase (Hochreiter and Schmidhuber 1997a).
Parameters and other details.
Comparison. In sections 4.3 and 4.4 we compare LOCOCODE to simple variants of ``independent component analysis'' (ICA, e.g., Jutten and Herault 1991, Cardoso and Souloumiac 1993, Molgedey and Schuster 1994, Comon 1994, Bell and Sejnowski 1995, Amari et al. 1996, Nadal and Parga 1997) and ``principal component analysis'' (PCA, e.g., Oja 1989). ICA is realized by Cardoso's (1993) JADE (Joint Approximate Diagonalization of Eigen-matrices) algorithm (we used the Matlab JADE version obtained via FTP from sig.enst.fr). JADE is based on whitening and subsequent joint diagonalization of 4th-order cumulant matrices. For PCA and ICA, 1,000 (3,000) training exemplars are used in case of () input fields.
Information content. To measure the information conveyed by the various codes obtained in sections 4.3 and 4.4 we train a standard backprop net on the training set used for code generation. Its inputs are the code components; its task is to reconstruct the original input (for all tasks except for ``noisy bars'' the original input is scaled such that all input components are in ). The net has as many biased sigmoid hidden units with activation function (AF) as there are biased sigmoid output units with AF . We train it for 5,000 epochs without caring for overfitting. The training set consists of 500 fixed exemplars in the case of input fields (bars) and of 5000 in the case of input fields (real world images). The test set consists of 500 off-training set exemplars (in the case of real world images we use a separate test image). The average MSE on the test set is used to determine the reconstruction error.
Coding efficiency -- discrete codes. Coding efficiency is measured by the average number of bits needed to code a test set input pixel. The code components are scaled to the interval partitioned into 100 discrete intervals -- this results in 100 possible discrete values. Assuming independence of the code components we estimate the probability of each discrete code value by Monte Carlo sampling on the training set. To obtain the bits per pixels (Shannon's optimal value) on the test set we divide the sum of the negative logarithms of all discrete code component probabilities (averaged over the test set) by the number of input components.