Experimental Conditions

In all our experiments we associate input data with itself, using an FMS-trained 3-layer autoassociator (AA). Unless stated otherwise we use 700,000 training exemplars, sigmoid hidden units (HUs) with activation function (AF) , sigmoid output units with AF , noninput units with an additional bias input, normal weights initialized in , bias hidden weights with -1.0, with 0.5. The HU AFs do make sparseness better recognizable, but the output AFs are fairly arbitrary -- linear AFs or those of the HUs will do as well. Targets are scaled to , except for Task 2.2. Target scaling (1) prevents tiny first order derivatives of output units (which may cause floating point overflows), and (2) allows for proving that the FMS algorithm makes the Hessian entries of output units decrease where the weight precisions or increase (Hochreiter and Schmidhuber 1997a).

**Parameters and other details.**

- learning rate: conventional learning rate for error term (just like backprop's).
- : a positive ``regularizer'' (hyperparameter) scaling 's influence. is computed heuristically as described by Hochreiter and Schmidhuber (1997a).
- : a value used for updating during learning. It represents the absolute change of after each epoch.
- : the tolerable mean squared error (MSE) on the training set. It is used for dynamically computing , and for deciding when to switch phases in 2-phase learning.
- 2-phase learning speeds up the algorithm: phase 1 is conventional backprop, phase 2 is FMS. We start with phase 1 and switch to phase 2 once , where is the average epoch error. We switch back to phase 1 once . We finish in phase 2. The experimental sections will indicate 2-phase learning by mentioning values of .
- Pruning of weights and units: we judge a weight as being pruned if its required precision ( in Hochreiter and Schmidhuber 1997a) for each input is 100 times lower (corresponding to 2 decimal digits) than the highest precision of the other weights for the same input. A unit is considered pruned if all incoming weights are pruned except for the bias weight, or if all outgoing weights are pruned.

**Comparison.**
In sections 4.3 and 4.4 we compare LOCOCODE to
simple variants of
``independent component analysis'' (ICA, e.g.,
Jutten and Herault 1991,
Cardoso and Souloumiac 1993,
Molgedey and Schuster 1994,
Comon 1994,
Bell and Sejnowski 1995, Amari et al. 1996,
Nadal and Parga 1997)
and ``principal component analysis'' (PCA, e.g., Oja 1989).
ICA is realized by Cardoso's (1993) JADE (Joint Approximate
Diagonalization of Eigen-matrices) algorithm
(we used the Matlab JADE version obtained via FTP from `sig.enst.fr`).
JADE is based on whitening and subsequent
joint diagonalization of 4th-order cumulant matrices.
For PCA and ICA, 1,000 (3,000) training exemplars are used in
case of () input fields.

**Information content.**
To measure the information conveyed by the various codes obtained
in sections 4.3 and 4.4 we train a
standard backprop net on the training set used for code generation.
Its inputs are the code components; its task is
to reconstruct the original input
(for all tasks except for ``noisy bars'' the original
input is scaled such that all input
components are in ).
The net has as many biased sigmoid hidden
units
with activation function (AF)
as there are biased
sigmoid output units with AF
.
We train it for 5,000 epochs without caring for overfitting.
The training set consists
of 500 fixed exemplars
in the case of input fields (bars) and
of 5000 in the case of
input fields (real world images). The test set consists of 500
off-training set exemplars (in the case of real world images we use
a separate test image). The average MSE on the test set
is used to determine the reconstruction error.

**Coding efficiency -- discrete codes.**
Coding efficiency
is measured by the average number of bits needed to code a test
set input pixel. The code components are scaled to the interval
partitioned into 100 discrete intervals -- this
results in 100 possible discrete values.
Assuming independence of the code components
we estimate the probability of
each discrete code value by Monte Carlo sampling on the training set.
To obtain the bits per pixels (Shannon's optimal value)
on the test set we divide the sum of the
negative logarithms of all discrete code component probabilities
(averaged over the test set)
by the number of input components.

Back to Independent Component Analysis page.