Task 2.1  adapted from Dayan and Zemel (1995), see also Földiák (1990), Zemel (1993), Saund (1995), but more difficult (compare M. Baumgartner's 1996 diploma thesis). The input is a pixel grid with horizontal and vertical bars at random, independent positions. See Figure 1 for an example.
figure=bars0.eps,angle=90,width=0.7

``Although it might seem like a toy problem, the bar task with only 10 hidden units turns out to be quite hard for all the algorithms we discuss. The coding cost of making an error in one bar goes up linearly with the size of the grid, so at least one aspect of the problem gets easier with large grids.''We will see that even difficult variants of this task are not hard for LOCOCODE.
Training and testing. Each of the 10 possible bars appears with probability . In contrast to Dayan and Zemel's setup (1995) we allow for bar type mixing. This makes the task harder (Dayan and Zemel 1995, p. 570). To test LOCOCODE's ability to reduce redundancy, we use many more HUs (namely 25) than the required minimum of 10. Dayan and Zemel report that an AA trained without FMS (and more than 10 HUs) ``consistently failed''. This result has been confirmed by Baumgartner (1996).
For each of the 25 pixels there is an input unit. Input units that see a pixel of a bar take on activation , others . See Figure 1 for an example. Following Dayan and Zemel (1995), the net is trained on 500 randomly generated patterns (there may be pattern repetitions). Learning is stopped after 5,000 epochs. We say that a pattern is processed correctly if the absolute error of all output units is below 0.3.
Details. Parameters: learning rate: 1.0, , . Architecture: (252525).
figure=barsb.eps,angle=0,width=1.0

Results: factorial (but sparse) codes. Training MSE is 0.11 (average over 10 trials). The net generalizes well: only one of the test patterns is not processed correctly. 15 of the 25 HUs are indeed automatically pruned. All remaining HUs are binary: LOCOCODE finds an optimal factorial code which exactly mirrors the pattern generation process. Since the expected number of bars per input is 2, the code is also sparse.
For each of the 25 HUs, Figure 2 (left) shows a square depicting 25 typical posttraining weights on connections from 25 inputs (right: to 25 outputs). White (black) circles on gray (white) background are positive (negative) weights. The circle radius is proportional to the weight's absolute value. Figure 2 (left) also shows the bias weights (on top of the squares' upper left corners). The circle representing some HU's maximal absolute weight has maximal possible radius (circles representing other weights are scaled accordingly).
Backprop fails. For comparison we run this task with conventional BP with 25, 15 and 10 HUs. With 25 (15, 10) HUs the reconstruction error is 0.19 (0.24, 0.31). Backprop does not prune any units; the resulting weight patterns are highly unstructured, and the underlying input statistics are not discovered.
PCA and ICA. We tried both 10 and 15 components. Figure 3 shows results. PCA produces an unstructured and dense code, ICA10 an almost sparse code where some sources are recognizable but not separated. ICA15 finds a dense code and no sources. ICA/PCA codes with 10 components convey the same information as 10component lococodes. The higher reconstruction errors for PCA15 and ICA15 are due to overfitting (the backprop net overspecializes on the training set).
LOCOCODE can exploit the advantages of sigmoid output functions and is applicable to nonlinear signal mixtures. PCA and ICA, however, are limited to linear source superpositions. Since we allow for mixing of vertical and horizontal bars, the bars do not add linearly, thus exemplifying a major characteristic of real visual inputs. This contributes to making the task hard for PCA and ICA.
figure=barsa.eps,angle=0,width=1.0

Task 2.2 (noisy bars). Like Task 2.1 except for additional noise: bar intensities vary in ; input units that see a pixel of a bar are activated correspondingly (recall the constant intensity 0.5 in Task 2.1), others adopt activation . We also add Gaussian noise with variance 0.05 and mean 0 to each pixel. Figure 4 shows some training exemplars generated in this way. The task is adapted from Hinton et al. (1995) and Hinton and Ghahramani (1997) but more difficult because vertical and horizontal bars may be mixed in the same input.
figure=nbars0.eps,angle=0,width=0.7

Details. Training, testing, coding and learning are as in Task 2.1, except that and . is set to 2 times the expected minimal squared error: (number of inputs) . To achieve consistency with Task 2.1, the target pixel value is 1.4 times the input pixel value (compare Task 2.1: ). All other learning parameters are like in Task 2.1.
figure=nbarsb.eps,angle=0,width=1.0

Results. Training MSE is 2.5 (averaged over 10 trials); the net generalizes well. 15 of the 25 HUs are pruned away. Again LOCOCODE extracts an optimal (factorial) code which exactly mirrors the pattern generation process. Due to the bar intensity variations the remaining HUs are not binary as in Task 2.1. Figure 5 depicts typical weights to and from HUs.
PCA and ICA. Figure 6 shows results comparable to those of Task 2.1. PCA codes and ICA15 codes are unstructured and dense. ICA10 codes, however, are almost sparse  some sources are recognizable. They are not separated though. We observe that PCA/ICA codes with 10 components convey as much information as 10component lococodes. The lower reconstruction error for PCA15 and ICA15 is due to information about the current noise conveyed by the additional code components (we reconstruct noisy inputs).
figure=nbarsa.eps,angle=0,width=1.0

Conclusion. LOCOCODE solves a hard variant of the standard ``bars'' problem. It discovers the underlying statistics and extracts the essential, statistically independent features, even in presence of noise. Standard BP AAs accomplish none of these feats (Dayan and Zemel, 1995)  this has been confirmed by additional experiments conducted by ourselves. ICA and PCA also fail to extract the true input causes and the optimal features.
LOCOCODE achieves success solely by reducing informationtheoretic (de)coding costs. Unlike previous approaches, it does not depend on explicit terms enforcing independence (e.g., Schmidhuber 1992), zero mutual information among code components (e.g., Linsker 1988, Deco and Parra 1994), or sparseness (e.g., Field 1994, Zemel and Hinton 1994, Olshausen and Field 1996, Zemel 1993, Hinton and Ghahramani 1997).
LOCOCODE vs. ICA. Like recent simple methods for ``independent component analysis'' (ICA, e.g., Cardoso and Souloumiac 1993, Bell and Sejnowski 1995, Amari et al. 1996) LOCOCODE untangles mixtures of independent data sources. Unlike these methods, however, it does not need to know in advance the number of such sources  like ``predictability minimization'' (a nonlinear ICA approach  Schmidhuber 1992), it simply prunes away superfluous code components.
In many visual coding applications few sources determine the value of a given output (input) component, and the sources are easily computable from the input. Here LOCOCODE outperforms simple ICA because it minimizes the number of lowcomplexity sources responsible for each output component. It may be less useful for discovering input causes that can only be represented by highcomplexity input transformations, or for discovering many features (causes) collectively determining single input components (as, e.g., in acoustic signal separation). In such cases ICA does not suffer from the fact that each source influences each input component and none is computable by a lowcomplexity function.