EXPERIMENT 2: Independent Bars

Task 2.1 -- adapted from Dayan and Zemel (1995), see also Földiák (1990), Zemel (1993), Saund (1995), but more difficult (compare M. Baumgartner's 1996 diploma thesis). The input is a $5 \times 5$ pixel grid with horizontal and vertical bars at random, independent positions. See Figure 1 for an example.

**Figure 1:** *Task 2.1: example of partly overlapping bars. The 2nd and the 4th vertical bar and the 2nd horizontal bar are switched on simultaneously. Left: the corresponding input values.*
figure=bars0.eps,angle=-90,width=0.7

Training and testing. Each of the 10 possible bars appears with probability $\frac{1}{5}$ . In contrast to Dayan and Zemel's set-up (1995) we allow for bar type mixing. This makes the task harder (Dayan and Zemel 1995, p. 570). To test LOCOCODE's ability to reduce redundancy, we use many more HUs (namely 25) than the required minimum of 10. Dayan and Zemel report that an AA trained without FMS (and more than 10 HUs) ``consistently failed''. This result has been confirmed by Baumgartner (1996).

For each of the 25 pixels there is an input unit. Input units that see a pixel of a bar take on activation

, others

. See Figure 1 for an example. Following Dayan and Zemel (1995), the net is trained on 500 randomly generated patterns (there may be pattern repetitions). Learning is stopped after 5,000 epochs. We say that a pattern is processed correctly if the absolute error of all output units is below 0.3.

Details. Parameters: learning rate: 1.0, $E_{tol} = 0.16$ , $\Delta \lambda = 0.001$ . Architecture: (25-25-25).

**Figure 2:** *Task 2.1 (independent bars). Left: LOCOCODE's input-to-hidden weights. Right: hidden-to-output weights. See text for visualization details.*
figure=barsb.eps,angle=0,width=1.0

Results: factorial (but sparse) codes. Training MSE is 0.11 (average over 10 trials). The net generalizes well: only one of the test patterns is not processed correctly. 15 of the 25 HUs are indeed automatically pruned. All remaining HUs are binary: LOCOCODE finds an optimal factorial code which exactly mirrors the pattern generation process. Since the expected number of bars per input is 2, the code is also sparse.

For each of the 25 HUs, Figure 2 (left) shows a $5 \times 5$ square depicting 25 typical post-training weights on connections from 25 inputs (right: to 25 outputs). White (black) circles on gray (white) background are positive (negative) weights. The circle radius is proportional to the weight's absolute value. Figure 2 (left) also shows the bias weights (on top of the squares' upper left corners). The circle representing some HU's maximal absolute weight has maximal possible radius (circles representing other weights are scaled accordingly).

Backprop fails. For comparison we run this task with conventional BP with 25, 15 and 10 HUs. With 25 (15, 10) HUs the reconstruction error is 0.19 (0.24, 0.31). Backprop does not prune any units; the resulting weight patterns are highly unstructured, and the underlying input statistics are not discovered.

PCA and ICA. We tried both 10 and 15 components. Figure 3 shows results. PCA produces an unstructured and dense code, ICA-10 an almost sparse code where some sources are recognizable but not separated. ICA-15 finds a dense code and no sources. ICA/PCA codes with 10 components convey the same information as 10-component lococodes. The higher reconstruction errors for PCA-15 and ICA-15 are due to overfitting (the backprop net over-specializes on the training set).

LOCOCODE can exploit the advantages of sigmoid output functions and is applicable to nonlinear signal mixtures. PCA and ICA, however, are limited to linear source superpositions. Since we allow for mixing of vertical and horizontal bars, the bars do not add linearly, thus exemplifying a major characteristic of real visual inputs. This contributes to making the task hard for PCA and ICA.

**Figure 3:** Task 2.1 (independent bars). PCA and ICA: weights to code components (ICA with 10 and 15 components). ICA-10 does make some sources recognizable, but does not achieve lococode quality.
figure=barsa.eps,angle=0,width=1.0

Task 2.2 (noisy bars). Like Task 2.1 except for additional noise: bar intensities vary in

; input units that see a pixel of a bar are activated correspondingly (recall the constant intensity 0.5 in Task 2.1), others adopt activation

. We also add Gaussian noise with variance 0.05 and mean 0 to each pixel. Figure 4 shows some training exemplars generated in this way. The task is adapted from Hinton et al. (1995) and Hinton and Ghahramani (1997) but more difficult because vertical and horizontal bars may be mixed in the same input.

**Figure 4:** *Task 2.2 -- noisy bars examples: 25 $5 \times 5$ training inputs, depicted similarly to the weights in previous figures.*
figure=nbars0.eps,angle=0,width=0.7

Details. Training, testing, coding and learning are as in Task 2.1, except that $E_{tol} = 2.5$ and $\Delta \lambda = 0.01$ . $E_{tol}$ is set to 2 times the expected minimal squared error: $E_{tol} = 2$ (number of inputs) $\sigma^2 = 2*25*0.05=2.5$ . To achieve consistency with Task 2.1, the target pixel value is 1.4 times the input pixel value (compare Task 2.1:

). All other learning parameters are like in Task 2.1.

**Figure 5:** *Task 2.2 (independent noisy bars). Left: LOCOCODE's input-to-hidden weights. Right: hidden-to-output weights.*
figure=nbarsb.eps,angle=0,width=1.0

Results. Training MSE is 2.5 (averaged over 10 trials); the net generalizes well. 15 of the 25 HUs are pruned away. Again LOCOCODE extracts an optimal (factorial) code which exactly mirrors the pattern generation process. Due to the bar intensity variations the remaining HUs are not binary as in Task 2.1. Figure 5 depicts typical weights to and from HUs.

PCA and ICA. Figure 6 shows results comparable to those of Task 2.1. PCA codes and ICA-15 codes are unstructured and dense. ICA-10 codes, however, are almost sparse -- some sources are recognizable. They are not separated though. We observe that PCA/ICA codes with 10 components convey as much information as 10-component lococodes. The lower reconstruction error for PCA-15 and ICA-15 is due to information about the current noise conveyed by the additional code components (we reconstruct noisy inputs).

**Figure 6:** Task 2.2 (independent noisy bars). PCA and ICA: weights to code components (ICA with 10 and 15 components). Only ICA-10 codes extract a few sources, but they do not achieve the quality of lococodes.
figure=nbarsa.eps,angle=0,width=1.0

Conclusion. LOCOCODE solves a hard variant of the standard ``bars'' problem. It discovers the underlying statistics and extracts the essential, statistically independent features, even in presence of noise. Standard BP AAs accomplish none of these feats (Dayan and Zemel, 1995) -- this has been confirmed by additional experiments conducted by ourselves. ICA and PCA also fail to extract the true input causes and the optimal features.

LOCOCODE achieves success solely by reducing information-theoretic (de)coding costs. Unlike previous approaches, it does not depend on explicit terms enforcing independence (e.g., Schmidhuber 1992), zero mutual information among code components (e.g., Linsker 1988, Deco and Parra 1994), or sparseness (e.g., Field 1994, Zemel and Hinton 1994, Olshausen and Field 1996, Zemel 1993, Hinton and Ghahramani 1997).

LOCOCODE vs. ICA. Like recent simple methods for ``independent component analysis'' (ICA, e.g., Cardoso and Souloumiac 1993, Bell and Sejnowski 1995, Amari et al. 1996) LOCOCODE untangles mixtures of independent data sources. Unlike these methods, however, it does not need to know in advance the number of such sources -- like ``predictability minimization'' (a nonlinear ICA approach -- Schmidhuber 1992), it simply prunes away superfluous code components.

In many visual coding applications few sources determine the value of a given output (input) component, and the sources are easily computable from the input. Here LOCOCODE outperforms simple ICA because it minimizes the number of low-complexity sources responsible for each output component. It may be less useful for discovering input causes that can only be represented by high-complexity input transformations, or for discovering many features (causes) collectively determining single input components (as, e.g., in acoustic signal separation). In such cases ICA does not suffer from the fact that each source influences each input component and none is computable by a low-complexity function.