The following five experiments demonstrate effects of various input representations, data distributions, and architectures according to Table 1. The data always consists of 8 input vectors. Code units are initialized with a negative bias of -2.0.
Constant Parameters. , (2-phase learning).
Experiment 1.1: We use uniformly distributed inputs and 500,000 training examples. Parameters: learning rate: 0.1, the ``tolerable error'' , Architecture: (8-5-8) (8 input units, 5 HUs, 8 output units).
Results: factorial codes. In 7 out of 10 trials, FMS effectively pruned 2 HUs, and produced a factorial binary code with statistically independent code components. In 2 trials FMS pruned 2 HUs and produced an almost binary code -- with one trinary unit taking on values of 0.0, 0.5, 1.0. In one trial FMS produced a binary code with only one HU being pruned away. Obviously, under certain constraints on the input data, FMS has a strong tendency towards the compact, nonredundant codes advocated by numerous researchers.
Experiment 1.2: See Table 1 for differences to Experiment 1.1. We use 200,000 training examples and more HUs to make clear that in this case fewer units are pruned.
Results: local codes. 10 trials were conducted. FMS always produced a binary code. In 7 trials, only 1 HU was pruned, in the remaining trials 2 HUs. Unlike with standard BP, almost all inputs almost always were coded in an entirely local manner, i.e., only one HU was switched on, the others switched off. Recall that local codes were also advocated by many researchers - but they are precisely ``the opposite'' of the factorial codes from the previous experiment. How can LOCOCODE justify such different codes? How to explain this apparent discrepancy?
Explanation. The reason is: with the different input representation, the additional HUs do not necessarily result in much more additional complexity of the mappings for coding and decoding. The zero-valued inputs allow for low weight precision (low coding complexity) for connections leading to HUs (similarly for connections leading to output units). In contrast to Experiment 1.1 it is possible to describe the -th possible input by the following feature: ``the -th input component does not equal zero''. It can be implemented by a low-complexity component function. This contrasts the features in Experiment 1.1, where there are only 5 hidden units and no zero input components: there it is better to code with as few code components as possible, which yields a factorial code.
Experiment 1.3: like Experiment 1.2 but with one-dimensional input. Parameters: learning rate: 0.1, .
Results: feature detectors. 10 trials were conducted. FMS always produced the following code: one binary HU making a distinction between input values less than 0.5 and input values greater than 0.5, 2 HUs with continuous values, one of which is zero (or one) whenever the binary unit is on, while the other is zero (one) otherwise. All remaining HUs adopt constant values of either 1.0 or 0.0, thus being essentially pruned away. The binary unit serves as a binary feature detector, grouping the inputs into 2 classes.
Lococode recognizes the causes. The data of Experiment 1.3 may be viewed as being generated as follows: (1) first choose with uniform probability a value from ; then (2) choose one from ; then (3) add the two values. The first cause of the data is recognized perfectly but the second is divided among two code components, due to the non-linearity of the output unit: adding to 0 is different from adding to 0.75 (consider the first order derivatives).
Experiment 1.4: like Experiment 1.1 but with nonuniformly distributed inputs. Parameters: learning rate: 0.005, .
Results: sparse codes. In 4 out of 10 trials, FMS found a binary code (no HUs pruned). In 3 trials: a binary code with one HU pruned. In one trial: a code with one HU removed, and a trinary unit adopting values of 0.0, 0.5, 1.0. In 2 trials: a code with one pruned HU and 2 trinary HUs. Obviously, with this set-up, FMS prefers codes known as sparse distributed representations. Inputs with higher probability are coded by fewer active code components than inputs with lower probability. Typically, inputs with probability lead to one active code component, inputs with probability to two, and others to three.
Explanation. Why is the result different from Experiment 1.1's? To achieve equal error contributions to all inputs, the weights for coding/decoding highly probable inputs have to be given with higher precision than the weights for coding/decoding inputs with low probability: the input distribution from Experiment 1.1 will result in a more complex network. The next experiment will make this effect even more pronounced.
Experiment 1.5: like Experiment 1.4, but with architecture (8-8-8).
Results: sparse codes. In 10 trials, FMS always produced binary codes. In 2 trials only 1 HU was pruned. In 1 trial 3 units were pruned. In 7 trials 2 units were pruned. Unlike with standard BP, almost all inputs almost always were coded in a sparse, distributed manner: typically, 2 HUs were switched on, the others switched off, and most HUs responded to exactly 2 different input patterns. The mean probability of a unit being switched on was 0.28, and the probabilities of different HUs being switched on tended to be equal.
Table 1 provides an overview over Experiments 1.1 -- 1.5.
Conclusion. FMS always finds codes quite different from standard BP's rather unstructured ones. It tends to discover and represent the underlying causes. Usually the resulting lococode is sparse and based on informative feature detectors. Depending on properties of the data it may become factorial or local. This suggests that LOCOCODE may represent a general principle of unsupervised learning subsuming previous, COCOF-based approaches.
Feature-based lococodes automatically take into account input/output properties (binary?, local?, input probabilities?, noise?, number of zero input components?).