A linear (perceptron-like) network with 100 input units,
one output unit, and 100 weights,
is fed with 100-dimensional binary input vectors.
denotes the -th input vector.
denotes the th component
of , where ranges from 0 to 99.
Each input vector has exactly three bits
set to one, all the other bits are set to zero.
Obviously, there are
The network's output in response to is
The solution. The only solution to the problem is: make all equal to 1. The Kolmogorov complexity of this solution is small, since there is a short program that computes it. Its Levin complexity is small, too, since its ``logical depth'' (the runtime of its shortest program (Bennett, 1988)) is less than 400 time steps.
The training data. To illustrate the generalization capability of search for solution candidates with low Levin complexity, only 3 training examples are used. They were randomly chosen from the possible inputs. The first training example is the binary vector with on-bits at the positions 5, 17, and 86 (and off-bits everywhere else). The second one, , has on-bits at the positions 13, 55, and 58. The third one, , has on-bits at the positions 40, 87, and 94. In all three cases, the desired output (target) is 3.
Why conventional neural net algorithms fail to solve this problem. Since the training set is very small, conventional perceptron algorithms will not solve this apparently simple problem. They will not achieve good generalization on unseen test data. One reason is that connections from units that are always off won't be changed at all by gradient descent algorithms. Note, however, that scaling the inputs differently is not going to improve matters. Nor is weight decay. Weight decay encourages weight matrices with many zero entries. For the current task, this is a bad strategy.
Probabilistic search. The search procedure is as follows: the probabilistic search algorithm (as described in section 3) lists and executes programs computing solution candidates (weight vectors). The primitive ``WriteWeight'' (replacing ``output'', see section 3) is used for writing network weights. It has one argument and uses the variable WeightPointer taking on values from the set . In the beginning of a run, WeightPointer and all weights are initialized to 0. The instruction number and the semantics of ``WriteWeight'' are as follows (compare the list of primitives given in section 3):
Only if the solution candidate fits the training data exactly is the solution tested on the test data. Note that this is like a ``reward-only-at-goal'' task: the measure of success is binary - either the network fits all the training data, or it doesn't. There is no teacher providing a more informative error signal (such as the distance to the desired outputs).
RESULTS. Programs generating networks fitting the 3 training exemplars were found in 20 out of 100000 runs. 18 of these 20 led to perfect generalization on the unseen test examples. Typical programs used (conditional) jumps to generate correct solutions. See (Schmidhuber, 1994a) for details.
Comment. With the task above, probabilistic search among self-sizing programs leads to excellent generalization performance. At least in theory, however, it might be possible that an appropriate variant of Nowlan's and Hinton's approach (1992) might achieve good generalization performance on this task, too. Nowlan and Hinton encourage groups of weights with equal values, which is a good strategy in the case above. For this reason, the following task requires that no two weights have equal values. The Kolmogorov complexity of the solution, however, will again be low.