A linear (perceptron-like) network with 100 input units,
one output unit, and 100 weights,
is fed with 100-dimensional binary input vectors.
denotes the -th input vector.
denotes the th component
of , where ranges from 0 to 99.
Each input vector has exactly three bits
set to one, all the other bits are set to zero.
Obviously, there are
The network's output in response to is
The solution. The only solution to the problem is: make all equal to 1. The Kolmogorov complexity of this solution is small, since there is a short program that computes it. Its Levin complexity is small, too, since its ``logical depth'' (the runtime of its shortest program [Bennett, 1988]) is less than 400 time steps.
The training data. To illustrate the generalization capability of search for solution candidates with low Levin complexity, only 3 training examples are used. They were randomly chosen from the possible inputs. The first training example is the binary vector with on-bits at the positions 5, 17, and 86 (and off-bits everywhere else). The second one, , has on-bits at the positions 13, 55, and 58. The third one, , has on-bits at the positions 40, 87, and 94. In all three cases, the desired output (target) is 3. Generalization results (to be described below) obtained with this particular training set are very similar to those obtained with different sets of 3 training examples (created by randomly permuting the input units that are never on).
Why conventional neural net algorithms fail to solve this problem. Since the training set is very small, conventional perceptron algorithms will not solve this apparently simple problem. They will not achieve good generalization on unseen test data. One reason is that connections from units that are always off won't be changed at all by conventional gradient descent algorithms, e.g., [Werbos, 1974,LeCun, 1985,Parker, 1985,Rumelhart et al., 1986]. Note, however, that scaling the inputs differently is not going to improve matters. Nor is weight decay. Weight decay encourages weight matrices with many zero entries. For the current task, this is a bad strategy.
Probabilistic search. The search procedure is as follows: the probabilistic search algorithm (as described in section 3) lists and executes programs computing solution candidates (weight vectors). The primitive ``WriteWeight'' (replacing ``output'', see section 3) is used for writing network weights. It has one argument and uses the variable WeightPointer taking on values from the set . In the beginning of a run, WeightPointer and all weights are initialized to 0. The instruction number and the semantics of ``WriteWeight'' are as follows (compare the list of primitives given in section 3):
Only if the solution candidate fits the training data exactly is the solution tested on the test data. Note that this is like a ``reward-only-at-goal'' task: The measure of success is binary - either the network fits all the training data, or it doesn't. There is no teacher providing a more informative error signal (such as the distance to the desired outputs).
RESULTS. Programs fitting the 3 training exemplars were found in 20 out of 100000 runs. 18 of these 20 led to perfect generalization on the unseen test examples.
The first weight vector fitting the training data was found after 904 runs. The corresponding program was a ``wild'' one, allocating a lot of space and executing many useless instructions, but still leading to perfect generalization on all the unseen test data. Before halting, the program used 702 out of 1024 allocated time steps. Its time probability was (recall that the unit time is only 16 time steps). Its space probability was .
Another weight vector fitting the training data was computed during the 6038th run. The corresponding program is given in table 1. Here is a more readable interpretation (each program instruction is preceded by its address):
(0) Write the contents of address 1 (which is 1) onto the weight pointed to by
WeightPointer and increment weight pointer. Halt if WeightPointer is out of range.
(2) If the contents of address 1 is less or equal to the contents of address 1,
goto address 0.
Since the condition tested in the second instruction is always true, this little program will write down a correct solution, given enough time. It requires 201 time steps. In the case above it got more than enough time: the randomly chosen time limit was .
After 351,168 runs, the system came up with a faster program. See table 2. The program does this:
(0) Write the contents of address 0 (which happens to be 1, due to the code of
``write'' being 1) onto the weight pointed to by WeightPointer and increment
WeightPointer. Halt if WeightPointer is out of range.
(2) Write the contents of address 0 onto the weight pointed to by WeightPointer
and increment WeightPointer. Halt if WeightPointer is out of range.
(4) If the contents of address 5 (which happens to be 5) is less than or equal
to the contents of address 5 (this is always true), goto address 0.
This program writes two times before jumping, thus reducing runtime from 200 to 150 time steps (recall that the execution of each instruction, including jumps, takes one time step). Its space probability is . Other successful programs with exactly the same runtime were found in 7 out of runs. No faster programs were found.
Comment. With the example above, probabilistic search among self-sizing programs leads to excellent generalization performance. At least in theory, however, it might be possible that an appropriate variant of Nowlan's and Hinton's approach (1992) might achieve good generalization performance on this task, too. Recall from section 3 that Nowlan and Hinton encourage groups of weights with equal values, which is a good strategy in the case above. For this reason, the following task requires that no two weights have equal values. The Kolmogorov complexity of the solution, however, will again be low.