next up previous


The following pattern association task may seem trivial but will be made difficult (for traditional approaches) by providing only very few training examples.

The task. A linear (perceptron-like) network with 100 input units, one output unit, and 100 weights, is fed with 100-dimensional binary input vectors. $x^p$ denotes the $p$-th input vector. $x^p_i$ denotes the $i$th component of $x^p$, where $i$ ranges from 0 to 99. Each input vector has exactly three bits set to one, all the other bits are set to zero. Obviously, there are ${100 \choose 3} = 161,700$ possible inputs. The network's output in response to $x^p$ is

y^p = \sum w_i x^p_i,

where $w_i$ is the $i$-th weight. Each weight may take on integer values between -10000 and 10000. The task is to find weights such that $y^p$ equals the number of on-bits in $x^p$, for all 161,700 possible $x^p$. The number of solution candidates in the search space of possible weight vectors is huge: $20001^{100}$. This is too much for exhaustive search.

The solution. The only solution to the problem is: make all $w_i$ equal to 1. The Kolmogorov complexity of this solution is small, since there is a short program that computes it. Its Levin complexity is small, too, since its ``logical depth'' (the runtime of its shortest program (Bennett, 1988)) is less than 400 time steps.

The training data. To illustrate the generalization capability of search for solution candidates with low Levin complexity, only 3 training examples are used. They were randomly chosen from the $161,700$ possible inputs. The first training example is the binary vector $x^1$ with on-bits at the positions 5, 17, and 86 (and off-bits everywhere else). The second one, $x^2$, has on-bits at the positions 13, 55, and 58. The third one, $x^3$, has on-bits at the positions 40, 87, and 94. In all three cases, the desired output (target) is 3.

Why conventional neural net algorithms fail to solve this problem. Since the training set is very small, conventional perceptron algorithms will not solve this apparently simple problem. They will not achieve good generalization on unseen test data. One reason is that connections from units that are always off won't be changed at all by gradient descent algorithms. Note, however, that scaling the inputs differently is not going to improve matters. Nor is weight decay. Weight decay encourages weight matrices with many zero entries. For the current task, this is a bad strategy.

Probabilistic search. The search procedure is as follows: the probabilistic search algorithm (as described in section 3) lists and executes programs computing solution candidates (weight vectors). The primitive ``WriteWeight'' (replacing ``output'', see section 3) is used for writing network weights. It has one argument and uses the variable WeightPointer taking on values from the set $\{0, 1, ..., 99\}$. In the beginning of a run, WeightPointer and all weights are initialized to 0. The instruction number and the semantics of ``WriteWeight'' are as follows (compare the list of primitives given in section 3):

WriteWeight(address). $w_{WeightPointer}$ is set equal to the contents of address2. The variable WeightPointer is incremented. Halt if WeightPointer out of range.

Only if the solution candidate fits the training data exactly is the solution tested on the test data. Note that this is like a ``reward-only-at-goal'' task: the measure of success is binary - either the network fits all the training data, or it doesn't. There is no teacher providing a more informative error signal (such as the distance to the desired outputs).

RESULTS. Programs generating networks fitting the 3 training exemplars were found in 20 out of 100000 runs. 18 of these 20 led to perfect generalization on the ${100 \choose 3} - 3 = 161,697$ unseen test examples. Typical programs used (conditional) jumps to generate correct solutions. See (Schmidhuber, 1994a) for details.

Comment. With the task above, probabilistic search among self-sizing programs leads to excellent generalization performance. At least in theory, however, it might be possible that an appropriate variant of Nowlan's and Hinton's approach (1992) might achieve good generalization performance on this task, too. Nowlan and Hinton encourage groups of weights with equal values, which is a good strategy in the case above. For this reason, the following task requires that no two weights have equal values. The Kolmogorov complexity of the solution, however, will again be low.

next up previous
Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page