But an IQ test requires you to answer ``10'' instead of ``34.'' Why not ``34''? The reasons are: (1) ``Simple'' solutions are preferred over ``complex'' ones. This idea is often referred to as ``Occam's razor'' (e.g., Blumer et al., 1987). (2) It is assumed that the ``simpler'' the rules, the better the generalization on test data. (3) The makers of the IQ test assume that everybody agrees on what ``simple'' means.
Similarly, many neural net and machine learning researchers agree1that learning algorithms ought to extract ``simple'' rules to explain training data. But what exactly does ``simple'' mean? One theory providing a convincing objective criterion for ``simplicity'' is the theory of Kolmogorov complexity (or algorithmic complexity). Contrary to a popular myth, the incomputability of Kolmogorov complexity (due to the halting problem) does not prevent machine learning applications, because there are tractable yet very general extensions of Kolmogorov complexity. Few neural net and machine learning researchers, however, make use of the powerful tools provided by the theory.
Purpose of paper. This work and the experiments to be presented herein are intended (1) to demonstrate that basic concepts from the theory of Kolmogorov complexity are indeed of interest for the study of neural network generalization and for machine learning purposes in general, (2) to encourage neural net researchers to study this theory, and (3) to point to some limitations of the current state of the art and to important open problems (see also [Schmidhuber, 1994]).
Outline. Section 2 briefly reviews the following basic concepts of algorithmic complexity theory relevant to machine learning: (1) Kolmogorov complexity, (2) The universal prior (or Solomonoff-Levin distribution), under which the probability of a computable object (like the solution to a problem) is essentially equal to the probability of guessing its shortest program on a universal computer, (3) Solomonoff's theory of inductive inference, and how it justifies Occam's razor, (4) The principle of minimal description length (MDL) in its general form, (5) Levin complexity (a generalization of Kolmogorov complexity) and Levin's universal optimal search algorithm. For a given computable solution to a problem, consider the negative logarithm of the probability of guessing a program that computes it, plus the logarithm of its runtime. The Levin complexity of the solution is the minimal possible value of this. Levin's universal search algorithm essentially generates and tests solution candidates (from a set of possible computable candidates) in order of their Levin complexity, until a solution is found. For a broad class of problems, universal search can be shown to be optimal with respect to total expected search time, leaving aside a constant factor which does not depend on the problem. To the author's knowledge, section 3 presents the first general implementation of (a probabilistic variant of) universal search on a conventional digital machine. It is based on efficient ``self-sizing'' programs which influence their own runtime and storage size. Simulations in section 4 focus on the task of finding ``simple'' neural nets with high generalization capability. Experiments with toy problems demonstrate that the method, at least in certain cases where it is computationally feasible, can lead to generalization results that are impossible to obtain by more traditional neural net algorithms (also briefly reviewed in section 4). Section 5 discusses the fact that universal search is not directly applicable to incremental learning situations, and mentions new ideas on how to deal with such situations. Section 6 concludes with general remarks on problem solving and Occam's razor.