   Next: GENERALIZATION TASKS: SIMULATIONS Up: APPLICATION: FINDING ``SIMPLE'' NEURAL Previous: APPLICATION: FINDING ``SIMPLE'' NEURAL

## PREVIOUS ALGORITHMS FOR MAKING NETS ``SIMPLE''

• Weight decay. A special error term (in addition to the standard term enforcing matches between desired outputs and actual outputs) encourages weights close to zero. The idea is that a zero weight does not cost many bits to be specified, thus being ``simple'' [Hinton and van Camp, 1993]. Pearlmutter and Hinton were probably the first to propose weight decay, while Rumelhart was perhaps the first to suggest its use for reducing overfitting. Variants of weight decay were successfully applied by Weigend et al. (1990), Krogh and Hertz (1992), and others.

• Soft weight sharing. Nowlan and Hinton (1992) introduce an additional objective function encouraging groups of weights with nearly equal values. The weights are taken to be generated by mixtures of Gaussians. The fewer the number of Gaussians and the closer some weight is to the center of some Gaussian, the higher its probability, and the fewer bits are needed to encode it (according to classical information theory [Shannon, 1948]).

• Bayesian strategies for backprop nets. MacKay evaluates hyper-parameters (such as weight-decay rates) with respect to their probabilities of generating the observed data [MacKay, 1992]. Estimates of the probabilities are computed on the basis of Gaussian assumptions.

• Optimal Brain Surgeon. Hassibi and Storck (1993) use second order information to obtain ``simple'' nets by pruning weights whose influence on the error is minimal, and changing other weights to compensate. See [LeCun et al., 1991] for a related approach and (Vapnik, 1992; Guyon et al.,1992) for some theoretical analysis.

• ``Non-algorithmic'' MDL methods based on Gaussian priors. To minimize the sum of the description lengths of a neural net and its errors, Hinton and van Camp (1993) assume Gaussian weight priors and Gaussian error distributions, and minimize the asymmetric divergence (or Kullback-Leiber distance) between prior and posterior after training.

• Methods for finding ``flat'' minima. Hochreiter and Schmidhuber (1994, 1996) use efficient second order methods to search for large connected regions of ``acceptable'' error minima. This corresponds to ``simple'' networks with low-precision weight vectors, low description length and low expected overfitting.

• ``Mutual information networks.'' Deco et al. (1993) measure network complexity with respect to given data by measuring the mutual information between inputs and internal representations extracted by the hidden units.

• Methods for removing redundant information from input data. If the input data can be compressed, the networks processing the data can be made smaller (and simpler), in general. From the standpoint of classical information theory, an optimal compression algorithm is one that builds a factorial code of the input data -- a code with statistically independent components, e.g., [Barlow, 1989] (see Linsker (1988) for related work on the special, linear case). Various ``neural'' methods for compressing input data are known. [Schmidhuber, 1992b] describes a ``neural'' method designed to generate factorial codes. [Atick et al., 1992] focuses on visual inputs. [Schmidhuber, 1992a] focuses on loss-free sequence compression. [Becker, 1991] provides numerous additional references.

There are also numerous heuristic constructive methods, where network size grows in case of underfitting the training data. MDL approaches in other areas of machine learning include [Quinlan and Rivest, 1989,Gao and Li, 1989,Milosavljevic and Jurka, 1993,Pednault, 1989]. Among the implemented methods, neither the neural net approaches nor the other ones are general in the sense of Solomonoff, Kolmogorov, and Levin. All the previous implementations use measures for ``simplicity'' that lack the universality and elegance of those based on Kolmogorov complexity and algorithmic information theory. Many previous approaches are based on ad-hoc (usually Gaussian) priors.

The remainder of this paper is mostly devoted to simulations of the more general method based on the universal prior, self-sizing programs, and the probabilistic search algorithm favoring candidates with low Levin complexity over candidates with high Levin complexity. With certain seemingly trivial but actually non-trivial toy problems it will be demonstrated that the approach can lead to generalization results unmatchable by more traditional neural net algorithms. It should be mentioned, however, that this does not say much about the applicability of the method to real world tasks.   Next: GENERALIZATION TASKS: SIMULATIONS Up: APPLICATION: FINDING ``SIMPLE'' NEURAL Previous: APPLICATION: FINDING ``SIMPLE'' NEURAL
Juergen Schmidhuber 2003-02-12

Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page