Next: TASK / ARCHITECTURE / Up: SIMPLIFYING NEURAL NETS BY Previous: SIMPLIFYING NEURAL NETS BY

INTRODUCTION

Previous algorithms for finding low complexity networks with high generalization capability are based on significant prior assumptions. They can be broadly classified as follows: (1) Assumptions about the prior weight distribution. Hinton and van Camp [3] and Williams [13] assume that pushing the posterior distribution (after learning) close to the prior leads to ``good'' generalization. Weight decay can be derived e.g. from Gaussian priors. Nowlan and Hinton [8] assume that networks with many similar weights generated by Gaussian mixtures are ``better'' a priori. MacKay's priors [6] are implicit in additional penalty terms, which embody the assumptions made. (2) Prior assumptions about how theoretical results on early stopping and network complexity carry over to practical applications. Examples are methods based on validation sets (see []), Vapnik's ``structural risk minimization'' [1] [11], and the methods of Holden [5] and Wang et al. [12]. Our approach requires less prior assumptions than most other approaches (see appendix A.1).

Basic idea of flat minima search. Our algorithm finds a large region in weight space with the property that each weight vector from that region has similar small error. Such regions are called ``flat minima''. To get an intuitive feeling for why ``flat'' minima are interesting, consider this (see also Wolpert [14]): a ``sharp'' minimum corresponds to weights which have to be specified with high precision. A ``flat'' minimum corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description length (MDL), fewer bits of information are required to pick a ``flat'' minimum (corresponding to a ``simple'' or low complexity-network). The MDL principle suggests that low network complexity corresponds to high generalization performance (see e.g. [4,10]). Unlike Hinton and van Camp's method [3] (see appendix A.3), our approach does not depend on explicitly choosing a ``good'' prior.

Our algorithm finds ``flat'' minima by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using Pearlmutter's and M $\o$ ller's efficient second order method [,7], we obtain the same order of complexity as with conventional backprop. Automatically, the method effectively reduces numbers of units, weigths, and input lines, as well as the sensitivity of outputs with respect to remaining weights and units. Excellent experimental generalization results will be reported in section 4.

Next: TASK / ARCHITECTURE / Up: SIMPLIFYING NEURAL NETS BY Previous: SIMPLIFYING NEURAL NETS BY

Juergen Schmidhuber 2003-02-25

Back to Financial Forecasting page