Basic idea of flat minima search. Our algorithm finds a large region in weight space with the property that each weight vector from that region has similar small error. Such regions are called ``flat minima''. To get an intuitive feeling for why ``flat'' minima are interesting, consider this (see also Wolpert [14]): a ``sharp'' minimum corresponds to weights which have to be specified with high precision. A ``flat'' minimum corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description length (MDL), fewer bits of information are required to pick a ``flat'' minimum (corresponding to a ``simple'' or low complexity-network). The MDL principle suggests that low network complexity corresponds to high generalization performance (see e.g. [4,10]). Unlike Hinton and van Camp's method [3] (see appendix A.3), our approach does not depend on explicitly choosing a ``good'' prior.
Our algorithm finds ``flat'' minima by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using Pearlmutter's and Mller's efficient second order method [,7], we obtain the same order of complexity as with conventional backprop. Automatically, the method effectively reduces numbers of units, weigths, and input lines, as well as the sensitivity of outputs with respect to remaining weights and units. Excellent experimental generalization results will be reported in section 4.