Our algorithm tries to find a large region in weight space with the property that each weight vector from that region leads to similar small error. Such a region is called a ``flat minimum'' (Hochreiter & Schmidhuber, 1995). To get an intuitive feeling for why a flat minimum is interesting, consider this: a ``sharp'' minimum (see figure 2) corresponds to weights which have to be specified with high precision. A flat minimum (see figure 1) corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description (message) length (MML, Wallace, 1968; MDL, Rissanen, 1978), fewer bits of information are required to describe a flat minimum (corresponding to a ``simple'' or low complexitynetwork). The MDL principle suggests that low network complexity corresponds to high generalization performance. Similarly, the standard Bayesian view favors ``fat'' maxima of the posterior weight distribution (maxima with a lot of probability mass  see, e.g., Buntine & Weigend, 1991). We will see: flat minima are fat maxima.
figure=f.ps,angle=0,width=1.1

figure=p.ps,angle=0,width=1.1

Unlike, e.g., Hinton and van Camp's method (1993), our algorithm does not depend on the choice of a ``good'' weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second order method (Pearlmutter, 1994; Mller, 1993), we obtain conventional backprop's order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and input lines, as well as output sensitivity with respect to remaining weights and units. Unlike, e.g., simple weight decay, our method automatically treats/prunes units and weights in different layers in different reasonable ways.
Outline.
Using a variant of the Gibbs algorithm, appendix A.1 defines generalization, underfitting and overfitting error in a novel way. By defining an appropriate prior over inputoutput functions, we postulate that the most probable network is a ``flat'' one.
Appendix A.2 formally justifies the error function minimized by our algorithm.
Appendix A.3 describes an efficient implementation of the algorithm.
Appendix A.4 finally presents pseudo code of the algorithm.