Next: TASK / ARCHITECTURE / Up: FLAT MINIMA NEURAL COMPUTATION Previous: FLAT MINIMA NEURAL COMPUTATION

BASIC IDEAS / OUTLINE

Our algorithm tries to find a large region in weight space with the property that each weight vector from that region leads to similar small error. Such a region is called a ``flat minimum'' (Hochreiter & Schmidhuber, 1995). To get an intuitive feeling for why a flat minimum is interesting, consider this: a ``sharp'' minimum (see figure 2) corresponds to weights which have to be specified with high precision. A flat minimum (see figure 1) corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description (message) length (MML, Wallace, 1968; MDL, Rissanen, 1978), fewer bits of information are required to describe a flat minimum (corresponding to a ``simple'' or low complexity-network). The MDL principle suggests that low network complexity corresponds to high generalization performance. Similarly, the standard Bayesian view favors ``fat'' maxima of the posterior weight distribution (maxima with a lot of probability mass -- see, e.g., Buntine & Weigend, 1991). We will see: flat minima are fat maxima.

**Figure 1:** *Example of a ``flat'' minimum.*
figure=f.ps,angle=0,width=1.1

**Figure 2:** *Example of a ``sharp'' minimum.*
figure=p.ps,angle=0,width=1.1

Unlike, e.g., Hinton and van Camp's method (1993), our algorithm does not depend on the choice of a ``good'' weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second order method (Pearlmutter, 1994; M $\o$ ller, 1993), we obtain conventional backprop's order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and input lines, as well as output sensitivity with respect to remaining weights and units. Unlike, e.g., simple weight decay, our method automatically treats/prunes units and weights in different layers in different reasonable ways.

Outline.

Section 2 formally introduces basic concepts, such as error measures, flat minima etc.
Section 3 describes the novel algorithm called ``flat minimum search'' (FMS).
Section 4 formally derives the algorithm.
Section 5 reports experimental generalization results with feedforward and recurrent networks. For instance, in an application to stock market prediction, flat minimum search outperforms the following, widely used competitors: (1) conventional backprop, (2) weight decay, (3) ``optimal brain surgeon'' / ``optimal brain damage''.
Section 6 mentions relations to previous work.
Section 7 mentions limitations of the algorithm and outlines future work.
The appendix presents a detailed theoretical justification of our approach.
Using a variant of the Gibbs algorithm, appendix A.1 defines generalization, underfitting and overfitting error in a novel way. By defining an appropriate prior over input-output functions, we postulate that the most probable network is a ``flat'' one.
Appendix A.2 formally justifies the error function minimized by our algorithm.
Appendix A.3 describes an efficient implementation of the algorithm.
Appendix A.4 finally presents pseudo code of the algorithm.

Next: TASK / ARCHITECTURE / Up: FLAT MINIMA NEURAL COMPUTATION Previous: FLAT MINIMA NEURAL COMPUTATION

Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page