next up previous
Next: TASK / ARCHITECTURE / Up: FLAT MINIMA NEURAL COMPUTATION Previous: FLAT MINIMA NEURAL COMPUTATION

BASIC IDEAS / OUTLINE

Our algorithm tries to find a large region in weight space with the property that each weight vector from that region leads to similar small error. Such a region is called a ``flat minimum'' (Hochreiter & Schmidhuber, 1995). To get an intuitive feeling for why a flat minimum is interesting, consider this: a ``sharp'' minimum (see figure 2) corresponds to weights which have to be specified with high precision. A flat minimum (see figure 1) corresponds to weights many of which can be given with low precision. In the terminology of the theory of minimum description (message) length (MML, Wallace, 1968; MDL, Rissanen, 1978), fewer bits of information are required to describe a flat minimum (corresponding to a ``simple'' or low complexity-network). The MDL principle suggests that low network complexity corresponds to high generalization performance. Similarly, the standard Bayesian view favors ``fat'' maxima of the posterior weight distribution (maxima with a lot of probability mass -- see, e.g., Buntine & Weigend, 1991). We will see: flat minima are fat maxima.

Figure 1: Example of a ``flat'' minimum.
figure=f.ps,angle=0,width=1.1
Figure 2: Example of a ``sharp'' minimum.
figure=p.ps,angle=0,width=1.1

Unlike, e.g., Hinton and van Camp's method (1993), our algorithm does not depend on the choice of a ``good'' weight prior. It finds a flat minimum by searching for weights that minimize both training error and weight precision. This requires the computation of the Hessian. However, by using an efficient second order method (Pearlmutter, 1994; M$\o$ller, 1993), we obtain conventional backprop's order of computational complexity. Automatically, the method effectively reduces numbers of units, weights, and input lines, as well as output sensitivity with respect to remaining weights and units. Unlike, e.g., simple weight decay, our method automatically treats/prunes units and weights in different layers in different reasonable ways.



Outline.


next up previous
Next: TASK / ARCHITECTURE / Up: FLAT MINIMA NEURAL COMPUTATION Previous: FLAT MINIMA NEURAL COMPUTATION
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page