Next: MDL-JUSTIFICATION OF FLATNESS CONDITION Up: FLAT MINIMA NEURAL COMPUTATION Previous: THE ALGORITHM

# DERIVATION OF THE ALGORITHM

Outline. We are interested in weights representing nets with tolerable error but flat outputs (see section 2 and appendix A.1). To find nets with flat outputs, two conditions will be defined to specify for and, as a consequence, (see section 3). The first condition ensures flatness. The second condition enforces equal flatness'' in all weight space directions, to obtain low variance of the net functions induced by weights within a box. The second condition will be justified using an MDL-based argument. In both cases, linear approximations will be made (to be justified in A.2).

Formal details. We are interested in weights causing tolerable error (see acceptable minima'' in section 2) that can be perturbed without causing significant output changes, thus indicating the presence of many neighboring weights leading to the same net function. By searching for the boxes from section 2, we are actually searching for low-error weights whose perturbation does not significantly change the net function.

In what follows we treat the input as fixed: for convenience, we suppress , i.e. we abbreviate by . Perturbing the weights by (with components ), we obtain , where expresses 's dependence on (in what follows, however, often will be suppressed for convenience, i.e. we abbreviate by ). Linear approximation (justified in A.2) gives us Flatness Condition 1'':

 (3)

where defines tolerable output changes within a box and is small enough to allow for linear approximation (it does not appear in 's and 's gradient, see section 3). is 's linear approximation, and is . Flatness condition 1 is a robustness condition'' (or fault tolerance condition'', or perturbation tolerance condition'' - see, e.g., Minai & Williams, 1994; Murray & Edwards, 1993; Neti et al., 1992; Matsuoka, 1992; Bishop, 1993; Kerlirzin & Vallet, 1993; Carter et al., 1990).

Many boxes satisfy flatness condition 1. To select a particular, very flat , the following Flatness Condition 2'' uses up degrees of freedom left by inequality (3):

 (4)

Flatness condition 2 enforces equal directed errors''
, where has the obvious meaning, and is the -th component of . Linear approximation is justified by the choice of in inequality (3). As will be seen in the MDL-justification to be presented below, flatness condition 2 favors the box which minimizes the mean perturbation error within the box. This corresponds to minimizing the variance of the net functions induced by weights within the box (recall that is quadratic).

How to derive the algorithm from flatness conditions 1 and 2. We first solve equation (4) for (fixing for all ). Then we insert the (with fixed ) into inequality (3) (replacing the second '' in (3) by '', since we search for the box with maximal volume). This gives us an equation for the (which depend on , but this is notationally suppressed):

 (5)

The ( is replaced by ) approximate the from section 2. The box is approximated by , the box with center and edge lengths . 's volume is approximated by 's box volume . Thus, (see section 3) can be approximated by . This immediately leads to the algorithm given by equation (1).

How can the above approximations be justified? The learning process itself enforces their validity (see A.2). Initially, the conditions above are valid only in a very small environment of an initial'' acceptable minimum. But during search for new acceptable minima with more associated box volume, the corresponding environments are enlarged. Appendix A.2 will prove this for feedforward nets (experiments indicate that this appears to be true for recurrent nets as well).

Comments. Flatness condition 2 influences the algorithm as follows: (1) The algorithm prefers to increase the 's of weights whose current contributions are not important to compute the target output. (2) The algorithm enforces equal sensitivity of all output units with respect to weights of connections to hidden units. Hence, output units tend to share hidden units, i.e., different hidden units tend to contribute equally to the computation of the target. The contributions of a particular hidden unit to different output unit activations tend to be equal, too.

Flatness condition 2 is essential: flatness condition 1 by itself corresponds to nothing more but first order derivative reduction (ordinary sensitivity reduction). However, as mentioned above, what we really want is to minimize the variance of the net functions induced by weights near the actual weight vector.

Automatically, the algorithm treats units and weights in different layers differently, and takes the nature of the activation functions into account.

Subsections

Next: MDL-JUSTIFICATION OF FLATNESS CONDITION Up: FLAT MINIMA NEURAL COMPUTATION Previous: THE ALGORITHM
Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page