Next: A.2. WHY DOES THE Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION

## A.1. FLAT NETS: THE MOST PROBABLE HYPOTHESES

Short Guide Through Appendix A.1. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error.

More Detailed Guide Through Appendix A.1.

• After listing relevant definitions we will introduce a somewhat unconventional variant of the Gibbs algorithm, designed to take into account that FMS uses only the training data to determine , a distribution over the set of hypotheses expressing our prior belief in hypotheses (here we do not care where the data came from -- this will be treated later).

• This variant of the Gibbs algorithm will help us to introduce the concept of expected extended generalization error'', which can be split into an overfitting error'' (relevant for measuring whether the learning algorithm focuses too much on the training set) and an underfitting error'' (relevant for measuring whether the algorithm sufficiently approximates the training set). To obtain these errors, we measure the Kullback-Leibler distance between posterior after training on the training set and posterior after (hypothetical) training on all data (here the subscript indicates that for learning , is used as prior belief in hypotheses, too). The overfitting error measures the information conveyed by , but not by . The underfitting error measures the information conveyed by , but not by .

• We then introduce the tolerable error level'' and the set of acceptable minima''. The latter contains hypotheses with low underfitting error, assuming that indeed conveys information about the test set (every training set error reducing algorithm makes this assumption). In the remainder of the appendix, we will focus only on hypotheses within the set of acceptable minima.

• We introduce the relative overfitting error'', which is the relative contribution of a hypothesis to the mean overfitting error on the set of acceptable minima. The relative overfitting error measures the overfitting error of hypotheses with low underfitting error. The goal is to find a hypothesis with low overfitting error and, consequently, with low generalization error.

• The relative overfitting error is approximated based on the trade-off between low training set error and large values of . The distribution is restricted to the set of acceptable minima, to obtain the distribution .

• We then assume the data is obtained from a target chosen according to a given prior distribution. Using previously introduced distributions, we derive the expected test set error and the expected relative overfitting error. We want to reduce the latter by choosing a certain and .

• The special case of noise free data is considered.

• To be able to minimize the expected relative overfitting error, we need to adopt a certain prior belief . The only unknown distributions required to determine are and -- they describe how (noisy) data is obtained from the target. We have to make the following assumptions: the choice of prior belief is appropriate'', the noise on data drawn from the target has mean , and small noise is more probable than large noise (the noise assumptions ensure that reducing the training error -- by choosing some from -- reduces the expected underfitting error). We don't need Gaussian assumptions, though.

• We show that FMS approximates our special variant of the Gibbs algorithm: the prior is approximated locally in weight space, and flat are approximated by flat with near in weight space.

Definitions. Let be the set of all possible input/output pairs (pairs of vectors). Let be the set of functions that can be implemented by the network. For every net function we have . Elements of are parameterized with a parameter vector from the set of possible parameters . is a function which maps a parameter vector onto a net function ( is surjective.) Let be the set of target functions , where . Let be the set of hypothesis functions , where . For simplicity, take all sets to be finite, and let all functions map each to some . Values of functions with argument are denoted by . We have .

Let be the data, where . is divided into a training set and a test set . For the moment, we are not interested in how was obtained.

We use squared error , where is the Euclidean norm. . . holds.

Learning. We use a variant of the Gibbs formalism (see Opper & Haussler, 1991, or Levin et al., 1990). Consider a stochastic learning algorithm (random weight initialization, random learning rate). The learning algorithm attempts to reduce training set error by randomly selecting a hypothesis with low , according to some conditional distribution over . is chosen in advance, but in contrast to traditional Gibbs (which deals with unconditional distributions on ), we may take a look at the training set before selecting . For instance, one training set may suggest linear functions as being more probable than others, another one splines, etc. The unconventional Gibbs variant is appropriate because FMS uses only (the set of first components of 's elements, see section 3) to compute the flatness of . The trade-off between the desire for low and the a priori belief in a hypothesis according to is governed by a positive constant (interpretable as the inverse temperature from statistical mechanics, or the amount of stochasticity in the training algorithm).

We obtain , the learning algorithm applied to data :

 (6)

where
 (7)

is the error momentum generating function'', or the weighted accessible volume in configuration space'' or the partition function'' (from statistical mechanics).

For theoretical purposes, assume we know and may use it for learning. To learn, we use the same distribution as above (prior belief in some hypotheses is based exclusively on the training set). There is a reason why we do not use instead: does not allow for making a distinction between a better prior belief in hypotheses and a better approximation of the test set data. However, we are interested in how performs on the test set data . We obtain

 (8)

where
 (9)

The subscript indicates that the prior belief is chosen based on only.

Expected extended generalization error. We define the expected extended generalization error on the unseen test exemplars :

 (10)

Here is the mean error on after learning with , minus the mean error on after learning with . The second (negative) term is a lower bound (due to non-zero temperature) for the error on after learning the training set . Note: for the zero temperature limit we get (summation convention explained at the end of this paragraph)
, where . In this case, the generalization error depends on , restricted to those hypotheses compatible with (). For (full stochasticity), we get .
Summation convention: in general, denotes summation over those satisfying and . In what follows, we will keep an analogous convention: the first symbol is the running index, for which additional expressions specify conditions.

Overfitting and underfitting error. Let us separate the generalization error into an overfitting error and an underfitting error (in analogy to Wang et al., 1994; and Guyon et al., 1992). We will see that overfitting and underfitting error correspond to the two different error terms in our algorithm: decreasing one term is equivalent to decreasing , decreasing the other is equivalent to decreasing . Using the Kullback-Leibler distance (Kullback, 1959), we measure the information conveyed by , but not by (see figure 4). We may view this as information about : since there are more which are compatible with than there are which are compatible with , 's influence on is stronger than its influence on . To get the non-stochastic bias (see definition of ), we divide this information by and obtain the overfitting error:

 (11)

 figure=o.ps,angle=0,width=0.9
 figure=u.ps,angle=0,width=0.9

Analogously, we measure the information conveyed by , but not by (see figure 5). This information is about . To get the non-stochastic bias (see definition of ), we divide this information by and obtain the underfitting error:

 (12)

Peaks in which do not match peaks of produced by lead to overfitting error. Peaks of produced by which do not match peaks of lead to underfitting error. Overfitting and underfitting error tell us something about the shape of with respect to , i.e., to what degree is the prior belief in compatible with .

Why are they called overfitting'' and underfitting'' error? Positive contributions to the overfitting error are obtained where peaks of do not match (or are higher than) peaks of : there some will have large probability after training on but will have lower probability after training on all data . This is either because has been approximated too closely, or because of sharp peaks in -- the learning algorithm specializes either on or on (overfitting''). The specialization on will become even worse if is corrupted by noise -- the case of noisy will be treated later. Positive contributions to the underfitting error are obtained where peaks of do not match (or are higher than) peaks of : there some will have large probability after training on all data , but will have lower probability after training on . This is either due to a poor approximation (note that is almost fully determined by ), or to insufficient information about conveyed by (underfitting''). Either the algorithm did not learn enough'' of , or does not tell us anything about . In the latter case, there is nothing we can do -- we have to focus on the case where we did not learn enough about .

Analysis of overfitting and underfitting error. holds. Note: for zero temperature limit we obtain and .
. , i.e., there is no underfitting error. For (full stochasticity) we get and (recall that is not the conventional but the extended expected generalization error).

Since , holds. In what follows, averages after learning on are denoted by , and averages after learning on are denoted by .
Since , we have
.
Analogously, we have .
Thus, , and
.2With large , after learning on , measures the difference between average test set error and a minimal test set error. With large , after learning on , measures the difference between average test set error and a maximal test set error. So assume we do have a large (large enough to exceed the minimum of ). We have to assume that indeed conveys information about the test set: preferring hypotheses with small by using a larger leads to smaller test set error (without this assumption no error decreasing algorithm would make sense). can be decreased by enforcing less stochasticity (by further increasing ), but this will increase . Likewise, decreasing (enforcing more stochasticity) will decrease but increase . Increasing decreases the maximal test set error after learning more than it decreases the average test set error, thus decreasing , and vice versa. Decreasing increases the minimal test set error after learning more than it increases the average test set error, thus decreasing , and vice versa. This is the above-mentioned trade-off between stochasticity and fitting the training set, governed by .

Tolerable error level / Set of acceptable minima. Let us implicitly define a tolerable error level which, with confidence , is the upper bound of the training set error after learning.

 (13)

With ()-confidence, we have after learning. decreases with increasing . Now we define , which is the set of acceptable minima -- see section 2. The set of acceptable minima is a set of hypotheses with low underfitting error. With probability , the learning algorithm selects a hypothesis from . Note: for the zero temperature limit we have
and . By fixing a small (or a large ), will be forced to be low.

We would like to have an algorithm decreasing (1) training set error (this corresponds to decreasing underfitting error), and (2) an additional error term, which should be designed to ensure low overfitting error, given a fixed small . The remainder of this section will lead to an answer for the question: how to design this additional error term? Since low underfitting is obtained by selecting a hypothesis from , in what follows we will focus on only. Using an appropriate choice of prior belief, at the end of this section, we will finally see that the overfitting error can be reduced by an error term expressing preference for flat nets.

Relative overfitting error. Let us formally define the relative overfitting error , which is the relative contribution of some to the mean overfitting error of hypotheses set :

 (14)

where
 (15)

for , and zero otherwise.

For , we approximate as follows. We assume that is large where is large (trade-off between low and ). Then has large values (due to large ) where (assuming is small). We get
. The relative overfitting error can now be approximated by

 (16)

To obtain a distribution over , we introduce , the normalized distribution restricted to . For approximation (16) we have
 (17)

Prior belief in and . Assume was obtained from a target function . Let be the prior on targets and the probability of obtaining with a given . We have

 (18)

where .

The data is drawn from a target function with added noise (the noise-free case is treated below). We don't make any assumptions about the nature of the noise -- it does not have to be Gaussian (like, e.g., in MacKay's work, 1992b).

We want to select a which makes small, i.e., those with small should have high probabilities .

We don't know during learning. is assumed to be drawn from a target . We compute the expectation of , given . The probability of the test set , given , is

 (19)

where we assume (we don't remember which exemplars were already drawn). The expected test set error for some , given , is
 (20)

The expected relative overfitting error is obtained by inserting equation (20) into equation (17):

 (21)

Minimizing expected relative overfitting error. We define a such that has its largest value near small expected test set error (see (17) and (20)). This definition leads to a low expectation of (see equation (21)). Define

 (22)

where is the Dirac delta function, which we will use with loose formalism -- the context will make clear how the delta function is used.

Using equation (20) we get

 (23)

determines the hypothesis from that leads to lowest expected test set error. Consequently, we achieve the lowest expected relative overfitting error.

helps us to define :

 (24)

where for , and where is a small constant ensuring positive probability for all hypotheses .

To appreciate the importance of the prior in the definition of (see also equation (29)), in what follows, we will focus on the noise-free case.

The special case of noise-free data. Let be equal to (up to an normalizing constant):

 (25)

Assume . Let be the number of elements in .
. We expand from equation (20):
 (26)

Here , , and . The factor results from considering the mean test set error (where the test set is drawn from ), whereas is the maximal test set error (obtained by using a maximal test set). From (20) and (26), we obtain the expected test set error for some , given :
 (27)

From (27) and (17), we obtain the expected :

 (28)

For we obtain in this noise free case

 (29)

The lowest expected test set error measured by . See equation (27).

Noisy data and noise-free data: conclusion. For both the noise-free and the noisy case, equation (18) shows that given and , the expected test set error depends on prior target probability .

Choice of prior belief. Now we select some , our prior belief in target . We introduce a formalism similar to Wolpert's (Wolpert, 1994a ). is defined as the probability of obtaining by choosing a randomly according to .

Let us first have a look at Wolpert's formalism: . By restricting to , he obtains an injective function , which is restricted to . is surjective (because is surjective):

 (30)

where is the absolute Jacobian determinant of , evaluated at . If there is a locally flat (flat around ), then is high.

However, we prefer to follow another path. Our algorithm (flat minimum search) tends to prune a weight if is very flat in 's direction. It prefers regions where (where many weights lead to the same net function). Unlike Wolpert's approach, ours distinguishes the probabilities of targets with . The advantage is: we do not only search for which are flat in one direction but for which are flat in many directions (this corresponds to a higher probability of the corresponding targets). Define

 (31)

and
 (32)

We have
 (33)

partitions into equivalence classes. To obtain , we compute the probability of being in the equivalence class , if randomly chosen according to . An equivalence class corresponds to a net function, i.e., maps all of an equivalence class to the same net function.

Relation to FMS algorithm. FMS (from section 3) works locally in weight space . Let be the actual weight vector found by FMS (with ). Recall the definition of (see (22) and (23)): we want to find a hypothesis which best approximates those with large (the test data has high probability of being drawn from such targets). We will see that those with flat locally have high probability . Furthermore we will see that a close to with flat has flat , too. To approximate such targets , the only thing we can do is find a close to many with and large . To justify this approximation (see definition of while recalling that ), we assume (1) that the noise has mean , and (2) that small noise is more likely than large noise (e.g., Gaussian, Laplace, Cauchy distributions).

To restrict to a local range in , we define regions of equal net functions
.
Note: . If is flat along long distances in many directions , then has many elements. Locally in weight space, at with , for we define: if the minimum exists, then , where is a constant. If this minimum does not exist, then . locally approximates . During search for (corresponding to a hypothesis ), to locally decrease the expected test set error (see equation (20)), we want to enter areas where many large are near in weight space. We wish to decrease the test set error, which is caused by drawing data from highly probable targets (those with large ). We do not know, however, which 's are mapped to target's by . Therefore, we focus on ( near in weight space), instead of . Assume is small enough to allow for a Taylor expansion, and that is flat in direction :
, where is the Hessian of evaluated at , , and (analogously for higher order derivatives). We see: in a small environment of , there is flatness in direction , too. Likewise, if is not flat in any direction, this property also holds within a small environment of . Only near with flat , there may exist with large . Therefore, it is reasonable to search for a with , where is flat within a large region. This means to search for the determined by of equation (22). Since , holds: we search for a living within a large connected region, where for all within this region , where is defined in section 2. To conclude: we decrease the relative overfitting error and the underfitting error by searching for a flat minimum (see definition of flat minima in section 2).

Practical realization of the Gibbs variant.
(1) Select and , thus implicitly choosing .
(2) Compute the set .
(3) Assume we know how data is obtained from target , i. e. we know , , and the prior . Then we can compute and .
(4) Start with and increase until equation (13) holds. Now we know the from the implicit choice above.
(5) Since we know all we need to compute , select some according to this distribution.

Three comments on certain FMS limitations.
1. FMS only approximates the Gibbs variant given by the definition of (see (22) and (23)).
We only locally approximate in weight space. If is locally flat around then there exist units or weights which can be given with low precision (or can be removed). If there are other weights with , then one may assume that there are also points in weight space near such where weights can be given with low precision (think of, e.g., symmetrical exchange of weights and units). We assume the local approximation of is good. The most probable targets represented by flat are approximated by a hypothesis which is also represented by a flat (where is near in weight space). To allow for approximation of by , we have to assume that the hypothesis set is dense in the target set . If is flat in many directions then there are many that share this flatness and are well-approximated by . The only reasonable thing FMS can do is to make as flat as possible in a large region around , to approximate the with large prior probability (recall that flat regions are approximated by axis-aligned boxes, as discussed in section 7, paragraph entitled Generalized boxes?''). This approximation is fine if is smooth enough in unflat'' directions (small changes in should not result in quite different net functions).

2. Concerning point (3) above:
depends on (how the training data is drawn from the target, see (18)). depends on and (how the test data is drawn from the target). Since we do not know how the data is obtained, the quality of the approximation of the Gibbs algorithm may suffer from noise which has not mean , or from large noise being more probable than small noise.

Of course, if the choice of prior belief does not match the true target distribution, the quality of 's approximation will suffer as well.

3. Concerning point (5) above:
FMS outputs only a single instead of . This issue is discussed in section 7 (paragraph entitled multiple initializations?'').

To conclude: Our FMS algorithm from section 3 only approximates the Gibbs algorithm variant. Two important assumptions are made: The first is that an appropriate choice of prior belief has been made. The second is that the noise on the data is not too weird'' (mean , small noise more likely). The two assumptions are necessary for any algorithm based on an additional error term besides the training error. The approximations are: is approximated locally in weight space, and flat are approximated by flat with near 's. Our Gibbs variant takes into account that FMS uses only for computing flatness.

Next: A.2. WHY DOES THE Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION
Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page