Short Guide Through Appendix A.1. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error.
More Detailed Guide Through Appendix A.1.
Definitions. Let be the set of all possible input/output pairs (pairs of vectors). Let be the set of functions that can be implemented by the network. For every net function we have . Elements of are parameterized with a parameter vector from the set of possible parameters . is a function which maps a parameter vector onto a net function ( is surjective.) Let be the set of target functions , where . Let be the set of hypothesis functions , where . For simplicity, take all sets to be finite, and let all functions map each to some . Values of functions with argument are denoted by . We have .
Let be the data, where . is divided into a training set and a test set . For the moment, we are not interested in how was obtained.
We use squared error , where is the Euclidean norm. . . holds.
Learning. We use a variant of the Gibbs formalism (see Opper & Haussler, 1991, or Levin et al., 1990). Consider a stochastic learning algorithm (random weight initialization, random learning rate). The learning algorithm attempts to reduce training set error by randomly selecting a hypothesis with low , according to some conditional distribution over . is chosen in advance, but in contrast to traditional Gibbs (which deals with unconditional distributions on ), we may take a look at the training set before selecting . For instance, one training set may suggest linear functions as being more probable than others, another one splines, etc. The unconventional Gibbs variant is appropriate because FMS uses only (the set of first components of 's elements, see section 3) to compute the flatness of . The trade-off between the desire for low and the a priori belief in a hypothesis according to is governed by a positive constant (interpretable as the inverse temperature from statistical mechanics, or the amount of stochasticity in the training algorithm).
We obtain , the learning algorithm
applied to data :
For theoretical purposes, assume we know and may
use it for learning.
To learn, we use the same distribution
(prior belief in some hypotheses is based exclusively
on the training set).
There is a reason why
we do not use instead:
does not allow
for making a distinction between a better prior belief
and a better approximation of the test set data.
However, we are interested
in how performs on the test set data
We define the expected extended
on the unseen test exemplars
Overfitting and underfitting error.
Let us separate the generalization error into an overfitting error and an
(in analogy to
Wang et al., 1994;
and Guyon et al., 1992).
We will see that
overfitting and underfitting error correspond to the two
different error terms in our algorithm:
decreasing one term is equivalent to decreasing ,
decreasing the other is equivalent to decreasing .
Using the Kullback-Leibler distance (Kullback, 1959),
we measure the information
conveyed by , but not by
(see figure 4).
We may view this as information about :
since there are more which are compatible with
than there are which are compatible with ,
's influence on
is stronger than its influence on
To get the non-stochastic bias (see definition of ),
we divide this information by and obtain the overfitting error:
Analogously, we measure the information
, but not by
(see figure 5).
This information is about
To get the non-stochastic bias (see definition of ),
we divide this information by
and obtain the underfitting error:
Peaks in which do not match peaks of produced by lead to overfitting error. Peaks of produced by which do not match peaks of lead to underfitting error. Overfitting and underfitting error tell us something about the shape of with respect to , i.e., to what degree is the prior belief in compatible with .
Why are they called ``overfitting'' and ``underfitting'' error? Positive contributions to the overfitting error are obtained where peaks of do not match (or are higher than) peaks of : there some will have large probability after training on but will have lower probability after training on all data . This is either because has been approximated too closely, or because of sharp peaks in -- the learning algorithm specializes either on or on (``overfitting''). The specialization on will become even worse if is corrupted by noise -- the case of noisy will be treated later. Positive contributions to the underfitting error are obtained where peaks of do not match (or are higher than) peaks of : there some will have large probability after training on all data , but will have lower probability after training on . This is either due to a poor approximation (note that is almost fully determined by ), or to insufficient information about conveyed by (``underfitting''). Either the algorithm did not learn ``enough'' of , or does not tell us anything about . In the latter case, there is nothing we can do -- we have to focus on the case where we did not learn enough about .
Analysis of overfitting and underfitting error.
Note: for zero temperature limit
. , i.e., there is no underfitting error. For (full stochasticity) we get and (recall that is not the conventional but the extended expected generalization error).
In what follows, averages after learning on are denoted by ,
and averages after learning on are denoted by .
Since , we have
Analogously, we have .
Thus, , and
.2With large , after learning on , measures the difference between average test set error and a minimal test set error. With large , after learning on , measures the difference between average test set error and a maximal test set error. So assume we do have a large (large enough to exceed the minimum of ). We have to assume that indeed conveys information about the test set: preferring hypotheses with small by using a larger leads to smaller test set error (without this assumption no error decreasing algorithm would make sense). can be decreased by enforcing less stochasticity (by further increasing ), but this will increase . Likewise, decreasing (enforcing more stochasticity) will decrease but increase . Increasing decreases the maximal test set error after learning more than it decreases the average test set error, thus decreasing , and vice versa. Decreasing increases the minimal test set error after learning more than it increases the average test set error, thus decreasing , and vice versa. This is the above-mentioned trade-off between stochasticity and fitting the training set, governed by .
Tolerable error level / Set of acceptable minima.
Let us implicitly define a tolerable error level
which, with confidence ,
is the upper bound of the
training set error after learning.
We would like to have an algorithm decreasing (1) training set error (this corresponds to decreasing underfitting error), and (2) an additional error term, which should be designed to ensure low overfitting error, given a fixed small . The remainder of this section will lead to an answer for the question: how to design this additional error term? Since low underfitting is obtained by selecting a hypothesis from , in what follows we will focus on only. Using an appropriate choice of prior belief, at the end of this section, we will finally see that the overfitting error can be reduced by an error term expressing preference for flat nets.
Relative overfitting error.
Let us formally define the relative overfitting error ,
which is the relative contribution of some to the mean
overfitting error of hypotheses set :
as follows. We
assume that is large where is large
(trade-off between low and ).
has large values (due to large ) where
is small). We get
. The relative overfitting error can now be approximated by
Prior belief in and .
Assume was obtained from a
target function .
Let be the prior on targets
probability of obtaining with a given . We have
The data is drawn from a target function with added noise (the noise-free case is treated below). We don't make any assumptions about the nature of the noise -- it does not have to be Gaussian (like, e.g., in MacKay's work, 1992b).
We want to select a which makes small, i.e., those with small should have high probabilities .
We don't know
is assumed to be drawn from a target .
We compute the expectation of , given .
The probability of the test set
, given , is
The expected relative overfitting error is obtained by inserting equation (20) into equation (17):
Minimizing expected relative overfitting error.
We define a
has its largest value near
small expected test set error (see (17)
This definition leads to a low expectation of
(see equation (21)).
Using equation (20) we get
determines the hypothesis from that leads to lowest expected test set error. Consequently, we achieve the lowest expected relative overfitting error.
helps us to define :
To appreciate the importance of the prior in the definition of (see also equation (29)), in what follows, we will focus on the noise-free case.
The special case of noise-free data.
Let be equal to
an normalizing constant):
From (27) and (17), we obtain
we obtain in this noise free case
The lowest expected test set error measured by . See equation (27).
Noisy data and noise-free data: conclusion. For both the noise-free and the noisy case, equation (18) shows that given and , the expected test set error depends on prior target probability .
Choice of prior belief. Now we select some , our prior belief in target . We introduce a formalism similar to Wolpert's (Wolpert, 1994a ). is defined as the probability of obtaining by choosing a randomly according to .
Let us first have a look at Wolpert's formalism:
By restricting to , he obtains an injective function
which is restricted to .
is surjective (because is surjective):
However, we prefer to follow another path. Our algorithm (flat minimum
search) tends to prune a weight if is very flat
in 's direction. It prefers regions where
(where many weights lead to the same net function).
Unlike Wolpert's approach, ours distinguishes the probabilities of
The advantage is:
we do not only search for which are flat in one direction
but for which are flat in many directions (this corresponds
to a higher probability of the corresponding targets).
partitions into equivalence classes. To obtain , we compute the probability of being in the equivalence class , if randomly chosen according to . An equivalence class corresponds to a net function, i.e., maps all of an equivalence class to the same net function.
Relation to FMS algorithm. FMS (from section 3) works locally in weight space . Let be the actual weight vector found by FMS (with ). Recall the definition of (see (22) and (23)): we want to find a hypothesis which best approximates those with large (the test data has high probability of being drawn from such targets). We will see that those with flat locally have high probability . Furthermore we will see that a close to with flat has flat , too. To approximate such targets , the only thing we can do is find a close to many with and large . To justify this approximation (see definition of while recalling that ), we assume (1) that the noise has mean , and (2) that small noise is more likely than large noise (e.g., Gaussian, Laplace, Cauchy distributions).
to a local range in ,
we define regions of equal net functions
Note: . If is flat along long distances in many directions , then has many elements. Locally in weight space, at with , for we define: if the minimum exists, then , where is a constant. If this minimum does not exist, then . locally approximates . During search for (corresponding to a hypothesis ), to locally decrease the expected test set error (see equation (20)), we want to enter areas where many large are near in weight space. We wish to decrease the test set error, which is caused by drawing data from highly probable targets (those with large ). We do not know, however, which 's are mapped to target's by . Therefore, we focus on ( near in weight space), instead of . Assume is small enough to allow for a Taylor expansion, and that is flat in direction :
, where is the Hessian of evaluated at , , and (analogously for higher order derivatives). We see: in a small environment of , there is flatness in direction , too. Likewise, if is not flat in any direction, this property also holds within a small environment of . Only near with flat , there may exist with large . Therefore, it is reasonable to search for a with , where is flat within a large region. This means to search for the determined by of equation (22). Since , holds: we search for a living within a large connected region, where for all within this region , where is defined in section 2. To conclude: we decrease the relative overfitting error and the underfitting error by searching for a flat minimum (see definition of flat minima in section 2).
Practical realization of the Gibbs variant.
(1) Select and , thus implicitly choosing .
(2) Compute the set .
(3) Assume we know how data is obtained from target , i. e. we know , , and the prior . Then we can compute and .
(4) Start with and increase until equation (13) holds. Now we know the from the implicit choice above.
(5) Since we know all we need to compute , select some according to this distribution.
Three comments on certain FMS limitations.
1. FMS only approximates the Gibbs variant given by the definition of (see (22) and (23)).
We only locally approximate in weight space. If is locally flat around then there exist units or weights which can be given with low precision (or can be removed). If there are other weights with , then one may assume that there are also points in weight space near such where weights can be given with low precision (think of, e.g., symmetrical exchange of weights and units). We assume the local approximation of is good. The most probable targets represented by flat are approximated by a hypothesis which is also represented by a flat (where is near in weight space). To allow for approximation of by , we have to assume that the hypothesis set is dense in the target set . If is flat in many directions then there are many that share this flatness and are well-approximated by . The only reasonable thing FMS can do is to make as flat as possible in a large region around , to approximate the with large prior probability (recall that flat regions are approximated by axis-aligned boxes, as discussed in section 7, paragraph entitled ``Generalized boxes?''). This approximation is fine if is smooth enough in ``unflat'' directions (small changes in should not result in quite different net functions).
Concerning point (3) above:
depends on (how the training data is drawn from the target, see (18)). depends on and (how the test data is drawn from the target). Since we do not know how the data is obtained, the quality of the approximation of the Gibbs algorithm may suffer from noise which has not mean , or from large noise being more probable than small noise.
Of course, if the choice of prior belief does not match the true target distribution, the quality of 's approximation will suffer as well.
Concerning point (5) above:
FMS outputs only a single instead of . This issue is discussed in section 7 (paragraph entitled ``multiple initializations?'').
To conclude: Our FMS algorithm from section 3 only approximates the Gibbs algorithm variant. Two important assumptions are made: The first is that an appropriate choice of prior belief has been made. The second is that the noise on the data is not too ``weird'' (mean , small noise more likely). The two assumptions are necessary for any algorithm based on an additional error term besides the training error. The approximations are: is approximated locally in weight space, and flat are approximated by flat with near 's. Our Gibbs variant takes into account that FMS uses only for computing flatness.