next up previous
Next: A.2. WHY DOES THE Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION

A.1. FLAT NETS: THE MOST PROBABLE HYPOTHESES

Short Guide Through Appendix A.1. We introduce a novel kind of generalization error that can be split into an overfitting error and an underfitting error. To find hypotheses causing low generalization error, we first select a subset of hypotheses causing low underfitting error. We are interested in those of its elements causing low overfitting error.




More Detailed Guide Through Appendix A.1.

Definitions. Let $A=\{(x,y) \mid x \in X, y \in Y\}$ be the set of all possible input/output pairs (pairs of vectors). Let $NET$ be the set of functions that can be implemented by the network. For every net function $g \in NET$ we have $g \subset A$. Elements of $NET$ are parameterized with a parameter vector $w$ from the set of possible parameters $W$. $net(w)$ is a function which maps a parameter vector $w$ onto a net function $g$ ($net$ is surjective.) Let $T$ be the set of target functions $f$, where $T \subset NET$. Let $H$ be the set of hypothesis functions $h$, where $H \subset T$. For simplicity, take all sets to be finite, and let all functions map each $x \in X$ to some $y \in Y$. Values of functions with argument $x$ are denoted by $g(x),net(w)(x),f(x),h(x)$. We have $(x,g(x)) \in g; (x,net(w)(x)) \in net(w); (x,f(x)) \in f;
(x,h(x)) \in h$.

Let $D=\{(x_p,y_p) \mid 1 \leq p \leq m\}$ be the data, where $D \subset A$. $D$ is divided into a training set $D_0=\{(x_p,y_p) \mid 1 \leq p \leq n\}$ and a test set $D \setminus D_0 = \{(x_p,y_p) \mid n < p \leq m \}$. For the moment, we are not interested in how $D$ was obtained.

We use squared error $E(D,h):= \sum^m_{p=1} \parallel y_p - h(x_p) \parallel^2$, where $\parallel . \parallel$ is the Euclidean norm. $E(D_0,h):= \sum^n_{p=1} \parallel y_p - h(x_p) \parallel^2$. $E(D \setminus D_0,h):= \sum^m_{p=n+1} \parallel y_p - h(x_p) \parallel^2$. $E(D,h) = E(D_0,h) + E(D \setminus D_0,h)$ holds.

Learning. We use a variant of the Gibbs formalism (see Opper & Haussler, 1991, or Levin et al., 1990). Consider a stochastic learning algorithm (random weight initialization, random learning rate). The learning algorithm attempts to reduce training set error by randomly selecting a hypothesis with low $E(D_0,h)$, according to some conditional distribution $G(. \mid D_0)$ over $H$. $G(. \mid D_0)$ is chosen in advance, but in contrast to traditional Gibbs (which deals with unconditional distributions on $H$), we may take a look at the training set before selecting $G$. For instance, one training set may suggest linear functions as being more probable than others, another one splines, etc. The unconventional Gibbs variant is appropriate because FMS uses only $X_0$ (the set of first components of $D_0$'s elements, see section 3) to compute the flatness of $net(w')$. The trade-off between the desire for low $E(D_0,h)$ and the a priori belief in a hypothesis according to $G(. \mid D_0)$ is governed by a positive constant $\beta $ (interpretable as the inverse temperature from statistical mechanics, or the amount of stochasticity in the training algorithm).

We obtain $p(h \mid D_0)$, the learning algorithm applied to data $D_0$:

$\displaystyle p( h \mid D_0) = \frac{G(h \mid D_0) \exp (- \beta E(D_0,h) )}{Z(D_0,\beta)} ,$     (6)

where
$\displaystyle Z(D_0,\beta) = \sum_{h \in H} G(h \mid D_0) \exp (- \beta E(D_0,h))
.$     (7)

$Z(D_0,\beta)$ is the ``error momentum generating function'', or the ``weighted accessible volume in configuration space'' or ``the partition function'' (from statistical mechanics).

For theoretical purposes, assume we know $D$ and may use it for learning. To learn, we use the same distribution $G(h \mid D_0)$ as above (prior belief in some hypotheses $h$ is based exclusively on the training set). There is a reason why we do not use $G(h \mid D)$ instead: $G(h \mid D)$ does not allow for making a distinction between a better prior belief in hypotheses and a better approximation of the test set data. However, we are interested in how $G(h \mid D_0)$ performs on the test set data $D \setminus D_0$. We obtain

$\displaystyle p_{D_0}( h \mid D) = \frac{G(h \mid D_0) \exp (- \beta E(D,h) )}{Z_{D_0}(D,\beta)}
,$     (8)

where
$\displaystyle Z_{D_0}(D,\beta) = \sum_{h \in H} G(h \mid D_0) \exp (- \beta E(D,h))
.$     (9)

The subscript $D_0$ indicates that the prior belief is chosen based on $D_0$ only.

Expected extended generalization error. We define the expected extended generalization error $E_G(D,D_0)$ on the unseen test exemplars $D \setminus D_0$:

$\displaystyle E_G(D,D_0):= \sum_{h \in H} p(h \mid D_0) E(D \setminus D_0,h) -
\sum_{h \in H} p_{D_0}(h \mid D) E(D \setminus D_0,h)
.$     (10)

Here $E_G(D,D_0)$ is the mean error on $D \setminus D_0$ after learning with $D_0$, minus the mean error on $D \setminus D_0$ after learning with $D$. The second (negative) term is a lower bound (due to non-zero temperature) for the error on $D \setminus D_0$ after learning the training set $D_0$. Note: for the zero temperature limit $\beta \to \infty$ we get (summation convention explained at the end of this paragraph)
$E_G(D,D_0)=\sum_{h \in H, D_0 \subset h} \frac{G(h \mid D_0)}{Z(D_0)}
E(D \setminus D_0, h)$, where $Z(D_0)=\sum_{h \in H, D_0 \subset h} G(h \mid D_0)$. In this case, the generalization error depends on $G(h \mid D_0)$, restricted to those hypotheses $h$ compatible with $D_0$ ($D_0 \subset h$). For $\beta \to 0$ (full stochasticity), we get $E_G(D,D_0)=0$.
Summation convention: in general, $\sum_{h \in H, D_0 \subset h}$ denotes summation over those $h$ satisfying $h \in H$ and $D_0 \subset h$. In what follows, we will keep an analogous convention: the first symbol is the running index, for which additional expressions specify conditions.

Overfitting and underfitting error. Let us separate the generalization error into an overfitting error $E_o$ and an underfitting error $E_u$ (in analogy to Wang et al., 1994; and Guyon et al., 1992). We will see that overfitting and underfitting error correspond to the two different error terms in our algorithm: decreasing one term is equivalent to decreasing $E_o$, decreasing the other is equivalent to decreasing $E_u$. Using the Kullback-Leibler distance (Kullback, 1959), we measure the information conveyed by $p(. \mid D_0)$, but not by $p_{D_0}(. \mid D)$ (see figure 4). We may view this as information about $G(. \mid D_0)$: since there are more $h$ which are compatible with $D_0$ than there are $h$ which are compatible with $D$, $G(. \mid D_0)$'s influence on $p(h \mid D_0)$ is stronger than its influence on $p_{D_0}(h \mid D)$. To get the non-stochastic bias (see definition of $E_G$), we divide this information by $\beta $ and obtain the overfitting error:

$\displaystyle E_o(D,D_0)$ $\textstyle :=$ $\displaystyle \frac{1}{\beta} \sum_{h \in H}
p(h \mid D_0) \ln \frac{p(h \mid D_0)}{p_{D_0}(h \mid D)} =$ (11)
    $\displaystyle \sum_{h \in H} p(h \mid D_0) E(D \setminus D_0,h) +
\frac{1}{\beta} \ln \frac{Z_{D_0}(D,\beta)}{Z(D_0,\beta)}
.$  

Figure 4: Positive contributions to the overfitting error $E_o(D,D_0)$, after learning the training set with a large $\beta $.
figure=o.ps,angle=0,width=0.9
Figure 5: Positive contributions to the underfitting error $E_u(D_0,D)$, after learning the training set with a small $\beta $. Again, we use the $D$-posterior from figure 4, assuming it is almost fully determined by $E(D,h)$ (even if $\beta $ is smaller than in figure 4).
figure=u.ps,angle=0,width=0.9

Analogously, we measure the information conveyed by $p_{D_0}(. \mid D)$, but not by $p(. \mid D_0)$ (see figure 5). This information is about $D \setminus D_0$. To get the non-stochastic bias (see definition of $E_G$), we divide this information by $\beta $ and obtain the underfitting error:

$\displaystyle E_u(D,D_0)$ $\textstyle :=$ $\displaystyle \frac{1}{\beta} \sum_{h \in H}
p_{D_0}(h \mid D) \ln \frac{p_{D_0}(h \mid D)}{p(h \mid D_0)} =$ (12)
    $\displaystyle - \sum_{h \in H} p_{D_0}(h \mid D) E(D \setminus D_0,h) +
\frac{1}{\beta} \ln \frac{Z(D_0,\beta)}{Z_{D_0}(D,\beta)}
.$  

Peaks in $G(. \mid D_0)$ which do not match peaks of $p_{D_0}(. \mid D)$ produced by $D \setminus D_0$ lead to overfitting error. Peaks of $p_{D_0}(. \mid D)$ produced by $D \setminus D_0$ which do not match peaks of $G(. \mid D_0)$ lead to underfitting error. Overfitting and underfitting error tell us something about the shape of $G(. \mid D_0)$ with respect to $D \setminus D_0$, i.e., to what degree is the prior belief in $h$ compatible with $D \setminus D_0$.

Why are they called ``overfitting'' and ``underfitting'' error? Positive contributions to the overfitting error are obtained where peaks of $p(. \mid D_0)$ do not match (or are higher than) peaks of $p_{D_0}(. \mid D)$: there some $h$ will have large probability after training on $D_0$ but will have lower probability after training on all data $D$. This is either because $D_0$ has been approximated too closely, or because of sharp peaks in $G(. \mid D_0)$ -- the learning algorithm specializes either on $D_0$ or on $G(. \mid D_0)$ (``overfitting''). The specialization on $D_0$ will become even worse if $D_0$ is corrupted by noise -- the case of noisy $D_0$ will be treated later. Positive contributions to the underfitting error are obtained where peaks of $p_{D_0}(. \mid D)$ do not match (or are higher than) peaks of $p(. \mid D_0)$: there some $h$ will have large probability after training on all data $D$, but will have lower probability after training on $D_0$. This is either due to a poor $D_0$ approximation (note that $p(. \mid D_0)$ is almost fully determined by $G(. \mid D_0)$), or to insufficient information about $D$ conveyed by $D_0$ (``underfitting''). Either the algorithm did not learn ``enough'' of $D_0$, or $D_0$ does not tell us anything about $D$. In the latter case, there is nothing we can do -- we have to focus on the case where we did not learn enough about $D_0$.

Analysis of overfitting and underfitting error. $E_G(D,D_0) = E_o(D,D_0) + E_u(D,D_0)$ holds. Note: for zero temperature limit $\beta \to \infty$ we obtain $Z_{D_0}(D)= \sum_{h \in H, D \subset h} G(h \mid D_0)$ and $Z(D_0)=\sum_{h \in H, D_0 \subset h} G(h \mid D_0)$.
$E_o(D,D_0)=\sum_{h \in H, D_0 \subset h} \frac{G(h \mid D_0)}{Z(D_0)}
E(D \setminus D_0, h) = E_G(D,D_0)$. $E_u(D,D_0)=0$, i.e., there is no underfitting error. For $\beta \to 0$ (full stochasticity) we get $E_u(D,D_0)=0$ and $E_o(D,D_0)=0$ (recall that $E_G$ is not the conventional but the extended expected generalization error).

Since $D_0 \subset D$, $Z_{D_0}(D,\beta) < Z(D_0,\beta)$ holds. In what follows, averages after learning on $D_0$ are denoted by $<_{D_0}.>$, and averages after learning on $D$ are denoted by $<_D . >$.
Since $Z_{D_0}(D,\beta) = \sum_{h \in H} G(h \mid D_0) \exp (- \beta E(D_0,h) )
\exp (- \beta E(D \setminus D_0,h) )$, we have
$\frac{Z_{D_0}(D,\beta)}{Z(D_0,\beta)} = \sum_{h \in H}
p(h \mid D_0) \exp (-\beta E(D \setminus D_0,h)) =  <_{D_0}
\! \exp (-\beta E(D \setminus D_0,.))\! >$.
Analogously, we have $\frac{Z(D_0,\beta)}{Z_{D_0}(D,\beta)} =  <_D \!
\exp (\beta E(D \setminus D_0,.))\! >$.
Thus, $E_o(D,D_0) =  <_{D_0} \! E(D \setminus D_0,.) \! > +
\frac{1}{\beta} \ln
<_{D_0} \! \exp (-\beta E(D \setminus D_0,.))\! >$, and
$E_u(D,D_0) =
- <_{D} \! E(D \setminus D_0,.)\! > +
\frac{1}{\beta} \ln
<_{D} \! \exp (\beta E(D \setminus D_0,.))\! >$.2With large $\beta $, after learning on $D_0$, $E_o$ measures the difference between average test set error and a minimal test set error. With large $\beta $, after learning on $D$, $E_u$ measures the difference between average test set error and a maximal test set error. So assume we do have a large $\beta $ (large enough to exceed the minimum of $\frac{1}{\beta} \ln \frac{Z_{D_0}(D,\beta)}{Z(D_0,\beta)}$). We have to assume that $D_0$ indeed conveys information about the test set: preferring hypotheses $h$ with small $E(D_0,h)$ by using a larger $\beta $ leads to smaller test set error (without this assumption no error decreasing algorithm would make sense). $E_u$ can be decreased by enforcing less stochasticity (by further increasing $\beta $), but this will increase $E_o$. Likewise, decreasing $\beta $ (enforcing more stochasticity) will decrease $E_o$ but increase $E_u$. Increasing $\beta $ decreases the maximal test set error after learning $D$ more than it decreases the average test set error, thus decreasing $E_u$, and vice versa. Decreasing $\beta $ increases the minimal test set error after learning $D_0$ more than it increases the average test set error, thus decreasing $E_o$, and vice versa. This is the above-mentioned trade-off between stochasticity and fitting the training set, governed by $\beta $.

Tolerable error level / Set of acceptable minima. Let us implicitly define a tolerable error level $E_{tol}(\alpha,\beta)$ which, with confidence $1-\alpha$, is the upper bound of the training set error after learning.

$\displaystyle p(E(D_0,h) \leq E_{tol}(\alpha,\beta) ) = \sum_{h \in H,
E(D_0,h) \leq E_{tol}(\alpha,\beta)} p(h \mid D_0) = 1 - \alpha
.$     (13)

With ($1-\alpha$)-confidence, we have $E(D_0,h) \leq E_{tol}(\alpha,\beta)$ after learning. $E_{tol}(\alpha,\beta)$ decreases with increasing $\beta,\alpha$. Now we define $M(D_0) := \{h \in H \mid E(D_0,h) \leq E_{tol}
(\alpha,\beta)\}$, which is the set of acceptable minima -- see section 2. The set of acceptable minima is a set of hypotheses with low underfitting error. With probability $1-\alpha$, the learning algorithm selects a hypothesis from $M(D_0) \subset H$. Note: for the zero temperature limit $\beta \to \infty$ we have
$E_{tol}(\alpha)=0$ and $M(D_0) = \{h \in H \mid D_0 \subset h\}$. By fixing a small $E_{tol}$ (or a large $\beta $), $E_u$ will be forced to be low.

We would like to have an algorithm decreasing (1) training set error (this corresponds to decreasing underfitting error), and (2) an additional error term, which should be designed to ensure low overfitting error, given a fixed small $E_{tol}$. The remainder of this section will lead to an answer for the question: how to design this additional error term? Since low underfitting is obtained by selecting a hypothesis from $M(D_0)$, in what follows we will focus on $M(D_0)$ only. Using an appropriate choice of prior belief, at the end of this section, we will finally see that the overfitting error can be reduced by an error term expressing preference for flat nets.

Relative overfitting error. Let us formally define the relative overfitting error $E_{ro}$, which is the relative contribution of some $h \in M(D_0)$ to the mean overfitting error of hypotheses set $M(D_0)$:

$\displaystyle E_{ro}(D,D_0,M(D_0),h) =
p_{M(D_0)} (h \mid D_0) E(D \setminus D_0,h)
,$     (14)

where
$\displaystyle p_{M(D_0)} (h \mid D_0) :=
\frac{p(h \mid D_0)}{\sum_{h \in M(D_0)} p(h \mid D_0)}$     (15)

for $h \in M(D_0)$, and zero otherwise.

For $h \in M(D_0)$, we approximate $p(h \mid D_0)$ as follows. We assume that $G(h \mid D_0)$ is large where $E(D_0,h)$ is large (trade-off between low $E(D_0,h)$ and $G(h \mid D_0)$). Then $p(h \mid D_0)$ has large values (due to large $G(h \mid D_0)$) where $E(D_0,h) \approx E_{tol}(\alpha,\beta)$ (assuming $E_{tol}(\alpha,\beta)$ is small). We get
$p( h \mid D_0) \approx \frac{G(h \mid D_0) \exp (-\beta E_{tol}(\alpha,\beta))}
{Z(D_0,\beta)}$. The relative overfitting error can now be approximated by

$\displaystyle E_{ro}(D,D_0,M(D_0),h) \approx
\frac{G(h \mid D_0)}{\sum_{h \in M(D_0)} G(h \mid D_0)} E(D \setminus D_0,h)
.$     (16)

To obtain a distribution over $M(D_0)$, we introduce $G_{M(D_0)}(. \mid D_0)$, the normalized distribution $G(. \mid D_0)$ restricted to $M(D_0)$. For approximation (16) we have
$\displaystyle E_{ro}(D,D_0,M(D_0),h) \approx
G_{M(D_0)}(h \mid D_0) E(D \setminus D_0,h)
.$     (17)

Prior belief in $f$ and $D$. Assume $D$ was obtained from a target function $f$. Let $p(f)$ be the prior on targets and $p(D \mid f)$ the probability of obtaining $D$ with a given $f$. We have

$\displaystyle p(f \mid D_0) = \frac{p(D_0 \mid f) p(f)}{p(D_0)} ,$     (18)

where $p(D_0) = \sum_{f \in T} p(D_0 \mid f) p(f)$.

The data is drawn from a target function with added noise (the noise-free case is treated below). We don't make any assumptions about the nature of the noise -- it does not have to be Gaussian (like, e.g., in MacKay's work, 1992b).

We want to select a $G(. \mid D_0)$ which makes $E_{ro}$ small, i.e., those $h \in M(D_0)$ with small $E(D \setminus D_0,h)$ should have high probabilities $G(h \mid D_0)$.

We don't know $D \setminus D_0$ during learning. $D$ is assumed to be drawn from a target $f$. We compute the expectation of $E_{ro}$, given $D_0$. The probability of the test set $D \setminus D_0$, given $D_0$, is

$\displaystyle p(D \setminus D_0 \mid D_0) =
\sum_{f \in T} p(D \setminus D_0 \mid f) p (f \mid D_0)
,$     (19)

where we assume $p(D \setminus D_0 \mid f, D_0) = p(D \setminus D_0 \mid f)$ (we don't remember which exemplars were already drawn). The expected test set error $E(.,h)$ for some $h$, given $D_0$, is
$\displaystyle \sum_{D \setminus D_0} p(D \setminus D_0 \mid D_0)
E(D \setminus ...
...id D_0) \sum_{D \setminus D_0} p(D \setminus D_0 \mid f)
E(D \setminus D_0,h)
.$     (20)

The expected relative overfitting error $E_{ro}(.,D_0,M(D_0),h)$ is obtained by inserting equation (20) into equation (17):


$\displaystyle E_{ro}(.,D_0,M(D_0),h) \approx
G_{M(D_0)}(h \mid D_0)
\sum_{f \in...
...id D_0) \sum_{D \setminus D_0} p(D \setminus D_0 \mid f)
E(D \setminus D_0,h)
.$     (21)

Minimizing expected relative overfitting error. We define a $G_{M(D_0)}(. \mid D_0)$ such that $G_{M(D_0)}(. \mid D_0)$ has its largest value near small expected test set error $E(.,.)$ (see (17) and (20)). This definition leads to a low expectation of $E_{ro}(.,D_0,M(D_0),.)$ (see equation (21)). Define

$\displaystyle G_{M(D_0)}(h \mid D_0) := \delta \left( argmin_{h' \in M(D_0)}
\left( E(.,h') \right)
- h \right)
\mbox{ ,}$     (22)

where $\delta$ is the Dirac delta function, which we will use with loose formalism -- the context will make clear how the delta function is used.

Using equation (20) we get

$\displaystyle G_{M(D_0)}(h \mid D_0) = \delta \left( argmin_{h' \in M(D_0)}
\le...
...} p(D \setminus D_0 \mid f)
E(D \setminus D_0,h')
\right)
- h \right)
\mbox{ .}$     (23)

$G_{M(D_0)}(. \mid D_0)$ determines the hypothesis $h$ from $M(D_0)$ that leads to lowest expected test set error. Consequently, we achieve the lowest expected relative overfitting error.

$G_{M(D_0)}$ helps us to define $G$:

$\displaystyle G(h \mid D_0) := \frac{\zeta + G_{M(D_0)} (h \mid D_0)}
{\sum_{h \in H} (\zeta + G_{M(D_0)}(h \mid D_0) )}
\mbox{ ,}$     (24)

where $G_{M(D_0)}(h \mid D_0) = 0$ for $h \notin M(D_0)$, and where $\zeta$ is a small constant ensuring positive probability $G(h \mid D_0)$ for all hypotheses $h$.

To appreciate the importance of the prior $p(f)$ in the definition of $G_{M(D_0)}$ (see also equation (29)), in what follows, we will focus on the noise-free case.

The special case of noise-free data. Let $p(D_0 \mid f)$ be equal to $\delta (D_0 \subset f)$ (up to an normalizing constant):

$\displaystyle p(f \mid D_0) = \frac{\delta (D_0 \subset f) p(f)}
{\sum_{f \in T, D_0 \subset f} p(f)}$     (25)

Assume $p( D \setminus D_0 \mid f) =
\frac{\delta (D \setminus D_0 \subset f)}{\sum_{D \setminus D_0} \delta
(D \setminus D_0 \subset f)}$. Let $F$ be the number of elements in $X$.
$p( D \setminus D_0 \mid f) =
\frac{\delta (D \setminus D_0 \subset f)}{2^{F-n}}$. We expand $\sum_{D \setminus D_0} p(D \setminus D_0 \mid f)
E(D \setminus D_0,h)$ from equation (20):
$\displaystyle \frac{1}{2^{F-n}}\sum_{D \setminus D_0 \subset f} E(D \setminus D_0,h)$ $\textstyle =$ $\displaystyle \frac{1}{2^{F-n}}\sum_{D \setminus D_0 \subset f} \mbox{ }
\sum_{(x,y) \in D \setminus D_0} E((x,y) , h) =$ (26)
    $\displaystyle \frac{1}{2^{F-n}}\sum_{(x,y) \in f \setminus D_0}
E((x,y) , h) \sum_{i=1}^{F-n} {F -n -1 \choose i-1}
= \frac{1}{2}E (f \setminus D_0 ,h ).$  

Here $E((x,y),h) = \parallel y - h(x) \parallel^2$, $E(f \setminus D_0,h) = \sum_{(x,y) \in f \setminus D_0}
\parallel y - h(x) \parallel^2$, and $\sum_{i=1}^{F-n} {F -n -1 \choose i-1}= 2^{F-n-1}$. The factor $\frac{1}{2}$ results from considering the mean test set error (where the test set is drawn from $f$), whereas $E (f \setminus D_0 ,h )$ is the maximal test set error (obtained by using a maximal test set). From (20) and (26), we obtain the expected test set error $E(.,h)$ for some $h$, given $D_0$:
$\displaystyle \sum_{D \setminus D_0} p(D \setminus D_0 \mid D_0)
E(D \setminus D_0 , h) =
\frac{1}{2} \sum_{f \in T} p(f \mid D_0) E (f \setminus D_0 ,h ) .$     (27)

From (27) and (17), we obtain the expected $E_{ro}(.,D_0,M(D_0),h)$:

$\displaystyle E_{ro}(.,D_0,M(D_0),h) \approx
\frac{1}{2} \mbox{ } G_{M(D_0)}(h \mid D_0)
\sum_{f \in T} p(f \mid D_0) E (f \setminus D_0 ,h )
.$     (28)

For $G_{M(D_0)}(h \mid D_0)$ we obtain in this noise free case

$\displaystyle G_{M(D_0)}(h \mid D_0) = \delta \left( argmin_{h' \in M(D_0)}
\le...
...um_{f \in T} p(f \mid D_0) E(f \setminus D_0, h') \right)
- h \right)
\mbox{ .}$     (29)

The lowest expected test set error measured by $\frac{1}{2} \sum_{f \in T} p(f \mid D_0) E(f \setminus D_0, h)$. See equation (27).



Noisy data and noise-free data: conclusion. For both the noise-free and the noisy case, equation (18) shows that given $D_0$ and $h$, the expected test set error depends on prior target probability $p(f)$.



Choice of prior belief. Now we select some $p(f)$, our prior belief in target $f$. We introduce a formalism similar to Wolpert's (Wolpert, 1994a ). $p(f)$ is defined as the probability of obtaining $f=net(w)$ by choosing a $w$ randomly according to $p(w)$.

Let us first have a look at Wolpert's formalism: $p(f) = \int dw p(w) \delta (net(w) -f)$. By restricting $W$ to $W_{inj}$, he obtains an injective function $net_{inj}:W_{inj} \rightarrow NET:net_{inj}(w) = net(w)$ , which is $net$ restricted to $W_{inj}$. $net_{inj}$ is surjective (because $net$ is surjective):

$\displaystyle p(f)$ $\textstyle =$ $\displaystyle \int_{W} p(w) \frac{\delta (net_{inj}(w) -f)}{\vert\mbox{det } net_{inj}'(w)\vert}
\vert\mbox{det } net_{inj}'(w)\vert dw =$ (30)
    $\displaystyle \int_{NET} p(net_{inj}^{-1}(g))
\frac{\delta (g -f)}{\vert\mbox{det } net_{inj}'(net_{inj}^{-1}(g))\vert} dg =$  
    $\displaystyle \frac{p(net_{inj}^{-1}(f))}
{\vert\mbox{det } net_{inj}'(net_{inj}^{-1}(f))\vert} ,$  

where $\vert\mbox{det } net_{inj}'(w)\vert$ is the absolute Jacobian determinant of $net_{inj}$, evaluated at $w$. If there is a locally flat $net(w)=f$ (flat around $w$), then $p(f)$ is high.

However, we prefer to follow another path. Our algorithm (flat minimum search) tends to prune a weight $w_i$ if $net(w)$ is very flat in $w_i$'s direction. It prefers regions where $\mbox{det } net'(w) = 0$ (where many weights lead to the same net function). Unlike Wolpert's approach, ours distinguishes the probabilities of targets $f=net(w)$ with $\mbox{det } net'(w) = 0$. The advantage is: we do not only search for $net(w)$ which are flat in one direction but for $net(w)$ which are flat in many directions (this corresponds to a higher probability of the corresponding targets). Define

$\displaystyle net^{-1}(g) := \{ w \in W \mid net(w) = g \}$     (31)

and
$\displaystyle p(net^{-1}(g)) := \sum_{w \in net^{-1}(g)} p(w) .$     (32)

We have
$\displaystyle p(f) = \frac{\sum_{g \in NET} p(net^{-1}(g)) \delta (g - f)}
{\su...
...1}(g)) \delta (g - f)} =
\frac{p(net^{-1}(f))}{\sum_{f \in T} p(net^{-1}(f))} .$     (33)

$net$ partitions $W$ into equivalence classes. To obtain $p(f)$, we compute the probability of $w$ being in the equivalence class $\{ w \mid net(w) = f \}$, if randomly chosen according to $p(w)$. An equivalence class corresponds to a net function, i.e., $net$ maps all $w$ of an equivalence class to the same net function.

Relation to FMS algorithm. FMS (from section 3) works locally in weight space $W$. Let $w'$ be the actual weight vector found by FMS (with $h=net(w')$). Recall the definition of $G_{M(D_0)}(h \mid D_0)$ (see (22) and (23)): we want to find a hypothesis $h$ which best approximates those $f$ with large $p(f)$ (the test data has high probability of being drawn from such targets). We will see that those $f=net(w)$ with flat $net(w)$ locally have high probability $p(f)$. Furthermore we will see that a $w'$ close to $w$ with flat $net(w)$ has flat $net(w')$, too. To approximate such targets $f$, the only thing we can do is find a $w'$ close to many $w$ with $net(w)=f$ and large $p(f)$. To justify this approximation (see definition of $p(f \mid D_0)$ while recalling that $h \in G_{M(D_0)}$), we assume (1) that the noise has mean $0$, and (2) that small noise is more likely than large noise (e.g., Gaussian, Laplace, Cauchy distributions).

To restrict $p(f)=p(net(w))$ to a local range in $W$, we define regions of equal net functions
$F(w)=\{\bar w \mid \forall_{\tau} 0 \leq \tau \leq 1, w +
\tau (\bar w-w) \in W: net(w) = net(w + \tau (\bar w-w) ) \}$.
Note: $F(w) \subset net^{-1}(net(w))$. If $net(w)$ is flat along long distances in many directions $\bar w -w$, then $F(w)$ has many elements. Locally in weight space, at $w'$ with $h=net(w')$, for $\gamma > 0$ we define: if the minimum $w= argmin_{\bar w}
\{ \parallel \bar w -w' \parallel  \mid \
\parallel \bar w -w' \parallel < \gamma, net(\bar w) = f \}$ exists, then $p_{w',\gamma}(f) = c  p(F(w))$, where $c$ is a constant. If this minimum does not exist, then $p_{w',\gamma}(f) = 0$. $p_{w',\gamma}(f) $ locally approximates $p(f)$. During search for $w'$ (corresponding to a hypothesis $h=net(w')$), to locally decrease the expected test set error (see equation (20)), we want to enter areas where many large $F(w)$ are near $w'$ in weight space. We wish to decrease the test set error, which is caused by drawing data from highly probable targets $f$ (those with large $p_{w',\gamma}(f) $). We do not know, however, which $w$'s are mapped to target's $f$ by $net(.)$. Therefore, we focus on $F(w)$ ($w$ near $w'$ in weight space), instead of $p_{w',\gamma}(f) $. Assume $\parallel w' -w \parallel$ is small enough to allow for a Taylor expansion, and that $net(w')$ is flat in direction $(\bar w -w')$:
$net(w)=net(w'+(w-w')) = net(w') + \nabla net(w') (w-w')
+ \frac{1}{2} (w-w') H(net(w')) (w-w') + \ldots$, where $H(net(w'))$ is the Hessian of $net(.)$ evaluated at $w'$, $\nabla net(w) (\bar w - w') = \nabla net(w') (\bar w -w') + O(w-w')$, and $(\bar w-w') H(net(w)) (\bar w -w') = (\bar w-w') H(net(w')) (\bar w - w') + O(w-w')$ (analogously for higher order derivatives). We see: in a small environment of $w'$, there is flatness in direction $(\bar w -w')$, too. Likewise, if $net(w')$ is not flat in any direction, this property also holds within a small environment of $w'$. Only near $w'$ with flat $net(w')$, there may exist $w$ with large $F(w)$. Therefore, it is reasonable to search for a $w'$ with $h=net(w')$, where $net(w')$ is flat within a large region. This means to search for the $h$ determined by $G_{M(D_0)}(. \mid D_0)$ of equation (22). Since $h \in M(D_0)$, $E(D_0,net(w')) \leq E_{tol}$ holds: we search for a $w'$ living within a large connected region, where for all $w$ within this region $E(net(w'),net(w),X)=\sum_{x \in X}
\parallel net(w')(x) - net(w)(x) \parallel^2 \leq \epsilon$, where $\epsilon$ is defined in section 2. To conclude: we decrease the relative overfitting error and the underfitting error by searching for a flat minimum (see definition of flat minima in section 2).




Practical realization of the Gibbs variant.
(1) Select $\alpha$ and $E_{tol}(\alpha,\beta)$, thus implicitly choosing $\beta $.
(2) Compute the set $M(D_0)$.
(3) Assume we know how data is obtained from target $f$, i. e. we know $p(D_0 \mid f)$, $p(D \setminus D_0 \mid f)$, and the prior $p(f)$. Then we can compute $G_{M(D_0)}(. \mid D_0)$ and $G(. \mid D_0)$.
(4) Start with $\beta = 0$ and increase $\beta $ until equation (13) holds. Now we know the $\beta $ from the implicit choice above.
(5) Since we know all we need to compute $p(h \mid D_0)$, select some $h$ according to this distribution.



Three comments on certain FMS limitations.
1. FMS only approximates the Gibbs variant given by the definition of $G_{M(D_0)}(h \mid D_0)$ (see (22) and (23)).
We only locally approximate $p(f)$ in weight space. If $f=net(w)$ is locally flat around $w$ then there exist units or weights which can be given with low precision (or can be removed). If there are other weights $w_i$ with $net(w_i)=f$, then one may assume that there are also points in weight space near such $w_i$ where weights can be given with low precision (think of, e.g., symmetrical exchange of weights and units). We assume the local approximation of $p(f)$ is good. The most probable targets represented by flat $net(w)$ are approximated by a hypothesis $h$ which is also represented by a flat $net(w')$ (where $w'$ is near $w$ in weight space). To allow for approximation of $net(w)$ by $net(w')$, we have to assume that the hypothesis set $H$ is dense in the target set $T$. If $net(w')$ is flat in many directions then there are many $net(w)=f$ that share this flatness and are well-approximated by $net(w')$. The only reasonable thing FMS can do is to make $net(w')$ as flat as possible in a large region around $w'$, to approximate the $net(w)$ with large prior probability (recall that flat regions are approximated by axis-aligned boxes, as discussed in section 7, paragraph entitled ``Generalized boxes?''). This approximation is fine if $net(w')$ is smooth enough in ``unflat'' directions (small changes in $w'$ should not result in quite different net functions).

2. Concerning point (3) above:
$p(f \mid D_0)$ depends on $p(D_0 \mid f)$ (how the training data is drawn from the target, see (18)). $G_{M(D_0)}(h \mid D_0)$ depends on $p(f \mid D_0)$ and $p(D \setminus D_0 \mid f)$ (how the test data is drawn from the target). Since we do not know how the data is obtained, the quality of the approximation of the Gibbs algorithm may suffer from noise which has not mean $0$, or from large noise being more probable than small noise.

Of course, if the choice of prior belief does not match the true target distribution, the quality of $G_{M(D_0)}(h \mid D_0)$'s approximation will suffer as well.

3. Concerning point (5) above:
FMS outputs only a single $h$ instead of $p(h \mid D_0)$. This issue is discussed in section 7 (paragraph entitled ``multiple initializations?'').

To conclude: Our FMS algorithm from section 3 only approximates the Gibbs algorithm variant. Two important assumptions are made: The first is that an appropriate choice of prior belief has been made. The second is that the noise on the data is not too ``weird'' (mean $0$, small noise more likely). The two assumptions are necessary for any algorithm based on an additional error term besides the training error. The approximations are: $p(f)$ is approximated locally in weight space, and flat $net(w)$ are approximated by flat $net(w')$ with $w'$ near $w$'s. Our Gibbs variant takes into account that FMS uses only $X_0$ for computing flatness.


next up previous
Next: A.2. WHY DOES THE Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page