next up previous
Next: A.2. HOW TO FLATTEN Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION

A.1. OVERFITTING ERROR

In analogy to [12] and [1], we decompose the generalization error into an ``overfitting'' error and an ``underfitting'' error. There is no significant underfitting error (corresponding to Vapnik's empirical risk) if $E_{q}(w,D_0) \leq E_{tol}$. Some thought is required, however, to define the ``overfitting'' error. We do this in a novel way. Since we do not know the relation $D$, we cannot know $p(\alpha \mid D)$, the ``optimal'' posterior weight distribution we would obtain by training the net on $D$ ($\rightarrow$ ``sure thing hypothesis''). But, for theoretical purposes, suppose we did know $p(\alpha \mid D)$. Then we could use $p(\alpha \mid D)$ to initialize weights before learning the training set $D_0$. Using the Kullback-Leibler distance, we measure the information (due to noise) conveyed by $D_0$, but not by $D$. In conjunction with the initialization above, this provides the conceptual setting for defining an overfitting error measure. But, the initialization does not really matter, because it does not heavily influence the posterior (see [4]).

The overfitting error is the Kullback-Leibler distance of the posteriors:
$E_{o}(D,D_{0}) = \int p(\alpha \mid D_{0})
\log \left( p(\alpha \mid D_{0})/p(\alpha \mid D) \right) d\alpha$. $E_o(D,D_0)$ is the expectation of $\log \left( p(\alpha \mid D_0)/
p(\alpha \mid D) \right)$ (the expected difference of the minimal description of $\alpha$ with respect to $D$ and $D_0$, after learning $D_0$). Now we measure the expected overfitting error relative to $M_{w}$ (see section 2) by computing the expectation of $\log
\left( p(\alpha \mid D_{0})/p(\alpha \mid D) \right)$ in the range $M_w$:

\begin{displaymath}
E_{ro}(w)= \beta \left( \int_{ M_{w}}
p_{M_w}(\alpha \mid D...
... E_q(\alpha,D) d \alpha -
\bar E_q(D_0,M_w) \right)
\mbox{ .}
\end{displaymath} (3)

Here $p_{M_w}(\alpha \mid D_0) := p(\alpha \mid D_0)/\int_{M_w}
p(\tilde \alpha \mid D_0) d \tilde \alpha$ is the posterior of $D_0$ scaled to obtain a distribution within $M_w$, and $\bar E_q(D_0,M_w) := \int_{M_w} p_{M_w}(\alpha \mid D_0)
E_q(\alpha,D_0) d \alpha$ is the mean error in $M_w$ with respect to $D_0$.

Clearly, we would like to pick $w$ such that $E_{ro}(w)$ is minimized. Towards this purpose, we need two additional prior assumptions, which are actually implicit in most previous approaches (which make additional stronger assumptions, see section 1): (1) ``Closeness assumption'': Every minimum of $E_q(.,D_0)$ is ``close'' to a maximum of $p(\alpha \vert D)$ (see formal definition in [4]). Intuitively, ``closeness'' ensures that $D_0$ can indeed tell us something about $D$, such that training on $D_0$ may indeed reduce the error on $D$. (2) ``Flatness assumption'': The peaks of $p(\alpha \vert D)$'s maxima are not sharp. This MDL-like assumption holds if not all weights have to be known exactly to model $D$. It ensures that there are regions with low error on $D$.


next up previous
Next: A.2. HOW TO FLATTEN Up: APPENDIX - THEORETICAL JUSTIFICATION Previous: APPENDIX - THEORETICAL JUSTIFICATION
Juergen Schmidhuber 2003-02-25


Back to Financial Forecasting page