next up previous
Next: THE ALGORITHM Up: FLAT MINIMA NEURAL COMPUTATION Previous: BASIC IDEAS / OUTLINE


TASK / ARCHITECTURE / BOXES

Generalization task. The task is to approximate an unknown function $f \subset X \times Y$ mapping a finite set of possible inputs $X \subset R^N$ to a finite set of possible outputs $Y \subset R^K$. A data set $D$ is obtained from $f$ (see appendix A.1). All training information is given by a finite set $D_0 \subset D$. $D_0$ is called the training set. The $p$th element of $D_0$ is denoted by an input/target pair $(x_p, y_p)$.

Architecture/ Net functions. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has $N$ input units, $K$ output units, $L$ weights, and differentiable activation functions. It maps input vectors $x \in R^N$ to output vectors $o(w,x) \in R^K$, where $w$ is the $L$-dimensional weight vector, and the weight on the connection from unit $j$ to $i$ is denoted $w_{ij}$. The net function induced by $w$ is denoted $net(w)$: for $x \in R^N$, $net(w)(x) = o(w,x) =
\left(o^1(w,x),o^2(w,x), \ldots, o^{K-1}(w,x),o^K(w,x) \right)$, where $o^i(w,x)$ denotes the $i$-th component of $o(w,x)$, corresponding to output unit $i$.

Training error. We use squared error $E(net(w),D_0):= \sum_{(x_p, y_p) \in D_0}
\parallel y_p - o(w,x_p) \parallel^{2}$, where $\parallel . \parallel$ denotes the Euclidean norm.

Tolerable error. To define a region in weight space with the property that each weight vector from that region leads to small error and similar output, we introduce the tolerable error $E_{tol}$, a positive constant (see appendix A.1 for a formal definition of $E_{tol}$). ``Small'' error is defined as being smaller than $E_{tol}$. $E(net(w),D_0) > E_{tol}$ implies ``underfitting''.

Boxes. Each weight $w$ satisfying $E(net(w),D_0) \leq E_{tol}$ defines an ``acceptable minimum'' (compare $M(D_0)$ in appendix A.1). We are interested in a large region of connected acceptable minima, where each weight $w$ within this region leads to almost identical net functions $net(w)$. Such a region is called a flat minimum. We will see that flat minima correspond to low expected generalization error. To simplify the algorithm for finding a large connected region (see below), we do not consider maximal connected regions but focus on so-called ``boxes'' within regions: for each acceptable minimum $w$, its box $M_w$ in weight space is a $L$-dimensional hypercuboid with center $w$. For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in direction of the axis corresponding to weight $w_{ij}$ is denoted by $\Delta w_{ij}(X)$. The $\Delta w_{ij}(X)$ are the maximal (positive) values such that for all $L$-dimensional vectors $\kappa$ whose components $\kappa_{ij}$ are restricted by $\vert\kappa_{ij}\vert \leq \Delta w_{ij}(X)$, we have:
$E(net(w),net(w + \kappa),X) \leq \epsilon$, where $E(net(w),net(w + \kappa),X) = \sum_{x \in X}
\parallel o(w,x) - o(w + \kappa, x) \parallel^2$, and $\epsilon$ is a small positive constant defining tolerable output changes (see also equation (1)). Note that $\Delta w_{ij}(X)$ depends on $\epsilon$. Since our algorithm does not use $\epsilon$, however, it is notationally suppressed. $\Delta w_{ij}(X)$ gives the precision of $w_{ij}$. $M_w$'s box volume is defined by $V(\Delta w(X)) :=
2^L \prod_{i,j} \Delta w_{ij}(X)$, where $\Delta w(X)$ denotes the vector with components $\Delta w_{ij}(X)$. Our goal is to find large boxes within flat minima.


next up previous
Next: THE ALGORITHM Up: FLAT MINIMA NEURAL COMPUTATION Previous: BASIC IDEAS / OUTLINE
Juergen Schmidhuber 2003-02-13


Back to Financial Forecasting page