next up previous
Next: THE ALGORITHM Up: SIMPLIFYING NEURAL NETS BY Previous: INTRODUCTION


TASK / ARCHITECTURE / BOXES

Generalization task. The task is to approximate an unknown relation $\bar D \subset X \times Z$ between a set of inputs $X \subset R^N$ and a set of outputs $Z \subset R^K$. $\bar D$ is taken to be a function. A relation $D$ is obtained from $\bar D$ by adding noise to the outputs. All training information is given by a finite relation $D_0 \subset D$. $D_0$ is called the training set. The $p$th element of $D_0$ is denoted by an input/target pair $(x_p, d_p)$.

Architecture. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has $N$ input units, $K$ output units, $W$ weights, and differentiable activation functions. It maps input vectors $x_p \in R^N$ to output vectors $o_p \in R^K$. The weight from unit $j$ to $i$ is denoted by $w_{ij}$. The $W$-dimensional weight vector is denoted by $w$.

Training error. Mean squared error $E_{q}(w,D_0):= \frac{1}{\vert D_0\vert} \sum_{(x_p, d_p) \in D_0}
\parallel d_p - o_p \parallel^{2}$ is used, where $\parallel . \parallel$ denotes the Euclidian norm, and $\vert.\vert$ denotes the cardinality of a set. To define regions in weight space with the property that each weight vector from that region has ``similar small error'', we introduce the tolerable error $E_{tol}$, a positive constant. ``Small'' error is defined as being smaller than $E_{tol}$. $E_{q}(w,D_0) > E_{tol}$ implies ``underfitting''.

Boxes. Each weight $w$ satisfying $E_{q}(w,D_0) \leq E_{tol}$ defines an ``acceptable minimum''. We are interested in large regions of connected acceptable minima. Such regions are called flat minima. They are associated with low expected generalization error (see [4]). To simplify the algorithm for finding large connected regions (see below), we do not consider maximal connected regions but focus on so-called ``boxes'' within regions: for each acceptable minimum $w$, its box $M_w$ in weight space is a $W$-dimensional hypercuboid with center $w$. For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in direction of the axis corresponding to weight $w_{ij}$ is denoted by $\Delta w_{ij}$, which is the maximal (positive) value such that for all $i,j$, all positive $\kappa_{ij} \leq \Delta w_{ij}$ can be added to or subtracted from the corresponding component of $w$ simultaneously without violating $E_{q}(.,D_0) \leq E_{tol}$ ($\Delta w_{ij}$ gives the precision of $w_{ij}$). $M_w$'s box volume is defined by $\Delta w :=
2^W \prod_{i,j} \Delta w_{ij}$.


next up previous
Next: THE ALGORITHM Up: SIMPLIFYING NEURAL NETS BY Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-25


Back to Financial Forecasting page