Next: THE ALGORITHM Up: SIMPLIFYING NEURAL NETS BY Previous: INTRODUCTION

TASK / ARCHITECTURE / BOXES

Generalization task. The task is to approximate an unknown relation $\bar D \subset X \times Z$ between a set of inputs $X \subset R^N$ and a set of outputs $Z \subset R^K$ . $\bar D$ is taken to be a function. A relation

is obtained from $\bar D$ by adding noise to the outputs. All training information is given by a finite relation $D_0 \subset D$ .

is called the training set. The

th element of

is denoted by an input/target pair

Architecture. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has input units, output units, weights, and differentiable activation functions. It maps input vectors $x_p \in R^N$ to output vectors $o_p \in R^K$ . The weight from unit to is denoted by $w_{ij}$ . The -dimensional weight vector is denoted by .

Training error. Mean squared error $E_{q}(w,D_0):= \frac{1}{\vert D_0\vert} \sum_{(x_p, d_p) \in D_0} \parallel d_p - o_p \parallel^{2}$ is used, where $\parallel . \parallel$ denotes the Euclidian norm, and $\vert.\vert$ denotes the cardinality of a set. To define regions in weight space with the property that each weight vector from that region has ``similar small error'', we introduce the tolerable error $E_{tol}$ , a positive constant. ``Small'' error is defined as being smaller than $E_{tol}$ . $E_{q}(w,D_0) > E_{tol}$ implies ``underfitting''.

Boxes. Each weight satisfying $E_{q}(w,D_0) \leq E_{tol}$ defines an ``acceptable minimum''. We are interested in large regions of connected acceptable minima. Such regions are called flat minima. They are associated with low expected generalization error (see [4]). To simplify the algorithm for finding large connected regions (see below), we do not consider maximal connected regions but focus on so-called ``boxes'' within regions: for each acceptable minimum , its box in weight space is a -dimensional hypercuboid with center . For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in direction of the axis corresponding to weight $w_{ij}$ is denoted by $\Delta w_{ij}$ , which is the maximal (positive) value such that for all , all positive $\kappa_{ij} \leq \Delta w_{ij}$ can be added to or subtracted from the corresponding component of simultaneously without violating $E_{q}(.,D_0) \leq E_{tol}$ ( $\Delta w_{ij}$ gives the precision of $w_{ij}$ ). 's box volume is defined by $\Delta w := 2^W \prod_{i,j} \Delta w_{ij}$ .

Next: THE ALGORITHM Up: SIMPLIFYING NEURAL NETS BY Previous: INTRODUCTION

Juergen Schmidhuber 2003-02-25

Back to Financial Forecasting page