TASK / ARCHITECTURE / BOXES

**Generalization task.**
The task is to approximate an unknown function
mapping a finite set of possible inputs
to a finite set of possible
outputs .
A data set
is obtained from
(see appendix A.1).
All training information is given
by a finite set
.
is called
the *training set*.
The th element of is denoted by an input/target
pair .

**Architecture/ Net functions.**
For simplicity, we will focus on
a standard feedforward net (but in the experiments, we will use
recurrent nets as well). The net has
input units, output units,
weights, and differentiable activation functions. It
maps input vectors to output vectors
,
where is the -dimensional weight vector, and
the weight on the connection from unit to is denoted .
The *net function* induced by is denoted :
for ,
,
where denotes the -th component of , corresponding
to output unit .

**Training error.**
We use squared error
,
where
denotes the Euclidean norm.

**Tolerable error.**
To define a region in weight space with the property
that each weight vector from that region leads to
small error and similar output,
we introduce the tolerable error ,
a positive constant (see appendix A.1 for a formal definition of ).
``Small'' error is defined as being smaller than .
implies ``underfitting''.

**Boxes.**
Each weight satisfying
defines an ``acceptable minimum''
(compare in appendix A.1).
**We are interested in a large region of connected acceptable minima,
where each weight within this region leads to almost
identical net functions .
Such a region is called a flat minimum.** We will see that
flat minima correspond to
low expected generalization error.
To simplify the algorithm for finding a large connected
region (see below), we do not consider
maximal connected regions but focus on so-called *``boxes''* within
regions: for each acceptable minimum ,
its *box* in weight space
is a -dimensional hypercuboid
with center .
For simplicity, each edge of the box is taken to be parallel
to one weight axis.
Half the length of the box edge in direction of the axis
corresponding to weight
is denoted by
.
The
are the maximal (positive) values such
that for all
-dimensional vectors
whose components
are restricted by
,
we have:

, where
,
and is a small positive constant defining
tolerable output changes (see also equation (1)).
Note that
depends on .
Since our algorithm does not use , however,
it is notationally suppressed.
gives the precision of .
's *box volume* is defined by
, where
denotes the vector with
components
.
Our goal is to find large boxes within flat minima.

Back to Financial Forecasting page