Architecture. For simplicity, we will focus on a standard feedforward net (but in the experiments, we will use recurrent nets as well). The net has input units, output units, weights, and differentiable activation functions. It maps input vectors to output vectors . The weight from unit to is denoted by . The -dimensional weight vector is denoted by .
Training error. Mean squared error is used, where denotes the Euclidian norm, and denotes the cardinality of a set. To define regions in weight space with the property that each weight vector from that region has ``similar small error'', we introduce the tolerable error , a positive constant. ``Small'' error is defined as being smaller than . implies ``underfitting''.
Boxes. Each weight satisfying defines an ``acceptable minimum''. We are interested in large regions of connected acceptable minima. Such regions are called flat minima. They are associated with low expected generalization error (see [4]). To simplify the algorithm for finding large connected regions (see below), we do not consider maximal connected regions but focus on so-called ``boxes'' within regions: for each acceptable minimum , its box in weight space is a -dimensional hypercuboid with center . For simplicity, each edge of the box is taken to be parallel to one weight axis. Half the length of the box edge in direction of the axis corresponding to weight is denoted by , which is the maximal (positive) value such that for all , all positive can be added to or subtracted from the corresponding component of simultaneously without violating ( gives the precision of ). 's box volume is defined by .