Below the algorithm in pseudo code (using fast multiplication as in appendix A.3.2). Comments are marked by ``**''. Note: the pseudo code was omitted from the version for Neural Computation. We recommend not to blindly reimplement the algorithm from the pseudo code, but to make a serious effort to understand it, by consulting the body of the paper as well. We believe that this will greatly facilitate proper, problem-specific use of the algorithm.
Notation.
In what follows, the variable integers and stand for units.
Variables are reserved for output units only.
is the number of layers, where the th layer is the output layer and the
st layer is the input layer.
The current pattern consists of input vector
and target vector (see section 2).
is the component of the input vector corresponding to input unit .
is the component of the output vector corresponding to output unit .
is the real-valued weight on the connection from
unit to unit (see in
section 2).
is the net input of unit (see equation (34) and text thereafter).
is the activation function of unit , is the first
derivative, is the second derivative of (see appendix A.2).
is the activation of the -th unit (see appendix A.2).
is the error of the -th output unit.
is
(see equation (43)).
is
(see equation (44)).
is
(see equation (45)).
is
,
the gradient of the quadratic error.
abs denotes the absolute value of real .
kron returns if and otherwise.
sign returns the sign of real .
are variables (used to compute the right hand side of equation (40)).
.
(see
the sums over in (40)).
(see the
denumerator in the second line
of equation (40)).
(see the numerator in the second line
of equation (40)).
stands for the number of weights that make a
significant contribution to the computation of the current pattern.
is an approximation of 's precision
(approximation because is unknown).
marks whether or not provides
a significant contribution to the computation of the current pattern.
is
(see equation (40)).
is (see equation (46)).
is (see equation (47)).
is
(see equation (49)).
is
(see equation (48)).
is
,
the gradient
of the additional error term (see equation (50)).
is the current pattern's quadratic error (see section 2 and appendix A.1).
is the learning rate for the quadratic error.
is the learning rate for the additional
error term (the following values are used
to make updates according to Weigend et al., 1991,
see also section 5.6).
is a parameter needed for updating .
is a parameter needed for updating , typical
value is 0.5.
is the most recent epoch's average error.
is the current epoch's average error.
is the exponentially weighted average epoch error.
is the parameter for the exponentially weighted error --
a typical value is 0.9 or 0.99, depending on training set size.
is the tolerable error level - it depends on the task
to be solved (see section 2 and appendix A.1, equation (13)).
is the number of exemplars observed during training.
is the length of an epoch, measured in number of
presented training patterns.
is the current number of
epochs so far, where represents integer division.
are variables required for normalizing
' gradient to the length of 's gradient.
is TRUE if
was alive for at least one pattern
presented during
the previous epoch, and FALSE otherwise.
is TRUE if
is TRUE
and if was
always TRUE during all epochs since
was set alive for the last time
(otherwise is FALSE).
denotes the assignment operator.
Additional comments.
Speeding up the algorithm. It makes sense to separate the algorithm into two phases. Phase 1 is conventional backprop, phase 2 is FMS. The backprop phase consists of the forward pass, the first backward pass, and the weight update based on (marked in the algorithm). Start with phase 1. Switch to phase 2 if . Switch again to phase 1 if (the values and can be changed).
Two-phase learning does not sensitively depend on (but avoid values that are always too small). Two-phase learning is justified because weights with large (and small ) hardly influence 's gradient -- it makes sense to let FMS focus on low-precision weights, and let backprop focus on the others.
ALGORITHM.
Initialization.
Set or (the exponent is the difference
between the numbers of
significant digits required for maximal and minimal precision).
Set to an arbitrary small value.
Set and to 0.
Initialize for all .
Initialize
(typically, ),
and provide a criterion for stopping the learning procedure.
Set
, for instance,
or
or
.
Set to some desired error value after learning.
Set
to 0.
Set .
Set
to 0.
Set TRUE for all .
Set
FALSE for all .
Set to a large value.
Set FALSE for non-existing .
While training not stopped do
begin(while)
select pattern pair .
** The following variables can be set after they were
used for the last time in the previous loop. **
set all components of
to 0
set all components of
to FALSE
set to 0
** Forward pass. **
for all input units do
begin
end
for to do
begin
for all units in layer do
begin
for all units in layer do
begin
if () do
begin
end
end
end
end
** Compute the error. **
for all output units do
begin
end
** Compute the error gradient **
** 1. backward pass. **
for all output units do
begin
for to do
begin
for all units in layer do
begin
(IF THEN: for all units in layer ELSE: ) do
begin
if () do
begin
set abs
1E-5 ** to avoid division overflow **
end
end
end
end
end
** End of conventional backprop (phase 1). **
** compute
**
** we recommend to
introduce additional local variables for inner loops, such as
and
**
for all output units do
begin
for all , such that () do
begin
end
end
** some weights are insignificant to compute the current pattern **
for all , such that () do
begin
if (
) do
begin
end
end
for all , such that () do
begin
if (
) do
begin
TRUE
for all output units do
begin
end
end
else do
begin
TRUE
end
end
** update variables after having marked the current
pattern's insignificant weights **
for all output units do
begin
for all , such that ( AND NOT
) do
begin
end
end
for all output units do
begin
for all , such that ( AND NOT
) do
begin
for all output units do
begin
end
end
end
** Forward pass. **
for all output units do
begin
for all input units do
begin
end
for to do
begin
(IF THEN: for all units in layer ELSE: ) do
begin
for all units in layer do
begin
if ( AND NOT
) do
begin
end
end
end
end
end
** 2. backward pass. **
for all output units do
begin
for to do
begin
for all units in layer do
begin
(IF THEN: for all units in layer ELSE: ) do
begin
if ( AND NOT
) do
begin
end
end
end
end
end
** Normalize 's gradient to the
length of 's gradient. **
for all , such that ( AND NOT
) do
begin
end
**
**
** End of 's gradient computation (phase 2). **
** Weight update. **
for all , such that ( AND NOT
) do
begin
end
** Update learning parameters. **
if ( mod
) do ** ``mod'' is the modulo function **
begin
** update according to Weigend et al.(1991). **
if (
OR ) do
begin
end
else do
begin
if () do
begin
if () do
begin
end
end
else do
begin
end
end
** update weights that are alive, **
** a weight is alive if it was alive (marked by ) **
** for at least one pattern presented during the previous epoch. **
for all , such that () do
begin
if (
FALSE) do
begin
FALSE
end
FALSE
end
if ( mod OR 2.0 ) do
** weights are re-animated if the average error is too large; weights are also **
** re-animated every 100-th epoch, to enable faster reduction of quadratic error **
** (due to weight changes, some previously dead weight may turn out to deserve **
** to live again); one may use values other than 100 and 2.0 **
begin
for all such that exists do
begin
TRUE
end
end
end
decide whether to stop learning or not
end(while)