DETAILS / PARAMETERS

**FMS details.**
To control 's influence during learning,
its gradient
is normalized and multiplied by the length of
's gradient (same for weight decay, see below).
is computed like in (Weigend et al., 1991)
and initialized with 0.
Absolute values of first order derivatives are replaced by
if below this value.
We ought to judge a weight as being pruned if
(see equation (5) in section 4)
exceeds the length of the weight range.
However,
the unknown scaling factor (see inequality (3) and
equation (5) in section 4)
is required to compute .
Therefore, we
judge a weight as being pruned if,
with arbitrary
,
is much bigger than
the corresponding 's of the other weights (typically,
there are clearly separable classes of weights with
high and low 's, which differ from each other by a factor
ranging from to ).

If all weights to and from a particular unit are very close to zero, the unit is lost: due to tiny derivatives, the weights will never again increase significantly. Sometimes, it is necessary to bring lost units back into the game. For this purpose, every time steps (typically, 500,000), all weights with are randomly re-initialized in ; all weights with are randomly initialized in , and is set to 0.

**Weight decay details.**
We used Weigend et al.'s weight decay term:
.
Like with FMS, 's
gradient was normalized and multiplied by the length of
's gradient.
was adjusted like with FMS.
Lost units were brought back
like with FMS.

**Modifications of OBS.**
Typically, most weights exceed 1.0 after training.
Therefore, higher order terms of
in the Taylor expansion of the error function
do not vanish.
Hence, OBS is not fully theoretically justified.
Still, we used OBS to delete high weights,
assuming that higher order derivatives are small if
second order derivatives are.
To obtain reasonable performance,
we modified the original OBS procedure
(notation following Hassibi and Stork, 1993):

- To detect the weight that deserves deletion,
we use both
(the original
value used by Hassibi and Stork) and
. Here denotes
the Hessian and its approximate inverse.
We delete the weight causing minimal training set error (after
tentative deletion).
- Like with OBD (LeCun et al., 1990),
to prevent numerical errors due to small eigenvalues of , we do:
if or or
(bad approximation of ),
we only delete
the weight detected in the previous step - the other weights
remain the same.
Here
denotes the sum of the absolute values of all components of a matrix.
- If OBS' adjustment of the remaining weights
leads to at least one absolute weight change
exceeding 5.0,
then is scaled such
that the maximal absolute weight change is 5.0.
This leads to better performance
(also due to small eigenvalues).
- If
after weight deletion, then
the net is retrained until either
or the number of training examples exceeds 800,000.
Practical experience indicates that the choice of
hardly influences the result.
- OBS is stopped if after retraining. The most recent weight deletion is countermanded.

Back to Financial Forecasting page