DETAILS / PARAMETERS

Next: RELATION TO PREVIOUS WORK Up: EXPERIMENTAL RESULTS Previous: EXPERIMENT 5 - stock

DETAILS / PARAMETERS

With exception of the experiment in section 5.2, all units are sigmoid in the range of

. Weights are constrained to

and initialized in [-0.1,0.1]. The latter ensures high first order derivatives in the beginning of the learning phase. WD is set up to hardly punish weights below

. $E_{\mbox{{\scriptsize average}}}$ is the average error on the training set, approximated using exponential decay: $E_{\mbox{{\scriptsize average}}} \leftarrow \gamma E_{\mbox{{\scriptsize average}}} + (1-\gamma) E(net(w),D_0)$ , where $\gamma = 0.85$ .

FMS details. To control 's influence during learning, its gradient is normalized and multiplied by the length of 's gradient (same for weight decay, see below). $\lambda$ is computed like in (Weigend et al., 1991) and initialized with 0. Absolute values of first order derivatives are replaced by $10^{-20}$ if below this value. We ought to judge a weight $w_{ij}$ as being pruned if $\delta w_{ij}$ (see equation (5) in section 4) exceeds the length of the weight range. However, the unknown scaling factor $\epsilon$ (see inequality (3) and equation (5) in section 4) is required to compute $\delta w_{ij}$ . Therefore, we judge a weight $w_{ij}$ as being pruned if, with arbitrary $\epsilon$ , $\delta w_{ij}$ is much bigger than the corresponding $\delta$ 's of the other weights (typically, there are clearly separable classes of weights with high and low $\delta$ 's, which differ from each other by a factor ranging from to ).

If all weights to and from a particular unit are very close to zero, the unit is lost: due to tiny derivatives, the weights will never again increase significantly. Sometimes, it is necessary to bring lost units back into the game. For this purpose, every $n_{init}$ time steps (typically, $n_{init} =$ 500,000), all weights $w_{ij}$ with $0 \leq w_{ij}<0.01$ are randomly re-initialized in ; all weights $w_{ij}$ with $0 \geq w_{ij}>-0.01$ are randomly initialized in , and $\lambda$ is set to 0.

Weight decay details. We used Weigend et al.'s weight decay term: $D(w) = \sum_{i,j} \frac{w_{ij}^2/w_0}{1 + w_{ij}^2/w_0}$ . Like with FMS, 's gradient was normalized and multiplied by the length of 's gradient. $\lambda$ was adjusted like with FMS. Lost units were brought back like with FMS.

Modifications of OBS. Typically, most weights exceed 1.0 after training. Therefore, higher order terms of $\delta w$ in the Taylor expansion of the error function do not vanish. Hence, OBS is not fully theoretically justified. Still, we used OBS to delete high weights, assuming that higher order derivatives are small if second order derivatives are. To obtain reasonable performance, we modified the original OBS procedure (notation following Hassibi and Stork, 1993):

To detect the weight that deserves deletion, we use both $L_q= \frac{w_q^2}{[H^{-1}]_{qq}}$ (the original value used by Hassibi and Stork) and $T_q := \frac{\partial E} {\partial w_q} w_q + \frac{1}{2} \frac{\partial^2 E}{\partial w_q^2} w_q^2$ . Here denotes the Hessian and $H^{-1}$ its approximate inverse. We delete the weight causing minimal training set error (after tentative deletion).
Like with OBD (LeCun et al., 1990), to prevent numerical errors due to small eigenvalues of , we do: if or or $\parallel I - H^{-1}H \parallel > 10.0$ (bad approximation of $H^{-1}$ ), we only delete the weight detected in the previous step - the other weights remain the same. Here $\parallel . \parallel$ denotes the sum of the absolute values of all components of a matrix.
If OBS' adjustment of the remaining weights leads to at least one absolute weight change exceeding 5.0, then $\delta w$ is scaled such that the maximal absolute weight change is 5.0. This leads to better performance (also due to small eigenvalues).
If $E_{\mbox{{\scriptsize average}}} > E_{tol}$ after weight deletion, then the net is retrained until either $E_{\mbox{{\scriptsize average}}} < E_{tol}$ or the number of training examples exceeds 800,000. Practical experience indicates that the choice of $E_{tol}$ hardly influences the result.
OBS is stopped if $E_{\mbox{{\scriptsize average}}} > E_{tol}$ after retraining. The most recent weight deletion is countermanded.

Next: RELATION TO PREVIOUS WORK Up: EXPERIMENTAL RESULTS Previous: EXPERIMENT 5 - stock

Juergen Schmidhuber 2003-02-13

Back to Financial Forecasting page