next up previous
Next: APPENDIX - THEORETICAL JUSTIFICATION Up: SIMPLIFYING NEURAL NETS BY Previous: THE ALGORITHM


EXPERIMENTAL RESULTS (see [4] for details)

EXPERIMENT 1 - noisy classification. The first experiment is taken from Pearlmutter and Rosenfeld [9]. The task is to decide whether the $x$-coordinate of a point in 2-dimensional space exceeds zero (class 1) or does not (class 2). Noisy training examples are generated as follows: data points are obtained from a Gaussian with zero mean and stdev 1.0, bounded in the interval $[-3.0,3.0]$. The data points are misclassified with a probability of $0.05$. Final input data is obtained by adding a zero mean Gaussian with stdev 0.15 to the data points. In a test with 2,000,000 data points, it was found that the procedure above leads to 9.27 per cent misclassified data. No method will misclassify less than 9.27 per cent, due to the inherent noise in the data. The training set is based on 200 fixed data points. The test set is based on 120,000 data points.

Results. 10 conventional backprop (BP) nets were tested against 10 equally initialized networks based on our new method (``flat minima search'', FMS). After 1,000 epochs, the weights of our nets essentially stopped changing (automatic ``early stopping''), while backprop kept changing weights to learn the outliers in the data set and overfit. In the end, our approach left a single hidden unit $h$ with a maximal weight of $30.0$ or $-30.0$ from the x-axis input. Unlike with backprop, the other hidden units were effectively pruned away (outputs near zero). So was the y-axis input (zero weight to $h$). It can be shown that this corresponds to an ``optimal'' net with minimal numbers of units and weights. Table 1 illustrates the superior performance of our approach.


Table 1: 10 comparisons of conventional backprop (BP) and our new method (FMS). The second row (labeled ``MSE'') shows mean squared error on the test set. The third row (``dto'') shows the difference between the fraction (in per cent) of misclassifications and the optimal fraction (9.27). The remaining rows provide the analoguous information for the new approach, which clearly outperforms backprop.
  Backprop New approach   Backprop New approach
  MSE dto MSE dto   MSE dto MSE dto
1 0.220 1.35 0.193 0.00 6 0.219 1.24 0.187 0.04
2 0.223 1.16 0.189 0.09 7 0.215 1.14 0.187 0.07
3 0.222 1.37 0.186 0.13 8 0.214 1.10 0.185 0.01
4 0.213 1.18 0.181 0.01 9 0.218 1.21 0.190 0.09
5 0.222 1.24 0.195 0.25 10 0.214 1.21 0.188 0.07


EXPERIMENT 2 - recurrent nets. The method works for continually running fully recurrent nets as well. At every time step, a recurrent net with sigmoid activations in $[0,1]$ sees an input vector from a stream of randomly chosen input vectors from the set $\{(0,0),(0,1),(1,0),(1,1)\}$. The task is to switch on the first output unit whenever an input $(1,0)$ had occurred two time steps ago, and to switch on the second output unit without delay in response to any input $(0,1)$. The task can be solved by a single hidden unit.

Results. With conventional recurrent net algorithms, after training, both hidden units were used to store the input vector. Not so with our new approach. We trained 20 networks. All of them learned perfect solutions. Like with weight decay, most weights to the output decayed to zero. But unlike with weight decay, strong inhibitory connections (-30.0) switched off one of the hidden units, effectively pruning it away.

EXPERIMENT 3 - stock market prediction. We predict the DAX (German stock market index) based on fundamental (experiments 3.1 and 3.2) and technical (experiment 3.3) indicators. We use strictly layered feedforward nets with sigmoid units active in [-1,1], and the following performance measures:

Confidence: output $o > \alpha \rightarrow$ positive tendency, $o < -\alpha \rightarrow$ negative tendency. Performance: Sum of confidently, incorrectly predicted DAX changes is subtracted from sum of confidently, correctly predicted ones. The result is divided by the sum of absolute changes.
EXPERIMENT 3.1: Fundamental inputs: (a) German interest rate (``Umlaufsrendite''), (b) industrial production divided by money supply, (c) business sentiments (``IFO Geschäftsklimaindex''). 24 training examples, 68 test examples, quarterly prediction, confidence: $\alpha =$ 0.0/0.6/0.9, architecture: (3-8-1).
EXPERIMENT 3.2: Fundamental inputs: (a), (b), (c) as in exp. 3.1, (d) dividend rate, (e) foreign orders in manufacturing industry. 228 training examples, 100 test examples, monthly prediction, confidence: $\alpha =$ 0.0/0.6/0.8, architecture: (5-8-1).
EXPERIMENT 3.3: Technical inputs: (a) 8 most recent DAX-changes, (b) DAX, (c) change of 24-week relative strength index (``RSI''), (d) difference of ``5 week statistic'', (e) ``MACD'' (difference of exponentially weighted 6 week and 24 week DAX). 320 training examples, 100 test examples, weekly predictions, confidence: $\alpha =$ 0.0/0.2/0.4, architecture: (12-9-1).
The following methods are tested: (1) Conventional backprop (BP), (2) optimal brain surgeon (OBS [2]), (3) weight decay (WD []), (4) flat minima search (FMS).

Results. Our method clearly outperforms the other methods. FMS is up to 63 per cent better than the best competitor (see [4] for details).


next up previous
Next: APPENDIX - THEORETICAL JUSTIFICATION Up: SIMPLIFYING NEURAL NETS BY Previous: THE ALGORITHM
Juergen Schmidhuber 2003-02-25


Back to Financial Forecasting page