RELATION TO PREVIOUS WORK

Most previous algorithms for finding low complexity networks with high generalization capability are based on different prior assumptions. They can be broadly classified into two categories (see Schmidhuber (1994a), however, for an exception):

**(1) Assumptions about the prior weight distribution.**
Hinton and van Camp (1993)
and Williams (1994)
assume that pushing the posterior weight distribution
close to the weight prior leads to ``good'' generalization
(see more details below).
Weight decay (e.g., Hanson & Pratt, 1989; Krogh & Hertz, 1992)
can be derived, e.g., from Gaussian or Laplace weight priors.
Nowlan and Hinton (1992) assume that
a distribution of networks
with many similar weights
generated by Gaussian mixtures is ``better'' *a priori*.
MacKay's weight priors (1992b) are implicit in
additional penalty terms,
which embody the assumptions made.
The problem with the approaches above is this:
there may not be a ``good'' weight prior
for *all* possible architectures and training sets.
With FMS, however,
we don't have to select a ``good'' weight prior --
instead we choose a prior over input/output functions.
This automatically takes the net architecture and
the training set into account.

**(2) Prior assumptions about how theoretical results on
early stopping and network complexity
carry over to practical applications.**
Such assumptions are implicit in
methods based on validation sets
(Mosteller & Tukey, 1968; Stone, 1974; Eubank, 1988; Hastie &
Tibshirani, 1993),
e.g., ``generalized cross validation''
(Craven & Wahba, 1979; Golub et al., 1979),
``final prediction error'' (Akaike, 1970),
``generalized prediction error''
(Moody & Utans, 1994; Moody, 1992).
See also Holden (1994),
Wang et al. (1994),
Amari and Murata (1993), and
Vapnik's ``structural risk minimization''
(Guyon et al., 1992; Vapnik, 1992).

**Constructive algorithms / pruning algorithms**.
Other architecture selection methods are less flexible in
the sense that they can be used only either before or after weight
adjustments. Examples are ``sequential network construction''
(Fahlman & Lebiere, 1990; Ash, 1989; Moody, 1989),
input pruning
(Moody, 1992; Refenes et al., 1994),
unit pruning
(White, 1989; Mozer & Smolensky, 1989; Levin et al., 1994),
weight pruning, e.g. ``optimal brain damage''
(LeCun et al., 1990),
``optimal brain surgeon''
(Hassibi & Stork, 1993).

**Hinton and van Camp** (1993).
They minimize the sum of
two terms:
the first is conventional error plus variance, the other
is the distance
between posterior
and weight prior .
They have to choose a ``good'' *weight* prior.
But, as mentioned above, perhaps there is no
``good'' weight prior for *all* possible
architectures and training sets.
With FMS, however, we don't depend on a ``good'' weight prior --
instead we have a prior over input/output functions, thus
taking into account net architecture and
training set.
Furthermore, Hinton and van Camp
have to compute variances of weights and unit activations,
which (in general) *cannot* be done using linear approximation.
Intuitively speaking, their weight variances are related to our
.
Our approach, however, *does* justify linear approximation,
as seen in appendix A.2.

**Wolpert** (1994a).
His (purely theoretical) analysis
suggests an interesting different additional error term
(taking into account local flatness in all directions):
the logarithm of the Jacobi determinant of the
functional from weight space
to the space of possible nets.
This term is small if the net
output (based on the current weight vector)
is locally flat in weight space
(if many neighboring weights lead to the same net function in the space
of possible net functions).
It is not clear, however, how to
derive a practical
algorithm (e.g., a pruning algorithm)
from this.

**Murray and Edwards** (1993).
They obtain additional error terms
consisting of weight squares and second order
derivatives. Unlike our approach, theirs
*explicitly* prefers weights near zero.
In addition, their approach
appears to require much more computation
time (due to second order derivatives in the error term).

Back to Financial Forecasting page