What is the goal of sensory coding? There is no generally agreed-upon answer to Field's (1994) question yet. Several information-theoretic objective functions (OFs) have been proposed to evaluate the quality of sensory codes. Most OFs focus on statistical properties of the code components (such as mutual information) -- we refer to them as code component-oriented OFs, or COCOFs. Some COCOFs explicitly favor near-factorial, minimally redundant codes of the input data (see, e.g., Watanabe 1985, Barlow et al. 1989, Linsker 1988, Schmidhuber 1992, Schmidhuber and Prelinger 1993, Schraudolph and Sejnowski 1993, Redlich 1993, Deco and Parra 1994). Such codes can be advantageous for (1) data compression, (2) speeding up subsequent gradient descent learning (e.g., Becker 1991), (3) simplifying subsequent Bayes classifiers (e.g., Schmidhuber et al. 1996). Other approaches favor local codes, e.g., Rumelhart and Zipser (1986), Barrow (1987), Kohonen (1988). They can help to achieve (1) minimal crosstalk, (2) subsequent gradient descent speed-ups, (3) facilitation of post training analysis, (4) simultaneous representation of different data items. Recently there also has been much work on COCOFs encouraging biologically plausible sparse distributed codes, e.g., Field (1987), Barlow (1983), Mozer (1991), Földiák (1990), Földiák and Young (1995), Palm (1992), Zemel and Hinton (1994), Field (1994), Saund (1994), Dayan and Zemel (1995), Li (1995), Olshausen and Field (1996), Zemel (1993), Hinton and Ghahramani (1997). Sparse codes share certain advantages of both local and dense codes.

**But what about coding costs?**
COCOFs express desirable properties of the code itself,
while neglecting the costs of constructing the code from the data.
For instance, coding input data without redundancy may
be very expensive in terms of information bits required to
describe the code-generating network, which may need
many finely tuned free parameters.
In fact, the most compact code
of the possible environmental inputs
would be the ``true'' probabilistic causal model corresponding to our
universe's most basic physical laws.
Generating this code and using it for dealing
with everyday events, however, would be extremely
inefficient.

A previous argument for ignoring coding costs
(e.g., Zemel 1993, Zemel and Hinton 1994, Hinton and Zemel 1994),
based on the principle of minimum description length
(MDL, e.g., Solomonoff 1964, Wallace and Boulton 1968,
Rissanen 1978),
focuses on hypothetical costs of transmitting the data
from some sender to a receiver --
how many bits are necessary
to enable the receiver to reconstruct the data?
It goes more or less like this:
*``Total transmission cost
is the number of bits required to describe (1) the data's code,
(2) the reconstruction error and
(3) the decoding procedure.
Since all input exemplars are encoded/decoded by the same
mapping, the coding/decoding costs are negligible because they
occur only once.''*

We doubt, however, that sensory coding's sole objective
should be to transform data into a compact code that is cheaply transmittable
across some ideal, abstract channel.
We believe that one of sensory coding's objectives
should be to reduce the cost of code *generation* through
data transformations
in *existing* channels (e.g., synapses etc.)^{1}.
Without denying the usefulness of certain COCOFs,
we postulate that an important scarce resource
is the bits required to describe the
mappings that generate and process the codes --
after all, it is these mappings that need to be implemented,
given some limited hardware.

**Lococodes.**
For such reasons we shift the point of view and
focus on the information-theoretic costs of code-generation
(compare Pajunen (1998) for recent related work).
We will present a novel approach to unsupervised learning called
``*lo*w-*co*mplexity *co*ding and *de*coding''
(LOCOCODE -- see also Hochreiter and Schmidhuber 1997b,
1997c, 1998).
Without assuming particular goals such as data compression,
simplifying subsequent classification, etc.,
but in the MDL spirit,
LOCOCODE generates so-called *lococodes*
that (1) convey information about the input data,
(2) can be computed from the data by a low-complexity mapping (LCM), (3) can
be decoded by a LCM. By minimizing coding/decoding costs
LOCOCODE will yield efficient, robust, noise-tolerant
mappings for processing inputs and codes.

**Lococodes through FMS.**
To implement LOCOCODE we
apply *Flat Minimum Search*
(FMS, Hochreiter and Schmidhuber 1997a)
to an autoassociator (AA) whose hidden layer
activations represent the code.
FMS is a general, gradient-based method for finding
networks that can be described
with few bits of information.

**Coding each input via few simple component functions (CFs).**
A CF is the function determining the activation of a code
component in response to a given input. The analysis in Section
3 will show that FMS-based
LOCOCODE tries to reproduce the
current input by using as few code components as possible,
each computed by a separate low-complexity CF
(implementable, e.g., by a subnetwork with few low-precision weights).

This reflects a basic assumption, namely, that the true input ``causes''
(e.g., Hinton et al. 1995, Dayan and Zemel 1995, Ghahramani 1995)
are indeed few and simple.
Training sets whose elements
are all describable by few features will result in
*sparse* codes, where sparseness does not necessarily
mean that there are ``few active code components'' but that
``few code components contribute to reproducing the input''.
This can make a difference in the nonlinear case,
where the *absence* of a particular HU activation may imply
*presence* of a particular feature,
and where sparseness may mean that
for each input only
few HUs
are simultaneously *non*-active:
our generalized view of sparse codes allows for noninformative
activation values other than zero.
(But LOCOCODE does prune
code components that are always inactive or always active.)

We will see that LOCOCODE encourages
noise-tolerant feature detectors reminiscent of those
observed in the mammalian visual cortex.
Inputs that are mixtures of a few regular features,
such as edges in images, can be described well in a sparse fashion
(only code components corresponding to present features contribute to
coding the input).
In contrast to previous approaches, however, sparseness is not viewed
as an *a priori* good thing, and is not enforced explicitly, but
only if the input data indeed is naturally describable by a sparse code.
Some lococodes are not only sparse but also factorial,
depending on whether the input is decomposable into factorial features.
Likewise lococodes may deviate from sparseness towards locality if
each input exhibits a single characteristic feature.
Then the code will not be factorial (because knowledge
of the component representing the characteristic feature implies
knowledge of all others), but it will
still be natural because it represents the true cause in a fashion that
makes reconstruction (and other types of further processing) simple.

**Outline.**
An FMS-review will follow in Section 2.
Section 3 will analyze the beneficial effects of
FMS' error terms in the context of autoencoding.
The remainder of our paper will be devoted to
empirical justifications of LOCOCODE.
Experiments in Section 4.2
will show that all three ``good'' kinds of code
discussed in previous work (namely local, sparse, factorial)
can be natural lococodes.
In Section 4.3
LOCOCODE will extract optimal sparse codes
reflecting the independent features
of random horizontal and vertical (noisy) bars, while ICA and PCA won't.
In Section 4.4
LOCOCODE will
generate plausible sparse codes
(based on well-known on-center-off-surround
and other appropriate feature detectors)
of real world images.
Section 4.5
will finally use LOCOCODE
as a preprocessor for a standard, overfitting back-propagation
(BP) speech data classifier.
Surprisingly, this combination achieves excellent generalization performance.
We conclude that the speech data's lococode already
conveys the ``essential'', noise-free information, already in a form
useful for further processing and classification.
Section 5 will discuss our findings.

Back to Independent Component Analysis page.