What is the goal of sensory coding? There is no generally agreed-upon answer to Field's (1994) question yet. Several information-theoretic objective functions (OFs) have been proposed to evaluate the quality of sensory codes. Most OFs focus on statistical properties of the code components (such as mutual information) -- we refer to them as code component-oriented OFs, or COCOFs. Some COCOFs explicitly favor near-factorial, minimally redundant codes of the input data (see, e.g., Watanabe 1985, Barlow et al. 1989, Linsker 1988, Schmidhuber 1992, Schmidhuber and Prelinger 1993, Schraudolph and Sejnowski 1993, Redlich 1993, Deco and Parra 1994). Such codes can be advantageous for (1) data compression, (2) speeding up subsequent gradient descent learning (e.g., Becker 1991), (3) simplifying subsequent Bayes classifiers (e.g., Schmidhuber et al. 1996). Other approaches favor local codes, e.g., Rumelhart and Zipser (1986), Barrow (1987), Kohonen (1988). They can help to achieve (1) minimal crosstalk, (2) subsequent gradient descent speed-ups, (3) facilitation of post training analysis, (4) simultaneous representation of different data items. Recently there also has been much work on COCOFs encouraging biologically plausible sparse distributed codes, e.g., Field (1987), Barlow (1983), Mozer (1991), Földiák (1990), Földiák and Young (1995), Palm (1992), Zemel and Hinton (1994), Field (1994), Saund (1994), Dayan and Zemel (1995), Li (1995), Olshausen and Field (1996), Zemel (1993), Hinton and Ghahramani (1997). Sparse codes share certain advantages of both local and dense codes.
But what about coding costs? COCOFs express desirable properties of the code itself, while neglecting the costs of constructing the code from the data. For instance, coding input data without redundancy may be very expensive in terms of information bits required to describe the code-generating network, which may need many finely tuned free parameters. In fact, the most compact code of the possible environmental inputs would be the ``true'' probabilistic causal model corresponding to our universe's most basic physical laws. Generating this code and using it for dealing with everyday events, however, would be extremely inefficient.
A previous argument for ignoring coding costs (e.g., Zemel 1993, Zemel and Hinton 1994, Hinton and Zemel 1994), based on the principle of minimum description length (MDL, e.g., Solomonoff 1964, Wallace and Boulton 1968, Rissanen 1978), focuses on hypothetical costs of transmitting the data from some sender to a receiver -- how many bits are necessary to enable the receiver to reconstruct the data? It goes more or less like this: ``Total transmission cost is the number of bits required to describe (1) the data's code, (2) the reconstruction error and (3) the decoding procedure. Since all input exemplars are encoded/decoded by the same mapping, the coding/decoding costs are negligible because they occur only once.''
We doubt, however, that sensory coding's sole objective should be to transform data into a compact code that is cheaply transmittable across some ideal, abstract channel. We believe that one of sensory coding's objectives should be to reduce the cost of code generation through data transformations in existing channels (e.g., synapses etc.)1. Without denying the usefulness of certain COCOFs, we postulate that an important scarce resource is the bits required to describe the mappings that generate and process the codes -- after all, it is these mappings that need to be implemented, given some limited hardware.
Lococodes. For such reasons we shift the point of view and focus on the information-theoretic costs of code-generation (compare Pajunen (1998) for recent related work). We will present a novel approach to unsupervised learning called ``low-complexity coding and decoding'' (LOCOCODE -- see also Hochreiter and Schmidhuber 1997b, 1997c, 1998). Without assuming particular goals such as data compression, simplifying subsequent classification, etc., but in the MDL spirit, LOCOCODE generates so-called lococodes that (1) convey information about the input data, (2) can be computed from the data by a low-complexity mapping (LCM), (3) can be decoded by a LCM. By minimizing coding/decoding costs LOCOCODE will yield efficient, robust, noise-tolerant mappings for processing inputs and codes.
Lococodes through FMS. To implement LOCOCODE we apply Flat Minimum Search (FMS, Hochreiter and Schmidhuber 1997a) to an autoassociator (AA) whose hidden layer activations represent the code. FMS is a general, gradient-based method for finding networks that can be described with few bits of information.
Coding each input via few simple component functions (CFs). A CF is the function determining the activation of a code component in response to a given input. The analysis in Section 3 will show that FMS-based LOCOCODE tries to reproduce the current input by using as few code components as possible, each computed by a separate low-complexity CF (implementable, e.g., by a subnetwork with few low-precision weights).
This reflects a basic assumption, namely, that the true input ``causes'' (e.g., Hinton et al. 1995, Dayan and Zemel 1995, Ghahramani 1995) are indeed few and simple. Training sets whose elements are all describable by few features will result in sparse codes, where sparseness does not necessarily mean that there are ``few active code components'' but that ``few code components contribute to reproducing the input''. This can make a difference in the nonlinear case, where the absence of a particular HU activation may imply presence of a particular feature, and where sparseness may mean that for each input only few HUs are simultaneously non-active: our generalized view of sparse codes allows for noninformative activation values other than zero. (But LOCOCODE does prune code components that are always inactive or always active.)
We will see that LOCOCODE encourages noise-tolerant feature detectors reminiscent of those observed in the mammalian visual cortex. Inputs that are mixtures of a few regular features, such as edges in images, can be described well in a sparse fashion (only code components corresponding to present features contribute to coding the input). In contrast to previous approaches, however, sparseness is not viewed as an a priori good thing, and is not enforced explicitly, but only if the input data indeed is naturally describable by a sparse code. Some lococodes are not only sparse but also factorial, depending on whether the input is decomposable into factorial features. Likewise lococodes may deviate from sparseness towards locality if each input exhibits a single characteristic feature. Then the code will not be factorial (because knowledge of the component representing the characteristic feature implies knowledge of all others), but it will still be natural because it represents the true cause in a fashion that makes reconstruction (and other types of further processing) simple.
Outline. An FMS-review will follow in Section 2. Section 3 will analyze the beneficial effects of FMS' error terms in the context of autoencoding. The remainder of our paper will be devoted to empirical justifications of LOCOCODE. Experiments in Section 4.2 will show that all three ``good'' kinds of code discussed in previous work (namely local, sparse, factorial) can be natural lococodes. In Section 4.3 LOCOCODE will extract optimal sparse codes reflecting the independent features of random horizontal and vertical (noisy) bars, while ICA and PCA won't. In Section 4.4 LOCOCODE will generate plausible sparse codes (based on well-known on-center-off-surround and other appropriate feature detectors) of real world images. Section 4.5 will finally use LOCOCODE as a preprocessor for a standard, overfitting back-propagation (BP) speech data classifier. Surprisingly, this combination achieves excellent generalization performance. We conclude that the speech data's lococode already conveys the ``essential'', noise-free information, already in a form useful for further processing and classification. Section 5 will discuss our findings.