First consider a conventional, physical heat exchanger. See Figure 1 (C). There are two touching water pipes with opposite flow direction. Cold water enters the first pipe. Hot water enters the second pipe. But hot water exits the first pipe, and cold water exits the second pipe! At any given point where both pipes touch, their temperatures are the same (provided the water speed is low enough to allow for sufficient temperature exchange). Entirely local interaction can lead to a complete reversal of global, macroscopic properties such as temperature. Physical heat exchangers are common in technical applications (e.g., nuclear power plants) and animals, e.g., rodents (Geoff Hinton, personal communication, 1994).
Basic idea. In analogy to the physical heat exchanger, I build a ``Neural Heat Exchanger''. There are two multi-layer feedforward networks with opposite flow direction. They correspond to the pipes. Both nets have the same number of layers. They are aligned such that each net's input layer is ``next'' to the other net's output layer, and each hidden layer in the first net is ``next'' to exactly one hidden layer in the other net. Input patterns enter the first net and are propagated ``up''. Desired outputs (targets) enter the ``opposite'' net and are propagated ``down''. Using the local, simple delta rule, each layer in each net tries to be similar (in information content) to the preceding layer and to the corresponding layer in the other net. The input entering the first net slowly ``heats up'' to become the target. The target entering the opposite net slowly ``cools down'' to become the input. No global control mechanism is required. See Figure 2 and details below.
Architecture. See Figure 2. The first pipe corresponds to a feedforward network with n layers , , ..., . Each unit in has directed connections to each unit in , . The second pipe corresponds to a feedforward network with n layers , , ..., . Each unit in has directed connections to each unit in , . For simplicity, let all layers have m units. The -th unit in is denoted . The -th unit in is denoted . The randomly initialized weight on the connection from some unit to some unit is denoted .
Dynamics (example). See Figure 2. Input patterns enter at . Output patterns exit at . The goal is to make the output patterns like the targets. 's flow direction is opposite to 's. Targets (desired outputs) enter at . Output patterns exit at . The goal is to make the output patterns like 's inputs. Input units are those in and . At any given discrete time step, their activations are set by the environment, according to the current task. Furthermore, at any given time, each noninput unit updates its variable activation (initialized with 0.0) as follows: with probability , set ; with probability , set ; where , for instance.
Learning. At any given discrete time step, using the simple delta rule (no backprop), weights are adjusted such that each noninput unit reduces its current (expected) distance to the corresponding unit . Symmetrically: each noninput unit reduces its current distance to the corresponding unit . Why? Because each layer should be similar to (``have the same temperature as'') the corresponding layer in the net with opposite flow direction.
Furthermore, at any given time step, using the simple delta rule (no backprop), weights are adjusted such that each noninput unit reduces its distance to unit . Symmetrically: each noninput unit reduces its distance to . Why? Because this tends to make successive units similar -- just like neighboring parts of a physical heat exchanger have similar temperature. The target entering slowly ``cools down'' to become the input. Likewise, the input entering slowly ``heats up'' to become the target.
Clearly, each weight gets error signals from two different local minimization processes. Simply add them up to change the weights.
Variants. The discussion above focused on the case where each layer has the same number of units. This makes it particularly convenient to define what it means for one layer to be similar to the preceding one: each unit's activation simply has to be similar to the one of the unit at the same position in the previous layer. Varying numbers of units per layer require us to refine our notion of layer similarity. For instance, layer similarity can be defined by measuring mutual information between successive layers. Non-probabilistic variants of the Neural Heat Exchanger may sometimes be appropriate as well.
Experiments. Three ETHZ undergrad students, Alberto Salerno, Thomas Fasciania, and Giorgio Pazmandi, recently reimplemented the Neural Heat Exchanger. They report that it was able to solve XOR more quickly than backprop. For larger scale parity problems, however, their system did not work as well as backprop. Sepp Hochreiter (personal communication) also implemented variants of the Neural Heat Exchanger. He learned simple functions such as AND with 5 and more hidden layers. He reports that the system prefers local coding in deep hidden layers. He also successfully tried variants where each layer has different numbers of units, and where either local auto-association or mutual information is used to define layer similarity. Unfortunately, however, at the moment of this writing, there has not yet been a detailed experimental study of the Neural Heat Exchanger. My own, very limited 1990 toy experiments also do not qualify as a systematic analysis. Much remains to be done.
Relation to recent work. According to Peter Dayan (personal communication, 1994), the Neural Heat Exchanger is essentially a supervised variant of the recent Helmholtz Machine [3,2]. Or, depending on the point of view, the Helmholtz Machine is an unsupervised variant of the Neural Heat Exchanger.
According to Peter Dayan and Geoff Hinton , a trouble with the Neural Heat Exchanger is that in non-deterministic domains, there is no reason why 's output should match 's input. Dayan and Hinton's algorithm overcomes this problem by using completely separate learning phases for top-down and bottom-up weights. This, however, makes their algorithm non-local in time: a global mechanism is required to separate the learning phases.
An alternative way to overcome the problem above may be to force part of 's output to reconstruct a unique representation of 's input, and to feed this representation also into , together with the target.