In a realistic application, of course, it is implausible to assume that the errors of all are minimal at all times. After having modified the functions computing the internal representations, the must be trained for some time to assure that they can adapt to the new situation.
Each of the predictors, the representational modules, and the potentially available auto-associator can be implemented as a feed-forward back-propagation network (e.g. Werbos, 1974). There are two alternating passes - one for minimizing prediction errors, the other one for maximizing . Here is an off-line version based on successive `epochs' (presentations of the whole ensemble of training patterns):
PASS 1 (minimizing prediction errors):
Repeat for a `sufficient' number of training epochs:
1. For all :
1.1. For all : Compute .
1.2. For all : Compute .
2. Change each weight of each according to
where is a positive constant learning rate.
PASS 2 (minimizing predictability):
2. For all :
2.1. For all : Compute .
2.2. For all : Compute .
2.3. If an auto-associator is involved, compute .
The off-line version above is perhaps not as appealing as a more local procedure where computing time is distributed evenly between PASS 2 and PASS 1:
An on-line version. An extreme on-line version does not sweep through the whole training ensemble before changing weights. Instead it processes the same single input pattern (randomly chosen according to the input distribution) in both PASS 1 and PASS 2 and immediately changes the weights of all involved networks simultaneously, according to the contribution of to the respective objective functions.
Simultaneous updating of the representations and the predictors, however, introduces a potential for instabilities. Both the predictors and the representational modules perform gradient descent (or gradient ascent) in changing functions. Given a particular implementation of the basic principle, experiments are needed to find out how much on-line interaction is permittable. With the toy-experiments reported below, on-line learning did not cause major problems.
It should be noted that if (section 5), then with a given input pattern we may compute the gradient of with respect to both the predictor weights and the weights of the representation modules in a single pass. After this we may simply perform gradient descent in the predictor weights and gradient ascent in the remaining weights (it is just a matter of flipping signs). This was actually done in the experiments.
Local maxima. Like all gradient ascent procedures, the method is subject to the problem of local maxima. A standard method for dealing with local maxima is to repeat the above algorithm with different weight initializations (using a fixed number of training epochs for each repetition) until a (near-) factorial code is found. Each repetition corresponds to a local search around the point in weight space defined by the current weight initialization.
Shared hidden units. It should be mentioned that some or all of the representational modules may share hidden units. The same holds for the predictors. Predictors sharing hidden units, however, will have to be updated sequentially: No representational unit may be used to predict its own activity.