At a given time, an agent with a non-Markovian interface to its environment cannot derive an optimal next action by considering its current input only. The algorithm described below differs from previous reinforcement algorithms in at least some of the following issues: It has a potential for on-line learning and non-Markovian environments, it is local in time and in principle it allows arbitrary time lags between actions and ulterior consequences; it does not care for something like episode-boundaries, it allows vector-valued reinforcement, it is based on two interacting fully recurrent continually running networks, and it tries to construct a full environmental model - thus providing complete `credit assignment paths' into the past.
We dedicate one or more conventional input units (called pain and pleasure units) for the purpose of reporting the actual reinforcement to a fully recurrent control network. Pain and pleasure input units have time-invariant desired values.
We employ the IID-Algorithm  for training a fully recurrent model network to model the relationships between environmental inputs, output actions of an agent, and corresponding pain or pleasure. The model network (e.g. ) in turn allows the system to compute controller gradients for `minimizing pain' and `maximizing pleasure'. Since reinforcement gradients depend on `credit assignment paths' leading `backwards through the environment', the model network should not only predict the pain and pleasure units but also the other input units.
The quantity to be minimized by the model network is , where is the activation of the th input unit at time , and is the model's prediction of the activation of the th input unit at time . The quantity to be minimized by the controller is , where is the activation of the th pain or pleasure input unit at time and is its desired activation for all times. ranges over all (discrete) time steps. Weights are changed at each time step. This relieves dependence on `episode boundaries'. Here the assumption is that the learning rates are small enough to avoid instabilities .
There are two versions of the algorithm: the sequential version and the parallel version. With the sequential version, the model network is first trained by providing it with randomly chosen examples of sequences of interactions between controller and environment. Then the model's weights are fixed to their current values, and the controller begins to learn. With the parallel version both the controller and the model learn concurrently. One advantage of the parallel version is that the model network focusses only on those parts of the environmental dynamics with which the controller typically is confronted. Another advantage is the applicability to changing environments. Some disadvantages of the parallel version are listed next.
1. Imperfect model networks. The model which is used to compute gradient information for the controller may be wrong. However, if we assume that the model network always finds a zero-point of its error function, then over time we can expect the control network to perform gradient descent according to a perfect model of the visible parts of the real world. 1.A: The assumption that the model network can always find a zero-point of its error function is not valid in the general case. One of the reasons is the old problem of local minima, for which this paper does not suggest any solutions. 1.B:  notes that a model network does not need to be perfect to allow increasing performance of the control network.
2. Instabilities. One source of instability could arise if the model network `forgets' information about the environmental dynamics because the activities of the controller push it into a new sub-domain, such that the weights responsible for the old well-modeled sub-domain become over-written.
3. Deadlock. Even if the model's predictions are perfect for all actions executed by the controller, this does not imply that the algorithm will always behave as desired. Let us assume that the controller enters a local minimum relative to the current state of an imperfect model network. This relative minimum might cause the controller to execute the same action again and again (in a certain spatio-temporal context), while the model does not get a chance to learn something about the consequences of alternative actions (this is the deadlock).
The sequential version lacks the flavor of on-line learning and is bound to fail as soon as the environment changes significantly. We will introduce `adaptive randomness' for the controller outputs to attack problems of the parallel version.