In the sequel first an on-line algorithm for reinforcement learning in non-stationary reactive environments is described. The algorithm heavily relies on an adaptive model of the environmental dynamics. The main contribution of this paper (see the second section) is to demonstrate how the algorithm may be naturally augmented by curiosity and boredom, in order to improve the world model in an on-line manner.
Consider an `animat' whose movements are controlled by the output units of a neural network, called the control network, which also receives the animat's sensory perception by means of its input units. The animat potentially is able to produce actions that may change the environmental input (external feedback caused by the `reactive' environment). By means of recurrent connections in the network the animat is also potentially able to internally represent past events (internal feedback).
The animat sometimes experiences different types of reinforcement by means of so-called reinforcement units or pain units that become activated in moments of reinforcement or `pain' (e.g. the experience of bumping against an obstacle with an extremity). The animat 's only goal is to minimize cumulative pain and maximize cumulative reinforcement. The animat is autonomous in the sense that no intelligent external teacher is required to provide additional goals or subgoals for it.
Reinforcement units and pain units are similar to other input units in the sense that they possess conventional outgoing connections to other units. However, unlike normal input units they can have desired activation values at every time. For the purpose of this paper we say that the desired activation of a pain unit is zero for all times, other reinforcement units may have positive desired values. In the sequel we assume a discrete time environment with `time ticks'. At a given time the quantity to be minimized by the learning algorithm is where is the activation of the th pain or reinforcement unit at time , ranges over all remaining time ticks still to come, and is the desired activation of the th reinforcement or pain unit for all times.
The reinforcement learning animat faces a very general spatio-temporal credit assignment task: No external teacher provides knowledge about e.g. desired outputs or `episode boundaries' (externally defined temporal boundaries of training intervals). In the sequel it is demonstrated how the animat may employ a combination of two recurrent self-supervised learning networks in order to satisfy its goal.
Munro , Jordan , Werbos , Robinson and Fallside , and Nguyen and Widrow  used `model networks' for constructing a mapping from output actions of a control network to their effects in in `task space' . The general system described below (an improved version of the variants described in  and ) also employs an adaptive model of the environmental dynamics for computing gradients of the control network's inputs with respect to the controller weights. The model network is trained to simulate the environment by making predictions about future inputs, including pain and reinforcement inputs. Training of the controller works as follows: Since we cannot propagate input errors (e.g. differences between actual pain signals and desired zero pain signals) `through the environment', we propagate them through the combination of model network and controller. Only the controller weights change during this phase, the weights of the model network have to remain fixed. (Actually, we do not really propagate errors through the networks. We use an algorithm which is functionally equivalent to `back-propagation through time' but uses only computations `going forward in time'. So there is no need for storing past activations.)
Both the control network and the model network are fully recurrent. The algorithm differs from algorithms by other authors in at least some of the following issues: It aims at on-line learning and locality in time, it does not care for `epoch-boundaries', it needs only reinforcement information for learning, it allows different kinds of reinforcement (or pain), and it allows both internal and external feedback with theoretically arbitrary time lags. Unlike with Robinson and Fallside's approach (which is the one that bears the most relationships to the one described here) `credit assignment paths' are provided that lead from pain units back to output units back to all input units and so on. There are also credit assignment paths that lead from input units back to the input units themselves, and from there to the output units. The latter paths are important in the common case when the environment can change even if there were no recent output actions.
The discrete time algorithm below concurrently adjusts the fully recurrent model network and the fully recurrent control network. An on-line version of Robinson and Fallside's Infinite-Input-Duration learning algorithm for fully recurrent networks  (first implemented by Williams and Zipser ) is used for training both the model network and the combination of controller and model network. The algorithm is a particular instantiation of a more general form and is based on the logistic activation function for all non-input units.
In step 1 of the main loop of the algorithm actions in the external world are computed. Due to the internal feedback, these actions are based on previous inputs and outputs. For all new activations, the corresponding derivatives with respect to all controller weights are updated.
In step 2 actions are executed in the external world, and the effects of the current action and/or previous actions may become visible.
In step 3 the model network tries to predict these effects without seeing the new input. Again the relevant gradient information is computed.
In step 4 the model network is updated in order to better predict the input (including pain) for the controller. Finally, the weights of the control network are updated in order to minimize the cumulative differences between desired and actual activations of the pain and reinforcement units. Since the control network continues activation spreading based on the actual inputs instead of using the predictions of the model network, `teacher forcing'  is used in the model network.
One can find various improvements of the systems described in  and . For instance, the partial derivatives of the controller's inputs with respect to the controller's weights are approximated by the partial derivatives of the corresponding predictions generated by the model network. Furthermore, the model sees the last input and current output of the controller at the same time.
Notation (the reader may find it convenient to compare with ):
is the set of all non-input units of the control network, is the set of its output units, is the set of its `normal' input units, is the set of its pain and reinforcement units, is the set of all units of the model network, is the set of its output units, is the set of all units that predict pain or reinforcement, is the set of variables for the weights of the model network, is the set of variables for the weights of the control network, is the variable for the updated activation of the th unit from , is the variable for the last value of , is the variable for the weight of the directed connection from unit to unit , is the variable which gives the current (approximated) value of , is the variable which gives the last value of , if then is 's desired activation for all times, is the learning rate for the control network, is the learning rate for the model network.
, . If , then is the unit from which predicts . Each unit from has one forward connection to each unit from . Each unit from is connected to each other unit from . Each unit from is connected to each other unit from . Each weight of a connection leading to a unit in is said to belong to . Each weight of a connection leading to a unit in is said to belong to . Each weight needs -values for all . Each weight needs -values for all .
For all :
for all possible : end.
For all :
Set by environmental perception,
1. For all .
For all :
For all :
for all : end .
2. Execute all motoric actions based on activations of
units in . Update the environment.
For all :
Set by environmental perception.
3. For all .
For all :
For all :
for all : end.
4. For all :
For all :
For all :
begin , ,
for all ,
for all end.
By employing probabilistic output units for and
by using `gradient descent through random number generators'
 we can introduce explicit explorative random
search capabilities into the otherwise deterministic algorithm.
In the context of the IID algorithm, this works as follows:
A probabilistic output unit consists
of a conventional unit which acts as a mean generator and a
acts as a variance generator.
At a given time, the probabilistic output is computed by
By performing more than one iteration of step 1 and step 3 at each time tick, one can adjust the algorithm to environments that change in a manner which is not predictable by semilinear operations (theoretically three additional iterations are sufficient for any environment).
The algorithm is local in time, but not in space. See  for a justification of certain deviations from `pure gradient descent through time', and for a description of how the algorithm can be used for planning action sequences. See  and  for two quite different entirely local methods.