In step 1 of the main loop of the algorithm, actions to be performed in the external world are computed. These actions are based on both current and previous inputs and outputs. For all new activations, the corresponding derivatives with respect to all controller weights are updated. In step 2 actions are executed in the external world, and the effects of the current action and/or previous actions may become visible. In step 3 the model network sees the last input and the current output of the controller at the same time. The model network tries to predict the new input without seeing it. Again the relevant gradient information is computed. In step 4 the model network is updated in order to better predict the input (including pleasure and pain) for the controller. The weights of the control network are updated in order to minimize the cumulative differences between desired and actual activations of the pain and pleasure units. `Teacher forcing' [13] is used in the model network (although there is no teacher besides the environment). The partial derivatives of the controller's inputs with respect to the controller's weights are approximated by the partial derivatives of the corresponding predictions generated by the model network.

Notation (the reader may find it convenient to compare with
[13]):
* is the set of all non-input units of the control network,
is the set of its output units,
is the set of its `normal' input units,
is the set of its pain and pleasure units,
is the set of all units of the model network,
is the set of its output units,
is the set of all units that predict pain or pleasure,
is the set of variables for the weights of the model network,
is the set of variables for the weights of the control network,
is the variable for the
updated activation of the th unit from
,
is the variable for the
last value of
,
is the variable for
the weight of the directed connection from unit to unit .
is the Kronecker-delta, which is 1 for and 0 otherwise,
is the variable which
gives the current (approximated) value
of
,
is the variable which
gives the last value
of
.
If then is 's desired activation for all times,
if
, then is the unit from which
predicts .
is the learning rate for
the control network,
is the learning rate for
the model network.
*

*
,
.
Each unit in
has one
forward connection to each unit
in ,
each unit in is connected to each other
unit in ,
each unit in is connected
to each other unit in .
Each weight variable of a connection leading
to a unit in is said to belong to ,
each weight variable of a connection leading
to a unit in is said to belong to .
For each weight
there are -values for
all ,
for each weight
there are -values for
all
.*
The parallel version of the algorithm works as follows:

*INITIALIZATION:
*

*
:
random,
possible :
.
*

*
*

*
:
Set according to the current environment,
*

*UNTIL TERMINATION CRITERION IS REACHED :
*

*1.
.
*

*
:
*

*
:
,
:
.
*

*2. Execute all actions based on activations of
units in . Update the environment.
*

*
:
Set according to environment.
*

*3.
.
*

*
:
*

*
:
,
:
.
*

*4.
:
*

*
:
*

*
:
,
,
,
*

*
.
*

The algorithm is local in time, but not in space. The computation complexity per time step is In what follows we describe some useful extensions of the scheme.

*1. More network ticks than environmental ticks.*
For highly `non-linear' environments
the algorithm has to be modified in a trivial manner
such that the involved networks
perform more than one (but not more than three)
iterations of step 1 and step 3 at each
time step.
(4-layer-operations in principle
can produce an arbitrary approximation of any desired mapping.)

*2. Adaptive randomness*.
Explicit explorative random
search capabilities can be introduced by
probabilistic controller outputs and
`gradient descent through random number generators'
[12].
We adjust both the mean and the variance of the
controller actions.
In the context of the IID algorithm, this works as follows:
A probabilistic output unit consists
of a conventional unit which acts as a mean generator and a
conventional unit
which
acts as a variance generator.
At a given time, the probabilistic output is computed by
,
where is distributed e.g. according to the normal distribution.
The corresponding
must then be updated according to
the following rule:

A more sophisticated strategy to improve the model network is
to introduce *`adaptive curiosity and boredom'*.
The priniciple of adaptive curiosity for model-building neural controllers
[7] says: Spend additional
reinforcement whenever there
is a mismatch between the expectations of the model network and
reality.

*3. Perfect models.*
Sometimes one can gain a `perfect' model by constructing an
appropriate mathematical description of the environmental
dynamics. This saves the time needed to train the model. However,
additional external knowledge is required.
For instance, the description of the environment
might be in form of differential
or difference equations.
In the context of the algorithm above, this means introducing
new variables for each
and each
relevant state variable of the dynamical environment.
The new variables serve to accumulate the values of
.
This can be done
in exactly the same cumulative manner as with the
activations of the model network above.

*4. Augmenting the algorithm by TD-methods.*
The following ideas
are not limited to recurrent nets, but are
also relevant for
feed-forward controllers in
Markovian environments.

It is possible to augment model-building algorithms with
an `adaptive critic' method.
To simplify the discussion, let us assume that there are no pleasure
units, just pain units.
The algorithm's goal is to minimize cumulative pain.
We introduce the TD-principle [10] by changing the error
function of the units in :
At a given time , the contribution of each unit to the
model network's error is
,
where is the activation of unit at time , and
is a discount factor for avoiding
predictions of infinite sums.
Thus is
trained to predict the *sum of all*
(discounted) future pain vectors and becomes a *vector-valued
adaptive critic*.
(This affects the first
*-loop* in step 4 .)

The controller's goal is to minimize the absolute value of 's
pain predictions. Thus,
the contribution of time to the
error function of the controller now becomes
.
This affects the second *For-loop* in step 4 of the
algorithm. Note that
it is not a *state* which is evaluated
by the adaptive critic component, but a *combination of a state
and an action*. This makes the approach similar to
[3].
[7] shows how a recurrent
model/controller combination can be used for look-ahead planning
*without* using TD-methods.

Back to Reinforcement Learning POMDP page