In step 1 of the main loop of the algorithm, actions to be performed in the external world are computed. These actions are based on both current and previous inputs and outputs. For all new activations, the corresponding derivatives with respect to all controller weights are updated. In step 2 actions are executed in the external world, and the effects of the current action and/or previous actions may become visible. In step 3 the model network sees the last input and the current output of the controller at the same time. The model network tries to predict the new input without seeing it. Again the relevant gradient information is computed. In step 4 the model network is updated in order to better predict the input (including pleasure and pain) for the controller. The weights of the control network are updated in order to minimize the cumulative differences between desired and actual activations of the pain and pleasure units. `Teacher forcing' [13] is used in the model network (although there is no teacher besides the environment). The partial derivatives of the controller's inputs with respect to the controller's weights are approximated by the partial derivatives of the corresponding predictions generated by the model network.
Notation (the reader may find it convenient to compare with
[13]):
is the set of all non-input units of the control network,
is the set of its output units,
is the set of its `normal' input units,
is the set of its pain and pleasure units,
is the set of all units of the model network,
is the set of its output units,
is the set of all units that predict pain or pleasure,
is the set of variables for the weights of the model network,
is the set of variables for the weights of the control network,
is the variable for the
updated activation of the
th unit from
,
is the variable for the
last value of
,
is the variable for
the weight of the directed connection from unit
to unit
.
is the Kronecker-delta, which is 1 for
and 0 otherwise,
is the variable which
gives the current (approximated) value
of
,
is the variable which
gives the last value
of
.
If
then
is
's desired activation for all times,
if
, then
is the unit from
which
predicts
.
is the learning rate for
the control network,
is the learning rate for
the model network.
,
.
Each unit in
has one
forward connection to each unit
in
,
each unit in
is connected to each other
unit in
,
each unit in
is connected
to each other unit in
.
Each weight variable of a connection leading
to a unit in
is said to belong to
,
each weight variable of a connection leading
to a unit in
is said to belong to
.
For each weight
there are
-values for
all
,
for each weight
there are
-values for
all
.
The parallel version of the algorithm works as follows:
INITIALIZATION:
:
random,
possible
:
.
:
Set
according to the current environment,
UNTIL TERMINATION CRITERION IS REACHED :
1.
.
:
:
,
:
.
2. Execute all actions based on activations of
units in . Update the environment.
:
Set
according to environment.
3.
.
:
:
,
:
.
4.
:
:
:
,
,
,
.
The algorithm
is local in time, but not in space. The computation complexity
per time step is
In what follows we describe some useful extensions of the scheme.
1. More network ticks than environmental ticks. For highly `non-linear' environments the algorithm has to be modified in a trivial manner such that the involved networks perform more than one (but not more than three) iterations of step 1 and step 3 at each time step. (4-layer-operations in principle can produce an arbitrary approximation of any desired mapping.)
2. Adaptive randomness.
Explicit explorative random
search capabilities can be introduced by
probabilistic controller outputs and
`gradient descent through random number generators'
[12].
We adjust both the mean and the variance of the
controller actions.
In the context of the IID algorithm, this works as follows:
A probabilistic output unit consists
of a conventional unit
which acts as a mean generator and a
conventional unit
which
acts as a variance generator.
At a given time, the probabilistic output
is computed by
,
where
is distributed e.g. according to the normal distribution.
The corresponding
must then be updated according to
the following rule:
A more sophisticated strategy to improve the model network is to introduce `adaptive curiosity and boredom'. The priniciple of adaptive curiosity for model-building neural controllers [7] says: Spend additional reinforcement whenever there is a mismatch between the expectations of the model network and reality.
3. Perfect models.
Sometimes one can gain a `perfect' model by constructing an
appropriate mathematical description of the environmental
dynamics. This saves the time needed to train the model. However,
additional external knowledge is required.
For instance, the description of the environment
might be in form of differential
or difference equations.
In the context of the algorithm above, this means introducing
new variables for each
and each
relevant state variable
of the dynamical environment.
The new variables serve to accumulate the values of
.
This can be done
in exactly the same cumulative manner as with the
activations of the model network above.
4. Augmenting the algorithm by TD-methods. The following ideas are not limited to recurrent nets, but are also relevant for feed-forward controllers in Markovian environments.
It is possible to augment model-building algorithms with
an `adaptive critic' method.
To simplify the discussion, let us assume that there are no pleasure
units, just pain units.
The algorithm's goal is to minimize cumulative pain.
We introduce the TD-principle [10] by changing the error
function of the units in :
At a given time
, the contribution of each unit
to the
model network's error is
,
where
is the activation of unit
at time
, and
is a discount factor for avoiding
predictions of infinite sums.
Thus
is
trained to predict the sum of all
(discounted) future pain vectors and becomes a vector-valued
adaptive critic.
(This affects the first
-loop in step 4 .)
The controller's goal is to minimize the absolute value of 's
pain predictions. Thus,
the contribution of time
to the
error function of the controller now becomes
.
This affects the second For-loop in step 4 of the
algorithm. Note that
it is not a state which is evaluated
by the adaptive critic component, but a combination of a state
and an action. This makes the approach similar to
[3].
[7] shows how a recurrent
model/controller combination can be used for look-ahead planning
without using TD-methods.
POMDP page