Next: EXPERIMENTS Up: Reinforcement Learning in Markovian Previous: INTRODUCTION

# THE ALGORITHM

The sequential version of the algorithm can be obtained in a straight-forward manner from the description of the parallel version below. At every time step, the parallel version is performing essentially the same operations:

In step 1 of the main loop of the algorithm, actions to be performed in the external world are computed. These actions are based on both current and previous inputs and outputs. For all new activations, the corresponding derivatives with respect to all controller weights are updated. In step 2 actions are executed in the external world, and the effects of the current action and/or previous actions may become visible. In step 3 the model network sees the last input and the current output of the controller at the same time. The model network tries to predict the new input without seeing it. Again the relevant gradient information is computed. In step 4 the model network is updated in order to better predict the input (including pleasure and pain) for the controller. The weights of the control network are updated in order to minimize the cumulative differences between desired and actual activations of the pain and pleasure units. Teacher forcing' [13] is used in the model network (although there is no teacher besides the environment). The partial derivatives of the controller's inputs with respect to the controller's weights are approximated by the partial derivatives of the corresponding predictions generated by the model network.

Notation (the reader may find it convenient to compare with [13]): is the set of all non-input units of the control network, is the set of its output units, is the set of its normal' input units, is the set of its pain and pleasure units, is the set of all units of the model network, is the set of its output units, is the set of all units that predict pain or pleasure, is the set of variables for the weights of the model network, is the set of variables for the weights of the control network, is the variable for the updated activation of the th unit from , is the variable for the last value of , is the variable for the weight of the directed connection from unit to unit . is the Kronecker-delta, which is 1 for and 0 otherwise, is the variable which gives the current (approximated) value of , is the variable which gives the last value of . If then is 's desired activation for all times, if , then is the unit from which predicts . is the learning rate for the control network, is the learning rate for the model network.

, . Each unit in has one forward connection to each unit in , each unit in is connected to each other unit in , each unit in is connected to each other unit in . Each weight variable of a connection leading to a unit in is said to belong to , each weight variable of a connection leading to a unit in is said to belong to . For each weight there are -values for all , for each weight there are -values for all . The parallel version of the algorithm works as follows:

INITIALIZATION:

: random, possible : .

: Set according to the current environment,

UNTIL TERMINATION CRITERION IS REACHED :

1. .

:

: , : .

2. Execute all actions based on activations of units in . Update the environment.

: Set according to environment.

3. .

:

: , : .

4. :

:

: , , ,

.

The algorithm is local in time, but not in space. The computation complexity per time step is In what follows we describe some useful extensions of the scheme.

1. More network ticks than environmental ticks. For highly non-linear' environments the algorithm has to be modified in a trivial manner such that the involved networks perform more than one (but not more than three) iterations of step 1 and step 3 at each time step. (4-layer-operations in principle can produce an arbitrary approximation of any desired mapping.)

2. Adaptive randomness. Explicit explorative random search capabilities can be introduced by probabilistic controller outputs and gradient descent through random number generators' [12]. We adjust both the mean and the variance of the controller actions. In the context of the IID algorithm, this works as follows: A probabilistic output unit consists of a conventional unit which acts as a mean generator and a conventional unit which acts as a variance generator. At a given time, the probabilistic output is computed by , where is distributed e.g. according to the normal distribution. The corresponding must then be updated according to the following rule:

A more sophisticated strategy to improve the model network is to introduce adaptive curiosity and boredom'. The priniciple of adaptive curiosity for model-building neural controllers [7] says: Spend additional reinforcement whenever there is a mismatch between the expectations of the model network and reality.

3. Perfect models. Sometimes one can gain a perfect' model by constructing an appropriate mathematical description of the environmental dynamics. This saves the time needed to train the model. However, additional external knowledge is required. For instance, the description of the environment might be in form of differential or difference equations. In the context of the algorithm above, this means introducing new variables for each and each relevant state variable of the dynamical environment. The new variables serve to accumulate the values of . This can be done in exactly the same cumulative manner as with the activations of the model network above.

4. Augmenting the algorithm by TD-methods. The following ideas are not limited to recurrent nets, but are also relevant for feed-forward controllers in Markovian environments.

It is possible to augment model-building algorithms with an `adaptive critic' method. To simplify the discussion, let us assume that there are no pleasure units, just pain units. The algorithm's goal is to minimize cumulative pain. We introduce the TD-principle [10] by changing the error function of the units in : At a given time , the contribution of each unit to the model network's error is , where is the activation of unit at time , and is a discount factor for avoiding predictions of infinite sums. Thus is trained to predict the sum of all (discounted) future pain vectors and becomes a vector-valued adaptive critic. (This affects the first -loop in step 4 .)

The controller's goal is to minimize the absolute value of 's pain predictions. Thus, the contribution of time to the error function of the controller now becomes . This affects the second For-loop in step 4 of the algorithm. Note that it is not a state which is evaluated by the adaptive critic component, but a combination of a state and an action. This makes the approach similar to [3]. [7] shows how a recurrent model/controller combination can be used for look-ahead planning without using TD-methods.

Next: EXPERIMENTS Up: Reinforcement Learning in Markovian Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-25

Back to Reinforcement Learning POMDP page