In the sequel first an on-line algorithm for reinforcement
learning in non-stationary reactive environments is described.
The algorithm heavily relies on an *adaptive model of the environmental
dynamics*. The main contribution of this paper (see the second section)
is to demonstrate how the algorithm may be naturally augmented
by *curiosity* and *boredom*,
in order to improve the world model
in an on-line manner.

Consider an `animat' whose movements are controlled by the output units of a neural network, called the control network, which also receives the animat's sensory perception by means of its input units. The animat potentially is able to produce actions that may change the environmental input (external feedback caused by the `reactive' environment). By means of recurrent connections in the network the animat is also potentially able to internally represent past events (internal feedback).

The animat sometimes experiences different types of
reinforcement
by means of so-called *reinforcement units*
or *pain units * that become activated in moments
of reinforcement or `pain'
(e.g. the experience of
bumping against an obstacle with an extremity).
The animat 's only
goal is to minimize cumulative pain and maximize cumulative
reinforcement. The animat is
autonomous in the sense that no intelligent
external teacher is required to provide additional goals or
subgoals for it.

Reinforcement units and pain units are similar to other input units
in the sense that they
possess conventional outgoing connections to other units.
However, unlike normal input units they
can have *desired activation
values* at every time. For the purpose of this paper
we say that the desired activation of a pain unit is zero
for all times, other reinforcement units may have positive desired
values.
In the sequel we assume a discrete time environment with
`time ticks'.
At a given time the quantity to be minimized by the learning
algorithm is
where
is the activation of the th pain or
reinforcement unit at time ,
ranges over all remaining
time ticks still to come, and is the desired activation
of the th
reinforcement or pain unit for all times.

The reinforcement learning
animat faces a very general spatio-temporal
credit assignment task: No external teacher provides
knowledge about
e.g. desired outputs or
`episode boundaries' (externally defined temporal
boundaries of training intervals). In the sequel it is demonstrated
how the animat may employ
a combination of two recurrent *self-supervised* learning
networks in order to satisfy its goal.

Munro [2],
Jordan [1],
Werbos [12],
Robinson and Fallside [6],
and Nguyen and Widrow [4]
used *`model networks'* for constructing a mapping
from output actions of a control network to their
effects in in `task space' [1].
The general system described below (an improved version of the
variants described in
[8] and
[10])
also employs an adaptive model of the environmental dynamics
for computing gradients of the control network's inputs with respect
to the controller weights.
The model network is trained to
simulate the environment by making predictions about future
inputs, including pain and reinforcement inputs.
Training of the controller works as follows:
Since we cannot propagate input errors (e.g. differences
between actual pain signals and desired zero pain signals)
`through the environment', we propagate them through the
combination of model network and controller. Only the controller
weights change during this phase, the weights of the model
network have to remain fixed.
(Actually, we do not really
propagate errors through the networks. We use an algorithm
which is functionally equivalent to `back-propagation through time'
but uses only computations `going forward in time'. So there is
no need for storing past activations.)

Both the
control network and the model network are fully recurrent.
The algorithm differs from algorithms by other authors
in at least some
of the following issues:
It aims at on-line learning and *locality in time*, it does not
care for `epoch-boundaries', it needs
only reinforcement information for learning, it allows
different kinds of reinforcement (or pain), and it
allows both internal and external feedback with
theoretically arbitrary time lags.
Unlike with Robinson and Fallside's
approach (which is the one that bears the most
relationships to the one described here)
`credit assignment paths' are provided that lead from
pain units back to output units back to all input units and so
on.
There are also
credit assignment paths that lead from input units back to the
input units themselves, and from there to the output units.
The latter paths
are important in the common case
when the environment can change even if there were no
recent output actions.

The discrete time algorithm below *concurrently* adjusts
the fully recurrent model network and the fully recurrent
control network.
An on-line version
of Robinson and Fallside's Infinite-Input-Duration
learning algorithm for fully recurrent
networks [5] (first implemented by
Williams and Zipser [14])
is used for training
both the model network and the combination of controller
and model network.
The algorithm
is a particular instantiation of a more
general form and is based on the logistic activation
function for all non-input units.

In step 1 of the main loop of the algorithm actions in the external world are computed. Due to the internal feedback, these actions are based on previous inputs and outputs. For all new activations, the corresponding derivatives with respect to all controller weights are updated.

In step 2 actions are executed in the external world, and the effects of the current action and/or previous actions may become visible.

In step 3 the model network tries to predict these effects without seeing the new input. Again the relevant gradient information is computed.

In step 4 the model network is updated in order to better predict the input (including pain) for the controller. Finally, the weights of the control network are updated in order to minimize the cumulative differences between desired and actual activations of the pain and reinforcement units. Since the control network continues activation spreading based on the actual inputs instead of using the predictions of the model network, `teacher forcing' [14] is used in the model network.

One can find various improvements of the systems described in [8] and [10]. For instance, the partial derivatives of the controller's inputs with respect to the controller's weights are approximated by the partial derivatives of the corresponding predictions generated by the model network. Furthermore, the model sees the last input and current output of the controller at the same time.

Notation (the reader may find it convenient to compare with [14]):

* is the set of all non-input units of the control network,
is the set of its output units,
is the set of its `normal' input units,
is the set of its pain and reinforcement units,
is the set of all units of the model network,
is the set of its output units,
is the set of all units that predict pain or reinforcement,
is the set of variables for the weights of the model network,
is the set of variables for the weights of the control network,
is the variable for the
updated activation of the th unit from
,
is the variable for the
last value of
,
is the variable for
the weight of the directed connection from unit to unit ,
is the variable which
gives the current (approximated) value
of
,
is the variable which
gives the last value
of
,
if then is 's desired activation for all times,
is the learning rate for
the control network,
is the learning rate for
the model network.
*

*
,
.
If
, then is the unit from which
predicts .
Each unit from
has one
forward connection to each unit
from . Each unit from is connected to each other
unit from . Each unit from is connected
to each other unit from .
Each weight of a connection leading
to a unit in is said to belong to .
Each weight of a connection leading
to a unit in is said to belong to .
Each weight
needs -values for
all .
Each weight
needs -values for
all
.*

*INITIALIZATION:
*

*For all
:
*

*
begin
random,
*

*
for all possible :
end.
*

*For all
*

*For all
:
*

*
Set by environmental perception,
*

*
FOREVER REPEAT:
*

*1. For all
.
*

*
For all
:
*

*
*

*
For all :
*

* begin
,
*

*
for all
:
end .
*

*2. Execute all motoric actions based on activations of
*

*
units in . Update the environment.
*

*
For all
:
*

*
Set by environmental perception.
*

*3. For all
.
*

*
For all
:
*

*
*

*
For all :
*

*
begin
,
*

*
for all
:
end.
*

*4. For all
:
*

*
*

*
For all
:
*

*
*

*
For all
:
*

*
begin
,
,
*

*
for all
,
*

*
for all
end.
*

By employing probabilistic output units for and
by using `gradient descent through random number generators'
[13] we can introduce explicit explorative random
search capabilities into the otherwise deterministic algorithm.
In the context of the IID algorithm, this works as follows:
A probabilistic output unit consists
of a conventional unit which acts as a mean generator and a
conventional unit
which
acts as a variance generator.
At a given time, the probabilistic output is computed by

where is distributed e.g. according to the normal distribution. The corresponding have to be updated according to the following rule:

By performing more than one iteration of step 1 and step 3 at each time tick, one can adjust the algorithm to environments that change in a manner which is not predictable by semilinear operations (theoretically three additional iterations are sufficient for any environment).

The algorithm is local in time, but not in space. See [8] for a justification of certain deviations from `pure gradient descent through time', and for a description of how the algorithm can be used for planning action sequences. See [7] and [9] for two quite different entirely local methods.

Back to Active Learning - Exploration - Curiosity page