next up previous


If $M$ is used as a `world model', then with many applications $i_M(t) = o_A(t) \circ x(t)$ and $d_M(t) = x(t+1)$, where $o_A(t)$ is the output vector of a controller $A$ at time $t$, `$\circ$' is the concatenation operator, and $x(t)$ is the environmental input at time $t$. In general, $o_A(t)$ influences the state of the environment. Therefore it may have an influence on $x(t+1)$.

In [7] confidence modules have been successfully applied to the problem of meaningful hierarchical sequence chunking. This section (which provides the major contribution of this paper) describes how they can help to make the construction of a world model more efficient.

We define curiosity as the desire to improve a predictor of the reactions of an environment (a `world model'). In [8] and [4] the following basic idea for `on-line state space exploration by implementing dynamic curiosity and boredom' has been formulated: Spend reinforcement for a model-building control system whenever there is a mismatch between the expectations of the adaptive world model and reality. Any sensible reinforcement learning algorithm can be used to encourage the controller to generate action sequences that provoke situations where the world model tends to make bad predictions. Since the model is adaptive, its predictions often will improve. This in turn will lead to less reinforcement for the control system. Therefore the corresponding action sequences will become discouraged. The controller will get `bored' with the corresponding situations and will start to focus on yet unpredictable parts of the environment.

The particular implementation described in [8] employed a recurrent confidence network with a one-dimensional output for modelling the expected error of the model network (this error was called the `curiosity reinforcement'). The confidence network was not called so: It was part of the model network (which predicted the next state of the environment plus a reinforcement vector including all kinds of reinforcement, not just `curiosity reinforcement'). The target activation of the single output unit of the confidence net was a function of the current error of the model network. In the simplest case this function was linear. The controller's goal was to activate the error-predicting unit by creating action sequences for provoking mismatches between expectations and reality. The gradient computed for the error predictor also served to change the internal representations of the whole network (whose error function simply contained an additional term). Recently [12] described related ideas (they use the term `competence network' instead of the term `confidence network' as used in [7] and [5]).

One problem with the idea above is that in non-deterministic environments the controller will focus on parts of the environmental dynamics which are inherently unpredictable. This is because the adaptive model usually will produce incorrect predictions for the uncertain parts of the environment. Therefore the control system will receive reinforcement although it cannot be expected that the world model will improve.

A related problem is that often certain parts of the environment can be represented only by a complex mapping which is difficult to learn while other parts are `easy to learn'. If we want a system which first tries to solve the easy tasks before focussing on the harder tasks then the system will need an (adaptive) internal representation of something like the expectation of how difficult certain learning tasks will be.

Both problems are related in the sense that both require to learn something about the effects of further learning. In what follows an approach for coping with these problems will be described. Instead of simply learning to predict errors as the approach described in [8] the new approach learns to predict cumulative error changes.

next up previous
Juergen Schmidhuber 2003-02-28

Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page