The basic principle can be formulated as follows:
*Learn a mapping from actions (or action sequences) to the
expectation of future performance improvement of the world model.
Encourage action sequences where this expectation is high.*

One way to do this is the following (section 4 will describe alternatives):
*Model
the reliability of the predictions of the adaptive predictor
as described in section 2.
At time , spend reinforcement for the model-building control system in
proportion to the current change of reliability
of the adaptive predictor.
The `curiosity goal' of the control system
(it might have additional `pre-wired' goals)
is to maximize the expectation of the cumulative sum of future
positive or negative changes in prediction reliability.*

More formally:
The control system's
curiosity goal
at time is to maximize

Here is a discount factor for avoiding infinite sums, and is the (positive or negative) change of assumed reliability caused by the observation of , , and .

For instance, if method 1 or method 3 from section 2 is employed, then , where is 's response to after having adjusted at time .

So far the discussion did not have to refer to a particular reinforcement learning algorithm. Every sensible reinforcement learning algorithm ought to be useful (e.g [1][16][13][9]). For instance, [6] describes how adaptive critics [1][15] can be used to build a `curious' model-building control system based on the principle described above. The following subsection focusses on Watkins' recent `Q-learning' method.

Back to Active Learning - Exploration - Curiosity page

Back to Reinforcement Learning page