The basic principle can be formulated as follows: Learn a mapping from actions (or action sequences) to the expectation of future performance improvement of the world model. Encourage action sequences where this expectation is high.
One way to do this is the following (section 4 will describe alternatives): Model the reliability of the predictions of the adaptive predictor as described in section 2. At time , spend reinforcement for the model-building control system in proportion to the current change of reliability of the adaptive predictor. The `curiosity goal' of the control system (it might have additional `pre-wired' goals) is to maximize the expectation of the cumulative sum of future positive or negative changes in prediction reliability.
More formally:
The control system's
curiosity goal
at time is to maximize
For instance, if method 1 or method 3 from section 2 is employed, then , where is 's response to after having adjusted at time .
So far the discussion did not have to refer to a particular reinforcement learning algorithm. Every sensible reinforcement learning algorithm ought to be useful (e.g [1][16][13][9]). For instance, [6] describes how adaptive critics [1][15] can be used to build a `curious' model-building control system based on the principle described above. The following subsection focusses on Watkins' recent `Q-learning' method.