Next: ADAPTIVE CURIOSITY Up: CURIOUS MODEL-BUILDING CONTROL SYSTEMS Previous: INTRODUCTION

ADAPTIVE CONFIDENCE

Consider an adaptive discrete time `predictor'

(not necessarily a neural network) whose input at time

is the real vector

and whose output at time

is the real vector

, where the real vector

represents the internal state of

. Meaningful internal states are required if the prediction task requires to memorize past events. At time

there is a target output

. The predictor's goal is to make

for all

After having provided a number of training examples for , usually will still make some errors, particularily if the training environment is noisy. How can we model the reliability of 's predictions?

We introduce an additional `confidence module' (not necessarily a neural network) whose input at time is the real vector and whose output at time is the real vector , where the real vector is the internal state of . At time there is a target output for the confidence module. should provide information about how reliable 's prediction can be expected to be [8] [5] [7].

In what follows, is the th component of a vector , denotes the expectation operator, denotes the dimensionality of vector , $\mid c \mid$ denotes the absolute value of scalar , $P(A \mid B)$ denotes the conditional probability of given , and $E(A \mid B)$ denotes the conditional expectation of given . For simplicity, we will concentrate on the case of for all . This means that 's and 's current outputs are based only on the current input. There is a variety of simple ways of representing reliability in :

1. Modelling probabilities of global prediction failures. Let be one-dimensional. Let $d_C(t)= P(o_M(t) \neq d_M(t) \mid i_M(t))$ . can be estimated by $\frac{n_1}{n_2}$ , where is the number of those times $k \leq t$ with and where is the number of those times with $i_M(k) = i_M(t), o_M(k) \neq d_M(k)$ .

2. Modelling probabilities of local prediction failures. Let be -dimensional. Let $d^j_C(t)= P(o^j_M(t) \neq d^j_M(t) \mid i_M(t))$ for all appropriate . can be estimated by $\frac{n_1}{n_2}$ , where is the number of those times $k \leq t$ with and where is the number of those times with $i_M(k) = i_M(t), o^j_M(k) \neq d^j_M(k)$ .

Variations of method 1 and method 2 would not measure the probabilities of exact matches between predictions and reality but the probability of `near-matches' within a certain (e.g. euclidian) tolerance.

3. Modelling global expected error. Let be one-dimensional. Let

$\begin{displaymath}d_C(t)= E \left\{ \frac{1}{2}(d_M(t) - o_M(t))^T(d_M(t)-o_M(t)) \mid i_M(t) \right\} . \end{displaymath}$

is a back-propagation net (e.g. [14]), an approximation of

can be obtained by using gradient descent (with a small learning rate) for training

at time

to emit

's error $\frac{1}{2}(d_M(t) - o_M(t))^T(d_M(t)-o_M(t))$ . This is a special case of the method described in [8] (there a fully recurrent net was employed). Of course, other error functions are possible. For instance, with the experiments described below the confidence network predicted the the absolute value of the difference between

's (one-dimensional) output and the current target value.

4. Modelling local expected error. Let be -dimensional. Let

$\begin{displaymath}d^j_C(t)= E \{ (d^j_M(t) - o^j_M(t))^2 \mid i_M(t) \} \end{displaymath}$