next up previous
Next: ADAPTIVE CURIOSITY Up: CURIOUS MODEL-BUILDING CONTROL SYSTEMS Previous: INTRODUCTION

ADAPTIVE CONFIDENCE

Consider an adaptive discrete time `predictor' $M$ (not necessarily a neural network) whose input at time $t$ is the real vector $i_M(t)$ and whose output at time $t$ is the real vector $o_M(t) = f_M(i_M(t), h_M(t))$, where the real vector $h_M(t)$ represents the internal state of $M$. Meaningful internal states are required if the prediction task requires to memorize past events. At time $t$ there is a target output $d_M(t)$. The predictor's goal is to make $o_M(t) = d_M(t)$ for all $t$.

After having provided a number of training examples for $M$, $M$ usually will still make some errors, particularily if the training environment is noisy. How can we model the reliability of $M$'s predictions?

We introduce an additional `confidence module' $C$ (not necessarily a neural network) whose input at time $t$ is the real vector $i_C(t)=i_M(t)$ and whose output at time $t$ is the real vector $o_C(t) = f_C(i_C(t), h_C(t))$, where the real vector $h_C(t)$ is the internal state of $C$. At time $t$ there is a target output $d_C(t)$ for the confidence module. $d_C(t)$ should provide information about how reliable $M$'s prediction $o_M(t)$ can be expected to be [8] [5] [7].

In what follows, $v^j$ is the $j$th component of a vector $v$, $E$ denotes the expectation operator, $dim(x)$ denotes the dimensionality of vector $x$, $\mid c \mid$ denotes the absolute value of scalar $c$, $P(A \mid B)$ denotes the conditional probability of $A$ given $B$, and $E(A \mid B)$ denotes the conditional expectation of $A$ given $B$. For simplicity, we will concentrate on the case of $h_C(t)=h_M(t)=0$ for all $t$. This means that $M$'s and $C$'s current outputs are based only on the current input. There is a variety of simple ways of representing reliability in $d_C(t)$:



1. Modelling probabilities of global prediction failures. Let $d_C(t)$ be one-dimensional. Let $d_C(t)= P(o_M(t) \neq d_M(t) \mid i_M(t)) $. $d_C(t)$ can be estimated by $\frac{n_1}{n_2}$, where $n_2$ is the number of those times $k \leq t$ with $i_M(k) = i_M(t)$ and where $n_1$ is the number of those times $k$ with $i_M(k) = i_M(t), o_M(k) \neq d_M(k)$.



2. Modelling probabilities of local prediction failures. Let $d_C(t)$ be $dim(d_M(t)) $-dimensional. Let $d^j_C(t)= P(o^j_M(t) \neq d^j_M(t) \mid i_M(t)) $ for all appropriate $j$. $d^j_C(t)$ can be estimated by $\frac{n_1}{n_2}$, where $n_2$ is the number of those times $k \leq t$ with $i_M(k) = i_M(t)$ and where $n_1$ is the number of those times $k$ with $i_M(k) = i_M(t),
o^j_M(k) \neq d^j_M(k)$.



Variations of method 1 and method 2 would not measure the probabilities of exact matches between predictions and reality but the probability of `near-matches' within a certain (e.g. euclidian) tolerance.



3. Modelling global expected error. Let $d_C(t)$ be one-dimensional. Let

\begin{displaymath}d_C(t)= E \left\{ \frac{1}{2}(d_M(t) - o_M(t))^T(d_M(t)-o_M(t))
\mid i_M(t) \right\} . \end{displaymath}

If $C$ is a back-propagation net (e.g. [14]), an approximation of $d_C(t)$ can be obtained by using gradient descent (with a small learning rate) for training $C$ at time $t$ to emit $M$'s error $\frac{1}{2}(d_M(t) - o_M(t))^T(d_M(t)-o_M(t))$. This is a special case of the method described in [8] (there a fully recurrent net was employed). Of course, other error functions are possible. For instance, with the experiments described below the confidence network predicted the the absolute value of the difference between $M$'s (one-dimensional) output and the current target value.



4. Modelling local expected error. Let $d_C(t)$ be $dim(d_M(t)) $-dimensional. Let

\begin{displaymath}d^j_C(t)= E \{ (d^j_M(t) - o^j_M(t))^2 \mid i_M(t) \} \end{displaymath}

for all appropriate $j$. If $C$ is a back-propagation net, an approximation of $d_C(t)$ can be obtained by using gradient descent (with a small learning rate) for training $C$ at time $t$ to emit $M$'s local prediction errors

\begin{displaymath}\left( (d^1_M(t) - o^1_M(t))^2, \ldots,
(d^m_M(t) - o^m_M(t))^2 \right)^T, \end{displaymath}

where $m = dim(o_M(t)) $.


next up previous
Next: ADAPTIVE CURIOSITY Up: CURIOUS MODEL-BUILDING CONTROL SYSTEMS Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-28


Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page