next up previous
Next: 3. SIMULATIONS OF RDIA Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 1. INTRODUCTION

2. MODEL BUILDING WITH RDIA

Our agent's task is to build a model of the transition probabilities $p_{ijk}$. The problem is studied in isolation from goal-directed reinforcement learning tasks: RDIA embodies a kind of ``unsupervised reinforcement learning''. It extends recent previous work on ``active exploration'' (e.g. [9,8,11]). Previous approaches (1) were limited to deterministic environments (they did not address the general problem of learning a model of the statistical properties of a non-deterministic NME), and (2) were based on ad-hoc elements instead of building on concepts from information theory.

Collecting ML estimates. For each state/action pair (or experiment) $(S_i, a_j)$, the agent has a counter $c_{ij}$ whose value at time $t$, $c_{ij}(t)$, equals the number of the agent's previous experiences with $(S_i, a_j)$. In addition, for each state/action pair $(S_i, a_j)$, there are $n$ counters $c_{ijk}$, $k = 1 ... n$. The value of $c_{ijk}$ at time $t$, $c_{ijk}(t)$, equals the number of the agent's previous experiences with $(S_i, a_j)$, where the next state was $S_k$. Note that $c_{ij}(t) = \sum_k c_{ijk}(t)$. At time $t$, if $c_{ij}(t) >0$, then

\begin{displaymath}
p^*_{ijk}(t) = \frac{c_{ijk}(t)}{c_{ij}(t)}
\end{displaymath}

denotes the agent's current unbiased estimate of $p_{ijk}$. If $c_{ij}(t) = 0$, then we define (somewhat arbitrarily) $p^*_{ijk}(t) = 0$. Note that, as a consequence, before the agent has conducted any experiments of the type $(S_i, a_j)$, the $p^*_{ijk}$ do not satisfy the requirements of a probability distribution. For $c_{ij}(t) >0$, the $p^*_{ijk}(t)$ build a maximum likelihood model (consistent with the previous experiences of the agent) of the probabilities of the possible next states.

Measuring information gain. If the agent performs an experiment by executing action $a(t) = a_j$ in state $S(t) = S_i$, and the new state is $S(t+1) = S_k$, then in general $p^*_{ijk}(t)$ will be different from $p^*_{ijk}(t+1)$. By observing the outcome of the experiment, the agent has acquired a piece of information. To measure its progress, we compute the information theoretic difference between what the agent knew before the experiment, at time $t$, and what the agent knew after the experiment, at time $t+1$. One natural way of doing this is to use the Kullback-Leibler distance (or asymmetric divergence) between the probability distributions represented by the $p^*_{ijk}(t)$ and $p^*_{ijk}(t+1)$. We define

\begin{displaymath}
D(t)=
\mid
\sum_k d_k(t)
\mid ,
\end{displaymath} (1)

where

\begin{displaymath}
d_k(t) = 0 ~~if~~ p^*_{ijk}(t+1) = 0 ~~or~~p^*_{ijk}(t) = 0;
\end{displaymath}


\begin{displaymath}
d_k(t) = p^*_{ijk}(t+1) ln~\frac{p^*_{ijk}(t+1)}{p^*_{ijk}(t)}~~otherwise.
\end{displaymath}

A related (but less informative) measure of progress is the entropy difference of the probability distributions represented by the $p^*_{ijk}(t)$ and $p^*_{ijk}(t+1)$,

\begin{displaymath}
D(t)=
\mid
\sum_k p^*_{ijk}(t+1)~ln~ p^*_{ijk}(t+1) -
\sum_k p^*_{ijk}(t)~ln~p^*_{ijk}(t)
\mid
\end{displaymath} (2)

for $c_{ij}(t) >0$. Again, if $c_{ij}(t) = 0$ (before the agent has conducted any experiments of type $(S_i, a_j)$ ), the entropy of the corresponding MLM is taken to be zero. In this case, $D(t)$ will be zero, too. Another simple, related performance measure is $D(t)= \sum_k \mid p^*_{ijk}(t+1) - p^*_{ijk}(t) \mid $ for $c_{ij}(t) >0$, and $D(t)=0$ for $c_{ij}(t) = 0$. Initial experiments seem to indicate that the particular definition of $D(t)$ does not make an essential difference.

In all cases, best policies are found by using $D(t)$ as the reinforcement $R(t)$ for the Q-Learning algorithm from section 2. Since an experiment at time $t$ affects only $n$ estimates (the $n$ $p^*_{ijk}(t+1)$ associated with $a_j = a(t)$ and $S_i = S(t)$), and since $D(t)$ can always be computed within $O(n)$ operations, the algorithm's overall complexity per time step is bounded by $O(n)$.


next up previous
Next: 3. SIMULATIONS OF RDIA Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION Previous: 1. INTRODUCTION
Juergen Schmidhuber 2003-02-28


Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page