Next: 3. SIMULATIONS OF RDIA
Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION
Previous: 1. INTRODUCTION
Our agent's task is to build a model of the transition
probabilities .
The problem is studied in isolation from goal-directed
reinforcement learning tasks:
RDIA embodies a kind of ``unsupervised
reinforcement learning''.
It extends recent previous work on ``active exploration''
(e.g. [9,8,11]).
Previous approaches (1) were limited
to deterministic environments (they did not address
the general problem of learning a
model of the statistical properties of a non-deterministic NME),
and (2) were based
on ad-hoc elements instead of building
on concepts from information theory.
Collecting ML estimates.
For each state/action pair (or experiment)
, the
agent has a counter whose value at time ,
, equals the number of the agent's previous experiences
with .
In addition, for each state/action pair , there
are counters , . The value
of at time ,
, equals the number of the agent's previous experiences
with , where the next state was .
Note that
.
At time , if , then
denotes the agent's current unbiased estimate of .
If , then we define (somewhat arbitrarily)
. Note that, as a consequence, before the agent
has conducted any experiments of the type ,
the do not satisfy the requirements of
a probability distribution. For ,
the
build a maximum likelihood model (consistent with the previous
experiences of the agent)
of the probabilities of the possible next states.
Measuring information gain.
If the agent performs an experiment by executing
action in state , and the new state
is ,
then in general will be different from
.
By observing the outcome of the experiment, the agent has
acquired a piece of information.
To measure its progress,
we compute the
information theoretic
difference between what the agent knew before
the experiment, at time , and what the agent knew after
the experiment, at time .
One natural way of doing this is to use the
Kullback-Leibler distance (or asymmetric divergence)
between the probability distributions represented
by the and
.
We define
|
(1) |
where
A related (but less informative)
measure of progress is the entropy difference of
the probability distributions represented
by the and
,
|
(2) |
for .
Again, if (before the agent
has conducted any experiments of type ),
the entropy of the corresponding MLM
is taken to be zero.
In this case, will be zero, too.
Another simple, related performance measure is
for , and for .
Initial experiments seem to indicate that the particular definition
of does not make an essential difference.
In all cases, best policies are found by
using as the reinforcement
for the Q-Learning algorithm from section 2.
Since an experiment at time affects only
estimates (the
associated with
and ),
and since can always be computed within
operations,
the algorithm's overall complexity per time step is bounded by
.
Next: 3. SIMULATIONS OF RDIA
Up: REINFORCEMENT DRIVEN INFORMATION ACQUISITION
Previous: 1. INTRODUCTION
Juergen Schmidhuber
2003-02-28
Back to Active Learning - Exploration - Curiosity page
Back to Reinforcement Learning page