So far we have talked about passive prediction, given the
observations. Note, however, that agents interacting with an environment
can also use predictions of the future to compute action sequences
that maximize expected future reward. Hutter's recent AIXI model
[22]
(author's SNF grant 61847)
does exactly this, by combining Solomonoff's
-based universal prediction scheme with an expectimax
computation.
In cycle action
results in perception
and reward
, where all quantities may depend on
the complete history. The perception
and reward
are sampled
from the (reactive) environmental probability distribution
.
Sequential decision theory shows
how to maximize the total expected reward, called value,
if
is known. Reinforcement learning [27] is used if
is
unknown. AIXI defines a mixture distribution
as a weighted sum of distributions
, where
is any class
of distributions including the true environment
.
It can be shown that the conditional probability of
environmental inputs to an AIXI agent, given the agent's earlier
inputs and actions,
converges with increasing length of interaction against the true, unknown
probability [22],
as long as the latter is recursively computable, analogously
to the passive prediction case.
Recent work [24] also demonstrated
AIXI's optimality in the following sense. The
Bayes-optimal policy based on the mixture
is self-optimizing
in the sense that the average value converges asymptotically for
all
to the optimal value achieved by the (infeasible)
Bayes-optimal policy
which knows
in advance.
The necessary condition that
admits self-optimizing policies
is also sufficient. No other structural assumptions are made on
.
Furthermore,
is Pareto-optimal in the sense that there is no other policy
yielding higher or equal value in all environments
and a strictly higher value in at least one [24].
We can modify the AIXI model such that its predictions are based on the
-approximable Speed Prior
instead of the incomputable
.
Thus we obtain
the so-called AIS model. Using Hutter's approach [22]
we can now show that the conditional
probability of environmental inputs to an AIS agent, given the earlier
inputs and actions, converges to the true but unknown probability,
as long as the latter is dominated by
, such as the
above.