Optimal Rational Decision Makers

So far we have talked about passive prediction, given the
observations. Note, however, that agents interacting with an environment
can also use predictions of the future to compute action sequences
that maximize expected future reward. Hutter's recent *AIXI model*
[22]
(author's SNF grant 61847)
does exactly this, by combining Solomonoff's
-based universal prediction scheme with an *expectimax*
computation.

In cycle action results in perception and reward , where all quantities may depend on the complete history. The perception and reward are sampled from the (reactive) environmental probability distribution . Sequential decision theory shows how to maximize the total expected reward, called value, if is known. Reinforcement learning [27] is used if is unknown. AIXI defines a mixture distribution as a weighted sum of distributions , where is any class of distributions including the true environment .

It can be shown that the conditional probability of environmental inputs to an AIXI agent, given the agent's earlier inputs and actions, converges with increasing length of interaction against the true, unknown probability [22], as long as the latter is recursively computable, analogously to the passive prediction case.

Recent work [24] also demonstrated
AIXI's optimality in the following sense. The
Bayes-optimal policy based on the mixture is self-optimizing
in the sense that the average value converges asymptotically for
all
to the optimal value achieved by the (infeasible)
Bayes-optimal policy which knows in advance.
The necessary condition that admits self-optimizing policies
is also sufficient. No other structural assumptions are made on .
Furthermore, is Pareto-optimal in the sense that there is no other policy
yielding higher or equal value in *all* environments
and a strictly higher value in at least one [24].

We can modify the AIXI model such that its predictions are based on the
-approximable Speed Prior instead of the incomputable .
Thus we obtain
the so-called *AIS model.* Using Hutter's approach [22]
we can now show that the conditional
probability of environmental inputs to an AIS agent, given the earlier
inputs and actions, converges to the true but unknown probability,
as long as the latter is dominated by , such as the above.