next up previous
Next: Optimal Universal Search Algorithms Up: The New AI: General Previous: Speed Prior-Based Predictions for


Optimal Rational Decision Makers

So far we have talked about passive prediction, given the observations. Note, however, that agents interacting with an environment can also use predictions of the future to compute action sequences that maximize expected future reward. Hutter's recent AIXI model [22] (author's SNF grant 61847) does exactly this, by combining Solomonoff's $M$-based universal prediction scheme with an expectimax computation.

In cycle $t$ action $y_t$ results in perception $x_t$ and reward $r_t$, where all quantities may depend on the complete history. The perception $x_t'$ and reward $r_t$ are sampled from the (reactive) environmental probability distribution $\mu$. Sequential decision theory shows how to maximize the total expected reward, called value, if $\mu$ is known. Reinforcement learning [27] is used if $\mu$ is unknown. AIXI defines a mixture distribution $\xi$ as a weighted sum of distributions $\nu \in M$, where $M$ is any class of distributions including the true environment $\mu$.

It can be shown that the conditional $M$ probability of environmental inputs to an AIXI agent, given the agent's earlier inputs and actions, converges with increasing length of interaction against the true, unknown probability [22], as long as the latter is recursively computable, analogously to the passive prediction case.

Recent work [24] also demonstrated AIXI's optimality in the following sense. The Bayes-optimal policy $p^\xi$ based on the mixture $\xi$ is self-optimizing in the sense that the average value converges asymptotically for all $\mu \in \cal M$ to the optimal value achieved by the (infeasible) Bayes-optimal policy $p^\mu$ which knows $\mu$ in advance. The necessary condition that $\cal M$ admits self-optimizing policies is also sufficient. No other structural assumptions are made on $\cal M$. Furthermore, $p^\xi$ is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments $\nu \in \cal M$ and a strictly higher value in at least one [24].

We can modify the AIXI model such that its predictions are based on the $\epsilon$-approximable Speed Prior $S$ instead of the incomputable $\cal M$. Thus we obtain the so-called AIS model. Using Hutter's approach [22] we can now show that the conditional $S$ probability of environmental inputs to an AIS agent, given the earlier inputs and actions, converges against the true but unknown probability, as long as the latter is dominated by $S$, such as the $S'$ above.


next up previous
Next: Optimal Universal Search Algorithms Up: The New AI: General Previous: Speed Prior-Based Predictions for
Juergen Schmidhuber 2003-02-04