So far we have talked about passive prediction, given the observations. Note, however, that agents interacting with an environment can also use predictions of the future to compute action sequences that maximize expected future reward. Hutter's recent AIXI model [22] (author's SNF grant 61847) does exactly this, by combining Solomonoff's -based universal prediction scheme with an expectimax computation.
In cycle action results in perception and reward , where all quantities may depend on the complete history. The perception and reward are sampled from the (reactive) environmental probability distribution . Sequential decision theory shows how to maximize the total expected reward, called value, if is known. Reinforcement learning [27] is used if is unknown. AIXI defines a mixture distribution as a weighted sum of distributions , where is any class of distributions including the true environment .
It can be shown that the conditional probability of environmental inputs to an AIXI agent, given the agent's earlier inputs and actions, converges with increasing length of interaction against the true, unknown probability [22], as long as the latter is recursively computable, analogously to the passive prediction case.
Recent work [24] also demonstrated AIXI's optimality in the following sense. The Bayes-optimal policy based on the mixture is self-optimizing in the sense that the average value converges asymptotically for all to the optimal value achieved by the (infeasible) Bayes-optimal policy which knows in advance. The necessary condition that admits self-optimizing policies is also sufficient. No other structural assumptions are made on . Furthermore, is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in all environments and a strictly higher value in at least one [24].
We can modify the AIXI model such that its predictions are based on the -approximable Speed Prior instead of the incomputable . Thus we obtain the so-called AIS model. Using Hutter's approach [22] we can now show that the conditional probability of environmental inputs to an AIS agent, given the earlier inputs and actions, converges against the true but unknown probability, as long as the latter is dominated by , such as the above.