next up previous
Next: Prediction Using a Universal Up: The New AI: General Previous: Introduction


More Formally

What is the optimal way of predicting the future, given the past? Which is the best way to act such as to maximize one's future expected reward? Which is the best way of searching for the solution to a novel problem, making optimal use of solutions to earlier problems?

Most previous work on these old and fundamental questions has focused on very limited settings, such as Markovian environments where the optimal next action, given past inputs, depends on the current input only [27].

We will concentrate on a much weaker and therefore much more general assumption, namely, that the environment's responses are sampled from a computable probability distribution. If even this weak assumption were not true then we could not even formally specify the environment, leave alone writing reasonable scientific papers about it.

Let us first introduce some notation. $B^*$ denotes the set of finite sequences over the binary alphabet $B=\{0,1\}$, $B^{\infty}$ the set of infinite sequences over $B$, $\lambda$ the empty string, $B^{\sharp} =
B^* \cup B^{\infty}$. $x,y,z,z^1,z^2$ stand for strings in $B^{\sharp}$. If $x \in B^*$ then $xy$ is the concatenation of $x$ and $y$ (e.g., if $x=10000$ and $y=1111$ then $xy = 100001111$). For $x \in B^*$, $l(x)$ denotes the number of bits in $x$, where $l(x)= \infty$ for $x \in B^{\infty}$; $l(\lambda) = 0$. $x_{n}$ is the prefix of $x$ consisting of the first $n$ bits, if $l(x) \geq n$, and $x$ otherwise ( $x_0 := \lambda$). $\log$ denotes the logarithm with basis 2, $f,g$ denote functions mapping integers to integers. We write $f(n)=O(g(n))$ if there exist positive constants $c,n_0$ such that $f(n) \leq cg(n)$ for all $n > n_0$. For simplicity let us consider universal Turing Machines [64] (TMs) with input alphabet $B$ and trinary output alphabet including the symbols ``0'', ``1'', and `` '' (blank). For efficiency reasons, the TMs should have several work tapes to avoid potential quadratic slowdowns associated with 1-tape TMs. The remainder of this paper assumes a fixed universal reference TM.

Now suppose bitstring $x$ represents the data observed so far. What is its most likely continuation $y \in B^{\sharp}$? Bayes' theorem yields

\begin{displaymath}
P(xy \mid x)
= \frac{P(x \mid xy) P(xy)}
{P(x)}
\propto P(xy)
\end{displaymath} (1)

where $P(z^2 \mid z^1)$ is the probability of $z^2$, given knowledge of $z^1$, and $P(x) = \int_{z \in B^{\sharp}} P(xz) dz$ is just a normalizing factor. So the most likely continuation $y$ is determined by $P(xy)$, the prior probability of $xy$. But which prior measure $P$ is plausible? Occam's razor suggests that the ``simplest'' $y$ should be more probable. But which exactly is the ``correct'' definition of simplicity? Sections 3 and 4 will measure the simplicity of a description by its length. Section 5 will measure the simplicity of a description by the time required to compute the described object.


next up previous
Next: Prediction Using a Universal Up: The New AI: General Previous: Introduction
Juergen Schmidhuber 2003-02-04