Next: Speed Prior Up: The Speed Prior: A Previous: The Speed Prior: A

Introduction

How to predict the future from the past? To get a grip on this fundamental question, let us first introduce some notation. denotes the set of finite sequences over the binary alphabet $B=\{0,1\}$ , $B^{\infty}$ the set of infinite sequences over , $\lambda$ the empty string, $B^{\sharp} = B^* \cup B^{\infty}$ . stand for strings in $B^{\sharp}$ . If $x \in B^*$ then is the concatenation of and (e.g., if and then ). For $x \in B^*$ , denotes the number of bits in , where $l(x)= \infty$ for $x \in B^{\infty}$ ; $l(\lambda) = 0$ . $x_{n}$ is the prefix of consisting of the first bits, if $l(x) \geq n$ , and otherwise ( $x_0 := \lambda$ ). $\log$ denotes the logarithm with basis 2, functions mapping integers to integers. We write if there exist positive constants such that $f(n) \leq cg(n)$ for all . For simplicity let us consider universal Turing Machines (TMs) with input alphabet and trinary output alphabet including the symbols ``0'', ``1'', and `` '' (blank). For efficiency reasons, the TMs should have several work tapes to avoid potential quadratic slowdowns associated with 1-tape TMs. The remainder of this paper refers assumes a fixed universal reference TM.

Now suppose bitstring represents the data observed so far. What is its most likely continuation $y \in B^{\sharp}$ ? Bayes' theorem yields

$\begin{displaymath} P(xy \mid x) = \frac{P(x \mid xy) P(xy)} {P(x)} \propto P(xy) \end{displaymath}$

(1)

where $P(z^2 \mid z^1)$ is the probability of

, given knowledge of

, and $P(x) = \int_{z \in B^{\sharp}} P(xz) dz$ is just a normalizing factor. So the most likely continuation

is determined by

, the prior probability of

. But which prior measure

is plausible? Occam's razor suggests that the ``simplest''

should be more probable. But which exactly is the ``correct'' definition of simplicity?

The next section will offer an alternative to the celebrated but noncomputable algorithmic simplicity measure or Solomonoff-Levin measure [24,29,25]. But let us first review Solomonoff's traditional approach.

Roughly fourty years ago Solomonoff started the theory of universal optimal induction based on the apparently harmless simplicity assumption that is computable [24]. While Equation (1) makes predictions of the entire future, given the past, Solomonoff [25] focuses just on the next bit in a sequence. Although this provokes surprisingly nontrivial problems associated with translating the bitwise approach to alphabets other than the binary one -- only recently Hutter managed to do this [8] -- it is sufficient for obtaining essential insights. Given an observed bitstring , Solomonoff assumes the data are drawn according to a recursive measure $\mu$ ; that is, there is a program for a universal Turing machine that reads $x \in B^*$ and computes $\mu(x)$ and halts. He estimates the probability of the next bit (assuming there will be one), using the remarkable, well-studied, enumerable prior [24,29,25,6,15]

$\begin{displaymath} M(x) = \sum_{program~prefix~p~computes \atop output~starting~with~x} 2^{-l(p)}. \end{displaymath}$

(2)

is universal, dominating the less general recursive measures as follows: For all $x \in B^*$ ,

$\begin{displaymath} M(x) \geq c_{\mu} \mu (x) \end{displaymath}$

(3)

where $c_{\mu}$ is a constant depending on $\mu$ but not on

. Solomonoff observed that the conditional

-probability of a particular continuation, given previous observations, converges towards the unknown conditional $\mu$ as the observation size goes to infinity [25], and that the sum over all observation sizes of the corresponding $\mu$ -expected deviations is actually bounded by a constant. Hutter eventually showed that the number of prediction errors made by universal Solomonoff prediction is essentially bounded by the number of errors made by any other predictor, including the optimal scheme based on the true $\mu$ [8]. He also derived general loss bounds and showed that the expected loss of the universal scheme does not exceed by much the loss of the optimal scheme [9]. Recent research also generalized Solomonoff's approach to the case of much less restrictive universal nonenumerable priors that are computable in the limit [22,23]. One might say that Solomonoff's restriction of recursiveness leads to a ``slightly more computable'' approach than the more general case.

However, while is enumerable, it is not recursive, and thus practically infeasible. This drawback inspired less general yet practically more feasible principles of minimum description length (MDL) [27,17] as well as priors derived from time-bounded restrictions [15] of Kolmogorov complexity [12,24,4]. No particular instance of these approaches, however, is universally accepted or has a general convincing motivation that carries beyond rather specialized application scenarios. For instance, typical efficient MDL approaches require the specification of a class of computable models of the data, say, certain types of neural networks, plus some computable loss function expressing the coding costs of the data relative to the model. This provokes numerous ad-hoc choices.

The novel approach pursued here agrees that Solomonoff's assumption of recursive priors without any time and space limitations is too weak, and that we somehow should specify additional resource constraints on the data-generating process to obtain a convincing basis for feasible inductive inference. But which constraints are plausible? Which reflect the ``natural'' concept of simplicity? Previous resource-oriented priors derived from, say, polynomial time bounds [15] have no obvious and plausible a priori justification.

Therefore we will suggest a novel, natural prior reflecting data-independent, optimally efficient use of computational resources. Based on this prior, Section 3 will derive a near-optimal computable strategy for making predictions, given past observations.

Next: Speed Prior Up: The Speed Prior: A Previous: The Speed Prior: A

Juergen Schmidhuber 2003-02-25

Back to Speed Prior page