Next: PROBABILISTIC SEARCH Up: Discovering Solutions with Low Previous: INTRODUCTION

BASIC CONCEPTS

Each Turing machine (TM)

(mapping bitstrings to bitstrings, without loss of generality) computes a partial function $f_C: \{0,1\}^* \rightarrow \{0,1\}^*$ (

is undefined where

does not halt). The Kolmogorov complexity

of a finite string

is the length of the shortest program

that computes

on a universal Turing machine

and halts, where the set of possible halting programs forms a prefix code (no halting program can be the prefix of another one):

$\begin{displaymath} K_U(s) = \min_p\{\mid p \mid : ~f_U(p) = s \}, \end{displaymath}$

where $\mid p \mid$ denotes the length of

. The invariance theorem (Solomonoff, 1964; Kolmogorov, 1965; Chaitin, 1969) states that $K_{U_1}(s) = K_{U_2}(s) + O(1)$ for two universal machines

and

and for all

. Therefore we may drop the index

and write

instead of

Machine learning and the prior problem. In machine learning applications, we are often concerned with the following problem: given training data , we would like to select the most probable hypothesis generating the data. Bayes formula yields

$\begin{displaymath} P(H \mid D) = \frac{ P(D \mid H) P(H) }{ P(D)}. \end{displaymath}$

(1)

We would like to select

such that $P(H \mid D)$ is maximal. Taking the log of this equation leads to the principle of minimum description length (Wallace, 1968; Rissanen, 1978). But where does the prior

come from? How does one define an a priori probability distribution on the set of possible hypotheses without introducing arbitrariness? This is often perceived as the prior problem of Bayesian approaches. The theory of algorithmic probability, however, provides a solution, as will be seen next.

Universal prior (Levin, 1974; Solomonoff, 1964; Gacs, 1974; Chaitin, 1975). Define , the a priori probability of a bitstring , as the probability of guessing a halting program that computes on universal TM . Here, the way of guessing is defined by the following procedure: whenever the scanning head of 's input tape (initially blank) shifts right to a field that has not been scanned before, do: With probability $\frac{1}{2}$ fill it with a 0; with probability $\frac{1}{2}$ fill it with a 1. We obtain

$\begin{displaymath} P_U(s) = \sum_{p: f_U(p) = s} (\frac{1}{2})^{\mid p \mid}. \end{displaymath}$

Under different universal priors (based on different universal machines), probabilities of a given string differ by not more than a constant factor independent of the string size, due to the invariance theorem. Therefore we may drop the index

and write

instead of

. This justifies the name ``universal prior'', also known as the Solomonoff-Levin distribution. Universal priors appear to be the only convincing method for assigning a priori probabilities to hypotheses (or other computable objects).

Dominance of shortest programs. It can be shown (Levin, 1974; Chaitin, 1975) that

$\begin{displaymath} P(s) = O(2^{-K(s)}). \end{displaymath}$

The probability of a string is dominated by the probabilities of its shortest programs. This justifies ``Occam's razor'': in equation (1), the a priori most probable

are those with minimal Kolmogorov complexity.

Dominance of universal prior. Suppose there are infinitely many enumerable solution candidates (strings): amazingly, dominates all discrete enumerable semimeasures P (including probability distributions, see e.g. Li and Vitanyi (1993) for details) in the following sense: for each there is a constant such that for all strings .

Since Kolmogorov complexity is incomputable in general, the universal prior is so, too. A popular myth states that this fact renders useless the concepts of Kolmogorov complexity, as far as practical machine learning is concerned. But this is not so, as will be seen next. There we focus on a natural, computable, yet very general extension of Kolmogorov complexity.

Levin complexity. In what follows, a (not necessarily halting) program is a string on 's input tape which can be scanned completely by . Let scan a program before it finishes printing onto the work tape. Let be the number of steps taken before is printed. Then

$\begin{displaymath} Kt(s) = \min_q\{\mid q \mid + log~t(q,s) \}. \end{displaymath}$

An invariance theorem similar to the one for

holds for

as well.

Levin's universal optimal search algorithm (Levin, 1973; 1984). Suppose we are looking for the solution (a string) to a given problem. Levin's universal search algorithm generates and evaluates all strings (solution candidates) in order of their complexity, until a solution is found. This is essentially equivalent to enumerating all programs in order of decreasing probabilities, divided by their runtimes. Each program computes a string that is tested to see whether it is a solution to the given problem. If so, the search is stopped. Amazingly, for a broad class of problems, including inversion problems and time-limited optimization problems, universal search can be shown to be optimal with respect to total expected search time, leaving aside a constant factor independent of the problem size: if string can be computed within time steps by a program , and the probability of guessing (as above) is $\bar{P}$ , then within $O(\frac{t}{\bar{P}})$ time steps, systematic enumeration according to Levin will generate , run it for time steps, and output . In the experiments below, a probabilistic algorithm strongly inspired by universal search will be used.

Next: PROBABILISTIC SEARCH Up: Discovering Solutions with Low Previous: INTRODUCTION

Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page