Next: History spotlights / Selected Up: DISCOVERING NEURAL NETS WITH Previous: INTRODUCTION

BASIC CONCEPTS RELEVANT TO LEARNING

This section briefly reviews a few basic concepts from the theories of algorithmic probability and Kolmogorov complexity (a.k.a. ``algorithmic complexity''). Selected references and a very brief and incomplete history of the subject can be found at the end of the section.

Algorithmic complexity theory provides various different but closely related measures for the complexity (or simplicity) of objects. We will focus on the one that appears to be the most useful for machine learning. Informally speaking, the complexity of a computable object is the length of the shortest program that computes it and halts, where the set of possible programs forms a prefix code.

Machine. To make this more precise, consider a Turing machine (TM) with 3 tapes: the program tape, the work tape, and the output tape. All three are finite but may grow indefinitely. For simplicity, but without loss of generality, let us focus on binary tape alphabets $\{0,1\}$ . Initially, the work tape and the output tape consist of a single square filled with a zero. The program tape consists of finitely many squares, each filled with a zero or a one. Each tape has a scanning head. Initially, the scanning head of each tape points to its first square. The program tape is ``read only,'' the output tape is ``write only'' and their scanning heads may be shifted only to the right (one square at a time). The work tape is ``read/write'' and its scanning head may move in both directions. Whenever the scanning heads of work tape or output tape shift beyond the current tape boundary, an additional square is appended and filled with a zero. The case of the program tape's scanning head shifting beyond the program tape boundary will be considered later. For the moment we assume that this does never happen. The TM has a finite number of internal states (one of them being the initial state). Its behavior is specified by a function (implemented as a look-up table). maps the current state and the contents of the square above the scanning head of the work tape to a new state and an action. There are 8 actions: shift worktape left/right, write 1/0 on worktape, write 1/0 on output tape and shift its scanning head right, copy contents above scanning head of program tape onto square above scanning head of work tape and shift program tape's scanning head right, and halt.

Self-delimiting programs. Let $\mid s \mid$ denote the number of bits in the bitstring . Consider a non-empty bitstring written onto the program tape such that the scanning head points to the first bit of . is a program for some TM , iff reads all $\mid p \mid$ bits and halts. In other words, during the (eventually terminating) computation process the head of 's program tape incrementally moves from its start position to the end of , but not any further. One may say that carries in itself the information about when to stop, and about its own length. Obviously, no program can be the prefix of another one.

Compiler theorem. Each TM , mapping bitstrings (written onto the program tape) to outputs (written onto the output tape) computes a partial function $f_C: \{0,1\}^* \rightarrow \{0,1\}^*$ ( is undefined if does not halt). It is well known that there is a universal TM with the following property: for every TM there exists a constant prefix $\mu_C$ such that $f_C(p)=f_U(\mu_C p)$ for all bitstrings . The prefix $\mu_C$ is the compiler that compiles programs for into equivalent programs for .

In what follows, let denote a (self-delimiting) program.

Kolmogorov complexity. Given , the Kolmogorov complexity (a.k.a. ``algorithmic complexity,'' ``algorithmic information,'' or occasionally ``Kolmogorov-Chaitin complexity'') of an arbitrary bitstring is denoted as and is defined as the length of a shortest program producing on :

$\begin{displaymath} K_U(s) = \min_p\{\mid p \mid : ~f_U(p) = s \}. \end{displaymath}$

is noncomputable, otherwise the halting problem could be solved. However, by comparing the number of possible programs with less than

bits (

) and the number of possible bitstrings with greater than

bits (

), one observes: most strings

are complex (or ``random,'' or ``incompressible'') in the sense that they cannot be computed by a program much shorter than

Invariance theorem. Due to the compiler theorem, $K_{U_1}(s) = K_{U_2}(s) + O(1)$ for two universal machines and . Therefore we may choose one particular universal machine and henceforth write .

Machine learning, MDL, and the prior problem. In machine learning applications, we are often concerned with the following problem: given training data , we would like to select the most probable hypothesis generating the data. Bayes formula yields

$\begin{displaymath} P(H \mid D) = \frac{ P(D \mid H) P(H) }{ P(D)}. \end{displaymath}$

(1)

We would like to select

such that $P(H \mid D)$ is maximal. This is equivalent to minimizing

$\begin{displaymath} -logP(H \mid D) = -logP(D \mid H) -logP(H) + logP(D). \end{displaymath}$

(2)

Let us interprete these equations. Since

is given,

may be viewed as a normalizing constant. $P(D\vert H)$ can usually be measured or at least approximated. According to classical information theory, $-log~P(D\vert H)$ is the ``optimal'' (minimal, most efficient) code length or description length for

, given

is the minimal code length for

. This leads to the minimum description length (MDL) principle: The best hypothesis for explaining the data is the one that minimizes the sum of the description length of the hypothesis and the description length of the data when encoded by the hypothesis. But where does the prior

come from? How does one define an a priori probability distribution on the set of possible hypotheses without introducing arbitrariness? This is often perceived as the prior problem of Bayesian approaches. The theory of algorithmic probability, however, provides a solution.

Universal prior. Define , the a priori probability of a bitstring , as the probability of guessing a (halting) program that computes on . Here, the way of guessing is defined by the following procedure: initially, the program tape consists of a single square. Whenever the scanning head of the program tape shifts to the right, do: (1) Append a new square. (2) With probability $\frac{1}{2}$ fill it with a 0; with probability $\frac{1}{2}$ fill it with a 1. We obtain

$\begin{displaymath} P_U(s) = \sum_{p: f_U(p) = s} (\frac{1}{2})^{\mid p \mid}. \end{displaymath}$

Clearly, the sum of all probabilities of all halting programs cannot exceed 1 (no halting program can be the prefix of another one). But certain program tape contents may lead to non-halting computations.

Under different universal priors (based on different universal machines), probabilities of a given string differ by not more than a constant factor independent of the string size, due to the invariance theorem (the constant factor corresponds to the probability of guessing a compiler). Therefore we may drop the index and write instead of . This justifies the name ``universal prior,'' also known as the Solomonoff-Levin distribution.

Algorithmic entropy. is denoted by , the algorithmic entropy of .

Dominance of shortest programs. It can be shown (the proof is non-trivial) that

$\begin{displaymath} K(s) = H(s) + O(1). \end{displaymath}$

(3)

Since

, this implies

$\begin{displaymath} P(s) = (\frac{1}{2})^{K(s) - O(1)} = 2^{O(1)} 2^{-K(s)} = O(2^{-K(s)}). \end{displaymath}$

The probability of guessing any of the programs computing some string and the probability of guessing its shortest program are essentially equal. The probability of a string is dominated by the probabilities of its shortest programs.

Dominance of universal prior. Suppose there are infinitely many enumerable solution candidates (strings): amazingly, dominates all discrete enumerable semimeasures P (including probability distributions) in the following sense: for each there is a constant such that $P_U(s) \geq cP(s)$ for all strings .

Inductive inference and Occam's razor. Occam's razor prefers solutions whose minimal descriptions are short over solutions whose minimal descriptions are longer. The ``modern'' prefix-based version of Solomonoff's theory of inductive inference justifies Occam's razor in the following way. Suppose the problem is to extrapolate a sequence of symbols (bits, without loss of generality). We have already observed a bitstring and would like to predict the next bit. Let denote the event `` is followed by symbol '' for $i \in \{0,1\}$ . Bayes tells us

$\begin{displaymath} P(s0 \mid s) = \frac{P(s \mid s0)P(s0)}{P(s)} = \frac{P(s0)}{P(s)},~~ P(s1 \mid s) = \frac{P(s1)}{P(s)}. \end{displaymath}$

We are going to predict ``the next bit will be 0'' if

, and vice versa. Since $P(si) = O((\frac{1}{2})^{K(si)})$ for $i \in \{0,1\}$ , the continuation with lower Kolmogorov complexity will (in general) be the more likely one.

Since Kolmogorov complexity is incomputable in general, the universal prior is so, too. A popular myth states that this fact renders useless the concepts of Kolmogorov complexity, as far as practical machine learning is concerned. But this is not so, as will be seen next. There we focus on a natural, computable, yet very general extension of Kolmogorov complexity.

Levin complexity. Let us slightly extend the notion of a program. In what follows, a program is a string on the program tape which can be scanned completely by . The same program may lead to different results, depending on the runtime. For a given computable string, consider the logarithm of the probability of guessing a program that computes it, plus the logarithm of the corresponding runtime. The Levin complexity of the string is the minimal possible value of this. More formally, let scan a program (a program in the extended sense) written onto the program tape before it finishes printing onto the work tape. Let be the number of steps taken before is printed. Then

$\begin{displaymath} Kt(s) = \min_q\{\mid q \mid + log~t(q,s) \}. \end{displaymath}$

An invariance theorem similar to the one for

holds for

as well.

Universal search. Suppose we have got a problem whose solution can be represented as a (bit)string. Levin's universal optimal search algorithm essentially generates and evaluates all strings (solution candidates) in order of their complexity, until a solution is found. This is essentially equivalent to enumerating all programs in order of decreasing probabilities, divided by their runtimes. Each program computes a string that is tested to see whether it is a solution to the given problem. If so, the search is stopped. To get some intuition on universal search, let $P(s \mid p)$ denote the probability of computing a solution from a halting program on the program tape. For each there is a given runtime . Now suppose the following gambling scenario. You bet on the success of some . The bid is . To minimize your expected losses, you are going to bet on the with maximal $\frac{P(s \mid p)}{t(p)}$ . If you fail to hit a solution, you may bet again, but not on a you already bet on. Continuing this procedure, you are going to list halting programs in order of decreasing $\frac{P(s \mid p)}{t(p)}$ . Of course, neither $P(s \mid p)$ nor are usually known in advance. Still, for a broad class of problems, including inversion problems and time-limited optimization problems, universal search can be shown to be optimal with respect to total expected search time, leaving aside a constant factor independent of the problem size. If string can be computed within time steps by a program , and the probability of guessing (as above) is $\bar{P}$ , then within $O(\frac{t}{\bar{P}})$ time steps, systematic enumeration according to Levin will generate , run it for time steps, and output . In the experiments below, a probabilistic algorithm strongly inspired by universal search will be used.

Conditional algorithmic complexity. The complexity measures above are actually simplifications of slightly more general measures. Suppose the universal machine starts the computation process with a non-empty string already written on its work tape. Conditional Kolmogorov complexity $K(s \mid x)$ is defined as

$\begin{displaymath} K(s \mid x) = min_p\{\mid p \mid : ~U~computes~s~from~p,~given~ x~on~the~worktape \}. \end{displaymath}$

$Kt(s \mid x)$ is defined analoguously. Variants of conditional complexity are used to determine how much information some string conveys about another one. In the context of machine learning, conditional complexity measures how much additional information has to be acquired, given what is already known.

Subsections

- History spotlights / Selected references

Next: History spotlights / Selected Up: DISCOVERING NEURAL NETS WITH Previous: INTRODUCTION

Juergen Schmidhuber 2003-02-12

Back to Optimal Universal Search page
Back to Program Evolution page
Back to Algorithmic Information page
Back to Speed Prior page