Relations to Previous Work

Despite (or maybe because of) the ambitiousness and potential power of self-improving machines, there has been little work in this vein outside our own labs at IDSIA and TU Munich. Here we will list essential differences between the Gödel machine and our previous approaches to `learning to learn,' `metalearning,' self-improvement, self-optimization, etc.

The most closely related approaches are Hutter's
HSEARCH and AIXI*(t,l)* (Item 3 below).
For historical reasons, however, we will first discuss
Levin's *Universal Search* and Hutter's AIXI.

**Gödel Machine vs Universal Search**Unlike the fully self-referential Gödel machine, Levin's

*Universal Search*[24,26] has a hardwired, unmodifiable meta-algorithm that cannot improve itself. It is asymptotically optimal for inversion problems whose solutions can be quickly verified in time (where is the solution size), but it will always suffer from the same huge constant slowdown factors (typically ) buried in the -notation. The self-improvements of a Gödel machine, however, can be more than merely -optimal, since its utility function may formalize a stonger type of optimality that does not ignore huge constants just because they are constant--compare the utility function of equation (1).Furthermore, the Gödel machine is applicable to general lifelong reinforcement learning (RL) tasks [20] where

*Universal Search*is*not*asymptotically optimal, and not even applicable, since in RL the evaluation of some behavior's value in principle consumes the learner's entire life! So the naive test of whether a program is good or not would consume the entire life. That is, we could test only one program; afterwards life would be over.Therefore, to achieve their objective, general RL machines must do things that

*Universal Search*does not do, such as predicting*future*tasks and rewards. This partly motivates Hutter's universal RL machine AIXI, to be discussed next.

**Gödel Machine vs AIXI**Unlike Gödel machines, Hutter's recent

*AIXI model*[16,19] generally needs*unlimited*computational resources per input update. It combines Solomonoff's universal prediction scheme [52,53] with an*expectimax*computation. In discrete cycle , action results in perception and reward , both sampled from the unknown (reactive) environmental probability distribution . AIXI defines a mixture distribution as a weighted sum of distributions , where is any class of distributions that includes the true environment . For example, may be a sum of all computable distributions [52,53], where the sum of the weights does not exceed 1. In cycle , AIXI selects as next action the first in an action sequence maximizing -predicted reward up to some given horizon. Recent work [18] demonstrated AIXI 's optimal use of observations as follows. The Bayes-optimal policy based on the mixture is self-optimizing in the sense that its average utility value converges asymptotically for all to the optimal value achieved by the (infeasible) Bayes-optimal policy which knows in advance. The necessary condition that admits self-optimizing policies is also sufficient. Furthermore, is Pareto-optimal in the sense that there is no other policy yielding higher or equal value in*all*environments and a strictly higher value in at least one [18].While AIXI clarifies certain theoretical limits of machine learning, it is computationally intractable, especially when includes all computable distributions. This drawback motivated work on the time-bounded, asymptotically optimal AIXI

*(t,l)*system [16] and the related HSEARCH [17], both to be discussed next.

**Gödel Machine vs HSEARCH and AIXI***(t,l)*Now we come to the most closely related previous work; so we will go an extra length to point out the main novelties of the Gödel machine.

Hutter's non-self-referential but still -optimal

*`fastest' algorithm for all well-defined problems*HSEARCH [17] uses a*hardwired*brute force proof searcher and ignores the costs of proof search. Assume discrete input/output domains , a formal problem specification (say, a functional description of how integers are decomposed into their prime factors), and a particular (say, an integer to be factorized). HSEARCH orders all proofs of an appropriate axiomatic system by size to find programs that for all provably compute within time bound . Simultaneously it spends most of its time on executing the with the best currently proven time bound . It turns out that HSEARCH is as fast as the*fastest*algorithm that provably computes for all , save for a constant factor smaller than (arbitrary ) and an -specific but -independent additive constant [17]. This constant may be enormous though.Hutter's AIXI

*(t,l)*[16] is related. In discrete cycle of AIXI*(t,l)*'s lifetime, action results in perception and reward , where all quantities may depend on the complete history. Using a universal computer such as a Turing machine, AIXI*(t,l)*needs an initial offline setup phase (prior to interaction with the environment) where it uses a*hardwired*brute force proof searcher to examine all proofs of length at most , filtering out those that identify programs (of maximal size and maximal runtime per cycle) which not only could interact with the environment but which for all possible interaction histories also correctly predict a lower bound of their own expected future reward. In cycle , AIXI*(t,l)*then runs all programs identified in the setup phase (at most ), finds the one with highest self-rating, and executes its corresponding action. The problem-independent setup time (where almost all of the work is done) is . The online time per cycle is . Both are constant but typically huge.**Advantages and Novelty of the Gödel Machine.**There are major differences between the Gödel machine and Hutter's HSEARCH [17] and AIXI*(t,l)*[16], including:- The theorem provers of HSEARCH and AIXI
*(t,l)*are hardwired, non-self-referential, unmodifiable meta-algorithms that cannot improve themselves. That is, they will always suffer from the same huge constant slowdowns (typically ) buried in the -notation. But there is nothing in principle that prevents the truly self-referential code of a Gödel machine from proving and exploiting drastic reductions of such constants, in the best possible way that provably constitutes an improvement, if there is any. - The demonstration of the -optimality of HSEARCH and AIXI
*(t,l)*depends on a clever allocation of computation time to some of their unmodifiable meta-algorithms. Our Global Optimality Theorem (Theorem 4.1, Section 4), however, is justified through a quite different type of reasoning which indeed exploits and crucially depends on the fact that there is no unmodifiable software at all, and that the proof searcher itself is readable, modifiable, and can be improved. This is also the reason why its self-improvements can be more than merely -optimal. - HSEARCH uses a ``trick'' of proving more than is necessary which also disappears in the sometimes quite misleading -notation: it wastes time on finding programs that provably compute for all even when the current is the only object of interest. A Gödel machine, however, needs to prove only what is relevant to its goal formalized by . For example, the general of eq. (1) completely ignores the limited concept of -optimality, but instead formalizes a stronger type of optimality that does not ignore huge constants just because they are constant.
- Both the Gödel machine and AIXI
*(t,l)*can maximize expected reward (HSEARCH cannot). But the Gödel machine is more flexible as we may plug in*any*type of formalizable utility function (e.g.,*worst case*reward), and unlike AIXI*(t,l)*it does not require an enumerable environmental distribution.

*(t,l)*or HSEARCH or other less general methods to initialize the substring of which is responsible for interaction with the environment. The Gödel machine will replace as soon as it finds a provably better strategy.It is the

*self-referential*aspects of the Gödel machine that relieve us of much of the burden of careful algorithm design required for AIXI*(t,l)*and HSEARCH. They make the Gödel machine both conceptually simpler*and*more general.

- The theorem provers of HSEARCH and AIXI
**Gödel Machine vs OOPS**The Optimal Ordered Problem Solver OOPS [47,44] (used by BIOPS in Section 5.2) extends

*Universal Search*(Item 1). It is a bias-optimal (see Def. 5.1) way of searching for a program that solves each problem in an ordered sequence of problems of a rather general type, continually organizing and managing and reusing earlier acquired knowledge. Solomonoff recently also proposed related ideas for a*scientist's assistant*[54] that modifies the probability distribution of*Universal Search*[24] based on experience.Like

*Universal Search*(Item 1), OOPS is not directly applicable to RL problems. A provably optimal RL machine must somehow*prove*properties of otherwise un-testable behaviors (such as: what is the expected reward of this behavior which one cannot naively test as there is not enough time). That is part of what the Gödel machine does: it tries to greatly cut testing time, replacing naive time-consuming tests by much faster proofs of predictable test outcomes whenever this is possible.Proof verification itself can be performed very quickly. In particular, verifying the correctness of a found proof typically does not consume the remaining life. Hence the Gödel machine may use OOPS as a bias-optimal proof-searching submodule (Section 5.2). Since the proofs themselves may concern quite different,

*arbitrary*notions of optimality (not just bias-optimality), the Gödel machine is more general than plain OOPS. But it is not just an extension of OOPS. Instead of OOPS it may as well use non-bias-optimal alternative methods to initialize its proof searcher. On the other hand, OOPS is not just a precursor of the Gödel machine. It is a stand-alone, incremental, bias-optimal way of allocating runtime to programs that reuse previously successful programs, and is applicable to many traditional problems, including but not limited to proof search.

**Gödel Machine vs Success-Story Algorithm and Other Metalearners**A learner's modifiable components are called its policy. An algorithm that modifies the policy is a learning algorithm. If the learning algorithm has modifiable components represented as part of the policy, then we speak of a self-modifying policy (SMP) [48]. SMPs can modify the way they modify themselves etc. The Gödel machine has an SMP.

In previous practical work we used the

*success-story algorithm*(SSA) to force some (stochastic) SMP to trigger better and better self-modifications [37,49,48,50]. During the learner's life-time, SSA is occasionally called at times computed according to SMP itself. SSA uses backtracking to undo those SMP-generated SMP-modifications that have not been empirically observed to trigger lifelong reward accelerations (measured up until the current SSA call--this evaluates the long-term effects of SMP-modifications setting the stage for later SMP-modifications). SMP-modifications that survive SSA represent a lifelong success history. Until the next SSA call, they build the basis for additional SMP-modifications. Solely by self-modifications our SMP/SSA-based learners solved a complex task in a partially observable environment whose state space is far bigger than most found in the literature [48].The Gödel machine's training algorithm is theoretically much more powerful than SSA though. SSA empirically measures the usefulness of previous self-modifications, and does not necessarily encourage provably optimal ones. Similar drawbacks hold for Lenat's human-assisted, non-autonomous, self-modifying learner [23], our Meta-Genetic Programming [34] extending Cramer's Genetic Programming [8,1], our metalearning economies [34] extending Holland's machine learning economies [15], and gradient-based metalearners for continuous program spaces of differentiable recurrent neural networks [36,13]. All these methods, however, could be used to seed with an initial policy.