Next: Philosophy Up: In C. Freksa, ed., Previous: Life in a Given

Generalization and Learning

In general, generalization is impossible. Given the history of a particular universe up to a given time, there are infinitely many possible continuations. Most of these continuations have nothing to do with the previous history. To see this, suppose we have observed a partial history S_k^i,j (the substring between the i-th and the j-th symbol of E_k). Now we want to generalize from previous experience to predict S_k^j+1,l, l > j. To do this, we need an algorithm that computes S_k^j+1,l from S_k^i,j (S_k^i,j may be stored on a separate, additional input tape for an appropriate universal Turing machine). There are 3^{l - j} possible futures. But for c < l - j, there are less than (1/3)^c * 3^{l - j} algorithms with less than l - j - c bits computing such a future, given S_k^i,j. Hence in most cases the shortest algorithm computing the future, given the past, won't be much shorter than the shortest algorithm computing the future from nothing. Both will have about the size of the entire future. In other words, the mutual algorithmic information between history and future will be zero. As a consequence, in most universes (those that can be computed by long algorithms only), successful generalization from previous experience is not possible. Neither is inductive transfer. This simple insight is related to results in [#!Wolpert:96!#].

Learning. Given the above, since learning means to use previous experience to improve future performance, learning is possible only in the few regular universes (no learning without compressibility). On the other hand, regularity by itself is not sufficient to allow for learning. For instance, there is a highly compressible and regular universe represented by ``,,,,,,,...''. It is too simple to allow for processes we would be willing to call learning.

In what follows, I will assume that a regular universe is complex enough to allow for identifying certain permanent data structures of a general learner to be described below. For convenience, I will abstract from bitstring models, and instead talk about environments, rewards, stacks etc. Of course, all these abstract concepts are representable as bitstrings.

Scenario. In general, the learner's life is limited. To it, time will be important (not to the Great Programmer though). Suppose its life in environment $\cal E$ lasts from time 0 to unknown time T. In between it repeats the following cycle over and over again ( $\cal A$ denotes a set of possible actions): select and execute $a \in \cal A$ with probability $P( a \mid \cal E )$ , where the modifiable policy P is a variable, conditional probability distribution on the possible actions, given current $\cal E$ . Action a will consume time and may change $\cal E$ and P. Actions that modify P are called primitive learning algorithms (PLAs). P influences the way P is modified (``self-modification''). Policy modification processes (PMPs) are action subsequences that include PLAs. The i-th PMP in system life is denoted PMP_i, starts at time s_i > 0, ends at e_i < T, e_i > s_i, and computes a sequence of P-modifications denoted M_i. Both s_i and e_i are computed dynamically by special instructions in $\cal A$ executed according to P itself: P says when to start and end PMPs.

Occasionally $\cal E$ provides real-valued reward. The cumulative reward obtained in between time 0 and time t > 0 is denoted by R(t) (where R(0) = 0). At each PMP-start s_i the learner's goal is to use experience to generate P-modifications to accelerate future reward intake. Assuming that reward acceleration is possible at all, given E and $\cal A$ , how can the learner achieve it? I will describe a rather general way of doing so.

The success-story criterion. Each PMP-start time s_i will trigger an evaluation of the system's performance so far. Since s_i is computed according to P, P incorporates information about when to evaluate itself. Evaluations may cause policy modifications to be undone (by restoring the previous policy -- in practical implementations, this requires to store previous values of modified policy components on a stack). At a given PMP-start t in the learner's life, let V(t) denot the set of those previous s_i whose corresponding M_i have not been undone yet. If V(t) is not empty, then let $v_i ~~ (i \in \{1, 2, \ldots, \mid V(t) \mid \}$ denote the i-th such time, ordered according to size. The success-story criterion SSC is satisfied if either V(t) is empty (trivial case) or if

$\begin{displaymath}\frac{R(t)}{t} < \frac{R(t) - R(v_1)}{t - v_1} < \frac{R(t) -... ...frac{R(t) - R(v_{ \mid V(t) \mid }) }{t - v_{\mid V(t) \mid}}. \end{displaymath}$

SSC essentially says that each surviving P-modification corresponds to a long term reward acceleration. Once SSC is satisfied, the learner continues to act and learn until the next PMP-start. Since there may be arbitrary reward delays in response to certain action subsequences, it is important that $\cal A$ indeed includes actions for delaying performance evaluations -- the learner will have to learn when to trigger evaluations. Since the success of a policy modification recursively depends on the success of later modifications for which it is setting the stage, the framework provides a basis for ``learning how to learn''. Unlike with previous learning paradigms, the entire life is considered for performance evaluations. Experiments in [#!Schmidhuber:97bias!#,#!Schmidhuber:97ssa!#] show the paradigm's practical feasibility. For instance, in [#!Schmidhuber:97bias!#] $\cal A$ includes an extension of Levin search [#!Levin:84!#] for generating the PMPs.

Next: Philosophy Up: In C. Freksa, ed., Previous: Life in a Given

Juergen Schmidhuber
1999-03-15