next up previous
Next: BASIC CONCEPTS Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: REINFORCEMENT LEARNING WITH SELF-MODIFYING

INTRODUCTION

Bias towards algorithmic regularity. There is no miraculous universal learning algorithm that will perform well in arbitrary environments [#!Schmidhuber:94self!#,#!Wolpert:96!#,#!Schmidhuber:97nn!#]. Appropriate inductive bias [#!Utgoff:86!#] is essential. In unknown environments, however, we do not want too specific a bias. How unspecific may it be? The algorithmic Occam's razor bias (AORB) assumes that problem solutions are non-random but regular [#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#], without specifying the way in which they are non-random. AORB is highly specific in the sense that it tends to be useless for solving an arbitrary problem from the set of all well-defined problems, almost all of which have irregular solutions [#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#]. But AORB is highly unspecific in the sense that it can help to solve all kinds of ``typical'', ``regular'', interesting problems occurring in our atypical, regular universe. This paper, based on [#!Schmidhuber:94self!#], goes beyond our other recent papers exploiting algorithmic regularities for practical machine learning purposes [#!Schmidhuber:95kol!#,#!Schmidhuber:97nn!#,#!Wiering:96levin!#].

Levin search (LS). References [#!Schmidhuber:95kol!#,#!Schmidhuber:97nn!#] show how a variant of Levin search (LS) [#!Levin:73!#,#!Levin:84!#,#!Solomonoff:86!#,#!LiVitanyi:93!#] can find problem solutions with low Kolmogorov complexity and high generalization ability in non-random but otherwise quite general settings. LS is of interest because it has the optimal order of computational complexity for a broad class of search problems. For instance, suppose there is an algorithm that solves certain time-limited optimization problems or inversion problems in O(f(n)) steps, where f is a total recursive function and n is a positive integer representing problem size. Then universal LS will solve the same problems in at most O(f(n)) steps (although a large constant may be buried in the O() notation). Despite this strong result, until recently LS has not received much attention except in purely theoretical studies -- see, e.g., [#!Watanabe:92!#].

Adaptive LS. References [#!Schmidhuber:94kol!#,#!Schmidhuber:97nn!#,#!Wiering:96levin!#] note that LS is not necessarily optimal if algorithmic information (e.g., [#!LiVitanyi:93!#]) between solutions to successive problems can be exploited to reduce future search costs based on experience. Our adaptive LS extension (ALS) [#!Schmidhuber:94kol!#,#!Wiering:96levin!#] does use experience to modify LS' underlying probability distribution. ALS can dramatically reduce the search time consumed by successive LS calls in certain regular environments [#!Wiering:96levin!#].

Exploiting arbitrary regularities? This paper goes one step further. Let us call a learner's modifiable components its policy. Policy-modifying algorithms are called learning algorithms. We observe that if our learner wants to exploit arbitrary algorithmic regularities then it must be able to execute arbitrary, problem-specific learning algorithms. In particular, this includes learning algorithms for creating better learning algorithms, given environmental conditions. In this sense we are interested in learning to learn.

Restricted notion of ``learning to learn''. In contrast to [#!Caruana:95!#] and other chapters in the book at hand, but in the spirit of the first author's earlier work [#!Schmidhuber:87!#,#!Schmidhuber:93selfreficann!#,#!Schmidhuber:94self!#] we will reserve the expressions metalearning or learning to learn to characterize learners that (1) can evaluate and compare learning methods, (2) measure the benefits of early learning on subsequent learning, (3) use such evaluations to reason about learning strategies and to select ``useful'' ones while discarding others. An algorithm is not considered to have learned to learn if it improves merely by luck, if it does not measure the effects of early learning on later learning, or if it has no explicit method designed to translate such measurements into useful learning strategies.

Overview. To create a potential for learning learning algorithms, we construct a self-modifying policy (SMP) which incorporates modifiable aspects of its own learning algorithm. This is easily achieved by letting SMP influence the composition of complex (probabilistic) learning algorithms from SMP-modifying instructions called primitive learning algorithms. We then ask: how can we force SMP to trigger better and better self-modification methods? Addressing this question in a reinforcement learning (RL) context naturally leads to the success-story algorithm (SSA). SSA is a backtracking method that measures the long-term effects of learning on later learning using long-term reward/time ratios. SSA uses such measurements to decide which SMP-generated SMP-modifications to keep and which to undo.

Outline. The next section will introduce basic concepts. A concrete implementation will be presented in section 3. Section 4 will focus on an experiment involving a difficult partially observable environment (POE). POEs have received a lot of attention in recent years, e.g., [#!Whitehead:90!#,#!Schmidhuber:91nips!#,#!Lin:93!#,#!Ring:94!#,#!Littman:94!#,#!Cliff:94!#,#!Chrisman:92!#,#!Jaakkola:95!#,#!Kaelbling:95!#,#!McCallum:95!#,#!Wiering:96levin!#,#!Wiering:96hq!#]. Our POE, however, is far larger than most of those purported to demonstrate the usefulness of previous POE algorithms, most of which appear to work only for tiny state spaces [#!Littman:94!#].


next up previous
Next: BASIC CONCEPTS Up: REINFORCEMENT LEARNING WITH SELF-MODIFYING Previous: REINFORCEMENT LEARNING WITH SELF-MODIFYING
Juergen Schmidhuber
2001-01-25


Back to Metalearning page
Back to Reinforcement Learning page