**Bias towards algorithmic regularity.**
There is no miraculous universal learning algorithm
that will perform well in arbitrary environments
[#!Schmidhuber:94self!#,#!Wolpert:96!#,#!Schmidhuber:97nn!#].
Appropriate inductive bias [#!Utgoff:86!#] is essential.
In unknown environments, however, we do not want too specific a bias.
How unspecific may it be?
The *algorithmic Occam's razor bias* (AORB)
assumes that problem solutions are non-random but regular
[#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#],
without specifying the way in which they are non-random.
AORB is highly specific in the sense that it tends to be useless for
solving an arbitrary problem from the set of all well-defined problems,
almost all of which have irregular solutions
[#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#].
But AORB is highly unspecific in the sense that it can help to solve
all kinds of ``typical'', ``regular'', interesting problems occurring in our
atypical, regular universe. This paper, based on [#!Schmidhuber:94self!#],
goes beyond our other recent papers
exploiting algorithmic regularities for practical
machine learning purposes
[#!Schmidhuber:95kol!#,#!Schmidhuber:97nn!#,#!Wiering:96levin!#].

**Levin search (LS).**
References
[#!Schmidhuber:95kol!#,#!Schmidhuber:97nn!#] show how a variant of
Levin search (LS) [#!Levin:73!#,#!Levin:84!#,#!Solomonoff:86!#,#!LiVitanyi:93!#]
can find problem solutions with low Kolmogorov complexity and
high generalization ability in non-random but otherwise
quite general settings. LS is of interest because
it has the optimal order of computational complexity
for a broad class of search problems.
For instance, suppose there is an algorithm that solves
certain time-limited optimization problems
or inversion problems in *O*(*f*(*n*)) steps, where *f* is a total recursive
function and *n* is a positive
integer representing problem size. Then universal LS will solve the
same problems in at most *O*(*f*(*n*)) steps (although a large constant
may be buried in the *O*() notation). Despite this strong result, until
recently LS has not received much attention except in purely theoretical
studies -- see, e.g., [#!Watanabe:92!#].

**Adaptive LS.**
References
[#!Schmidhuber:94kol!#,#!Schmidhuber:97nn!#,#!Wiering:96levin!#]
note that LS is not necessarily optimal
if algorithmic information (e.g., [#!LiVitanyi:93!#])
between solutions to successive problems can be exploited
to reduce future search costs based on experience.
Our adaptive LS extension (ALS)
[#!Schmidhuber:94kol!#,#!Wiering:96levin!#]
does use experience to
modify LS' underlying probability distribution.
ALS can dramatically reduce the search time consumed by
successive LS calls in certain regular environments
[#!Wiering:96levin!#].

**Exploiting arbitrary regularities?**
This paper goes one step further.
Let us call a learner's modifiable components its policy.
Policy-modifying algorithms are called learning
algorithms. We observe that if our learner wants to
exploit *arbitrary* algorithmic regularities
then it must be able to execute arbitrary, problem-specific
learning algorithms. In particular, this includes learning algorithms
for creating better learning algorithms, given environmental conditions.
In this sense we are interested in *learning to learn*.

**Restricted notion of ``learning to learn''.**
In contrast to [#!Caruana:95!#] and other chapters in the book at hand,
but in the spirit of the first author's earlier work
[#!Schmidhuber:87!#,#!Schmidhuber:93selfreficann!#,#!Schmidhuber:94self!#] we
will reserve the expressions *metalearning* or *learning to learn* to
characterize learners that (1) can evaluate and compare learning methods,
(2) measure the benefits of early learning on subsequent learning,
(3) use such evaluations to reason about learning strategies and to
select ``useful'' ones while discarding others. An algorithm is not
considered to have learned to learn if it improves merely by luck, if it
does not measure the effects of early learning on later learning, or if
it has no explicit method designed to translate such measurements into
useful learning strategies.

**Overview.**
To create a potential for learning learning algorithms,
we construct a *self-modifying policy* (SMP)
which incorporates modifiable aspects of its own learning
algorithm. This is easily achieved by letting SMP influence the
composition of complex (probabilistic) learning
algorithms from SMP-modifying instructions called *primitive
learning algorithms*.
We then ask: how can we force SMP to trigger better and better
self-modification methods? Addressing
this question in a reinforcement learning (RL) context
naturally leads to the *success-story algorithm* (SSA).
SSA is a backtracking method that
measures the long-term effects of learning on later learning
using long-term reward/time ratios. SSA
uses such measurements to decide which SMP-generated
SMP-modifications to keep and which to undo.

**Outline.**
The next section will introduce basic concepts.
A concrete implementation will be presented
in section 3. Section 4 will focus on an experiment involving a
difficult partially observable environment (POE).
POEs have received a lot of attention in recent years, e.g.,
[#!Whitehead:90!#,#!Schmidhuber:91nips!#,#!Lin:93!#,#!Ring:94!#,#!Littman:94!#,#!Cliff:94!#,#!Chrisman:92!#,#!Jaakkola:95!#,#!Kaelbling:95!#,#!McCallum:95!#,#!Wiering:96levin!#,#!Wiering:96hq!#].
Our POE, however, is far larger
than most of those purported to demonstrate the usefulness of previous
POE algorithms, most of which appear to work only for
tiny state spaces [#!Littman:94!#].

Back to Metalearning page

Back to Reinforcement Learning page