**Levin Search (LS).**
Unbeknownst to many machine learning
researchers, there exists a search algorithm with amazing
theoretical properties: for a broad class of search problems,
LS [Levin, 1973,Levin, 1984]
has the optimal order of computational complexity.
For instance, suppose there is an algorithm that solves
a certain type of maze task in steps, where is
a positive integer representing the problem size.
Then universal LS will solve the same task in at most
steps. See [Li and Vitányi, 1993]
for an overview. See [Schmidhuber, 1995a]
for recent implementations/applications.

**Search through program space is relevant for ``POMDPs''.**
LS is a smart way of performing exhaustive search by
``optimally'' allocating time to programs
computing solution candidates (details in section 2).
Since programs written in a general language
can use memory to disambiguate environmental
inputs, LS is of potential interest for solving
partially observable Markov decision problems (POMDPs),
which received a lot of attention during recent years, e.g.,
[Jaakkola et al., 1995,Kaelbling et al., 1995,Ring, 1994,McCallum, 1993].

**Incremental extensions of LS.**
LS by itself, however, is non-incremental:
it does not use experience with
previous tasks to speed up performance on new
tasks. Therefore, it cannot immediately be used
in typical, incremental reinforcement learning scenarios,
where, in case of success, the system
is given ``reinforcement'' (a real number) and
tries to use that experience to
maximize the sum of *future* reinforcements
to be obtained during the remainder of system life.
There have been proposals of
``adaptive'' variants of LS that modify LS'
underlying probability distribution
on program space
[Solomonoff, 1986,Schmidhuber, 1995a].
None of these, however, can guarantee that the
lifelong history of probability modifications
will correspond to a lifelong history of reinforcement
accelerations.

**EIRA**.
The problem above has been addressed recently
[].
At certain times in system life called checkpoints, a novel
technique called
``environment-independent reinforcement acceleration'' (EIRA)
invalidates certain modifications
of the system's policy
(the policy can be
an arbitrary modifiable algorithm mapping environmental inputs
and internal states to outputs and new internal states)
such that all currently valid modifications
are justified in the following sense: each still valid modification
has been followed by long-term performance speed-up. To measure
speed, at each checkpoint EIRA looks at the entire time
interval that went by since the modification occurred.
To do this efficiently, EIRA performs some backtracking
(the time required for backtracking is taken into account
for measuring performance speed-ups).
EIRA is general in the sense that it can be combined
with your favorite learning or search algorithm.
Essentially, EIRA works as a safety belt where
your favorite learning algorithm fails to
improve things such that long term reinforcement intake
speeds up (see details in section 4).

**Outline of paper.**
Section 2 describes LS details.
Section 3 presents the heuristic adaptation method (ALS -- a
simple, adaptive, incremental extension of LS related to the
linear reward-inaction algorithm, e.g., [Kaelbling, 1993]).
Section 4 briefly reviews EIRA and shows how to combine it
with ALS.
Section 5 presents results:
in an illustrative application involving a maze that has many more
states and obstacles than mazes solved by
previous authors working on POMDPs, we show how LS can solve
partially observable maze tasks with huge state spaces and
non-trivial but low-complexity solutions (Q-learning
fails to solve such tasks).
Then we show that ALS
can use previous experience to significantly reduce
search time.
Finally, we show that ALS augmented by EIRA
can clearly outperform ALS by itself.
Section 6 presents conclusions.

Back to Optimal Universal Search page

Back to Reinforcement Learning page

Back to Program Evolution page