The arguments above emphasize DS' ability to deal with general algorithm spaces as opposed to DPRL's limited spaces of reactive mappings. Which theoretical results apply to this case?

**Non-incremental & Deterministic.**
We know that Levin Search (LS)
[LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]
is optimal in deterministic and
*non*incremental settings,
that is, in cases where during the search process
there are no intermediate reinforcement
signals indicating the quality of suboptimal solutions.
LS generates and
tests computable solution candidates in order of their Levin complexities

where program computes in time steps, and is 's Solomonoff-Levin probability [LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]. Now suppose some algorithm A is designed to find the solving , where is a computable function mapping bitstrings to bitstrings. For instance, may represent a solution to a maze task implemented by iff leads to the goal. Let be a measure of problem size (such as the number of fields in the maze), and suppose A needs at most steps ( computable) to solve problems of size . Then LS also will need at most steps [LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]. Unlike GP etc., LS has a principled way of dealing with unknown program runtimes -- time is allocated in an optimal fashion. There are systems that use a probabilistic LS variant to discover neural nets that perfectly generalize from extremely few training examples [SchmidhuberSchmidhuber1995,SchmidhuberSchmidhuber1997], and LS implementations that learn to use memory for solving maze tasks unsolvable by DPRL due to highly ambiguous inputs [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b].

Almost all work in traditional RL, however, focuses on *incremental*
settings where continual policy improvement can bring more and
more reward per time, and where experience with suboptimal solution
candidates helps to find better ones. Adaptive Levin Search (ALS)
[Schmidhuber, Zhao, WieringSchmidhuber
et al.1997b,Wiering SchmidhuberWiering Schmidhuber1996] extends LS in this way: whenever
a new candidate is more successful than the best previous one, the
underlying probability distribution is modified to make the new candidate
more likely, and a new LS cycle begins. This guarantees that the search
time spent on each incremental step (given a particular probability
distribution embodying the current bias) is optimal in Levin's sense.

It does not guarantee, though, that the total search time is spent
optimally, because the probability adjustments themselves may have been
suboptimal. ALS cannot exploit *arbitrary* regularities in solution
space, because the probability modification algorithm itself is fixed.
A machine learning researcher's dream would be an incremental RL algorithm
that spends overall search time optimally in a way
comparable to non-incremental LS's.

**Incremental & Deterministic.**
Which theoretical results exist for incremental DS in general program spaces?
Few, except for the following basic, almost trivial one.
Suppose the environment allows for separating the search phase into
repeatable, deterministic trials such that each trial with a given
policy yields the same reward. Now consider stochastic hill-climbing
(SHC), one of the simplest DS methods:

**1.** Initialize vector-valued policy ,
set variables
,
.
**2.**
Measure reward obtained during a trial with
actions executed according to .
**3.**
If
then
, else
and
.
**4.** Modify by random mutation.
Go to **2.**

If (1) both environment and are deterministic, and if (2) the
environment and all other variables modified by (such as internal
memory cells whose contents may have changed due to actions executed
according to ) are reset after each trial, then the procedure above
will at least guarantee that performance cannot get worse over time. If
step **4** allows for arbitrary random mutations and the number of
possible policies is finite then we can even guarantee convergence on
an optimal policy with probability 1, given infinite time.
Of course, if the random mutations of step **4** are replaced by
a systematic enumeration of all possible mutations, then we also
will be able to guarantee
finite (though exponential in problem size)
time^{}.
Similar statements can be made about alternative DS methods such as GP
or evolutionary programming.

Of course, in real-world settings *no* general method can be expected
to obtain optimal policies within reasonable time.
Typical DS practitioners are usually content with suboptimal policies though.
The next section, however, will address even more fundamental problems of DS.

Back to Reinforcement Learning and POMDP page