The arguments above emphasize DS' ability to deal with general algorithm spaces as opposed to DPRL's limited spaces of reactive mappings. Which theoretical results apply to this case?
Non-incremental & Deterministic.
We know that Levin Search (LS)
[LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]
is optimal in deterministic and
that is, in cases where during the search process
there are no intermediate reinforcement
signals indicating the quality of suboptimal solutions.
LS generates and
tests computable solution candidates in order of their Levin complexities
Almost all work in traditional RL, however, focuses on incremental settings where continual policy improvement can bring more and more reward per time, and where experience with suboptimal solution candidates helps to find better ones. Adaptive Levin Search (ALS) [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b,Wiering SchmidhuberWiering Schmidhuber1996] extends LS in this way: whenever a new candidate is more successful than the best previous one, the underlying probability distribution is modified to make the new candidate more likely, and a new LS cycle begins. This guarantees that the search time spent on each incremental step (given a particular probability distribution embodying the current bias) is optimal in Levin's sense.
It does not guarantee, though, that the total search time is spent optimally, because the probability adjustments themselves may have been suboptimal. ALS cannot exploit arbitrary regularities in solution space, because the probability modification algorithm itself is fixed. A machine learning researcher's dream would be an incremental RL algorithm that spends overall search time optimally in a way comparable to non-incremental LS's.
Incremental & Deterministic. Which theoretical results exist for incremental DS in general program spaces? Few, except for the following basic, almost trivial one. Suppose the environment allows for separating the search phase into repeatable, deterministic trials such that each trial with a given policy yields the same reward. Now consider stochastic hill-climbing (SHC), one of the simplest DS methods:
1. Initialize vector-valued policy , set variables , . 2. Measure reward obtained during a trial with actions executed according to . 3. If then , else and . 4. Modify by random mutation. Go to 2.
If (1) both environment and are deterministic, and if (2) the environment and all other variables modified by (such as internal memory cells whose contents may have changed due to actions executed according to ) are reset after each trial, then the procedure above will at least guarantee that performance cannot get worse over time. If step 4 allows for arbitrary random mutations and the number of possible policies is finite then we can even guarantee convergence on an optimal policy with probability 1, given infinite time. Of course, if the random mutations of step 4 are replaced by a systematic enumeration of all possible mutations, then we also will be able to guarantee finite (though exponential in problem size) time. Similar statements can be made about alternative DS methods such as GP or evolutionary programming.
Of course, in real-world settings no general method can be expected to obtain optimal policies within reasonable time. Typical DS practitioners are usually content with suboptimal policies though. The next section, however, will address even more fundamental problems of DS.