next up previous
Next: DS: Problems with Unknown Up: Sequential Decision Making Based Previous: Typical Applications

DS Theory

The arguments above emphasize DS' ability to deal with general algorithm spaces as opposed to DPRL's limited spaces of reactive mappings. Which theoretical results apply to this case?

Non-incremental & Deterministic. We know that Levin Search (LS) [LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993] is optimal in deterministic and nonincremental settings, that is, in cases where during the search process there are no intermediate reinforcement signals indicating the quality of suboptimal solutions. LS generates and tests computable solution candidates $s$ in order of their Levin complexities

Kt(s) = \min_q\{-log D_P(q) + log~t(q,s)\},

where program $q$ computes $s$ in $t(q,s)$ time steps, and $D_P(q)$ is $q$'s Solomonoff-Levin probability [LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]. Now suppose some algorithm A is designed to find the $x$ solving $\phi (x) = y$, where $\phi$ is a computable function mapping bitstrings to bitstrings. For instance, $x$ may represent a solution to a maze task implemented by $\phi(x)=1$ iff $x$ leads to the goal. Let $n$ be a measure of problem size (such as the number of fields in the maze), and suppose A needs at most $O(f(n))$ steps ($f$ computable) to solve problems of size $n$. Then LS also will need at most $O(f(n))$ steps [LevinLevin1973,LevinLevin1984,Li VitányiLi Vitányi1993]. Unlike GP etc., LS has a principled way of dealing with unknown program runtimes -- time is allocated in an optimal fashion. There are systems that use a probabilistic LS variant to discover neural nets that perfectly generalize from extremely few training examples [SchmidhuberSchmidhuber1995,SchmidhuberSchmidhuber1997], and LS implementations that learn to use memory for solving maze tasks unsolvable by DPRL due to highly ambiguous inputs [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b].

Almost all work in traditional RL, however, focuses on incremental settings where continual policy improvement can bring more and more reward per time, and where experience with suboptimal solution candidates helps to find better ones. Adaptive Levin Search (ALS) [Schmidhuber, Zhao, WieringSchmidhuber et al.1997b,Wiering SchmidhuberWiering Schmidhuber1996] extends LS in this way: whenever a new candidate is more successful than the best previous one, the underlying probability distribution is modified to make the new candidate more likely, and a new LS cycle begins. This guarantees that the search time spent on each incremental step (given a particular probability distribution embodying the current bias) is optimal in Levin's sense.

It does not guarantee, though, that the total search time is spent optimally, because the probability adjustments themselves may have been suboptimal. ALS cannot exploit arbitrary regularities in solution space, because the probability modification algorithm itself is fixed. A machine learning researcher's dream would be an incremental RL algorithm that spends overall search time optimally in a way comparable to non-incremental LS's.

Incremental & Deterministic. Which theoretical results exist for incremental DS in general program spaces? Few, except for the following basic, almost trivial one. Suppose the environment allows for separating the search phase into repeatable, deterministic trials such that each trial with a given policy yields the same reward. Now consider stochastic hill-climbing (SHC), one of the simplest DS methods:

1. Initialize vector-valued policy $p$, set variables $BestPolicy := p$, $BestResult := - \infty$. 2. Measure reward $R$ obtained during a trial with actions executed according to $p$. 3. If $BestResult > R$ then $p := BestPolicy$, else $BestPolicy := p$ and $BestResult := R$. 4. Modify $p$ by random mutation. Go to 2.

If (1) both environment and $p$ are deterministic, and if (2) the environment and all other variables modified by $p$ (such as internal memory cells whose contents may have changed due to actions executed according to $p$) are reset after each trial, then the procedure above will at least guarantee that performance cannot get worse over time. If step 4 allows for arbitrary random mutations and the number of possible policies is finite then we can even guarantee convergence on an optimal policy with probability 1, given infinite time. Of course, if the random mutations of step 4 are replaced by a systematic enumeration of all possible mutations, then we also will be able to guarantee finite (though exponential in problem size) time[*]. Similar statements can be made about alternative DS methods such as GP or evolutionary programming.

Of course, in real-world settings no general method can be expected to obtain optimal policies within reasonable time. Typical DS practitioners are usually content with suboptimal policies though. The next section, however, will address even more fundamental problems of DS.

next up previous
Next: DS: Problems with Unknown Up: Sequential Decision Making Based Previous: Typical Applications
Juergen Schmidhuber 2003-02-19

Back to Reinforcement Learning and POMDP page