Levin Search (LS). Unbeknownst to many machine learning researchers, there exists a search algorithm with amazing theoretical properties: for a broad class of search problems, LS [Levin, 1973,Levin, 1984] has the optimal order of computational complexity. For instance, suppose there is an algorithm that solves a certain type of maze task in steps, where is a positive integer representing the problem size. Then universal LS will solve the same task in at most steps. See [Li and Vitányi, 1993] for an overview. See [Schmidhuber, 1995a] for recent implementations/applications.
Search through program space is relevant for ``POMDPs''. LS is a smart way of performing exhaustive search by ``optimally'' allocating time to programs computing solution candidates (details in section 2). Since programs written in a general language can use memory to disambiguate environmental inputs, LS is of potential interest for solving partially observable Markov decision problems (POMDPs), which received a lot of attention during recent years, e.g., [Jaakkola et al., 1995,Kaelbling et al., 1995,Ring, 1994,McCallum, 1993].
Incremental extensions of LS. LS by itself, however, is non-incremental: it does not use experience with previous tasks to speed up performance on new tasks. Therefore, it cannot immediately be used in typical, incremental reinforcement learning scenarios, where, in case of success, the system is given ``reinforcement'' (a real number) and tries to use that experience to maximize the sum of future reinforcements to be obtained during the remainder of system life. There have been proposals of ``adaptive'' variants of LS that modify LS' underlying probability distribution on program space [Solomonoff, 1986,Schmidhuber, 1995a]. None of these, however, can guarantee that the lifelong history of probability modifications will correspond to a lifelong history of reinforcement accelerations.
EIRA. The problem above has been addressed recently . At certain times in system life called checkpoints, a novel technique called ``environment-independent reinforcement acceleration'' (EIRA) invalidates certain modifications of the system's policy (the policy can be an arbitrary modifiable algorithm mapping environmental inputs and internal states to outputs and new internal states) such that all currently valid modifications are justified in the following sense: each still valid modification has been followed by long-term performance speed-up. To measure speed, at each checkpoint EIRA looks at the entire time interval that went by since the modification occurred. To do this efficiently, EIRA performs some backtracking (the time required for backtracking is taken into account for measuring performance speed-ups). EIRA is general in the sense that it can be combined with your favorite learning or search algorithm. Essentially, EIRA works as a safety belt where your favorite learning algorithm fails to improve things such that long term reinforcement intake speeds up (see details in section 4).
Outline of paper. Section 2 describes LS details. Section 3 presents the heuristic adaptation method (ALS -- a simple, adaptive, incremental extension of LS related to the linear reward-inaction algorithm, e.g., [Kaelbling, 1993]). Section 4 briefly reviews EIRA and shows how to combine it with ALS. Section 5 presents results: in an illustrative application involving a maze that has many more states and obstacles than mazes solved by previous authors working on POMDPs, we show how LS can solve partially observable maze tasks with huge state spaces and non-trivial but low-complexity solutions (Q-learning fails to solve such tasks). Then we show that ALS can use previous experience to significantly reduce search time. Finally, we show that ALS augmented by EIRA can clearly outperform ALS by itself. Section 6 presents conclusions.