next up previous
Next: DS Advantage 3: Straight-forward Up: Advantages of Direct Search Previous: DS Advantage 1: No

DS Advantage 2: No Markovian Restrictions

Convergence proofs for DPRL also require that the learner's current input conveys all the information about the current state (or at least about the optimal next action). In the real world, however, the current sensory input typically tells next to nothing about the ``current state of the world,'' if there is such a thing at all. Typically, memory of previous events is required to disambiguate inputs. For instance, as your eyes are sequentially scanning the visual scene dominated by this text you continually decide which parts (or possibly compressed descriptions thereof) deserve to be represented in short-term memory. And you have presumably learned to do this, apparently by some unknown, sophisticated RL method fundamentally different from DPRL.

Some DPRL variants such as $Q(\lambda)$ are limited to a very special kind of exponentially decaying short-term memory. Others simply ignore memory issues by focusing on suboptimal, memory-free solutions to problems whose optimal solutions do require some form of short-term memory [Jaakkola, Singh, JordanJaakkola et al.1995]. Again others can in principle find optimal solutions even in partially observable environments (POEs) [Kaelbling, Littman, CassandraKaelbling et al.1995,Littman, Cassandra, KaelblingLittman et al.1995], but they (a) are practically limited to very small problems [LittmanLittman1996], and (b) do require knowledge of a discrete state space model of the environment. To various degrees, problem (b) also holds for certain hierarchical RL approaches to memory-based input disambiguation [RingRing1991,RingRing1993,RingRing1994,McCallumMcCallum1996,Wiering SchmidhuberWiering Schmidhuber1998]. Although no discrete models are necessary for DPRL systems with function approximators based on recurrent neural networks [SchmidhuberSchmidhuber1991c,LinLin1993], the latter do suffer from a lack of theoretical foundation, perhaps even more so than the backgammon player.

DS, however, does not care at all for Markovian conditions and full observability of the environment. While DPRL is essentially limited to learning reactive policies mapping current inputs to output actions, DS in principle can be applied to search spaces whose elements are general algorithms or programs with time-varying variables that can be used for memory purposes [WilliamsWilliams1992,TellerTeller1994,SchmidhuberSchmidhuber1995,Wiering SchmidhuberWiering Schmidhuber1996,Sa\lustowicz SchmidhuberSa\lustowicz Schmidhuber1997].


next up previous
Next: DS Advantage 3: Straight-forward Up: Advantages of Direct Search Previous: DS Advantage 1: No
Juergen Schmidhuber 2003-02-19


Back to Reinforcement Learning and POMDP page