Some DPRL variants such as are limited to a very special kind of exponentially decaying short-term memory. Others simply ignore memory issues by focusing on suboptimal, memory-free solutions to problems whose optimal solutions do require some form of short-term memory [Jaakkola, Singh, JordanJaakkola et al.1995]. Again others can in principle find optimal solutions even in partially observable environments (POEs) [Kaelbling, Littman, CassandraKaelbling et al.1995,Littman, Cassandra, KaelblingLittman et al.1995], but they (a) are practically limited to very small problems [LittmanLittman1996], and (b) do require knowledge of a discrete state space model of the environment. To various degrees, problem (b) also holds for certain hierarchical RL approaches to memory-based input disambiguation [RingRing1991,RingRing1993,RingRing1994,McCallumMcCallum1996,Wiering SchmidhuberWiering Schmidhuber1998]. Although no discrete models are necessary for DPRL systems with function approximators based on recurrent neural networks [SchmidhuberSchmidhuber1991c,LinLin1993], the latter do suffer from a lack of theoretical foundation, perhaps even more so than the backgammon player.
DS, however, does not care at all for Markovian conditions and full observability of the environment. While DPRL is essentially limited to learning reactive policies mapping current inputs to output actions, DS in principle can be applied to search spaces whose elements are general algorithms or programs with time-varying variables that can be used for memory purposes [WilliamsWilliams1992,TellerTeller1994,SchmidhuberSchmidhuber1995,Wiering SchmidhuberWiering Schmidhuber1996,Saustowicz SchmidhuberSaustowicz Schmidhuber1997].