next up previous
Next: Acknowledgments Up: Sequential Decision Making Based Previous: Relation to Market Models


Direct search (DS) in policy space offers several advantages over traditional reinforcement learning (RL). For instance, DS does not need a priori information about world states and the topology of their interactions. It does not care whether the environment is fully observable. It makes hierarchical credit assignment conceptually trivial, and also allows for many alternative, non-hierarchical types of abstract credit assignment.

Existing DS methods, however, do suffer from fundamental problems in presence of environmental stochasticity and/or unknown temporal delays between actions and observable effects. In particular, they do not have a principled way of deciding when to stop policy evaluations.

Stochastic policy evaluation by the success-story algorithm (SSA) differs from traditional DS. SSA never quits evaluating any previous policy change that has not yet been undone for lack of empirical evidence that it has contributed to a lifelong reward acceleration. Each invocation of SSA retrospectively establishes a success history of surviving self-modifications: only policy changes that have empirically proven their long-term usefulness so far get another chance to justify themselves. This stabilizes the ``truly useful'' policy changes in the long run.

Unlike many traditional value function-based RL methods, SSA is not limited to fully observable worlds, and does not require discounting of future rewards. It shares these advantages with traditional DS algorithms. Unlike stochastic hill-climbing and other DS methods such as genetic algorithms, however, SSA does not heavily depend on a priori knowledge about reasonable trial lengths necessary to collect sufficient statistics for estimating long-term consequences and true values of tested policies.

On the other hand, many DS methods can be augmented by SSA in a straight-forward way: just measure the time used up by all actions, policy modifications, and policy tests, and occasionally insert checkpoints that invoke SSA. In this sense SSA's basic concepts are not algorithm-specific -- instead they reflect a novel, general way of thinking about how ``true" performance should be measured in RL systems using DS in policy space.

Since SSA automatically collects statistics about long-term effects of earlier policy changes on later ones, it is of interest for improving the credit assignment method itself [Schmidhuber, Zhao, SchraudolphSchmidhuber et al.1997a].

Although the present paper's illustrative SSA application is much less complex than our previous ones, it is the first to provide insight into SSA's fundamental advantages over traditional DS methods in case of stochastic policy evaluations and unknown temporal delays between causes and effects.

Market models are similar to traditional DPRL in the sense that the bids of their agents correspond to predictions of future reward used in DPRL algorithms such as Q-learning. On the other hand, they are similar to DS in the sense that they incorporate evolutionary pressure and do not obviously require full environmental observability and Markovian conditions but can be applied to agents whose policies are drawn from general program spaces. For example, bankrupt agents who spent all their money are usually replaced by mutations of more successful agents. This introduces Darwinian selection absent in traditional DPRL.

SSA shares certain aspects of market models. In particular, at a given time any existing policy modification's reward/time ratio measured by SSA may be viewed as an investment into the future. The policy modification will fail to survive once it fails to generate enough return (including the reward obtained by its own ``children") to exceed the corresponding investment of its ``parent", namely, the most recent previous, still existing policy modification.

Future research will hopefully show when to prefer pure market models over SSA and vice versa. For instance, it will be interesting to study whether the former can efficiently deal with long, unknown causal delays and highly stochastic policies and environments.

next up previous
Next: Acknowledgments Up: Sequential Decision Making Based Previous: Relation to Market Models
Juergen Schmidhuber 2003-02-19

Back to Reinforcement Learning and POMDP page