Next: ACKNOWLEDGEMENTS Up: A GENERAL METHOD FOR Previous: HISTORY OF IDEAS /

CONCLUSION

It is easy to show that there can be no algorithm for general, unknown environments that is guaranteed to continually increase reinforcement intake per fixed time interval. For this reason, the reinforcement acceleration criterion (SSC) relaxes standard mesures of performance improvement, by allowing for consideration of arbitrary time intervals. The success-story algorithm (SSA) is guaranteed to achieve lifelong performance improvement according to this relaxed criterion. Since SSA takes into account all recursive long-term effects of policy modifications on all later policy modifications, and since it does not care for the nature of the policy modification processes, it provides a sound theoretical framework for ``meta-learning'', ``meta-meta-learning'', etc. Since SSA is environment-independent, it also provides a sound theoretical framework for ``multi-agent learning'', where each SSA-learning agent is part of the environment of the other agents.

There are many different ways of implementing SSA. Two of them are described in this paper. The first leads to a ``self-referential'' system using assembler-like primitive instructions to modify its own policy -- the system's learning mechanism is embedded within the system, and accessible to self-manipulation. The second implementation leads to a general reinforcement learning algorithm for recurrent nets. Alternatively, however, the PMP from section 1 may be designed to execute arbitrary, conventional or non-conventional learning or search algorithms.

Other SSA applications. In recent work [50,42], we combine SSA and Levin search (LS) [18,20] to solve partially observable Markov decision problems (POMDPs). POMDPs received a lot of attention in the reinforcement learning community. LS is theoretically optimal for a wide variety of search problems including many POMDPs. We show that that LS can solve partially observable mazes (POMs) involving many more states and obstacles than those solved by various previous authors (here, LS also can easily outperform Q-learning). We then note, however, that LS is not necessarily optimal for ``incremental'' learning problems where experience with previous problems may help to reduce future search costs. For this reason, we introduce a heuristic, adaptive extension of LS (ALS) which uses experience to increase probabilities of instructions occurring in successful programs found by LS. To deal with cases where ALS does not lead to long term performance improvement, we use SSA as a safety belt. Experiments with additional POMs demonstrate: (a) ALS can dramatically reduce the search time consumed by successive calls of LS. (b) Additional significant speed-ups can be obtained by combining ALS and SSA.

In other recent work [53], we use SSA for a multi-agent system with agents much more complex than the ones in section 4. In fact, each agent uses incremental self-improvement as described in section 2. Experiments demonstrate the multi-agent system's effectiveness. For instance, a system consisting of three co-evolving agents chasing each other learns rather sophisticated, stochastic predator and prey strategies. Additional applications of SSA to quite challenging, complex multiagent tasks are described in [41,40].

Next: ACKNOWLEDGEMENTS Up: A GENERAL METHOD FOR Previous: HISTORY OF IDEAS /

Juergen Schmidhuber 2003-02-19