next up previous


The success-story algorithm collects more information than previous RL methods about long-term effects of policy changes and ``bias shifts''. In contrast to traditional RL approaches, time is not reset at trial boundaries. Instead SSA measures the total reward received and the total time consumed by learning and policy tests during all trials following some bias shift: bias shifts are evaluated by measuring their long-term effects on later learning. Bias shifts are undone once there is empirical evidence that they have not set the stage for long-term performance improvement. No bias shift is safe forever, but in many regular environments the survival probabilities of useful bias shifts will approach unity if they can justify themselves by contributing to long-term reward accelerations.

The more powerful an SMP's instruction set, the more powerful the stochastic learning algorithms it may construct, and the more general the bias shifts it can compute. The more general the bias shifts, the harder the credit assignment problem -- but SSA provides a way of forcing even SMPs with very general (e.g., Turing machine-equivalent) instruction sets to focus on bias shifts causing histories of long-term performance improvements.

Previous work. Unlike SMP/SSA, Lenat's approach [#!Lenat:83!#] requires human interaction defining the ``interestingness'' of concepts to be explored. Also, many essential aspects of system behavior in previous work on metalearning [#!Lenat:83!#,#!SOAR:93!#] are not accessible to self-modifications.

The meta-evolution scheme [#!Schmidhuber:87!#] attempts to learn learning algorithms by a potentially infinite hierarchy of Genetic Programming levels. Higher-level pools contain program-modifying programs which are applied to lower-level programs, and rewarded recursively based on lower-level performance. Unfortunately there is no theoretical assurance that meta-evolution will spend overall computation time wisely.

Self-modifying recurrent neural nets [#!Schmidhuber:93selfreficann!#,#!Schmidhuber:93selfrefieee!#] are able to run their own weight change algorithms. Activations of special output units are used to address, read, and modify all of the network's own weights. Gradient descent in a standard temporal error function corresponds to gradient descent in the space of weight change algorithms, or learning algorithms. Interestingly, this can be done in ``only'' $O(n^4 \log n)$ operations per time step. One problem with this wild idea are the numerous local minima.

In contrast to SMP/SSA no previous metalearning approach evaluates learning processes by measuring their long-term effects on subsequent learning in terms of reward intake speed.

Limitations. (1) Like any machine learning algorithm, SMP/SSA suffers from the fundamental limitations mentioned in the first paragraph of this paper. (2) SSA does not make much sense in antagonistic environments in which reward constantly decreases no matter what the learner does. In such environments SSC will be satisfiable only in a trivial way. True success stories will be possible only in ``fair'', regular environments that do allow for long-term reward speed-ups. This does include certain zero-sum games though [#!Jieyu:96self!#]. (3) We do not gain much by applying SMP/SSA to, say, simple ``Markovian'' mazes for which there already are efficient RL methods based on dynamic programming. In certain more realistic situations where standard RL approaches tend to fail, however, our method is of interest.

Application. A particular SMP/SSA implementation based on very simple, assembler-like instructions has been used successfully to solve a complex POE. To our knowledge, such complex POEs have not been solved by previous POE algorithms, which appear to work only for tiny state spaces [#!Littman:94!#].

Outlook. There are many alternative ways of implementing SMP/SSA -- for instance, instructions may be highly complex learning algorithms. Future work will focus on plugging a whole variety of well-known algorithms into SMP/SSA, and letting it pick and combine the best, problem-specific ones.

next up previous
Juergen Schmidhuber

Back to Metalearning page
Back to Reinforcement Learning page