Reward/Goal. Occasionally provides real-valued reward. is the cumulative reward obtained between time 0 and time , where . At time the learner's goal is to accelerate long-term reward intake: it wants to let exceed the current average reward intake. To compute the ``current average reward intake'' a previous point to compute is required. How to specify in a general yet reasonable way? For instance, if life consists of many successive ``trials'' with non-deterministic outcome, how many trials must we look back in time? This question will be addressed by the success-story algorithm (SSA) below.
Initialization. At time (system birth), we initialize the learner's variable internal state , a vector of variable, binary or real-valued components. Environmental inputs may be represented by certain components of . We also initialize the vector-valued policy . 's -th variable component is denoted . There is a set of possible actions to be selected and executed according to current and . For now, there is no need to specify -- this will be done in the experimental sections (typically, will be a conditional probability distribution on the possible next actions, given current ). We introduce an initially empty stack that allows for stack entries with varying sizes, and the conventional push and pop operations.
Basic cycle. Between time 0 (system birth) and time (system death) the following basic cycle is repeated over and over again:
IF there is no ``tag'' (a pair of time and cumulative reward until then) stored somewhere in ,
THEN push the tag (, ) onto , and go to 3 (this ends the current SSA call).
ELSE denote the topmost tag in by (, ). Denote the one below by (, ) (if there is not any tag below, set variable -- recall ).
ELSE pop off all stack entries above the one for tag (, ) (these entries will be former policy components saved during earlier executions of step 3), and use them to restore as it was be before time . Then also pop off the tag (, ). Go to SSA.1.
SSA ensures life-time success stories.
At a given time in the
learner's life, define the set of currently valid times as those
previous times still stored in tags somewhere in . If this set
is not empty right before tag is pushed in step SSA.2 of
the basic cycle, then let
denote
the -th valid time, counted from the bottom of . It is easy
to show (Schmidhuber, 1994, 1996) that the current SSA call will have
enforced the following ``success-story criterion'' SSC ( is the
in the most recent step SSA.2):
(1) |
SSA's generalization assumption. Since life is one-way (time is never reset), during each SSA call the system has to generalize from a single experience concerning the usefulness of any previous policy modification: the average reward per time since then. At the end of each SSA call, until the beginning of the next one, the only temporary generalization assumption for inductive inference is: -modifications that survived all previous SSA calls will remain useful. In absence of empirical evidence to the contrary, each still valid sequence of -modifications is assumed to have successfully set the stage for later ones. What has appeared useful so far will get another chance to justify itself.
When will SSC be satisfiable in a non-trivial way? In irregular and random environments there is no way of justifying permanent policy modifications by SSC. Also, a trivial way of satisfying SSC is to never make a modification. Let us assume, however, that , , and action set (representing the system's initial bias) do indeed allow for -modifications triggering long-term reward accelerations. This is an instruction set-dependent assumption much weaker than the typical Markovian assumptions made in previous RL work, e.g., [#!Kumar:86!#,#!Sutton:88!#,#!WatkinsDayan:92!#,#!Williams:92!#]. Now, if we prevent all instruction probabilities from vanishing (see concrete implementations in sections 3/4), then the system will execute -modifications occasionally, and keep those consistent with SSC. In this sense, it cannot help getting better. Essentially, the system keeps generating and undoing policy modifications until it discovers some that indeed fit its generalization assumption.
Greediness? SSA's strategy appears greedy. It always keeps the policy that was observed to outperform all previous policies in terms of long-term reward/time ratios. To deal with unknown reward delays, however, the degree of greediness is learnable -- SSA calls may be triggered or delayed according to the modifiable policy itself.
Actions can be almost anything. For instance, an action executed in step 3 may be a neural net algorithm. Or it may be a Bayesian analysis of previous events. While this analysis is running, time is running, too. Thus, the complexity of the Bayesian approach is automatically taken into account. In section 3 we will actually plug in an adaptive Levin search extension. Similarly, actions may be calls of a Q-learning variant -- see experiments in [#!Schmidhuber:96meta!#]. Plugging Q into SSA makes sense in situations where Q by itself is questionable because the environment might not satisfy the preconditions that would make Q sound. SSA will ensure, however, that at least each policy change in the history of all still valid policy changes will represent a long-term improvement, even in non-Markovian settings.
Limitations. (1) In general environments neither SSA nor any other scheme is guaranteed to continually increase reward intake per fixed time interval, or to find the policy that will lead to maximal cumulative reward. (2) No reasonable statements can be made about improvement speed which indeed highly depends on the nature of the environment and the choice of initial, ``primitive'' actions (including learning algorithms) to be combined according to the policy. This lack of quantitative convergence results is shared by many other, less general RL schemes though (recall that Q-learning is not guaranteed to converge in finite time).
Outline of remainder. Most of our paper will be about plugging various policy-modifying algorithms into the basic cycle. Despite possible implementation-specific complexities the overall concept is very simple. Sections 3 and 4 will describe two concrete implementations. The first implementation's action set consists of a single but ``strong'' policy-modifying action (a call of a Levin search extension). The second implementation uses many different, less ``powerful'' actions. They resemble assembler-like instructions from which many different policies can be built (the system's modifiable learning strategy is able to modify itself). Experimental case studies will involve complex environments where standard RL algorithms fail. Section 5 will conclude.