Next: Implementation 1: Plugging LS Up: Shifting Inductive Bias with Previous: Introduction / Overview

Basic Set-Up and SSA

Reward/Goal. Occasionally provides real-valued reward. is the cumulative reward obtained between time 0 and time , where . At time the learner's goal is to accelerate long-term reward intake: it wants to let $\frac{R(T) - R(t)}{T - t}$ exceed the current average reward intake. To compute the ``current average reward intake'' a previous point to compute $\frac{R(t) - R(t')}{t - t'}$ is required. How to specify in a general yet reasonable way? For instance, if life consists of many successive ``trials'' with non-deterministic outcome, how many trials must we look back in time? This question will be addressed by the success-story algorithm (SSA) below.

Initialization. At time (system birth), we initialize the learner's variable internal state $\cal I$ , a vector of variable, binary or real-valued components. Environmental inputs may be represented by certain components of $\cal I$ . We also initialize the vector-valued policy . 's -th variable component is denoted . There is a set of possible actions to be selected and executed according to current and $\cal I$ . For now, there is no need to specify -- this will be done in the experimental sections (typically, will be a conditional probability distribution on the possible next actions, given current $\cal I$ ). We introduce an initially empty stack $\cal S$ that allows for stack entries with varying sizes, and the conventional push and pop operations.

Basic cycle. Between time 0 (system birth) and time (system death) the following basic cycle is repeated over and over again:

1.

Execute actions selected according to

and $\cal I$ (this may change environment and $\cal I$ ), until a certain EVALUATION CRITERION is satisfied, or until an action is selected that will modify

2.

IF the EVALUATION CRITERION is satisfied, THEN call SSA, which backtracks and undoes certain previous

-modifications if necessary (to ensure that the history of still valid modifications corresponds to a history of reward accelerations):

SSA.1.

Set variable

equal to current time.

IF there is no ``tag'' (a pair of time and cumulative reward until then) stored somewhere in $\cal S$ ,

THEN push the tag (, ) onto $\cal S$ , and go to 3 (this ends the current SSA call).

ELSE denote the topmost tag in $\cal S$ by (, ). Denote the one below by (, ) (if there is not any tag below, set variable -- recall ).

SSA.2.

$\begin{displaymath}\frac{R(t) - R(t')}{t - t'} > \frac{R(t) - R(t'')}{t - t''} \end{displaymath}$

THEN push tag (

), and go to 3. This ends the current SSA call.

ELSE pop off all stack entries above the one for tag (, ) (these entries will be former policy components saved during earlier executions of step 3), and use them to restore as it was be before time . Then also pop off the tag (, ). Go to SSA.1.

3.

IF the most recent action selected in step 1 will modify

, THEN push copies of those

to be modified onto $\cal S$ , and execute the action.

4.

IF some TERMINATION CRITERION is satisfied, THEN die. ELSE go to step 1.

SSA ensures life-time success stories. At a given time in the learner's life, define the set of currently valid times as those previous times still stored in tags somewhere in $\cal S$ . If this set is not empty right before tag is pushed in step SSA.2 of the basic cycle, then let $t_i ~~ (i \in \{1, 2, \ldots, V(t) \})$ denote the -th valid time, counted from the bottom of $\cal S$ . It is easy to show (Schmidhuber, 1994, 1996) that the current SSA call will have enforced the following ``success-story criterion'' SSC ( is the in the most recent step SSA.2):

$\begin{displaymath}\frac{R(t)}{t} < \frac{R(t) - R(t_1)}{t - t_1} < \frac{R(t) -... ...t - t_2} < \ldots < \frac{R(t) - R(t_{V(t)}) }{t - t_{V(t)}}. \end{displaymath}$

(1)

SSC demands that each still valid time marks the beginning of a long-term reward acceleration measured up to the current time

. Each

-modification that survived SSA represents a bias shift whose start marks a long-term reward speed-up. In this sense, the history of still valid bias shifts is guaranteed to be a life-time success story (in the worst case an empty one). No Markov-assumption is required.

SSA's generalization assumption. Since life is one-way (time is never reset), during each SSA call the system has to generalize from a single experience concerning the usefulness of any previous policy modification: the average reward per time since then. At the end of each SSA call, until the beginning of the next one, the only temporary generalization assumption for inductive inference is: -modifications that survived all previous SSA calls will remain useful. In absence of empirical evidence to the contrary, each still valid sequence of -modifications is assumed to have successfully set the stage for later ones. What has appeared useful so far will get another chance to justify itself.

When will SSC be satisfiable in a non-trivial way? In irregular and random environments there is no way of justifying permanent policy modifications by SSC. Also, a trivial way of satisfying SSC is to never make a modification. Let us assume, however, that , $\cal I$ , and action set $\cal A$ (representing the system's initial bias) do indeed allow for -modifications triggering long-term reward accelerations. This is an instruction set-dependent assumption much weaker than the typical Markovian assumptions made in previous RL work, e.g., [#!Kumar:86!#,#!Sutton:88!#,#!WatkinsDayan:92!#,#!Williams:92!#]. Now, if we prevent all instruction probabilities from vanishing (see concrete implementations in sections 3/4), then the system will execute -modifications occasionally, and keep those consistent with SSC. In this sense, it cannot help getting better. Essentially, the system keeps generating and undoing policy modifications until it discovers some that indeed fit its generalization assumption.

Greediness? SSA's strategy appears greedy. It always keeps the policy that was observed to outperform all previous policies in terms of long-term reward/time ratios. To deal with unknown reward delays, however, the degree of greediness is learnable -- SSA calls may be triggered or delayed according to the modifiable policy itself.

Actions can be almost anything. For instance, an action executed in step 3 may be a neural net algorithm. Or it may be a Bayesian analysis of previous events. While this analysis is running, time is running, too. Thus, the complexity of the Bayesian approach is automatically taken into account. In section 3 we will actually plug in an adaptive Levin search extension. Similarly, actions may be calls of a Q-learning variant -- see experiments in [#!Schmidhuber:96meta!#]. Plugging Q into SSA makes sense in situations where Q by itself is questionable because the environment might not satisfy the preconditions that would make Q sound. SSA will ensure, however, that at least each policy change in the history of all still valid policy changes will represent a long-term improvement, even in non-Markovian settings.

Limitations. (1) In general environments neither SSA nor any other scheme is guaranteed to continually increase reward intake per fixed time interval, or to find the policy that will lead to maximal cumulative reward. (2) No reasonable statements can be made about improvement speed which indeed highly depends on the nature of the environment and the choice of initial, ``primitive'' actions (including learning algorithms) to be combined according to the policy. This lack of quantitative convergence results is shared by many other, less general RL schemes though (recall that Q-learning is not guaranteed to converge in finite time).

Outline of remainder. Most of our paper will be about plugging various policy-modifying algorithms into the basic cycle. Despite possible implementation-specific complexities the overall concept is very simple. Sections 3 and 4 will describe two concrete implementations. The first implementation's action set consists of a single but ``strong'' policy-modifying action (a call of a Levin search extension). The second implementation uses many different, less ``powerful'' actions. They resemble assembler-like instructions from which many different policies can be built (the system's modifiable learning strategy is able to modify itself). Experimental case studies will involve complex environments where standard RL algorithms fail. Section 5 will conclude.

Next: Implementation 1: Plugging LS Up: Shifting Inductive Bias with Previous: Introduction / Overview

Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page