System resources. The learner lives from time 0 to unknown time T in an unknown environment E. It has an internal state S and a policy SMP. Both S and SMP are variable data structures influencing probabilities of instructions to be executed. Between time 0 and T, the learner repeats the following cycle over and over again ( denotes a set of possible instructions): select and execute with conditional probability . Instruction a will consume time [#!Russell:91!#,#!Boddy:94!#] and may change E, S, and even SMP. (Somewhat related, but more restricted limited resource scenarios were studied in [#!Berry:85!#,#!Gittins:89!#,#!Greiner:96!#] and references therein.)
Primitive learning algorithms (PLAs). Actions in that modify SMP are called primitive learning algorithms (PLAs). To ensure non-vanishing exploration potential PLAs may not generate SMP-modifications that will let certain action probabilities vanish entirely. PLAs and other actions can be combined to form more complex (probabilistic) learning algorithms.
Checkpoints. The learner's entire life-time can be partitioned into time intervals separated by special times called checkpoints. Checkpoints are computed dynamically during the learner's life by certain actions in executed according to SMP itself. The learner's k-th checkpoint is denoted sk. Checkpoints obey the following rules: (1) . (2) (3) Except for the first, checkpoints may not occur before at least one PLA executed at least one SMP-modification since the previous checkpoint.
Sequences of SMP-modifications (SSMs). SSMk denotes the sequence of SMP-modifications computed by PLAs between sk and sk+1. Since PLA execution probabilities depend on SMP, SMP can in fact modify the way in which it modifies itself.
Reward. Occasionally E provides real-valued reward. R(t) is the cumulative reward obtained between time 0 and time t > 0, where R(0) = 0.
Goal. At some checkpoint t the learner's goal is to generate SMP-modifications that accelerate long-term reward intake: it wants to let exceed the current average reward intake. To determine this speed, a previous point t' < t to compute is required. How can t' be specified in a general yet reasonable way? For instance, if life consists of many successive ``trials'' with non-deterministic outcome, how many trials must we look back in time?
We address this question by maintaining a time-varying set V of past checkpoints that have led to long-term reward accelerations. Initially V is empty.
Success-story criterion (SSC).
Let vk denote the k-th element of V in ascending order.
SSC is satisfied at time t if either V is empty (trivial case) or if
Success-Story Algorithm (SSA). At every checkpoint we invoke the success-story algorithm (SSA):
1. WHILE SSC is not satisfiedUndo all SMP modifications made since the most recent checkpoint in V.2. Add the current checkpoint to V.
Remove that checkpoint from V.
``Undoing'' a modification means restoring the preceding SMP -- this requires storing past values of SMP components on a stack prior to modification. (Components of SMP and elements of V can be stored on the same stack -- see section 3.)
Thus each SMP modification that survived SSA is part of a bias shift generated after a checkpoint marking a lifelong reward speed-up.
Implementing SSA. Using stack-based backtracking methods such as those described in section 3, one can guarantee that SSC will be satisfied after each new SMS-start, despite interference from S and E. Although inequality (1) contains |V| fractions, SSA can be implemented efficiently: only the two sequences of modifications on top of the stack need to be considered at a given time in an SSA call (see details in section 3). A single SSA call, however, may undo many sequences of self-modifications if necessary.
Timing SSA calls. Between two checkpoints SMP is temporarily protected from SSA evaluations. Since the way of setting checkpoints depends on SMP, SMP can learn when to evaluate itself. This evaluation-timing ability of SMP is important in dealing with unknown reward delays.
SSA's generalization assumption. At the end of each SSA call, until the beginning of the next one, the only temporary generalization assumption for inductive inference is: SMP-modifications that survived all previous SSA calls will remain useful. In absence of empirical evidence to the contrary, each surviving sequence of modifications SSMk is assumed to have set the stage for later successful SSMi, i > k. In unknown environments, which other generalization assumption would make sense? Since life is one-way (time is never reset), during each SSA call the system has to generalize from a single experience concerning the usefulness of policy modifications executed after any given previous point in time: the average reward per time since then.
Satisfying SSC in a non-trivial way. In antagonistic environments with constantly decreasing reward there is no way of justifying permanent SMP-modifications by SSC. Let us assume, however, that E, S and (representing the system's initial bias) do indeed allow for SMP-modifications triggering long-term reward accelerations. This is an instruction set-dependent assumption much weaker than the typical Markovian assumptions made in previous RL work, e.g., [#!Kumar:86!#,#!Sutton:88!#,#!WatkinsDayan:92!#]. Now, since we prevent all instruction probabilities from vanishing (see implementation below), SMP cannot help but trigger occasional SMP-modifications, and keep those consistent with SSC: in this sense, the learner can only improve. Note that the older some SMP-modification, the more time will have passed to gather experience with its long-term consequences, and the more stable it will be if it is indeed useful and not just there by chance.
Measuring effects of learning on later learning. Since some SMP-modification's success recursively depends on the success of later SMP-modifications for which it sets the stage, SSA (unlike other RL methods) automatically provides a basis for learning how to learn or metalearning: SSA prefers SSMs making ``good'' future SSMs more likely. In other words, SSA prefers probabilistic learning algorithms that lead to better probabilistic learning algorithms.
SMP/SSA's limitations. Of course, in general environments neither SMP/SSA nor any other scheme is guaranteed to find ``optimal'' policies that will lead to maximal cumulative reward. SSA is only guaranteed to selectively undo those policy modifications that were not empirically observed to lead to an overall speed-up of average reward intake. Still, this is more than can be said about previous RL algorithms.