**System resources.**
The learner lives from time 0 to unknown time *T*
in an unknown environment *E*.
It has an internal state *S* and a
policy SMP. Both *S* and SMP are
variable data structures influencing probabilities of
instructions to be executed. Between time 0 and *T*,
the learner repeats the following cycle
over and over again (
denotes a set of possible instructions):
select and execute
with conditional probability
.
Instruction *a* will consume time
[#!Russell:91!#,#!Boddy:94!#]
and may change *E*, *S*, and even SMP.
(Somewhat related, but more restricted limited resource scenarios were
studied in [#!Berry:85!#,#!Gittins:89!#,#!Greiner:96!#]
and references therein.)

**Primitive learning algorithms (PLAs).**
Actions in
that modify SMP are
called *primitive learning algorithms* (PLAs).
To ensure non-vanishing exploration potential
PLAs may not generate SMP-modifications that
will let certain action probabilities vanish entirely.
PLAs and other actions can be combined to form
more complex (probabilistic) learning algorithms.

**Checkpoints.**
The learner's entire life-time can be partitioned into
time intervals separated by special times called *checkpoints*.
Checkpoints are computed dynamically during the learner's life by certain
actions in
executed according to SMP itself.
The learner's *k*-th checkpoint is denoted *s*_{k}.
Checkpoints obey the following rules:
(1)
.
(2)
(3) Except for the first, checkpoints may not occur
before at least one
PLA executed at least one SMP-modification
since the previous checkpoint.

**Sequences of SMP-modifications (SSMs).**
SSM_{k} denotes the sequence of SMP-modifications
computed by PLAs between
*s*_{k} and *s*_{k+1}.
Since PLA execution probabilities depend on
SMP, SMP can in fact modify the way in which it modifies itself.

**Reward.**
Occasionally *E* provides
real-valued reward. *R*(*t*) is the cumulative reward obtained
between time 0 and time *t* > 0, where *R*(0) = 0.

**Goal.**
At some checkpoint *t* the
learner's goal is to
generate SMP-modifications that
accelerate long-term reward intake: it wants
to let
exceed
the current average reward intake.
To determine this speed, a previous
point *t*' < *t* to compute
is required.
How can *t*' be specified in a general yet reasonable way?
For instance, if life consists of many successive ``trials''
with non-deterministic outcome, how many trials must we
look back in time?

We address this question by maintaining a time-varying set *V*
of past checkpoints that have led to long-term reward accelerations.
Initially *V* is empty.

**Success-story criterion (SSC).**
Let *v*_{k} denote the *k*-th element of *V* in ascending order.
SSC is satisfied at time *t* if either *V* is empty (trivial case) or if

(1) |

SSC demands that each checkpoint in

**Success-Story Algorithm (SSA).**
At every checkpoint
we invoke the *success-story algorithm* (SSA):

1. WHILESSC is not satisfiedUndo all SMP modifications made since the most recent checkpoint inV.

Remove that checkpoint fromV.2.Add the current checkpoint toV.

``Undoing'' a modification means restoring the preceding SMP --
this requires storing past values of
SMP components on a stack prior to modification.
(Components of SMP and elements of *V* can be stored on the same stack --
see section 3.)

Thus each SMP modification that survived SSA is part of a bias shift generated after a checkpoint marking a lifelong reward speed-up.

**Implementing SSA.**
Using stack-based backtracking methods such as those described
in section 3, one can guarantee that SSC will be satisfied
after each new SMS-start, despite interference from *S* and *E*.
Although inequality (1) contains |*V*| fractions,
SSA can be implemented efficiently: only
the two sequences of modifications
on top of the stack need to be considered at a given
time in an SSA call (see details in section 3).
A *single* SSA call, however, may undo
*many* sequences of self-modifications if necessary.

**Timing SSA calls.**
Between two checkpoints SMP is temporarily
protected from SSA evaluations.
Since the way of setting checkpoints
depends on SMP, SMP can learn
*when* to evaluate itself.
This evaluation-timing ability of SMP
is important in dealing with unknown reward delays.

**SSA's generalization assumption.**
At the end of each SSA call,
until the beginning of the next one,
the only temporary generalization
assumption for inductive inference is:
SMP-modifications that survived all previous SSA calls
will remain useful. In absence of empirical evidence to the
contrary, each surviving sequence of modifications SSM_{k}
is assumed to have set the stage for
later successful SSM_{i}, *i* > *k*.
In unknown environments, which other generalization
assumption would make sense?
Since life is one-way (time is never reset),
during each SSA call the system has
to generalize from
a *single* experience concerning
the usefulness of policy modifications executed after any given
previous point in time: the average reward per
time since then.

**Satisfying SSC in a non-trivial way.**
In antagonistic environments with constantly
decreasing reward there is no way
of justifying *permanent* SMP-modifications by SSC.
Let us assume, however, that
*E*, *S* and
(representing the system's initial bias)
do indeed allow for SMP-modifications
triggering long-term reward accelerations.
This is an
instruction set-dependent assumption much weaker
than the typical Markovian assumptions made in previous RL work,
e.g., [#!Kumar:86!#,#!Sutton:88!#,#!WatkinsDayan:92!#].
Now, since we prevent all
instruction probabilities from vanishing (see implementation below),
SMP cannot help but trigger occasional SMP-modifications,
and keep those consistent with SSC:
in this sense, the learner can only improve.
Note that the older some SMP-modification, the more time
will have passed to
gather experience with its long-term consequences, and the
more stable it will be if it is indeed useful and not
just there by chance.

**Measuring effects of learning on later learning.**
Since some SMP-modification's success recursively depends on the
success of later SMP-modifications for which it sets the stage,
SSA (unlike other RL methods)
automatically provides a basis for *learning how to learn*
or *metalearning*:
SSA prefers SSMs making ``good'' future SSMs more likely.
In other words, SSA prefers probabilistic learning algorithms
that lead to better probabilistic learning algorithms.

**SMP/SSA's limitations.**
Of course, in general environments
neither SMP/SSA nor any other scheme is guaranteed to find
``optimal'' policies that will lead to maximal cumulative
reward. SSA is only guaranteed to
selectively undo those
policy modifications that were not empirically
observed to lead to an overall speed-up of average
reward intake. Still, this is more than
can be said about previous RL algorithms.

Back to Metalearning page

Back to Reinforcement Learning page