**Basic set-up.**
At a given time, the variable matrix above represents
the system's current *policy*. Each call of the
procedure **Adapt**
(invoked by ALS)
modifies the policy.
Let us consider the complete sequence of such calls
spanning the entire
system life, which
starts at time 0 and ends at some point in the future
(time flows in one direction -- there are no resets to 0).
By definition,
the -th call occurs at time , is denoted **Adapt**,
and generates a policy modification denoted by .
In between two calls, a certain amount of time
is consumed by **Levin search** (details about how time is
measured will follow in the section on experiments).

**Goal.** Whenever ALS as above finds a solution,
the system receives a reward of .
The goal is to receive as much reward as quickly as possible,
by generating policy changes that minimize the computation
time required by *future* calls of **Levin search**.
Let us denote the sum of all reinforcements between
time 0 and time by .

**Reinforcement/time ratios.**
Right before each call of **Adapt**, EIRA (see details below)
essentially
*invalidates* those policy modifications
that are not consistent with
the so-called reinforcement acceleration criterion (RAC).
To define RAC, we first introduce a measure indicating how
useful **Adapt** has been until the current time --
we simply compute the reinforcement/time ratio :

At a particular time , RAC is satisfied if for each

(a) , and

(b) such that is still valid: .

Obviously, RAC only holds if
the history of still valid policy modification
represents a history of
long-term reinforcement accelerations -- each still valid
modification has to be followed by more average reinforcement
per time than all the previous ones.
*Note that the success of some Adapt call
depends on the success of all later Adapt calls, for
which it is ``setting the stage''!* This represents an
essential difference to previous performance criteria.

**EIRA** uses a stack to store information about policy modifications
computed by calls of **Adapt**. Right before
**Adapt** is executed,
EIRA restores (if necessary) previous policies such that
RAC holds. EIRA is based on two processes:

**(1) Pushing.**
At time ,
EIRA pushes the following information on the stack:
, , and the previous values of those columns
of (representing probability distributions)
changed by **Adapt**
(this information may be needed for
later restoring
the old policy, as it used to be before was
generated).

**(2) Popping.**
Right before each call of **Adapt**,
while none of the following conditions (1-3) holds,
EIRA pops probability vectors
off the stack and *invalidates* the corresponding
policy modifications, by restoring the previous policies.

(1) , where and are still valid, and is the most recent valid policy modification generated earlier than .

(2) , where is the only valid policy.

(3) the stack is empty.

**Theoretical soundness.**
Using induction,
it can be shown that this backtracking procedure
ensures that RAC holds after each popping
process [Schmidhuber, 1995b].

At any given time,
EIRA's straight-forward *generalization assumption* is:
modifications that survived the most recent
popping process will remain useful. In general environments,
what else could be assumed?
Note that at any given time in system life,
we have only one single ``training
example'' to evaluate the current long-term
usefulness of any given previous **Adapt** call,
namely the average reinforcement per
time since it occurred.
During the next popping process, however,
EIRA will reevaluate ``usefulness so far'' of
still valid modifications.

**To conclude:** EIRA again and again implicitly evaluates each
still valid policy modification as to whether it has been
followed by long-term performance improvement (perhaps
because the modification set the stage for later useful
modifications). If there is evidence
to the contrary, EIRA invalidates policy modifications until
RAC is fulfilled again.
EIRA's stack-based backtracking
is efficient in the sense that only the two most recent
still valid modifications have to be considered at a given
time
(although a *single* popping process may invalidate
*many* modifications).

Back to Optimal Universal Search page

Back to Reinforcement Learning page

Back to Program Evolution page