next up previous


Basic set-up. At a given time, the variable matrix $M$ above represents the system's current policy. Each call of the procedure Adapt (invoked by ALS) modifies the policy. Let us consider the complete sequence of such calls spanning the entire system life, which starts at time 0 and ends at some point in the future (time flows in one direction -- there are no resets to 0). By definition, the $i$-th call occurs at time $t^i$, is denoted Adapt$_i$, and generates a policy modification denoted by $M(i)$. In between two calls, a certain amount of time is consumed by Levin search (details about how time is measured will follow in the section on experiments).

Goal. Whenever ALS as above finds a solution, the system receives a reward of $+1.0$. The goal is to receive as much reward as quickly as possible, by generating policy changes that minimize the computation time required by future calls of Levin search. Let us denote the sum of all reinforcements between time 0 and time $t>0$ by $R(t)$.

Reinforcement/time ratios. Right before each call of Adapt, EIRA (see details below) essentially invalidates those policy modifications that are not consistent with the so-called reinforcement acceleration criterion (RAC). To define RAC, we first introduce a measure indicating how useful Adapt$_i$ has been until the current time $t$ -- we simply compute the reinforcement/time ratio $Q(i,t)$:

Q(i,t) = \frac{R(t) - R(t^i)}{t - t^i}

At a particular time $t$, RAC is satisfied if for each Adapt$_i$ that computed a still valid (not yet invalidated) policy modification M(i), we have
(a) $Q(i,t) > \frac{R(t)}{t}$, and

(b) $\forall k < i$ such that $M(k)$ is still valid: $Q(i,t) > Q(k,t)$.

Obviously, RAC only holds if the history of still valid policy modification represents a history of long-term reinforcement accelerations -- each still valid modification has to be followed by more average reinforcement per time than all the previous ones. Note that the success of some Adapt call depends on the success of all later Adapt calls, for which it is ``setting the stage''! This represents an essential difference to previous performance criteria.

EIRA uses a stack to store information about policy modifications computed by calls of Adapt. Right before Adapt$_i$ is executed, EIRA restores (if necessary) previous policies such that RAC holds. EIRA is based on two processes:

(1) Pushing. At time $t^i$, EIRA pushes the following information on the stack: $t^i$, $R(t^i)$, and the previous values of those columns of $M$ (representing probability distributions) changed by Adapt$_i$ (this information may be needed for later restoring the old policy, as it used to be before $M(i)$ was generated).

(2) Popping. Right before each call of Adapt, while none of the following conditions (1-3) holds, EIRA pops probability vectors off the stack and invalidates the corresponding policy modifications, by restoring the previous policies.

(1) $Q(k,t) > Q(l,t)$, where $M(k)$ and $M(l)$ are still valid, and $M(l)$ is the most recent valid policy modification generated earlier than $M(k)$.

(2) $Q(k,t) > \frac{R(t)}{t}$, where $M(k)$ is the only valid policy.

(3) the stack is empty.

Theoretical soundness. Using induction, it can be shown that this backtracking procedure ensures that RAC holds after each popping process [Schmidhuber, 1995b].

At any given time, EIRA's straight-forward generalization assumption is: modifications that survived the most recent popping process will remain useful. In general environments, what else could be assumed? Note that at any given time in system life, we have only one single ``training example'' to evaluate the current long-term usefulness of any given previous Adapt call, namely the average reinforcement per time since it occurred. During the next popping process, however, EIRA will reevaluate ``usefulness so far'' of still valid modifications.

To conclude: EIRA again and again implicitly evaluates each still valid policy modification as to whether it has been followed by long-term performance improvement (perhaps because the modification set the stage for later useful modifications). If there is evidence to the contrary, EIRA invalidates policy modifications until RAC is fulfilled again. EIRA's stack-based backtracking is efficient in the sense that only the two most recent still valid modifications have to be considered at a given time (although a single popping process may invalidate many modifications).

next up previous
Juergen Schmidhuber 2003-02-25

Back to Optimal Universal Search page
Back to Reinforcement Learning page
Back to Program Evolution page