Previous work on reinforcement learning (e.g., [16,2,49,51]) requires strong assumptions about the environment. In many realistic settings, however, these assumptions do not hold. In particular, any event/action/experiment occurring early in the life of a learning system may influence events/actions/experiments at any later time. To address these important but previously mostly ignored issues, this paper describes a novel, general framework for single-life reinforcement learning with limited computational resources in rather general environments. The focus is on theoretical considerations. However, experiments in sections 3 and 4 will serve to illustrate the basic concepts.
Scenario. Consider a learning system executing a lifelong action sequence in an unknown environment. For now, it won't even be necessary to introduce a detailed formal model of the environment. Different system actions may require different amounts of execution time (like in scenarios studied in [25,4], and in references given therein). Occasionally the environment provides real-valued ``reinforcement''. The sum of all reinforcements obtained between ``system birth'' (at time 0) and time is denoted by . Throughout its lifetime, the system's goal is to maximize , the cumulative reinforcement at (initially unknown) ``death'' . There is only one life. Time flows in one direction (no resets to zero). Related, but less general scenarios were studied in papers on ``bandit problems'' (e.g., [3,9] and references therein), which also require to wisely use limited resources to perform experiments. See also  and references therein for work on finding search space elements with optimal expected utility, subject to the resource constraint of spending only a feasible amount of time on finding such an element.
Policy. The system's current policy is embodied by modifiable parameters of an arbitrary algorithm mapping environmental inputs and internal states to output actions and new internal states. For example, parameters may be bits of machine code, or weights of a neural net. The number of parameters needs not be fixed. For now, we won't need a detailed formal model of the policy.
Policy modification processes (PMPs). Policy parameters are occasionally modified by finite ``policy modification processes'' (PMPs), sometimes also called ``learning processes''. Different PMPs may take different amounts of time. The -th PMP in system life is denoted PMP, starts at time , ends at , , and computes a policy modification denoted by . Later we will require that a non-zero time-interval passes between and , the beginning of PMP. While PMP is running, the system's lifelong interaction sequence with the environment may continue, and there may be reinforcement signals, too. In fact, PMP may use environmental feedback to compute modification -- for instance, by executing a known reinforcement learning algorithm. Later I will even take advantage of the possibility that PMPs may be generated and executed according to information embedded in the policy itself -- this is of interest in the context of ``self-modifying'' policies that ``learn how to learn how to learn ...''. For the moment, however, I do not care for what exactly happens while PMP is running: in what follows, I won't need a formal model of the PMP details. The following paragraphs are only concerned with the question: how do we measure modification 's influence on the remaining system life? In particular, how do we measure whether successfully set the stage for later , ?
The problem. In general environments, events/actions/experiments occurring early in system life may influence events/actions/experiments at any later time. In particular, PMP may affect the environmental conditions for PMP. This is not addressed by existing algorithms for adaptive control and reinforcement learning (see, e.g., [16,2,49,51]), and not even by naive, inefficient, but more general and supposedly infallible exhaustive search among all possible policies, as will be seen next.
What's wrong with exhaustive search? Apart from the fact that exhaustive search is not considered practical even for moderate search spaces, it also suffers from another, more fundamental problem. Let be the number of possible policies. For the sake of the argument, suppose that is small enough to allow for systematic, sequential generate-and-test of all policies within the system life time. Suppose that after all policies have been generated (and evaluated for some time), during the remainder of system life, we keep the policy whose test brought about maximal reinforcement during the time it was tested. In the general ``on-line'' situation considered in this paper, this may be the wrong thing to do: for instance, each policy test may change the environment in a way that changes the preconditions for policies considered earlier. A policy discarded earlier may suddenly be the ``best'' policy (but will never be considered again). Similar things can be said about almost every other search algorithm or learning algorithm -- exhaustive search is just a convenient, very simple, but representative example.
How to measure performance improvements? Obviously, the performance criterion used by naive exhaustive search (and other, less general search and learning algorithms) is not appropriate for the general set-up considered in this paper. We first have to ask: what is a reasonable performance criterion for such general (but typical) situations? More precisely: if we do not make any assumptions about the environment, can we still establish a sensible criterion according to which, at a certain time, (1) the system's performance throughout its previous life always kept improving or at least never got worse, and which (2) can be guaranteed to be achieved by the system? Indeed, such a criterion can be defined, as will be seen below. First we will need additional concepts.
Reinforcement/time ratios based on single observations.
At a given time in system life,
we essentially have only one single ``training
example'' to evaluate the current long-term
usefulness of any given previous PMP,
namely the average reinforcement per
time since that learning process occurred.
Suppose PMP started
execution at time and completed itself at time
For and , the reinforcement/time ratio
is defined as
Why measure time starting from the beginning of PMP, instead of starting from its end? Later we will see that the first evaluations of PMP's performance will be delayed at least until after its end. While PMP is still running, it is in a ``grace period'' which may be used to collect delayed reinforcement to justify -- this is important when reinforcement signals occur long after policy modifications took place. Indeed, later I will make use of the fact that a ``self-modifying'' policy may determine the runtimes of its own PMPs, thus in principle being able to learn how long to wait for delayed reinforcement.
Currently valid modifications. After , PMP's effects on the policy will stay in existence until (a) being overwritten by later PMP, , or until (b) being invalidated by the ``success-story algorithm'' (SSA), the method to be described below. To define valid modifications, only (b) is relevant: after , the policy modification generated by PMP will remain valid as long as SSA (see below) does not discard (by restoring the previous policy right before PMP started).
Success-story criterion (SSC). At time , SSC is satisfied if for each PMP that computed a still valid modification of the system's policy,
(a) , and
(b) for all , where PMP computed a still valid : .
In words: SSC is satisfied if the beginning of each completed PMP that computed a still valid modification has been followed by long-term reinforcement acceleration -- measured up until the current time. Note that the success of some PMP depends on the success of all later PMPs, for which it is ``setting the stage''. This represents an essential difference to previous performance criteria.
The method to achieve SSC is called ``success-story algorithm'' (SSA). SSA uses a stack to trace and occasionally invalidate certain policy modifications. Details are next.
Success-story algorithm (SSA). SSA uses an initially empty stack to store information about currently valid policy changes computed by PMPs. Occasionally, at times called ``checkpoints'', this information is used to restore previous policies, such that SSC holds (later we will see that a ``self-modifying'' policy can in principle learn to set checkpoints at appropriate times). SSA is based on two complementary kinds of processes:
(1) Pushing. Recall that PMP starts at time and ends at . Let denote a record consisting of the values , (which will be needed to compute values at later times ), plus the information necessary to restore the old policy (before modification by PMP). is pushed onto the stack (this will consume time, of course). By definition, PMP is not finished until the entire pushing process is completed. From now on, the modification will remain valid as long as will remain on the stack.
(2) Popping. At certain times called ``checkpoints'' (see below), do:
WHILE no time is reached such that one of the following conditions (1-3) holds, DO: pop the topmost -value off the stack, and invalidate the corresponding modification , thus restoring the corresponding previous policy.
(1) , where , and and are the two most recent currently valid modifications (if existing),
(2) , where is the only currently valid modification (if existing),
(3) the stack is empty.For all PMP, checkpoints occur right before time , such that, by definition, coincides with the end of the corresponding popping process (note that checkpoints themselves may be set by a ``self-modifying'' policy that determines beginnings and runtimes of its own PMPs -- care has to be taken, however, to avoid PMPs with infinite duration; see section 2.2). The time consumed by pushing processes, popping processes, and all other computations is taken into account (for instance, time goes on as popping takes place).
Theorem 1. Whenever popping is finished, SSA will have achieved SSC.
Proof sketch. Induction over the stack contents after popping.
Note that SSA does not care for the nature of the PMPs -- for all completed PMP (based, e.g., on conventional reinforcement learning algorithms), at each ``checkpoint'', SSA will get rid of if was not followed by long-term reinforcement speed-up (note that before countermanding , SSA will already have countermanded all ). No modification is guaranteed to remain valid forever.
Generalization assumption. After popping, until the next checkpoint, SSA's straight-forward generalization assumption is: modifications that survived the most recent popping process (because they appeared to contribute to speed-up of average reinforcement intake) will remain useful. In general environments, which other generalization assumption would make sense? Recall that since life is one-way (time is never reset), at each checkpoint the system has to generalize from a single experience concerning the usefulness of any given previous learning process: the average reinforcement per time since that learning process occurred. Learning from singular experiences contrasts other, less realistic kinds of reinforcement learning, which, in one way or another, require the assumption that it is possible to collect sufficient statistics from repeatable training sequences.
To summarize: SSA again and again (at each checkpoint) implicitly evaluates each still valid policy modification as to whether it has been followed by long-term performance improvement (perhaps because the modification set the stage for later useful modifications). If there is evidence to the contrary, SSA invalidates policy modifications until SSC is fulfilled again. SSA's stack-based backtracking is efficient in the sense that only the two most recent (``topmost'') still valid modifications need to be considered at a given time (although a single popping process may invalidate many modifications). SSA is a general framework -- you can use your favorite learning algorithm to generate the PMP. This makes sense especially in situations where the applicability of is questionable because the environment does not satisfy the preconditions that would make sound. SSA can at least guarantee that those of 's policy modifications that appear to have negative long-term effects are countermanded.
Note that a trivial way of satisfying SSC is to never make a policy modification. But any system that cannot let modification probabilities entirely vanish occasionally will execute policy modifications, and keep those consistent with SSC. In this sense, it cannot help but getting better, if the environment does indeed provide a chance to improve performance, given the set of possible actions.
Due to the generality of the approach, no reasonable statements can be made about improvement speed, which indeed highly depends on the nature of the environment and the choice of the action set. This lack is shared by almost all other reinforcement learning approaches, though.
Costs of environment-independence. A price to pay for environment-independence is this: the beginning of the time interval considered to measure the current reinforcement/time ratio is marked by the beginning of the PMP that computed the most recent valid policy modification. The length of this time interval may vary unpredictably, because its beginning may wander back in time due to policy restaurations by SSA. In arbitrary environments, SSA cannot guarantee speed-up of reinforcement per fixed time interval (no algorithm can).
Benefits of environment independence. Next we will see that SSA provides a novel basis for two notoriously difficult issues in machine learning: (*) Multi-agent learning, (**) ``Learning how to learn''.
(*) Multi-agent learning. Consider the case where there are multiple, interacting, SSA-learning agents. For each agent, the others are part of the changing (possibly complex) environment. This is the main reason why previous approaches to multi-agent learning are heuristic by nature. Not so with SSA, however: since SSA is environment-independent, each agent will still be able to satisfy its SSC after each checkpoint. In cases where all agents try to speed up the same reinforcement signals, and where no agent can speed up reinforcement intake by itself, this automatically enforces ``learning to cooperate'' [35,36,40]. Section 2.3 will illustrate this with an application of a system consisting of multiple agents, where each agent is in fact just a connection in a neural net.
(**) Learning how to learn / Incremental self-improvement. With SSA, the success of PMP recursively depends on the success of later PMP, : the cumulative reinforcement collected after PMP includes the cumulative reinforcement collected after PMP, . In particular, performance improvements include those improvements that make future additional improvements more likely: policy modification can prove its long term usefulness by setting the stage for additional, useful modifications , etc. Now recall that SSA does not care for the nature of the PMPs -- they may depend on the system policy itself. If we allow a system's policy to change itself by computing and executing its own PMPs (this is easy to implement, e.g. by using instructions of an assembler-like, ``self-referential'' programming language as system actions -- see section 2.2), then SSA will keep only those self-modifications followed by reinforcement speed-ups, in particular those leading to ``better'' future self-modifications, etc., recursively: embedding the learning mechanism within the system policy immediately and naturally leads to ``incremental self-improvement'' and ``learning how to learn how to learn ...'' -- there is no circularity involved, and the approach remains sound.
Outline of remainder of paper. The main contribution of this paper is given by what has been said above. The remainder of this paper mainly serves to illustrate the theory. In section 2, I will exemplify the ideas in paragraph (**) on incremental self-improvement and describe a particular, concrete, working, ``evolutionary'' system that implements them. Section 3 will then apply this system to various tasks, including a variant of Sutton's maze task . One difference to Sutton's original task is that the policy environment continually changes because of actions generated by the system itself. Section 4 will then exemplify the ideas in paragraph (*) on multi-agent learning and also describe a particular, concrete, working, ``evolutionary'' system that implements them. With this system, each SSA-learning agent is in fact just a connection in a fully recurrent neural net. Each connection's policy is represented by its current weight. Connections do have a very limited policy space in comparison to typical, more complex agents used in standard AI -- but this is irrelevant: SSA does not care for the complexity of the agents. A by-product of this research is a general reinforcement learning algorithm for recurrent nets. Again, an application to a variant of Sutton's maze task will serve to illustrate the operation of the system. In this application, the environment of each connection's policy continually changes, because the policies of all the other connections keep changing.
Note. This paper is based on . In the meantime we have published several additional papers on SSA, e.g., [50,53,40,42].