The currently surprising module wants to repeat similar surprises with higher probability in the future. The other wants to avoid further surprises by learning not to agree on similar computation sequences (implicitly learning what the other already knows). And it wants to be ``creative'' in the sense that it wants to generate new surprises for the other module instead. In principle, both can learn by executing subsequences of instructions that include LIs. How can we ensure that each module indeed improves by accelerating its reward intake?
In this chapter I will use the IS paradigm [35,34] to deal with both modules' complex spatio-temporal credit assignment problem. This does not necessarily mean that IS is the best way of doing so. Other RL paradigms may be appropriate, too -- this chapter's basic ideas are independent of the choice of RL method. IS seems attractive, however, because: (1) It does not make an explicit difference between learning algorithms and other instructions, or between learning, metalearning, metametalearning, etc. (2) It properly takes into account that the success of each module modification recursively depends on the success of all later modifications for which it is setting the stage. (3) Its objective takes into account the entire time consumed by lifelong learning itself. (4) It is designed for quite general non-Markovian credit assignment problems in lifelong learning situations -- see [35,34,31] for recent IS applications. Following , the remainder of this section will describe 's IS-based learning process. 's is analogous.
Checkpoints. 's entire life time can be partitioned into time intervals separated by special times called checkpoints. Checkpoints are computed dynamically during the learner's life by certain instructions in executed according to the modules themselves. 's -th checkpoint is denoted . Checkpoints obey the following rules: (1) . (2) (3) Except for the first, checkpoints may not occur before at least one LI executed at least one -modification since the previous checkpoint.
Sequences of module modifications. SLM denotes the sequence of -modifications (SLM) computed by LIs between checkpoints and . Since LI execution probabilities depend on and , the modules can in fact modify the way they modify themselves -- they can devise their own probabilistic learning algorithms.
Goal. At some checkpoint 's goal is to generate -modifications that will accelerate long-term reward intake: it wants to let the value of exceed the current average reward intake.
Success-story criterion (SSC).
maintains a time-varying set
of past checkpoints that have led to
long-term reward accelerations. Initially is empty.
Let denote the -th element of in ascending order.
SSC is satisfied at time if either is empty (trivial case) or if
Success-story algorithm (SSA). At every checkpoint of we invoke the success-story algorithm (SSA):
1. WHILE SSC is not satisfiedUndo all -modifications made since the most recent checkpoint in .2. Add the current checkpoint to .
Remove that checkpoint from .
``Undoing'' a modification means restoring the preceding -- this requires storing past values of -components on a stack prior to modification. (Components of and elements of can be stored on the same stack -- see the appendix.) Thus each -modification that survived SSA is part of a bias shift generated after a checkpoint marking a lifelong reward speed-up: since there has been more reward per time than since , for ( ).
Timing SSA calls. Between two checkpoints is temporarily protected from SSA evaluations. Since the way of setting checkpoints depends on itself, can learn to influence when it gets evaluated. This evaluation-timing ability is important in dealing with unknown reward delays.
SSA's generalization assumption. At the end of each SSA call, until the beginning of the next one, the only temporary generalization assumption for inductive inference is: -modifications that survived all previous SSA calls will remain useful. In the absence of empirical evidence to the contrary, each surviving SLM is assumed to have set the stage for later successful SLM. Since life is one-way (time is never reset), during each SSA call the system has to generalize from a single experience concerning the utility of -modifications executed after any given previous point in time: the average reward per time since then.
Implementing SSA. Using stack-based backtracking methods such as those described in [35,34] and the appendix, one can guarantee that SSC will be satisfied right after each new SLM-start, despite interference from , , and . Although inequality 1 contains fractions, SSA can be implemented efficiently: only the two SLMs on top of the stack need to be considered at any given time in an SSA call (see details in appendix). A single SSA call, however, may undo many SLMs if necessary.
What has been described for analogously holds for . The appendix describes a particular implementation (the one used for the experiments) based on an assembler-like programming language similar to those used in [34,26].