Fundamental transfer limitations. Inductive transfer of knowledge from one task solution to the next (e.g., Caruana et al. 1995; Pratt and Jennings 1996) requires the solutions to share mutual algorithmic information. Since almost all sequences of solutions to well-defined problems are incompressible and have maximal Kolmogorov complexity [#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#], arbitrary task solutions almost never share mutual information. This implies that inductive transfer and ``generalization'' are almost always impossible -- see, e.g., Schmidhuber (1997a); for related results see Wolpert (1996). From a practical point of view, however, even the presence of mutual information is no guarantee of successful transfer. This is because concepts such as Kolmogorov complexity and algorithmic information do not take into account the time consumed by learning algorithms computing a new task's solution from previous ones. In typical machine learning applications, however, it is precisely the learning time that we want to minimize.
Reward acceleration. Given the observations above, all attempts at successful transfer must be limited to task sequences of a particularly friendly kind. In the context of reinforcement learning (RL) we will focus on task sequences that allow for speeding up the learner's long-term average reward intake. Fortunately, in our own highly atypical and regular universe such task sequences abound. For instance, often we encounter situations where high reward for some problem's solution can be achieved more quickly by first learning easier but related tasks yielding less reward.
Our learner's single life lasts from time 0 to time
(time is not reset in case of new learning trials).
Each modification of its policy corresponds to a
shift of inductive bias (Utgoff, 1986).
By definition, ``good'' bias shifts are those that
help to accelerate long-term average reward intake.
The learner's method for generating good bias shifts
must take into account:
(1)
Bias shifts occurring
early in the learner's life generally influence the probabilities
of later bias shifts.
(2) ``Learning'' (modifying the policy) and policy tests will
consume part of the learner's limited
life-time
.
Previous RL approaches. To deal with issues (1) and (2), what can we learn from traditional RL approaches? Convergence theorems for existing RL algorithms such as Q-learning [#!WatkinsDayan:92!#] require infinite sampling size as well as strong (usually Markovian) assumptions about the environment, e.g., [#!Sutton:88!#,#!WatkinsDayan:92!#,#!Williams:92!#]. They are of great theoretical interest but not extremely relevant to our realistic limited life case. For instance, there is no proof that Q-learning will converge within finite given time, not even in Markovian environments. Also, previous RL approaches do not consider the computation time consumed by learning and policy tests in their objective function. And they do not explicitly measure long-term effects of early learning on later learning.
Basic ideas (see details in section 2).
To address issues (1) and (2), we treat learning algorithms
just like other time-consuming actions. Their probabilities of being executed at
a given time may depend on the learner's current internal state and
policy. Their only distinguishing feature is that they may also modify the policy. In case of policy changes or bias shifts, information
necessary to restore the old policy is pushed on a stack. At any given
time in system life there is only one single training example to estimate
the long-term usefulness of any previous bias shift -- namely the
reward per time since then. This includes all the reward collected
after later bias shifts for which
may have set the stage, thus
providing a simple measure of earlier learning's usefulness for later learning.
Occasionally the ``success-story algorithm'' (SSA) uses backtracking to
undo those policy modifications that have not been empirically
observed
to trigger long-term reward accelerations (measured up until the current SSA
call).
For instance, certain bias shifts may have been too specifically
tailored to previous tasks (``overfitting'')
and may be harmful for future inductive transfer.
Those bias shifts that survive SSA represent a lifelong success history.
Until the next SSA call, they will
build the basis for additional bias shifts and
get another chance to justify their existence.
Due to unknown reward delays, there is no a priori good
way of triggering SSA calls. In principle, however, it is possible to
build policies that
can learn to trigger SSA calls. Since learning algorithms are actions and
can be combined (according to the policy) to form more complex learning
algorithms, SSA also allows for embedding the learning strategy
within the policy itself. There is no pre-wired difference between ``learning'',
``metalearning'', ``metametalearning'' etc.
Outline of remainder. Section 2 will describe the learner's basic cycle of operations and SSA details. It will explain how lifelong histories of reward accelerations can be enforced despite possible interference from parallel internal or external processes. Sections 3 and 4 will present two concrete implementations and inductive transfer experiments with complex, partially observable environments (POEs). Some of our POEs are bigger and more complex than POEs considered in most previous POE work.