**Fundamental transfer limitations.** Inductive transfer of
knowledge from one task solution to the next (e.g., Caruana et
al. 1995; Pratt and Jennings 1996)
requires the solutions to share mutual algorithmic information.
Since almost all sequences of solutions to well-defined problems
are incompressible and have maximal Kolmogorov complexity
[#!Solomonoff:64!#,#!Kolmogorov:65!#,#!Chaitin:69!#,#!LiVitanyi:93!#], arbitrary
task solutions almost never share mutual information. This implies that
inductive transfer and ``generalization'' are almost always impossible
-- see, e.g., Schmidhuber (1997a); for related results see Wolpert
(1996).
From a practical point of
view, however, even the presence of mutual information is no guarantee
of successful transfer. This is because concepts such as Kolmogorov
complexity and algorithmic information do not take into account the time
consumed by learning algorithms computing a new task's solution from
previous ones. In typical machine learning applications, however, it
is precisely the learning time that we want to minimize.

**Reward acceleration.** Given the observations above, all
attempts at successful transfer must be limited to task sequences of a
particularly friendly kind. In the context of reinforcement learning
(RL) we will focus on task sequences that allow for speeding up the
learner's long-term average reward intake. Fortunately, in our own highly
atypical and regular universe such task sequences abound.
For instance, often we encounter situations where high
reward for some problem's solution can be achieved
more quickly by first learning easier but related
tasks yielding less reward.

Our learner's single life lasts from time 0 to time
(time is not reset in case of new learning trials).
Each modification of its policy corresponds to a
shift of inductive bias (Utgoff, 1986).
By definition, ``good'' bias shifts are those that
help to accelerate long-term average reward intake.
The learner's method for generating good bias shifts
must take into account:
**(1)**
Bias shifts occurring
early in the learner's life generally influence the probabilities
of later bias shifts.
**(2)** ``Learning'' (modifying the policy) and policy tests will
consume part of the learner's limited
life-time^{}.

**Previous RL approaches.**
To deal with issues (1) and (2),
what can we learn from traditional RL approaches?
Convergence theorems for existing RL algorithms such
as Q-learning [#!WatkinsDayan:92!#] require infinite sampling size as
well as strong (usually Markovian) assumptions about the environment, e.g.,
[#!Sutton:88!#,#!WatkinsDayan:92!#,#!Williams:92!#]. They are
of great theoretical interest but not extremely relevant to our realistic
limited life case. For instance, there is no proof that Q-learning will
converge within finite given time, not even in Markovian environments.
Also, previous RL approaches do not consider the computation time consumed
by learning and policy tests in their objective function. And they do not
explicitly measure long-term effects of early learning on later learning.

**Basic ideas** (see details in section 2).
To address issues (1) and (2), we treat learning algorithms
just like other time-consuming actions. Their probabilities of being executed at
a given time may depend on the learner's current internal state and
policy. Their only distinguishing feature is that they may also *modify* the policy. In case of policy changes or bias shifts, information
necessary to restore the old policy is pushed on a stack. At any given
time in system life there is only one single training example to estimate
the long-term usefulness of any previous bias shift -- namely the
reward per time since then. This includes all the reward collected
after later bias shifts for which may have set the stage, thus
providing a simple measure of earlier learning's usefulness for later learning.
Occasionally the ``success-story algorithm'' (SSA) uses backtracking to
undo those policy modifications that have not been empirically
observed^{}
to trigger long-term reward accelerations (measured up until the current SSA
call).
For instance, certain bias shifts may have been too specifically
tailored to previous tasks (``overfitting'')
and may be harmful for future inductive transfer.
Those bias shifts that survive SSA represent a lifelong success history.
Until the next SSA call, they will
build the basis for additional bias shifts and
get another chance to justify their existence.

Due to unknown reward delays, there is no *a priori* good
way of triggering SSA calls. In principle, however, it is possible to
build policies that
can *learn* to trigger SSA calls. Since learning algorithms are actions and
can be combined (according to the policy) to form more complex learning
algorithms, SSA also allows for embedding the learning strategy
within the policy itself. There is no pre-wired difference between ``learning'',
``metalearning'', ``metametalearning'' etc.^{}

**Outline of remainder.**
Section 2 will describe the learner's basic
cycle of operations and SSA details. It will explain how
lifelong histories of reward accelerations can be enforced
despite possible interference from parallel internal or external processes.
Sections 3 and 4
will present two concrete implementations and inductive transfer
experiments with complex,
partially observable environments (POEs).
Some of our POEs are bigger and more complex than POEs considered
in most previous POE work.

Back to Optimal Universal Search page

Back to Reinforcement Learning page

Back to Program Evolution page