Optimal Ordered Problem Solver (OOPS)

Our recent OOPS [53,55] is a simple, general, theoretically sound, in a certain sense time-optimal way of searching for a universal behavior or program that solves each problem in a sequence of computational problems, continually organizing and managing and reusing earlier acquired knowledge. For example, the -th problem may be to compute the -th event from previous events (prediction), or to find a faster way through a maze than the one found during the search for a solution to the -th problem (optimization).

Let us first introduce the important concept of bias-optimality, which is a pragmatic definition of time-optimality, as opposed to the asymptotic optimality of both LSEARCH and HSEARCH, which may be viewed as academic exercises demonstrating that the notation can sometimes be practically irrelevant despite its wide use in theoretical computer science. Unlike asymptotic optimality, bias-optimality does not ignore huge constant slowdowns:

This definition makes intuitive sense: the most probable candidates
should get the lion's share of the total search time, in a way that
precisely reflects the initial bias.
Now we are ready to provide a general overview of the basic
ingredients of OOPS [53,55]:

**Primitives.**
We start with an initial set of user-defined primitive behaviors.
Primitives may be assembler-like instructions or
time-consuming software, such as, say, theorem provers,
or matrix operators for neural network-like parallel architectures,
or trajectory generators for robot simulations,
or state update procedures for multiagent systems,
etc. Each primitive is represented by a token.
It is essential that those primitives
whose runtimes are not known in advance
can be interrupted at any time.

**Task-specific prefix codes.**
Complex behaviors are represented by token sequences or programs.
To solve a given task represented by task-specific program inputs,
OOPS tries to sequentially compose
an appropriate complex behavior from primitive ones,
always obeying the rules of a given user-defined
initial programming language. Programs are grown
incrementally, token by token; their beginnings
or *prefixes* are immediately executed while being created;
this may modify some task-specific internal state
or memory, and may transfer control back to previously
selected tokens (e.g., loops).
To add a new token to some program
prefix, we first have to wait until the
execution of the prefix so far *explicitly
requests* such a prolongation, by setting an
appropriate signal in the internal state.
Prefixes that cease to request any further tokens
are called *self-delimiting* programs or simply
programs (programs are their own prefixes).
*Binary* self-delimiting programs were studied
by [30] and [8] in the context of
Turing machines [67] and the
theory of Kolmogorov complexity and
algorithmic probability [62,28].
OOPS, however, uses a more practical, not necessarily binary framework.

The program construction procedure above yields *task-specific
prefix codes* on program space: with any given task,
programs that halt because they have found a solution
or encountered some error cannot request any more tokens.
Given the current task-specific inputs, no program can be the prefix of
another one. On a different task, however, the same program
may continue to request additional tokens. This is important
for our novel approach--incrementally growing self-delimiting
programs are unnecessary for the asymptotic optimality
properties of LSEARCH and HSEARCH, but essential
for OOPS.

**Access to previous solutions.**
Let denote a found prefix solving the first tasks.
The search for may greatly profit from
the information conveyed by (or the knowledge embodied by)
which
are stored or *frozen* in special *non*modifiable
memory shared by all tasks,
such that they are accessible to (this is another difference to
*non*incremental LSEARCH and HSEARCH).
For example, might execute a token sequence that
calls as a subprogram, or that
copies into some internal *modifiable* task-specific memory, then modifies
the copy a bit, then applies the slightly edited copy to the current task.
In fact, since the number of frozen programs may grow to a large value,
much of the knowledge
embodied by may be about how to access
and edit and use older ().

**Bias.**
The searcher's initial bias is embodied by initial,
user-defined, task dependent probability
distributions on the finite or infinite search space of
possible program prefixes. In the simplest case we start
with a maximum entropy distribution on the tokens, and
define prefix probabilities as the products of the
probabilities of their tokens.
But prefix continuation probabilities may
also depend on previous tokens in
context sensitive fashion.

**Self-computed suffix probabilities.**
In fact, we permit that any executed prefix
assigns a task-dependent, self-computed
probability distribution to its own possible continuations.
This distribution
is encoded and manipulated in task-specific internal memory.
So unlike with ALS [60] we do not use a prewired
learning scheme to update the probability distribution.
Instead we leave such updates to prefixes
whose online execution modifies
the probabilities of their suffixes.
By, say, invoking previously frozen code
that redefines the probability distribution on future prefix
continuations, the currently tested prefix may completely reshape the
most likely paths through the search space of its own continuations,
based on experience ignored by *non*incremental LSEARCH and HSEARCH.
This may introduce significant problem class-specific knowledge
derived from solutions to earlier tasks.

**Two searches.**
Essentially,
OOPS provides equal resources for two near-*bias-optimal*
searches
(Def. 1)
that run in parallel until
is discovered and stored in non-modifiable memory.
The first is exhaustive; it systematically tests
all possible prefixes on all tasks up to .
Alternative prefixes are tested
on all current tasks in parallel while still growing;
once a task is solved, we remove it from the current set;
prefixes that fail on a single task are discarded.
The second search is much more focused;
it only searches for prefixes that start with , and
only tests them on task , which is safe,
because we already know that such prefixes solve all tasks up to .

**Bias-optimal backtracking**.
HSEARCH and LSEARCH
assume potentially infinite storage. Hence they
may largely ignore questions of storage management.
In any practical system, however, we have to efficiently reuse
limited storage. Therefore, in both searches of OOPS, alternative
prefix continuations are evaluated by a novel, practical, token-oriented
backtracking procedure
that can deal with several tasks in parallel,
given some *code bias* in the form of previously found code.
The procedure always ensures near-*bias-optimality*
(Def. 1):
no candidate behavior gets more time than it
deserves, given the probabilistic bias.
Essentially we conduct a depth-first search in program space,
where the branches of the search tree are program prefixes,
and backtracking (partial resets of partially solved task sets and
modifications of internal states and continuation probabilities)
is triggered once the sum of the runtimes of the current prefix on all current
tasks exceeds the prefix probability multiplied by the total search time so far.

In case of unknown, infinite task sequences we can typically
never know whether we already have found an optimal solver for all
tasks in the sequence. But once we unwittingly do find one,
at most half of the total future run time will be wasted on searching
for alternatives.
Given the initial bias and subsequent
bias shifts due to
no other bias-optimal
searcher
can expect to solve the -th task set
substantially faster than OOPS. A by-product of this
optimality property is that it gives us
a natural and precise measure of bias and bias shifts,
conceptually related to Solomonoff's *conceptual jump size*
of [64,65].

Since there is no fundamental difference between domain-specific problem-solving programs and programs that manipulate probability distributions and thus essentially rewrite the search procedure itself, we collapse both learning and metalearning in the same time-optimal framework.

**An example initial language.**
For an illustrative application, we wrote an interpreter for a
stack-based universal programming language inspired
by FORTH [35],
with initial primitives for defining and calling recursive
functions, iterative loops, arithmetic operations, and domain-specific
behavior.
Optimal metasearching for better search algorithms is enabled
through the inclusion of bias-shifting instructions that can modify the
conditional probabilities of future search options in currently running
program prefixes.

**Experiments.**
Using the assembler-like language mentioned above,
we first teach OOPS something about recursion, by training it to
construct samples of the simple context free language
( 1's followed by 2's),
for up to 30 (in fact, the system discovers a
universal solver for all ).
This takes roughly 0.3 days on a standard personal computer (PC).
Thereafter, within a few additional days,
OOPS demonstrates incremental knowledge transfer:
it exploits aspects of its previously
discovered universal -solver,
by rewriting its search procedure such that it
more readily discovers a universal solver
for all disk *Towers of Hanoi* problems--in
the experiments it solves all
instances up to (solution size ),
but it would also work for .
Previous, less general reinforcement learners
and *non*learning AI planners
tend to fail for much smaller instances.

**Future research**
may focus on devising particularly compact,
particularly reasonable sets of initial codes with
particularly broad practical applicability.
It may turn out that the most useful initial
languages are not traditional programming
languages similar to the FORTH-like
one, but instead based on a handful of primitive
instructions for massively parallel cellular automata
[68,70,76],
or on a few nonlinear operations on matrix-like
data structures such as those used in
recurrent neural network research
[72,44,4].
For example, we could use the principles of
OOPS to create a non-gradient-based, near-bias-optimal
variant of Hochreiter's successful recurrent network
metalearner [19].
It should also be of interest to study probabilistic
*Speed Prior*-based OOPS variants [54]
and to devise applications of OOPS-like methods as
components of universal reinforcement learners (see below).
In ongoing work, we are applying OOPS to the problem
of optimal trajectory planning for robotics
in a realistic physics simulation.
This involves the interesting trade-off
between comparatively fast program-composing primitives or
*``thinking primitives''*
and time-consuming *``action primitives''*,
such as *stretch-arm-until-touch-sensor-input*.