More Formally

The explorer's life in environment lasts
from time 0 (birth) to unknown time (death).
Two matrices
( and )
of modifiable real values represent
the learner's modules. 's -th variable, vector-valued column is denoted ;
its -th real-valued component
; similarly for
.
A variable *InstructionPointer* (*IP*)
with range
always points
to one of the module pair's columns.
is the learner's variable internal state with -th
component
for
.

Throughout its life the learner repeats the following
basic cycle over and over:
select and execute
instruction
with probability
(here denotes a set of
possible instructions), where

for , . The

**Reward.**
Occasionally may distribute real-valued *external reward*
equally between both modules.
is each module's cumulative external
reward obtained between time 0 and time , where .
In addition there may be occasional
*surprise rewards* triggered by
the special instruction *Bet!(x,y,c,d)*
.
Suppose the two modules have agreed on executing a *Bet!* instruction,
given a certain *IP* value. The probabilities of its four
arguments
and
depend on both modules: the arguments are selected
according to the four probability distributions
, , ,
(details in the appendix).
If then no module will be rewarded.
means that bets on ,
while bets on
.
means that bets on
,
while bets on .
Execution of the instruction will result in an immediate test:
which module is wrong, which is right?
The former will be punished (surprise reward minus 1),
the other one will be rewarded ().
Note that in this particular implementation the modules always
bet a fixed amount of 1 -- alternative implementations, however,
may compute real-valued bids via appropriate instruction sequences.

**Interpretation of surprise rewards.**
Instructions are likely to be executed only if both modules
collectively assign them high conditional probabilities,
given and .
In this sense both must ``agree''
on each executed instruction of the lifelong computation process.
In particular,
both collectively set arguments in the case they
decide to execute a *Bet!* instruction. By setting
they express that their predictions of the *Bet!* outcome differ.
Hence will be rewarded for
luring into agreeing upon instruction subsequences
(algorithms)
that include *Bet!* instructions demonstrating that certain
calculations yield results different from
what expects, and vice versa.
Thus each module is motivated to discover algorithms whose
outcomes surprise the other;
but each also may reduce the probability of
algorithms it does *not* expect to surprise the other.

The sheer existence of the *Bet!* instruction
motivates each module to act not only to receive external
reward but also to obtain surprise reward, by discovering
algorithmic regularities the other module still finds
surprising -- a type of curiosity.

**Trivial and random results.**
Why not provide reward if and
(meaning both modules
rightly believe in )?
Because then both would soon focus on this particular way of
making a correct and rewarding prediction, and do nothing else.
Why not provide punishment if and
(meaning that both modules are wrong)?
Because then both modules would soon be discouraged from
making any prediction at all.
In case the truth of is considered a
well-known, ``trivial'' fact whose confirmation does not deserve reward.
In case the truth of is
considered a subjectively ``random,'' irregular result.
Surprise rewards
can occur only in the case both modules' opinions differ.
They reflect one module's disappointed *confident* expectation,
and the other's justified one,
where, by definition, ``confidence'' translates into ``agreement on the
surprising instruction sequence'' --
no surprise without such confidence.

**Examples of learnable regularities.**
A Turing-equivalent instruction set
(one that permits construction of a
universal Turing machine) allows
exploitation of arbitrary computable regularities
[11,2,39,14]
to trigger or avoid surprises.
For instance, in partially predictable environments the following
types of regularities may help to reliably generate matches of
computational outcomes. *(1) Observation/prediction:* selected inputs
computed via the environment (observations obtained through ``active
perception'') may match the outcomes of earlier internal computations
(predictions). *(2) Observation/explanation:* memorized inputs
computed via the environment may match the outcomes of later internal
computations (explanations). *(3) Planning/acting:* outcomes
of internal computations (planning processes) may match desirable
inputs *later* computed via the environment (with the help of
environment-changing instructions). *(4) ``Internal'' regularities:*
the following computations yield matching results --
subtracting 2 from 14, adding 3, 4, and 5, multiplying 2 by 6.
Or: apparently, the computation of the truth value of
`` is the sum of two primes'' yields the same result
for each even integer (Goldbach's conjecture).
*(5) Mixtures of (1-4).*
For instance, raw inputs may be too noisy to be precisely predictable.
Still, there may be internally computable, predictable, informative input
transformations [33]. For example, hearing the first
two words of the sentence ``John eats chips'' does not make the word
``chips'' predictable, but at least it is likely that the third word will
represent something edible. Examples (1-5) mainly differ in the degree
to which the environment is involved in the computation processes, and
in the temporal order of the computations.

**Curiosity's utility?** The two-module system is supposed to solve
self-generated tasks in an unsupervised manner. But can the knowledge
collected in this way help to solve externally posed tasks? Intuition
suggests that the more one knows about the world the easier it will be
to maximize external reward. In fact, later I will present an example
where a curious system indeed outperforms a non-curious one. This does
not reflect a universal law though: in general there is no guarantee
that curiosity will not turn out to be harmful (for example, by ``killing the
cat'' [40]).

**Relative reward weights?**
Let and
denote 's and 's
respective total cumulative rewards obtained
between time 0 and time .
The sum of both modules' surprise rewards always remains zero:
we have
for all .

If we adopt the traditional hope that exploration will contribute to accelerating environmental rewards, then zero-sum surprise rewards seem to afford less need to worry about the relative weights of surprise versus other rewards than the surprise rewards of previous approaches, which did not add up to zero.

**Enforcing fairness.**
To avoid situations where one module consistently
outperforms the other, the instruction set
includes a special LI
that copies the
currently superior module onto the other (see the appendix for details).
This LI (with never-vanishing probability) will
occasionally bring both modules on a par with each other.

In principle, each module could learn to outsmart the other by executing
subsequences of instructions that include LIs. But how can we ensure that
each module indeed improves? Note that arithmetic actions affecting and jump instructions affecting *IP* cause a highly non-Markovian
setting and prevent traditional RL algorithms based on dynamic
programming from being applicable. For such reasons I use *Incremental
Self-improvement (IS)* [35,34] to
deal with both modules' complex spatio-temporal credit assignment problem.

Back to Active Learning - Exploration - Curiosity page