next up previous
Next: Incremental Self-improvement (IS) Up: Exploring the Predictable In Previous: Introduction


More Formally

The explorer's life in environment $\cal E$ lasts from time 0 (birth) to unknown time $T$ (death). Two $m \times n$ matrices (${\sc Left }$ and ${\sc Right}$) of modifiable real values represent the learner's modules. ${\sc Left }$'s $k$-th variable, vector-valued column is denoted ${\sc Left }_k$; its $l$-th real-valued component ${\sc Left }_{k,l}$; similarly for ${\sc Right}$. A variable InstructionPointer (IP) with range $\{1, \ldots, m \}$ always points to one of the module pair's columns. $\cal S$ is the learner's variable internal state with $i$-th component $\cal S$$_i$$\in$ $\{-M, \ldots, -2, -1, 0, 1, 2, \ldots, M \}$ for $i \in \{1, \ldots, m \}$.

Throughout its life the learner repeats the following basic cycle over and over: select and execute instruction $a_j \in$ $\cal A$ with probability $Q(IP,j)$ (here $\cal A$ denotes a set of possible instructions), where

\begin{displaymath}
Q(i,j) =
\frac{f({\sc Right}_{i,j}, {\sc Left }_{i,j})}
{\sum_k f({\sc Right}_{i,k},{\sc Left }_{i,k})}
\end{displaymath}

for $i \in \{1, \ldots, m \}$, $j \in \{1, \ldots, n \}$. The collective decision function $f(x,y)$ maps real-valued $x,y$ to real values. Given an appropriate $f$, each module may ``veto'' instructions suggested by the other module. Only instructions that are strongly supported by both ``modules'' are highly likely to be selected (in the experiments I use $f(x,y) = xy$). Some instructions require parameters -- these are selected in an analogous fashion, using probability distributions $Q(IP+1,.)$, $Q(IP+2,.)$, ..., $Q(IP+v,.)$, where $v$ is the number of parameters. Instructions consume time and may change (1) environment $\cal E$: there are instructions such as MoveAgentInCurrentDirection(StepSize), TurnAgent(Angle), (2) state $\cal S$: new inputs from the environment may affect the internal state; and there are also arithmetic instructions such as Add(x,y,z) whose interpretation is $\cal S$ $_z \leftarrow (\cal S$$_x + \cal S$$_y)~mod~M$; (3) instruction pointer IP, for example, through conditional jump instructions; and (4) the modules themselves: there are instructions called learning instructions (LIs) that can modify the parameters determining the conditional probability distributions. To ensure non-vanishing exploration potential LIs may not generate module modifications that will let an instruction probability go to zero. LIs and other instructions can be combined to form more complex (probabilistic) learning algorithms. See Figure 1 for an illustration of how instruction sequences can be drawn from two probabilistic policies. See the appendix for details of a concrete implementation.

Figure 1: Snapshot of parts of two policies and some memory cells (which are viewed as part of the policy environment). Each policy is a set of variable probability distribution on $n_{ops}$ possible instructions or parameters. For simplicity, in this hypothetical example, $n_{ops} = 8$, and there is only one output instruction for manipulating the external environment, which in turn may affect memory cells (active perception). Instruction pointer IP currently points to a particular pair of distributions which collectively determine the probability of selecting a particular instruction (here: JMPLEQ whose probability is high). JMPLEQ requires three parameters generated according to the subsequent probability distributions.
\begin{figure}\centerline{\psfig{figure=chap23/cells2.ps,angle=0,width=12.5cm}}\end{figure}

Reward. Occasionally $\cal E$ may distribute real-valued external reward equally between both modules. $R(t)$ is each module's cumulative external reward obtained between time 0 and time $t > 0$, where $R(0) = 0$. In addition there may be occasional surprise rewards triggered by the special instruction Bet!(x,y,c,d) $\in$ $\cal A$. Suppose the two modules have agreed on executing a Bet! instruction, given a certain IP value. The probabilities of its four arguments $x,y \in \{ 1, \ldots, m \} $ and $c,d \in \{ -1, 1 \} $ depend on both modules: the arguments are selected according to the four probability distributions $Q(IP+1,.)$, $Q(IP+2,.)$, $Q(IP+3,.)$, $Q(IP+4,.)$ (details in the appendix). If $c = d$ then no module will be rewarded. $c = 1, d = -1$ means that ${\sc Left }$ bets on $\cal S$$_x = \cal S$$_y$, while ${\sc Right}$ bets on $\cal S$ $_x \neq \cal S$$_y$. $c = -1, d = 1$ means that ${\sc Left }$ bets on $\cal S$ $_x \neq \cal S$$_y$, while ${\sc Right}$ bets on $\cal S$$_x = \cal S$$_y$. Execution of the instruction will result in an immediate test: which module is wrong, which is right? The former will be punished (surprise reward minus 1), the other one will be rewarded ($+1$). Note that in this particular implementation the modules always bet a fixed amount of 1 -- alternative implementations, however, may compute real-valued bids via appropriate instruction sequences.

Interpretation of surprise rewards. Instructions are likely to be executed only if both modules collectively assign them high conditional probabilities, given $\cal E$ and $\cal S$. In this sense both must ``agree'' on each executed instruction of the lifelong computation process. In particular, both collectively set arguments $x,y,c,d$ in the case they decide to execute a Bet! instruction. By setting $c \neq d$ they express that their predictions of the Bet! outcome differ. Hence ${\sc Left }$ will be rewarded for luring ${\sc Right}$ into agreeing upon instruction subsequences (algorithms) that include Bet! instructions demonstrating that certain calculations yield results different from what ${\sc Right}$ expects, and vice versa. Thus each module is motivated to discover algorithms whose outcomes surprise the other; but each also may reduce the probability of algorithms it does not expect to surprise the other.

The sheer existence of the Bet! instruction motivates each module to act not only to receive external reward but also to obtain surprise reward, by discovering algorithmic regularities the other module still finds surprising -- a type of curiosity.

Trivial and random results. Why not provide reward if $\cal S$$_x = \cal S$$_y$ and $c = d = 1$ (meaning both modules rightly believe in $\cal S$$_x = \cal S$$_y$)? Because then both would soon focus on this particular way of making a correct and rewarding prediction, and do nothing else. Why not provide punishment if $\cal S$$_x = \cal S$$_y$ and $c = d = -1$ (meaning that both modules are wrong)? Because then both modules would soon be discouraged from making any prediction at all. In case $c = d = 1$ the truth of $\cal S$$_x = \cal S$$_y$ is considered a well-known, ``trivial'' fact whose confirmation does not deserve reward. In case $c = d = -1$ the truth of $\cal S$$_x = \cal S$$_y$ is considered a subjectively ``random,'' irregular result. Surprise rewards can occur only in the case both modules' opinions differ. They reflect one module's disappointed confident expectation, and the other's justified one, where, by definition, ``confidence'' translates into ``agreement on the surprising instruction sequence'' -- no surprise without such confidence.

Examples of learnable regularities. A Turing-equivalent instruction set $\cal A$ (one that permits construction of a universal Turing machine) allows exploitation of arbitrary computable regularities [11,2,39,14] to trigger or avoid surprises. For instance, in partially predictable environments the following types of regularities may help to reliably generate matches of computational outcomes. (1) Observation/prediction: selected inputs computed via the environment (observations obtained through ``active perception'') may match the outcomes of earlier internal computations (predictions). (2) Observation/explanation: memorized inputs computed via the environment may match the outcomes of later internal computations (explanations). (3) Planning/acting: outcomes of internal computations (planning processes) may match desirable inputs later computed via the environment (with the help of environment-changing instructions). (4) ``Internal'' regularities: the following computations yield matching results -- subtracting 2 from 14, adding 3, 4, and 5, multiplying 2 by 6. Or: apparently, the computation of the truth value of ``$n$ is the sum of two primes'' yields the same result for each even integer $n > 2$ (Goldbach's conjecture). (5) Mixtures of (1-4). For instance, raw inputs may be too noisy to be precisely predictable. Still, there may be internally computable, predictable, informative input transformations [33]. For example, hearing the first two words of the sentence ``John eats chips'' does not make the word ``chips'' predictable, but at least it is likely that the third word will represent something edible. Examples (1-5) mainly differ in the degree to which the environment is involved in the computation processes, and in the temporal order of the computations.

Curiosity's utility? The two-module system is supposed to solve self-generated tasks in an unsupervised manner. But can the knowledge collected in this way help to solve externally posed tasks? Intuition suggests that the more one knows about the world the easier it will be to maximize external reward. In fact, later I will present an example where a curious system indeed outperforms a non-curious one. This does not reflect a universal law though: in general there is no guarantee that curiosity will not turn out to be harmful (for example, by ``killing the cat'' [40]).

Relative reward weights? Let $RL(t)$ and $RR(t)$ denote ${\sc Left }$'s and ${\sc Right}$'s respective total cumulative rewards obtained between time 0 and time $t > 0$. The sum of both modules' surprise rewards always remains zero: we have $RL(t) + RR(t) - 2R(t) = 0$ for all $t$.

If we adopt the traditional hope that exploration will contribute to accelerating environmental rewards, then zero-sum surprise rewards seem to afford less need to worry about the relative weights of surprise versus other rewards than the surprise rewards of previous approaches, which did not add up to zero.

Enforcing fairness. To avoid situations where one module consistently outperforms the other, the instruction set includes a special LI that copies the currently superior module onto the other (see the appendix for details). This LI (with never-vanishing probability) will occasionally bring both modules on a par with each other.

In principle, each module could learn to outsmart the other by executing subsequences of instructions that include LIs. But how can we ensure that each module indeed improves? Note that arithmetic actions affecting $\cal S$ and jump instructions affecting IP cause a highly non-Markovian setting and prevent traditional RL algorithms based on dynamic programming from being applicable. For such reasons I use Incremental Self-improvement (IS) [35,34] to deal with both modules' complex spatio-temporal credit assignment problem.


next up previous
Next: Incremental Self-improvement (IS) Up: Exploring the Predictable In Previous: Introduction
Juergen Schmidhuber 2003-03-10


Back to Active Learning - Exploration - Curiosity page