Experiment 2a.

Consider Figure 2. The entrance to the small room is blocked. Whenever the point-like agent moves into the $100 \times 100$ northeast corner (GOAL1) of the $1000 \times 1000$ field it receives external reward 100.0 and is immediately teleported back to start position

; its direction is reset to 0.0.

Solution. GOAL1 can be reached by a simple stochastic policy making the agent head in the northeast direction. Once the system somehow has fixed the agent's direction in an optimal way, the shortest path requires about 100 calls of MoveAgent (due to the agent's limited stepsize). We will see that the major difficulties in learning such a policy are due to the system's extremely low initial bias towards solving this task, and the extremely rare occurrences of rewarding training instances.

Random behavior results. With its module-modifying capabilities being switched off (LIs such as IncProbLEFT(

) have no effect), the system exhibits random behavior according to its maximum entropy initialization. This behavior leads to about five visits to GOAL1 per 10 million time steps, which shows how little built-in knowledge there is about the nature of the goal. Following Geman et al. [6], the learning system's bias towards solving the task is extremely low, while variance is high. There is a confusing choice of $2 \times 576 = 1152$ module columns (many more than needed for solving the task), all being potential candidates for modifications. The two modules initially have no idea that two of the 23 instruction types (namely, MoveAgent and SetDirection) are particularly important. Furthermore, module modifications can be computed only by appropriate instruction subsequences (including LIs) generated according to the module columns themselves. Initially all instructions are extremely stochastic, however, partly due to the many arguments and addresses to jump to. Reducing this ``free parameter overkill'' by using smaller modules would be equivalent to inserting more a priori knowledge about the task.

Comparison. I compare the performance of two learning systems. One is curious/creative in the sense described in the previous sections, the other is not. The latter is like the former except that its Bet! action has no effect -- external reward is the only type of reward. Ten simulations are conducted for each system. Each simulation takes 200 million time steps. During the 10th and the 20th

time step interval I count how often the agent visits GOAL1. Since external reward initially occurs only every 2 million time steps on average, goal-relevant training examples are very rare.

Results. Table 1 lists the plain and the curious systems' goal visits per

time steps after half their life times, and near death. All 10 results are given due to the high variance. The curious system's results tend to be more than an order of magnitude better than the plain's, which tend to be more than an order of magnitude better than the random's.

Figures 6, 7, 8 show details of the particularly successful simulation 1. We observe that total reward fluctuations due to surprise rewards appear negligible, although they seem important for overall performance improvement (compare Table 1). Between 20 and 30 million steps there are lots of Bet and SSAandCopy actions. Soon afterwards the instruction types Move and SetDirection start to dominate. Around 130 million steps their frequencies suddenly rise dramatically. The final breakthrough is achieved shortly thereafter. The modules' first 100 columns after simulation 1 are shown in Figure 9. Both modules are almost identical. The probability mass of many columns is concentrated in a single value. Additional inspection revealed, however, that some of the most frequently used instruction addresses point to maximum entropy distributions.

Possible explanation. How can curiosity/creativity help? In early system life external rewards occur very rarely. During this time, however, there are many surprise reward-oriented instruction subsequences. Possibly they provide the system with useful fragments of ``probabilistic knowledge'' about consequences of its innate instructions. For instance, to repeatedly surprise itself it may construct little ``probabilistic subprograms'' involving highly probable jumps to appropriate module column addresses. Once it has figured out how to increase the likelihood of certain jumps it may find it easier to solve external tasks that also require jumps.

Details of how the system actually came up with the final matrices remain quite unclear, however. More experimental analysis is necessary to better understand how solving external tasks can benefit from pure curiosity.

Table 1: Experiment 2a: 10 simulations, each taking

time steps. The table lists goal visits achieved by plain and curious systems during the 10th (half life time) and the final

time step interval. Random behavior leads to about 5 visits per

time steps. Curiosity typically seems to help a lot. But note the second simulation's ``outlier.''

Simulation	plain: half	curious: half	plain: final	curious: final
1	4	899	9	15774
2	4	0	4	0
3	3	204	15	381
4	124	558	1036	3150
5	399	539	1000	2073
6	9	1024	82	2937
7	9	1024	123	1181
8	6	891	20	1561
9	58	89	324	446
10	62	5603	476	9963
Average	68	1083	309	3746

**Figure 6:** Experiment 2a (additional external reward): stack pointer evolutions during the first 200 million time steps of the particularly successful simulation 1, sampled at million step intervals. Compare Figures 7 and 8.
$\begin{figure}\centerline{\psfig{figure=chap23/stackptrextern.ps,height=12cm,width=11.5cm}}\end{figure}$

**Figure 7:** Experiment 2a: evolution of both modules' cumulative rewards during the first 200 million time steps of simulation 1, sampled at million step intervals. Plot resolution is too low to show that both curves are not identical -- fluctuations due to surprise rewards seem negligible. Still, surprise rewards seem essential for significant performance improvement -- compare Table 1. The final breakthrough occurs around 140 million steps. Compare Figures 6 and 8.
$\begin{figure}\centerline{\psfig{figure=chap23/rewardextern.ps,height=12cm,width=11.5cm}}\end{figure}$

**Figure 8:** Experiment 2a: execution frequencies of selected instruction types during the first 200 million time steps of simulation 1, sampled at million time step intervals. Between 20 and 30 million steps there are many *Bet!* and *SSAandCopy* instructions. Soon afterwards *Move* and *SetDirection* start to dominate. The final breakthrough is achieved 10 million steps after their frequencies increase dramatically around 130 million steps. Compare Figures 6 and 7.
$\begin{figure}\centerline{\psfig{figure=chap23/instextern.ps,height=12cm,width=11.5cm}}\end{figure}$

**Figure 9:** Experiment 2a: LEFT's (top) and RIGHT's first 100 (of 576) probability distributions after simulation 1. Grey scales indicate probability magnitudes (white close to 0, black close to 1). The probability mass of many (but not all) columns is concentrated in a single value. Both modules are almost identical due to *SSAandCopy* LIs. Their stacks are quite different though.
$\begin{figure}\centerline{\psfig{figure=chap23/policy.eps,width=12cm}}\end{figure}$

The outlier. Note the curious system's outlier in the second simulation. Inspection revealed that during the entire simulation the system never reached the goal at all -- there was not a single training example concerning external reward. This outlier reflects curiosity's potential drawback: it may focus attention on computations that do not have anything to do with external reinforcement. On average, however, surprise rewards do help.

Runtimes. To some readers the runtimes of several hundred million time steps may seem large. Note, however, that the number of training examples (trials) is much smaller because the goal is hit very rarely, especially in the beginning of the learning phase. Some initial trials take millions of time steps simply because the system does not know that there is a source of reward at all, and that some of its instructions (those for interaction with the environment) are much more important for achieving the goal than arithmetic instructions -- not one biases its search towards the goal. The following simulations will require even longer runtimes.