I keep the basic setup from Experiment 2a, but open the entrance to the small room in Figure 2 by removing the block. Whenever the point-like agent moves into the small room's northeast corner (GOAL2) it receives external reward 1000 (10 times as much as for reaching GOAL1) and is teleported back to start position ; its direction is reset to 0.0.
The idea is: knowledge collected by solving the simpler task (reaching GOAL1) may help to solve the more difficult, but also more rewarding task (reaching GOAL2). Note that both tasks are solvable by similar stochastic policies making the agent head northeast. To receive a lot of reward, however, the agent must avoid GOAL1 on its way to GOAL2 -- otherwise it will be teleported back home.
This setup involves an obvious goal-directed exploration component: it is likely that the system will keep an exploration strategy that has helped to improve a policy for reaching GOAL1 (successful exploration strategies have an ``evolutionary advantage''). This strategy may later also help to improve a policy for reaching GOAL2 -- in principle, IS can use experience with exploration strategies to evaluate and refine them.
Random behavior results. With its module-modifying capabilities being switched off (LIs such as IncProbLEFT() have no effect), the system exhibits random behavior according to its maximum entropy initialization. On average this behavior leads to less than one visit of GOAL2 per 10 million time steps.
Comparison. Again I compare the performance of a ``plain'' and a ``curious'' system. Again the former is like the latter except that its Bet! instructions have no effect. Ten simulations are conducted for each system. Each simulation takes 4 billion time steps. I keep track of how often the agent reaches GOAL2 per time steps.
Results. Figure 10 plots average goal visit frequencies during the first 500 million time steps. During this period, the curious system's results are clearly better than the plain's, and its performance improves much faster.
As more simulation time is spent, however, initial differences tend to level out. Figure 11 plots average goal visit frequencies during the entire 4 billion steps. As goal-oriented training examples become more and more frequent, we see that the performances of plain and curious systems eventually reach comparable levels -- the former even becomes a bit better than the latter.
Advantage in the case of rare rewards? For this particular experiment, self-generated surprise rewards appear to boost initial external reward. Sufficient training time, however, cancels out the initial advantage. Could it be that curiosity is particularly useful as long as external reward is extremely rare and little has already been learned? This may seem intuitively plausible: in the beginning performance tends to be so bad that time spent on extensive exploration seems to be a good investment with little downside potential -- things cannot get much worse. Once the system's strategy is rather efficient and yields frequent external rewards, however, additional progress depends on fine-tuning it. In this stage surprise reward-oriented curiosity may distract more than it helps -- it tends to consume time without necessarily contributing much to optimizing goal-directed trajectories. Many additional experiments are necessary, however, to understand whether the above is a typical result or not.