It has already been shown that IS by itself can solve interesting tasks. For instance,  describes two agents A and B living in a partially observable environment with obstacles. They learn to solve a complex task that could not be solved by various TD() Q-learning variants . The task requires (1) agent A to find and take ``key A''; (2) agent A go to ``door A'' and open it for agent B; (3) agent B to enter through ``door A,'' find, and take another ``key B''; (4) agent B to go to another ``door B'' to open it (to free the way to the goal); (5) one of the agents to reach the goal. Both agents share the same design. Each is equipped with limited ``active'' sight: by executing certain instructions, it can sense obstacles, its own key, the corresponding door, or the goal, within up to 50 unit lengths in front of it. The agent can also move forward (up to 30 unit lengths), change its direction, turn relative to its key or its door or the goal. It can use memory (embodied by its IP) to disambiguate inputs. Reward is provided only if one of the agents touches the goal. This agent's reward is 5.0; the other's is 3.0. In the beginning, the goal is found only every 300,000 basic cycles. Through IS, however, within 130,000 trials the average trial length decreases by a factor of 60 -- both agents learn to cooperate to accelerate reward intake .
This section's purpose is not to elaborate on how IS can solve difficult tasks. Instead IS is used as a particular vehicle to implement the two-module idea for preliminary attempts at studying ``inquisitive'' explorers. Subsection 4.1 will describe empirically observed system behavior in the absence of external rewards. In Subsection 4.2 there will be additional reward for solving externally posed tasks, to see whether curiosity can indeed be useful.
Experimental details. There are instructions (see the appendix) and columns per module. . Time is measured as follows: selecting an instruction head, selecting an argument, selecting one of the two values required to compute the next instruction addresses, pushing or popping a module column costs one time step. Other computations do not cost anything. This ensures that measured time is of the order of total CPU-time. For instance, selecting an instruction head plus six arguments plus the next IP address costs time steps.
Figure 2 shows a point-like agent's two-dimensional environment whose width is 1000 unit lengths. The large ``room'' in the south is a square. Its southwest corner has coordinates (0.0, 0.0), its southeast corner (1000.0, 0.0). There are infinitely many possible agent states: the agent's current position is given by a pair of real numbers. Its initial coordinates are , its initial direction is 0, its stepsize 12 unit lengths. Compare appendix A.3.2.