Task. The first experiment involves the partially observable maze shown in figure 2. The system has to discover a path leading from start position S to goal G. There are four actions with obvious semantics: go west, go north, go east, go south. There are 16 possible observations: the agent can only ``see'' which of the 4 adjacent fields are blocked. Although there are 62 possible agent positions, there are only 9 highly ambiguous inputs. (Not all of the 16 possible observations can occur in this maze. This means that the system may occasionally generate unsolvable subgoals, such that control will never be transferred to another agent.) There is no deterministic, memory-free policy for solving this task. Stochastic memory-free policies will also perform poorly. For instance, input 5 stands for ``fields to the left and to the right of the agent are blocked''. The optimal action in response to input 5 depends on the subtask: at the beginning of a trial, it is ``go north'', later ``go south'', near the end, ``go north'' again. Hence at least three reactive agents are necessary to solve this POMDP.
Reward function. Once the system hits the goal it receives a reward of 100. Otherwise the reward is zero. The discount factor .
Parameters and experimental set-up. We compare systems with 3, 4, 6, 8, and 12 agents and noise-free actions. We also compare systems with 4, 8, and 12 agents whose actions selected during learning/testing are replaced by random actions with probability 10%. One experiment consists of 100 simulations of a given system. Each simulation consists of 20,000 trials. is 1000. After every 500th trial there is a test run during which actions and subgoals with maximal table entries are selected ( is set to 1.0). If the system does not find the goal during a test run, then the trial's outcome is counted as 1000 steps.
After a coarse search through parameter space, we use the following parameters for all experiments: = .05, = .2, , = .9 for both HQ-tables and Q-tables. is set to and linearly increased to 1.0. All table entries are initialized with 0.
For purposes of comparison, we also ran 20,000 trials during which at most 1000 actions were picked randomly. We also tried Q()-learning augmented as follows: the current input is the current observation and the last observation that is different from the current observation. At least in theory this Q() variant might be able to solve the problem.
Results. Augmented Q() failed to find stable solutions. HQ worked well, though. Figure 3A plots average test run length against trial numbers. Within 20,000 trials all systems almost always find near-optimal deterministic policies.
Consider Table 1. The largest systems are always able to decompose the POMDP into a sequence of RPPs. The average number of steps is close to optimal. In approximately 1 out of 8 cases, the optimal 28-step path is found. In most cases one of the 30-step solutions is found. Since the number of 30-step solutions is much larger than the number of 28-step solutions (there are many more appropriate subgoal combinations), this result is not surprising.
Systems with more than 3 agents are performing better -- here the system profits from having more free parameters. More than 6 agents do not help though. All systems perform significantly better than the random system, which finds the goal in only 19% of all 1000 step trials.
In case of noisy actions (the probability of replacing a selected action by a random action is 10%), the systems still reach the goal in most of the simulations (see figure 3B). In the final trial of each simulation, systems with 4 (8, and 12) agents find the goal with probability 86% (87%, and 84%). There is no significant difference between smaller and larger systems.
We also studied how the system adds agents during the learning process. The 8-agent system found solutions using 3 (4, 5, 6, 7, 8) agents in 8 (19, 16, 17, 21, 19) simulations. Using more agents tends to make things easier. During the first few trials 3 agents were used on average, during the final trials 6. Less agents tend to give better results, however. Why? Systems that fail to solve the task with few subgoals start using more subgoals until they become successful. But the more subgoals there are, the more possibilities to compose paths, and the lower the probability of finding a shortest path in this maze.
Experimental analysis. How does the system discover and stabilize subgoal combinations (SCs)? The only optimal 28-step solution uses observation 2 as the first subgoal (5th top field) and observation 9 as the second (southeast inner corner). There are several 30-step solutions, however -- e.g., SCs (3, 12), (2, 12), (10, 12).
Figure 4 shows how SCs evolve by plotting them every 10 trials (observation 16 stands for an unsolved subgoal). The first 10,000 SCs are quite random, and the second agent often is not able to achieve its subgoal at all. Later, however, the system gradually focuses on successful SCs. Although useful SCs are occasionally lost due to exploration of alternative SCs, near simulation end the system converges to SC (3, 12).
The goal is hardly ever found prior to trial 5200 (the figure does not show this). Then there is a sudden jump in performance -- most later trials cost just 30 steps. From this moment on observation 12 is used as second subgoal in more than 95% of all cases, and the goal is found in about 85%. The first subgoal tends to vary among observations 2, 3 and 10. Finally, around 16,000 trials, the first subgoal settles down on observation 3, although observation 2 would work as well.
The reason for faster stabilization of the second subgoal may be its proximity to the final goal. Once the second and third agents have learned RPs leading from the second subgoal to the goal, the second subgoal's HQ-value will have increased dramatically and dominate the alternatives. Only once the second subgoal is firmly established can a similar effect help to stabilize the first. Subgoals tend to get fixed in reverse order of their online generation.