Task. The second experiment involves the maze shown in figure 5. Starting at S, the system has to (1) fetch a key at position K, (2) move towards the ``door'' (the shaded area) which normally behaves like a wall and will open (disappear) only if the agent is in possession of the key, and (3) proceed to goal G. There are only 11 different, highly ambiguous inputs; the key (door) is observed as a free field (wall). The optimal path takes 83 steps.
Reward function. Once the system hits the goal, it receives a reward of 500. For all other actions there is a reward of -0.1. There is no additional, intermediate reward for taking the key or going through the door. The discount factor .
Parameters. The experimental set-up is analogous to the one in section 3.1. We use systems with 3, 4, 6 and 8 agents, and systems with 8 agents whose actions are corrupted by different amounts of noise (5%, 10%, and 25%). = .05, = .01 . is linearly increased from to . Again, = .9 for both HQ-tables and Q-tables, and all table entries are zero-initialized. One simulation consists of 20,000 trials.
Results. We first ran 20,000 thousand-step trials of a system executing random actions. It never found the goal. Then we ran the random system for 3000 10,000 step trials. The shortest path ever found took 1,174 steps. We observe: goal discovery within 1000 steps (and without ``action penalty'' through negative reinforcement signals for each executed action) is very unlikely to happen.
Figure 6A and Table 2 show HQ-learning results for noise-free actions. Within 20,000 trials good, deterministic policies are found in almost all simulations. Optimal 83 step paths are found with 3 (4, 6, 8) agents in 8% (9%, 8%, 6%) of all simulations. During the few runs that did not lead to good solutions the goal was rarely found at all. This reflects a general problem: in the POMDP case exploration issues are trickier than in the MDP case -- much remains to be done to better understand them.
If random actions are taken in 5% (10%, 25%) of all cases, the 8 agent system still finds the goal in 92% (90%, 84%) of the final trials (see table 2). In many cases long paths (300 -- 700 steps) are found. The best solutions use only 84 (91, 118) steps, though. Interestingly, a little noise (e.g. 5%) does decrease performance, but much more noise does not lead to much worse results.