Task.
The second experiment involves the maze shown in
figure 5.
Starting at S, the system has to (1) fetch a key at position K,
(2) move towards the ``door'' (the shaded area) which normally
behaves like a wall and will open (disappear)
only if the agent is in possession of the key, and (3) proceed to
goal G. There are only 11 different, highly ambiguous inputs;
the key (door) is observed as a free field (wall).
The optimal path takes 83 steps.
Reward function.
Once the system hits the goal, it receives a
reward of 500. For all other actions there is a reward of -0.1.
There is no additional, intermediate
reward for taking the key or going through the door.
The discount factor .
Parameters.
The experimental set-up is analogous to the one in section 3.1.
We use systems with 3, 4, 6 and 8 agents, and systems with 8
agents whose actions are corrupted by different amounts of noise
(5%, 10%, and 25%).
= .05, = .01
.
is linearly increased from to .
Again, = .9 for both HQ-tables and Q-tables, and
all table entries are zero-initialized.
One simulation consists of 20,000 trials.
Results.
We first ran 20,000 thousand-step trials of
a system executing random actions. It never found the goal.
Then we ran the random system for 3000 10,000 step trials.
The shortest path ever found took 1,174 steps.
We observe:
goal discovery within 1000 steps
(and without ``action penalty'' through
negative reinforcement signals for each executed action)
is very unlikely to happen.
Figure 6A and Table 2 show HQ-learning results for noise-free actions. Within 20,000 trials good, deterministic policies are found in almost all simulations. Optimal 83 step paths are found with 3 (4, 6, 8) agents in 8% (9%, 8%, 6%) of all simulations. During the few runs that did not lead to good solutions the goal was rarely found at all. This reflects a general problem: in the POMDP case exploration issues are trickier than in the MDP case -- much remains to be done to better understand them.
If random actions are taken in 5% (10%, 25%) of all cases, the 8 agent system still finds the goal in 92% (90%, 84%) of the final trials (see table 2). In many cases long paths (300 -- 700 steps) are found. The best solutions use only 84 (91, 118) steps, though. Interestingly, a little noise (e.g. 5%) does decrease performance, but much more noise does not lead to much worse results.