next up previous
Next: PREVIOUS WORK Up: EXPERIMENTS Previous: LEARNING TO SOLVE A

THE KEY AND THE DOOR

% latex2html id marker 1189
\fbox{\parbox{8cm}{
\begin{bf}Appropriate place for Figure \ref{DOORWAY} \end{bf}}}


Task. The second experiment involves the $26 \times 23$ maze shown in figure 5. Starting at S, the system has to (1) fetch a key at position K, (2) move towards the ``door'' (the shaded area) which normally behaves like a wall and will open (disappear) only if the agent is in possession of the key, and (3) proceed to goal G. There are only 11 different, highly ambiguous inputs; the key (door) is observed as a free field (wall). The optimal path takes 83 steps.


Reward function. Once the system hits the goal, it receives a reward of 500. For all other actions there is a reward of -0.1. There is no additional, intermediate reward for taking the key or going through the door. The discount factor $\gamma = 1.0$.


Parameters. The experimental set-up is analogous to the one in section 3.1. We use systems with 3, 4, 6 and 8 agents, and systems with 8 agents whose actions are corrupted by different amounts of noise (5%, 10%, and 25%). $\alpha_Q$ = .05, $\alpha_{HQ}$ = .01 $\forall i:~ T_i = .2$. $p_{max}$ is linearly increased from $.4$ to $.8$. Again, $\lambda$ = .9 for both HQ-tables and Q-tables, and all table entries are zero-initialized. One simulation consists of 20,000 trials.

% latex2html id marker 1207
\fbox{\parbox{8cm}{
\begin{bf}Appropriate place for Figure \ref{DOORWAY_RESULTS} \end{bf}}}

\fbox{\parbox{8cm}{
\begin{bf}Appropriate place for Table 2 \end{bf}}}


Results. We first ran 20,000 thousand-step trials of a system executing random actions. It never found the goal. Then we ran the random system for 3000 10,000 step trials. The shortest path ever found took 1,174 steps. We observe: goal discovery within 1000 steps (and without ``action penalty'' through negative reinforcement signals for each executed action) is very unlikely to happen.

Figure 6A and Table 2 show HQ-learning results for noise-free actions. Within 20,000 trials good, deterministic policies are found in almost all simulations. Optimal 83 step paths are found with 3 (4, 6, 8) agents in 8% (9%, 8%, 6%) of all simulations. During the few runs that did not lead to good solutions the goal was rarely found at all. This reflects a general problem: in the POMDP case exploration issues are trickier than in the MDP case -- much remains to be done to better understand them.

If random actions are taken in 5% (10%, 25%) of all cases, the 8 agent system still finds the goal in 92% (90%, 84%) of the final trials (see table 2). In many cases long paths (300 -- 700 steps) are found. The best solutions use only 84 (91, 118) steps, though. Interestingly, a little noise (e.g. 5%) does decrease performance, but much more noise does not lead to much worse results.


next up previous
Next: PREVIOUS WORK Up: EXPERIMENTS Previous: LEARNING TO SOLVE A
Juergen Schmidhuber 2003-02-24


Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning