next up previous
Next: CONCLUSION Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: PREVIOUS WORK


HQ's advantages.

  1. Most POMDP algorithms need a priori information about the POMDP, such as the total number of environmental states, the observation function, or the action model. HQ does not.

  2. Unlike ``history windows'', HQ-learning can in principle handle arbitrary time lags between events worth memorizing. To focus this power on where it is really needed, short history windows may be included in the agent inputs to take care of the shorter time lags. This, however, is orthogonal to HQ's basic ideas.

  3. To reduce memory requirements, HQ does not explicitly store all experiences with different subgoal combinations. Instead it estimates the average reward for choosing particular subgoal/RP combinations, and stores its experiences in a single sequence of Q- and HQ-tables. These are used to make successful subgoal/RP combinations more likely. HQ's approach is advantageous in case the POMDP exhibits certain regular structure: if one and the same agent tends to receive RPPs achievable by similar RPs then it can ``reuse'' previous RPP solutions.

  4. HQ-learning can immediately generalize from solved POMDPs to ``similar'' POMDPs containing more states but requiring identical actions in response to inputs observed during subtasks (the RPPs remain invariant).

  5. Like Q-learning, HQ-learning allows for representing RPs and subgoal evaluations by function approximators other than look-up tables.

HQ's current limitations.

  1. An agent's current subgoal does not uniquely represent previous subgoal histories. This means that HQ-learning does not really get rid of the ``hidden state problem'' (HSP). HQ's HSP is not as bad as Q's, though. Q's is that it is impossible to build a Q-policy that reacts differently to identical observations, which may occur frequently. Appropriate HQ-policies, however, do exist.

    Still, HQ's remaining HSP may prevent HQ from learning an optimal policy. To deal with this HSP one might think of using subgoal trees instead of sequences. All possible subgoal sequences are representable by a tree whose branches are labeled with subgoals and whose nodes contain RPs for solving RPPs. Each node stands for a particular history of subgoals and previously solved subtasks -- there is no HSP any more. Since the tree grows exponentially with the number of possible subgoals, however, it is practically infeasible in case of large scale POMDPs. Perhaps it will be possible to find a reasonable compromise between simple linear sequences and full-fledged trees.

  2. In case of noisy observations transfer of control may happen at inappropriate times. A remedy may be to use more reliable inputs combining successive observations.

  3. In case of noisy actions, an inappropriate action may be executed right before passing control. The resulting new subtask may not be solvable by the next agent's RP. A remedy similar to the one mentioned above may be to represent subgoals as pairs of successive observations.

  4. If there are many possible observations then subgoals will be tested infrequently. This may delay convergence. To overcome this problem one might either try function approximators instead of look-up tables or let each agent generate a set of multiple, alternative subgoals -- once a subgoal in the set is reached, control is transferred to the next agent.

  5. Some parameters, such as the maximal number of agents and the maximal runtime, need to be set in advance. The former is not critical -- it may be large since storage requirements are low. The latter, however, should be as low as possible to avoid wasting time on ``cycling'' between states.

next up previous
Next: CONCLUSION Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: PREVIOUS WORK
Juergen Schmidhuber 2003-02-24

Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning