Summary. HQ-learning is a novel method for reinforcement learning in partially observable environments. ``Non-Markovian'' tasks are automatically decomposed into subtasks solvable by memoryless policies, without intermediate external reinforcement for ``good'' subgoals. This is done by an ordered sequence of agents, each discovering both a local control policy and an appropriate subgoal. At each time step, the only type of memory is carried by the ``name'' of the agent that is active. Our experiments involve (model-free, deterministic) POMDPs with many more states than most POMDPs found in the literature. The results demonstrate HQ-learning's ability to quickly learn optimal or near-optimal policies.
Future work.
The current HQ version is restricted to learning single
linearly ordered subgoal sequences.
For very complex POMDPs, generalized HQ-architectures based on
directed acyclic (or even recurrent) graphs may turn out to be useful.
In our point of view, however, the most challenging problem is exploration:
``destructive'' exploration rules will unlearn good subgoal sequences.
How to improve POMDP exploration is still an open question.