Other authors proposed hierarchical reinforcement learning techniques, e.g., Schmidhuber (1991b), Dayan and Hinton (1993), Moore (1993), Tham (1995), Sutton (1995). Their methods, however, have been designed for MDPs. Since the focus of our paper is on POMDPs, this section is limited to a brief summary of previous POMDP approaches with specific advantages and disadvantages.
Recurrent neural networks.
Schmidhuber (1991c) uses
two interacting, gradient-based recurrent networks.
The ``model network'' serves to model (predict) the
environment, the other one uses the model net to compute
gradients maximizing reinforcement predicted by the model
(this extends ideas by Nguyen and B. Widrow 1989; and
Jordan and Rumelhart 1990).
To our knowledge this work presents the first successful
reinforcement learning application to simple
non-Markovian tasks (e.g., learning to be a flipflop).
Lin (1993)
also uses combinations
of controllers and recurrent nets.
He compares time-delay neural networks (TDNNs)
and recurrent neural networks.
Despite their theoretical power, standard recurrent nets run into practical problems in case of long time lags between relevant input events. Although there are recent attempts at overcoming this problem (e.g., Schmidhuber 1992; Hihi and Bengio 1995; Hochreiter and Schmidhuber 1997, and references therein), there are no reinforcement learning applications yet.
Belief vectors.
Kaelbling, Littman and Cassandra (1995)
hierarchically build policy trees to calculate
optimal policies in stochastic, partially observable environments.
For each possible environmental state
the ``belief vector'' provides the agent's estimate of the probability
of currently being in this state. After each observation
the belief vector is updated using action/observation models
and Bayes' formula. This compresses the
history of previous events into a probability distribution.
Based on this ``belief state'' an
optimal action can be chosen (Sondik 1971).
Dynamic programming algorithms are used to compute optimal
policies based on the belief states.
Problems with this approach are that the nature of the underlying MDP
needs to be known, and that it is computationally very expensive.
Methods for speeding it up focus on constructing more
compact policy trees.
For instance, Littman (1996)
uses the ``witness algorithm'' to accelerate
policy tree construction.
Zhang and Liu (1996) propose a scheme which (a) speeds up
the dynamic programming updates and (b) uses an oracle
providing additional state information to decrease
uncertainty.
Boutilier and Poole (1996) use Bayesian networks to represent POMDPs, and use these more compact models to accelerate policy computation. Parr and Russell (1995) use gradient descent methods on a continuous representation of the value function. Their experiments show significant speed-ups on certain small problems.
Littman et al. (1995) compare different POMDP algorithms using belief vectors. They report that ``small POMDPs'' (with less than 10 states and few actions) do not pose a very big problem for most methods. Larger POMDPs (50 to 100 states), however, cause major problems. This indicates that the problems in the current paper (which involve 62 and 960 states) can hardly be solved by such methods. HQ-learning, by contrast, is neither computationally complex nor requires knowledge of the underlying MDP. In absence of prior knowledge this can be a significant advantage. An advantage of the other methods, however, is that they can deal with very noisy perceptions and actions.
A possible HQ extension could use belief vectors to assign selection probabilities to each agent and to weigh their Q-values. In very noise environments this may work better than simple HQ.
Hidden Markov Models.
McCallum's utile distinction memory (1993)
is an extension of Chrisman's perceptual distinctions approach (1992),
which combines Hidden Markov Models (HMMs)
and Q-learning. The system is able to solve simple
POMDPs (maze tasks with only a few fields) by splitting
``inconsistent'' HMM states whenever the agent fails to
predict their utilities (but instead
experiences quite different returns from these states).
One problem of the approach
is that it cannot solve problems in which conjunctions
of successive perceptions are useful for predicting reward
while independent perceptions are irrelevant.
HQ-learning does not have this problem -- it deals with
perceptive conjunctions by using multiple agents if necessary.
Memory bits.
Littman (1994)
uses branch-and-bound heuristics to find suboptimal memoryless
policies extremely quickly. To handle mazes for which there is no
safe, deterministic, memoryless policy, he replaces each conventional
action by
two actions, each having the additional effect of switching
on or off a ``memory bit''.
Good results are obtained with a toy problem.
The method does not scale though,
due to search space explosion caused by adding
memory bits.
Cliff and Ross (1994) use Wilson's (1994) classifier system (ZCS) for POMDPs. There are memory bits which can be set and reset by actions. ZCS is trained by bucket-brigade and genetic algorithms. The system is reported to work well on small problems but to become unstable in case of more than one memory bit. Also, it is usually not able to find optimal deterministic policies. Wilson (1995) recently described a more sophisticated classifier system which uses prediction accuracy for calculating fitness, and a genetic algorithm working in environmental niches. His study shows that this makes the classifiers more general and more accurate. It would be interesting to see how well this system can use memory for solving POMDPs.
One problem with memory bits is that tasks such as those in section 3 require (1) switching on/off memory bits at precisely the right moment, and (2) keeping them switched on/off for long times. During learning and exploration, however, each memory bit will be very unstable and change all the time -- algorithms based on incremental solution refinement will usually have great difficulties in finding out when to set or reset it. Even if the probability of changing a memory bit in response to a particular observation is low it will eventually change if the observation is made frequently. HQ-learning does not have such problems. Its memory is embodied solely by the active agent's number, which is rarely incremented during a trial. This makes it much more stable.
Program evolution with memory cells (MCs).
Certain techniques for automatic program synthesis
based on evolutionary principles
can be used to evolve short-term memorizing
programs that read and write MCs during runtime
(e.g., Teller, 1994).
A recent such method is
Probabilistic Incremental Program Evolution
(PIPE -- Salustowicz and Schmidhuber, 1997).
PIPE iteratively generates successive
populations of functional programs according to an adaptive
probability distribution over all possible programs.
On each iteration it uses the best program to refine the distribution.
Thus it stochastically generates better and better programs.
An MC-based PIPE variant has been successfully used to solve tasks in
partially observable mazes.
Unlike the memory bit approach mentioned in the previous
paragraph, population-based approaches
will not easily unlearn programs that make good use of memory.
On serial machines, however, their evaluation
tends to be computationally much more expensive than HQ.
Learning control hierarchies.
Ring's system (1994)
constructs a bottom-up control hierarchy.
The lowest level nodes are primitive perceptual and control actions.
Nodes at higher levels represent sequences of lower level nodes.
To disambiguate inconsistent states,
new higher-level nodes are added to incorporate
information hidden ``deeper'' in the past, if necessary.
The system is able to quickly learn certain non-Markovian maze problems
but often is not able to generalize from previous experience
without additional learning, even if the optimal policies for old
and new task are identical.
HQ-learning, however, can reuse the same policy and generalize well
from previous to ``similar'' problems.
McCallum's U-tree (1996) is quite similar to Ring's system. It uses prediction suffix trees (see Ron and al. 1994) in which the branches reflect decisions based on current or previous inputs/actions. Q-values are stored in the leaves, which correspond to clusters of instances collected and stored during the entire learning phase. Statistical tests are used to decide whether instances in a cluster correspond to significantly different utility estimates. If so, the cluster is split. McCallum's recent experiments demonstrate the algorithm's ability to learn reasonable policies in large state spaces.
One problem with Ring's and McCallum's approaches is that they depend on the creation of an -th order Markov model, where is the size of the ``time window'' used for sampling observations. Hence for large the approach will suffer from the curse of dimensionality.
Consistent Representations.
Whitehead (1992) uses the
``Consistent Representation (CR) Method'' to deal with
inconsistent internal states which result from ``perceptual aliasing''
due to ambiguous input information.
CR uses an ``identification stage'' to
execute perceptual actions which collect the information needed
to define a consistent internal state.
Once a consistent internal state has been identified,
a single action is generated to maximize
future discounted reward. Both identifier and controller are
adaptive. One limitation of his method is that the system has
no means of remembering and using any information other than
that immediately perceivable.
HQ-learning, however, can profit from remembering
previous events for very long time periods.
Levin Search.
Wiering and Schmidhuber (1996)
use Levin search (LS) through program space
(Levin 1973)
to discover programs computing solutions for large POMDPs.
LS is of interest because of its amazing
theoretical properties: for a broad class of search problems, it has
the optimal order of computational complexity. For instance, suppose
there is an algorithm that solves a certain type of maze task in
steps, where is a positive integer representing the problem size.
Then LS will solve the same task in at most steps.
Wiering and Schmidhuber show that LS may have
substantial advantages over other reinforcement learning techniques,
provided the algorithmic complexity of the solutions is low.
Success-Story Algorithm.
Wiering and Schmidhuber (1996) also extend LS to obtain an
incremental method for generalizing from previous experience
(``adaptive LS'').
To guarantee that the lifelong history of policy changes
corresponds to a lifelong history of reinforcement accelerations,
they use the success-story algorithm
(SSA, e.g., Schmidhuber, Zhao and Schraudolph 1997a;
Zhao and Schmidhuber 1996,
Schmidhuber, Zhao and Wiering 1997b).
This can lead to further significant
speed-ups. SSA is actually not LS-specific, but a general
approach that allows for
plugging in a great variety of learning algorithms. For instance,
in additional
experiments with a ``self-referential'' system that embeds its
policy-modifying method within the policy itself, SSA is
able to solve huge POMDPs with more than states
(Schmidhuber et al. 1997a).
It may be possible to combine SSA with
HQ-learning in an advantageous way.
Multiple Q-learners.
Like HQ-learning, Humphrys'
W-learning (1996) uses
multiple Q-learning agents.
A major difference is that
his agents' skills
are prewired -- different agents focus on
different input features and receive different rewards.
``Good'' reward functions are found by genetic algorithms.
An important goal is to learn which agent to select
for which part of the input space.
Eight different learning methods implementing
cooperative and competitive strategies
are tested in a rather complex
dynamic environment, and seem to lead to reasonable results.
Digney (1996) describes a nested Q-learning technique based on multiple agents learning independent, reusable skills. To generate quite arbitrary control hierarchies, simple actions and skills can be composed to form more complex skills. Learning rules for selecting skills and for selecting actions are the same, however. This may make it hard to deal with long reinforcement delays. In experiments the system reliably learns to solve a simple maze task. It remains to be seen, however, whether the system can reliably learn to decompose solutions of complex problems into stable skills.