We will use off-line learning for updating the tables -- this means storing experiences and postponing learning until after trial end (no intra-trial parameter adaptation). In principle, however, online learning is applicable as well (see below). We will describe two HQ variants, one based on Q-learning, the other on Q()-learning -- Q() overcomes Q's inability to solve certain RPPs. The learning rules appear very similar to those of conventional Q and Q(). One major difference though is that each agent's prospects of achieving its subgoal tend to vary as various agents try various subgoals.
Learning the Q-values. We want to approximate the system's expected discounted future reward for executing action , given . In the optimal case we have
Q-value updates are generated in two different situations ( denotes the total number of executed actions during the current trial, and is the learning rate):
Learning the HQ-values: intuition. Recall the introduction's traffic light task. The first traffic light is a good subgoal. We want our system to discover this by exploring (initially random) subgoals and learning their HQ-values. The traffic light's HQ-value, for instance, should converge to the expected (discounted) future cumulative reinforcement to be obtained after it has been chosen as a subgoal. How? Once the traffic light has been reached and the first agent passes control to the next, the latter's own expectation of future reward is used to update the first's HQ-values. Where do the latter's expectations originate? They reflect its own experience with final reward (to be obtained at the station).
In the optimal case we have
We adjust only HQ-values of agents active before trial end ( denotes the number of agents active during the last trial, denotes the learning rate, and the chosen subgoal for agent ):
The first and third rules resemble traditional Q-learning rules. The second rule is necessary if agent has learned a (possibly high) value for a subgoal that is unachievable due to subgoals selected by previous agents.
HQ()-learning: motivation. Q-learning's lookahead capability is restricted to one step. It cannot solve all RPPs because it cannot properly assign credit to different actions leading to identical next states (Whitehead 1992). For instance, suppose you walk along a wall that looks the same everywhere except in the middle where there is a picture. The goal is to reach the left corner where there is reward. This RPP is solvable by an RP. Given the ``picture'' input, however, Q-learning with one step look-ahead would assign equal values to actions ``go left'' and ``go right'' because they both yield identical ``wall'' observations.
Consequently HQ-learning may suffer from Q-learning's inability to solve certain RPPs. To overcome this problem, we augment HQ by TD()-methods for evaluating and improving policies in a manner analogous to Lin's offline Q()-method (1993). TD()-methods integrate experiences from several successive steps to disambiguate identical short-term effects of different actions. Our experiments indicate that RPPs are solvable by Q()-learning with sufficiently high .
In principle, online Q() may be used as well. See, e.g., Peng and Williams (1996), or Wiering and Schmidhuber's fast Q() implementation (1997). Online Q() should not use ``action-penalty'' (Koenig 1996), however, because punishing varying actions in response to ambiguous inputs will trap the agent in cyclic behavior.
Combined dynamics. Q-table policies are reactive and learn to solve RPPs. HQ-table policies are metastrategies for composing RPP sequences. Although Q-tables and HQ-tables do not explicitly communicate they influence each other through simultaneous learning. Their cooperation results in complex dynamics quite different from those of conventional Q-learning.
Utilities of subgoals and RPs are estimated by tracking how often they are part of successful subgoal/RP combinations. Subgoals that never or rarely occur in solutions become less likely to be chosen, others become more likely. In a certain sense subtasks compete for being assigned to subagents, and the subgoal choices ``co-evolve'' with the RPs. Maximizing its own expected utility, each agent implicitly takes into account frequent decisions made by other agents. Each agent eventually settles down on a particular RPP solvable by its RP and ceases to adapt. This will be illustrated by Experiment 1 in Section 3.
Estimation of average reward for choosing a particular subgoal ignores dependencies on previous subgoals. This makes local minima possible. If several rewarding suboptimal subgoal sequences are ``close'' in subgoal space, then the optimal one may be less probable than suboptimal ones. We will show in the experiments that this actually can happen.
Exploration issues. Initial choices of subgoals and RPs may influence the final result - there may be local minimum traps. Exploration is a partial remedy: it encourages alternative competitive strategies similar to the current one. Too little exploration may prevent the system from discovering the goal at all. Too much exploration, however, prevents reliable estimates of the current policy's quality and reuse of previous successful RPs. To avoid over-exploration we use the Max-Boltzmann (Max-Uniform) distribution for Q-values (HQ-values) (for discussions of exploration issues see, e.g., Fedorov 1972; Schmidhuber 1991a; Thrun 1992; Cohn 1994; Caironi and Dorigo 1994; Storck, Hochreiter and Schmidhuber 1995; Wilson 1996a, Schmidhuber 1997c). These distributions also make it easy to reduce the relative weight of exploration (as opposed to exploitation): to obtain a deterministic policy at the end of the learning process, we increase during learning until it finally achieves a maximum value.
Selecting actions according to the traditional Boltzmann distribution causes the following problems: (1) It is hard to find good values for the temperature parameter. (2) The degree of exploration depends on the Q-values: actions with almost identical Q-values (given a certain input) will be executed equally often. For instance, suppose a sequence of 5 different states in a maze leads to observation sequence , where represents a single observation. Now suppose there are almost equal Q-values for going west or east in response to . Then the Q-updates will hardly change the differences between these Q-values. The resulting random walk behavior will cost a lot of simulation time.
For RP training we prefer the Max-Boltzmann rule instead. It focuses on the greedy policy and only explores actions competitive with the optimal actions. Subgoal exploration is less critical though. The Max-Random subgoal exploration rule may be replaced by the Max-Boltzmann rule or others.