next up previous
Next: EXPERIMENTS Up: HQ-LEARNING Previous: HQ-LEARNING

LEARNING RULES

We will use off-line learning for updating the tables -- this means storing experiences and postponing learning until after trial end (no intra-trial parameter adaptation). In principle, however, online learning is applicable as well (see below). We will describe two HQ variants, one based on Q-learning, the other on Q($\lambda$)-learning -- Q($\lambda$) overcomes Q's inability to solve certain RPPs. The learning rules appear very similar to those of conventional Q and Q($\lambda$). One major difference though is that each agent's prospects of achieving its subgoal tend to vary as various agents try various subgoals.


Learning the Q-values. We want $Q_i(O_t,A_t)$ to approximate the system's expected discounted future reward for executing action $A_t$, given $O_t$. In the optimal case we have

\begin{displaymath}
Q_i(O_t,A_t) = \sum_{S_j \in S} P(S_j\vert O_t,\Theta,i) (R(...
...a
\sum_{S_k \in S} D(S_j, A_t, S_k) V_{Active(t+1)}(B(S_k))),
\end{displaymath} (1)

where $P(S_j\vert O_t,\Theta,i)$ denotes the probability that the system is in state $S_j$ at time $t$ given observation $O_t$, all architecture parameters denoted $\Theta$, and the information that $i = Active(t)$. HQ-learning does not depend on estimating this probability, although belief vectors (Littman, 1996) or a world model (e.g., Moore 1993) might help to speed up learning. $V_i(O_t)$ is the utility of observation $O_t$ according to agent $\cal C_{\em i}$, which is equal to the Q-value for taking the best action: $V_i(O_t) := \max_{A_j \in A} \{Q_i(O_t, A_j)\}.$

Q-value updates are generated in two different situations ($T \le T_{max}$ denotes the total number of executed actions during the current trial, and $\alpha_Q$ is the learning rate):

Q.1
Let $\cal C_{\em i}$ and $\cal C_{\em j}$ denote the agents active at times $t$ and $t+1$ -- possibly $i=j$. If $t < T$ then
$Q_i(O_t, A_t) \leftarrow (1 - \alpha_Q) Q_i(O_t, A_t) + \alpha_Q
(R(S_t, A_t) + \gamma V_j(O_{t+1}))$.
Q.2
If agent $\cal C_{\em i}$ is active at time $T$, and the final action $A_T$ has been executed, then
$Q_i(O_T, A_T) \leftarrow
(1 - \alpha_Q) Q_i(O_T, A_T) + \alpha_Q R(S_T, A_T)$.
Note that $R(S_T, A_T)$ is the final reward for reaching a goal state if $T < T_{max}$. A main difference with standard one-step Q-learning is that agents can be trained on Q-values which are not their own (see [Q.1]).


Learning the HQ-values: intuition. Recall the introduction's traffic light task. The first traffic light is a good subgoal. We want our system to discover this by exploring (initially random) subgoals and learning their HQ-values. The traffic light's HQ-value, for instance, should converge to the expected (discounted) future cumulative reinforcement to be obtained after it has been chosen as a subgoal. How? Once the traffic light has been reached and the first agent passes control to the next, the latter's own expectation of future reward is used to update the first's HQ-values. Where do the latter's expectations originate? They reflect its own experience with final reward (to be obtained at the station).


More formally. In the optimal case we have

\begin{displaymath}
HQ_i(O_j) = {\cal E} (R_i) + \gamma^{t_{i+1} - t_i} HV_{i+1},
\end{displaymath}

where ${\cal E}$ denotes the average over all possible trajectories. $R_i = \sum_{t=t_i}^{t^{i+1}-1} \gamma^{t-t_i} R(S_t,A_t)$, $\cal C_{\em i}$'s discounted cumulative reinforcement during the time it will be active (note that this time interval and the states encountered by $\cal C_{\em i}$ depend on $\cal C_{\em i}$'s subtask), and $HV_i := \max_{O_l \in O} \{HQ_i(O_l)\})$ is the estimated discounted cumulative reinforcement to be received by $\cal C_{\em i}$.

We adjust only HQ-values of agents active before trial end ($N$ denotes the number of agents active during the last trial, $\alpha_{HQ}$ denotes the learning rate, and $\hat O_i$ the chosen subgoal for agent $\cal C_{\em i}$):

HQ.1
If $\cal C_{\em i}$ is invoked before agent $\cal C_{\em N-1}$, then we update according to
$HQ_i(\hat O_i) \leftarrow (1 - \alpha_{HQ}) HQ_i(\hat O_i) +
\alpha_{HQ} (R_i + \gamma^{t_{i+1} - t_i} HV_{i+1})$.
HQ.2
If $\cal C_{\em i} = $ $\cal C_{\em N-1}$, then $HQ_i(\hat O_i) \leftarrow
(1 - \alpha_{HQ}) HQ_i(\hat O_i) + \alpha_{HQ} (R_i + \gamma^{t_N - t_i} R_N)$.
HQ.3
If $\cal C_{\em i} = $ $\cal C_{\em N}$, and $i < M$, then $HQ_i(\hat O_i) \leftarrow (1 - \alpha_{HQ} ) HQ_i(\hat O_i) +
\alpha_{HQ} R_i$.

The first and third rules resemble traditional Q-learning rules. The second rule is necessary if agent $\cal C_{\em N}$ has learned a (possibly high) value for a subgoal that is unachievable due to subgoals selected by previous agents.


HQ($\lambda$)-learning: motivation. Q-learning's lookahead capability is restricted to one step. It cannot solve all RPPs because it cannot properly assign credit to different actions leading to identical next states (Whitehead 1992). For instance, suppose you walk along a wall that looks the same everywhere except in the middle where there is a picture. The goal is to reach the left corner where there is reward. This RPP is solvable by an RP. Given the ``picture'' input, however, Q-learning with one step look-ahead would assign equal values to actions ``go left'' and ``go right'' because they both yield identical ``wall'' observations.

Consequently HQ-learning may suffer from Q-learning's inability to solve certain RPPs. To overcome this problem, we augment HQ by TD($\lambda$)-methods for evaluating and improving policies in a manner analogous to Lin's offline Q($\lambda$)-method (1993). TD($\lambda$)-methods integrate experiences from several successive steps to disambiguate identical short-term effects of different actions. Our experiments indicate that RPPs are solvable by Q($\lambda$)-learning with sufficiently high $\lambda$.

Q($\lambda$).1
For the Q-tables we first compute desired Q-values $Q'(O_t, A_j)$ for $t=T,\ldots,1$:
$Q'(O_T, A_T) \leftarrow R(S_T,A_T)\\
Q'(O_t,A_t) \leftarrow R(S_t, A_t) +
\gamma ((1 - \lambda) V_{Active(t+1)}(O_{t+1})
+ \lambda Q'(O_{t+1},A_{t+1})$)
Q($\lambda$).2
Then we update the Q-values, beginning with $Q_N(O_T,A_T)$ and ending with $Q_1(O_1,A_1)$, according to
$Q_i(O_t, A_t) \leftarrow
(1 - \alpha_Q) Q_i(O_t, A_t) + \alpha_Q Q'(O_t,A_t)$

HQ($\lambda$).1
For the HQ-tables we also compute desired HQ-values $HQ_i'(\hat O_i)$ for $i=N,\ldots,1$: $HQ'_N(\hat O_N) \leftarrow R_N\\
HQ'_{N-1}(\hat O_{N-1}) \leftarrow R_{N-1} + ...
...amma^{t_{i+1} - t_i}
((1 - \lambda) HV_{i+1} + \lambda HQ'_{i+1}(\hat O_{i+1})$)
HQ($\lambda$).2
Then we update the HQ-values for agents $\cal C_{\em 1}, \ldots,\cal C_{\em Min(N,M-1)}$ according to
$HQ_i(\hat O_i) \leftarrow (1 - \alpha_{HQ}) HQ_i(\hat O_i) + \alpha_{HQ}
HQ'_i(\hat O_i)$


In principle, online Q($\lambda$) may be used as well. See, e.g., Peng and Williams (1996), or Wiering and Schmidhuber's fast Q($\lambda$) implementation (1997). Online Q($\lambda$) should not use ``action-penalty'' (Koenig 1996), however, because punishing varying actions in response to ambiguous inputs will trap the agent in cyclic behavior.


Combined dynamics. Q-table policies are reactive and learn to solve RPPs. HQ-table policies are metastrategies for composing RPP sequences. Although Q-tables and HQ-tables do not explicitly communicate they influence each other through simultaneous learning. Their cooperation results in complex dynamics quite different from those of conventional Q-learning.

Utilities of subgoals and RPs are estimated by tracking how often they are part of successful subgoal/RP combinations. Subgoals that never or rarely occur in solutions become less likely to be chosen, others become more likely. In a certain sense subtasks compete for being assigned to subagents, and the subgoal choices ``co-evolve'' with the RPs. Maximizing its own expected utility, each agent implicitly takes into account frequent decisions made by other agents. Each agent eventually settles down on a particular RPP solvable by its RP and ceases to adapt. This will be illustrated by Experiment 1 in Section 3.

Estimation of average reward for choosing a particular subgoal ignores dependencies on previous subgoals. This makes local minima possible. If several rewarding suboptimal subgoal sequences are ``close'' in subgoal space, then the optimal one may be less probable than suboptimal ones. We will show in the experiments that this actually can happen.


Exploration issues. Initial choices of subgoals and RPs may influence the final result - there may be local minimum traps. Exploration is a partial remedy: it encourages alternative competitive strategies similar to the current one. Too little exploration may prevent the system from discovering the goal at all. Too much exploration, however, prevents reliable estimates of the current policy's quality and reuse of previous successful RPs. To avoid over-exploration we use the Max-Boltzmann (Max-Uniform) distribution for Q-values (HQ-values) (for discussions of exploration issues see, e.g., Fedorov 1972; Schmidhuber 1991a; Thrun 1992; Cohn 1994; Caironi and Dorigo 1994; Storck, Hochreiter and Schmidhuber 1995; Wilson 1996a, Schmidhuber 1997c). These distributions also make it easy to reduce the relative weight of exploration (as opposed to exploitation): to obtain a deterministic policy at the end of the learning process, we increase $p_{max}$ during learning until it finally achieves a maximum value.

Selecting actions according to the traditional Boltzmann distribution causes the following problems: (1) It is hard to find good values for the temperature parameter. (2) The degree of exploration depends on the Q-values: actions with almost identical Q-values (given a certain input) will be executed equally often. For instance, suppose a sequence of 5 different states in a maze leads to observation sequence $O_1 - O_1 - O_1 - O_1 - O_1$, where $O_1$ represents a single observation. Now suppose there are almost equal Q-values for going west or east in response to $O_1$. Then the Q-updates will hardly change the differences between these Q-values. The resulting random walk behavior will cost a lot of simulation time.

For RP training we prefer the Max-Boltzmann rule instead. It focuses on the greedy policy and only explores actions competitive with the optimal actions. Subgoal exploration is less critical though. The Max-Random subgoal exploration rule may be replaced by the Max-Boltzmann rule or others.


next up previous
Next: EXPERIMENTS Up: HQ-LEARNING Previous: HQ-LEARNING
Juergen Schmidhuber 2003-02-24


Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning