next up previous
Next: LEARNING RULES Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: INTRODUCTION

HQ-LEARNING

POMDP specification. System life is separable into ``trials''. A trial consists of at most $T_{max}$ discrete time steps $t = 1,2,3,\dots,T$, where $T < T_{max}$ if the agent solves the problem in fewer than $T_{max}$ time steps. A POMDP is specified by $\cal Z$ $= <S,S_1,O,B,A,R,\gamma,D>$, where $S$ is a finite set of environmental states, $S_1 \in S$ is the initial state, $O$ is a finite set of observations, the function $B : S \rightarrow O$ is a many-to-one mapping of states to (ambiguous) observations, $A$ is a finite set of actions, $R : S \times A \rightarrow I\!\!R$ maps state-action pairs to scalar reinforcement signals, $0 \leq \gamma \leq 1$ is a discount factor which trades off immediate rewards against future rewards, and $D : S \times A \times S \rightarrow I\!\!R$ is a state transition function, where $p(S_{t+1}\vert S_t, A_t) := D(S_t, A_t \rightarrow S_{t+1}) =
D(S_t,A_t,S_{t+1})$ denotes the probability of transition to state $S_{t+1}$ given $S_t$, where $S_t \in S$ is the environmental state at time $t$, and $A_t \in A$ is the action executed at time $t$. The system's goal is to obtain maximal (discounted) cumulative reinforcement during the trial.


POMDPs as RPP sequences. The optimal policy of any deterministic POMDP with final goal state is decomposable into a finite sequence of optimal policies for appropriate RPPs, along with subgoals determining transitions from one RPP to the next. The trivial decomposition consists of single-state RPPs and the corresponding subgoals. In general, POMDPs whose only decomposition is trivial are hard -- there is no efficient algorithm for solving them. HQ, however, is aimed at situations that require few transitions between RPPs. HQ's architecture implements such transitions by passing control from one RPP-solving subagent to the next.


Architecture. There is an ordered sequence of $M$ agents $\cal C_{\em 1}$, $\cal C_{\em 2}$, ... $\cal C_{\em M}$, each equipped with a Q-table, an HQ-table, and a control transfer unit, except for $\cal C_{\em M}$, which only has a Q-table (see figure 1). Each agent is responsible for learning part of the system's policy. Its Q-table represents its local policy for executing an action given an input. It is given by a matrix of size $\vert O\vert \times \vert A\vert$, where $\vert O\vert$ is the number of different possible observations and $\vert A\vert$ the number of possible actions. $Q_i(O_t,A_j)$ denotes $\cal C_{\em i}$'s Q-value (utility) of action $A_j$ given observation $O_t$. The agent's current subgoal is generated with the help of its HQ-table, a vector with $\vert O\vert$ elements. For each possible observation there is an HQ-table entry representing its estimated value as a subgoal. $HQ_i(O_j)$ denotes $\cal C_{\em i}$'s HQ-value (utility) of selecting $O_j$ as its subgoal.

The system's current policy is the policy of the currently active agent. If $\cal C_{\em i}$ is active at time step $t$, then we will denote this by $Active(t) := i$. The variable $Active(t)$ represents the only kind of short-term memory in the system.


Architecture limitations. The sequential architecture restricts the POMDP types HQ can solve. To see this consider the difference between RL goals of (1) achievement and (2) maintenance. The former refer to single state goals (e.g., find the exit of a maze), the latter to maintaining a desirable state over time (such as keeping a pole balanced). Our current HQ-variant handles achievement goals only. In case of maintenance goals it will eventually run out of agents -- there must be an explicit final desired state (this restriction may be overcome with different agent topologies beyond the scope of this paper).

% latex2html id marker 951
\fbox{\parbox{8cm}{
\begin{bf}Appropriate place for Figure \ref{Architecture} \end{bf}}}

Selecting a subgoal. In the beginning $\cal C_{\em 1}$ is made active. Once $\cal C_{\em i}$ is active, its HQ-table is used to select a subgoal for $\cal C_{\em i}$. To explore different subgoal sequences we use the Max-Random rule: the subgoal with maximal $HQ_i$ value is selected with probability $p_{max}$, a random subgoal is selected with probability $1 - p_{max}$. Conflicts between multiple subgoals with maximal $HQ_i$-values are solved by randomly selecting one. $\hat O_i$ denotes the subgoal selected by agent $\cal C_{\em i}$. This subgoal is only used in transfer of control as defined below and should not be confused with an observation.


Selecting an action. $\cal C_{\em i}$'s action choice depends only on the current observation $O_t$. During learning, at time $t$, the active agent $\cal C_{\em i}$ will select actions according to the Max-Boltzmann distribution: with probability $p_{max}$ take the action $A_j$ with maximal $Q_i(O_t,A_j)$ value; with probability $1 - p_{max}$ select an action according to the traditional Boltzmann or Gibbs distribution, where the probability of selecting action $A_j$ given observation $O_t$ is:

\begin{displaymath}
p(A_j\vert O_t) = \frac{e^{Q_i(O_t,A_j)/T_i}}{\sum_{A_k \in A}
e^{Q_i(O_t,A_k)/T_i}}.
\end{displaymath}

The ``temperature'' $T_i$ adjusts the degree of randomness involved in agent $\cal C_{\em i}$'s action selection in case the Boltzmann rule is used. Conflicts between multiple actions with maximal Q-values are solved by randomly selecting one.


Transfer of control. Control is transferred from one active agent to the next as follows. Each time $\cal C_{\em i}$ has executed an action, its control transfer unit checks whether $\cal C_{\em i}$ has reached the goal. If not, it checks whether $\cal C_{\em i}$ has solved its subgoal to decide whether control should be passed on to $\cal C_{\em i+1}$. We let $t_i$ denote the time at which agent $\cal C_{\em i}$ is made active (at system start-up, we set $t_1 \leftarrow 1$).

IF no goal state reached AND current subgoal = $\hat O_i$
AND $Active(t) < M$ AND $B(S_t) = \hat O_i$
THEN $Active(t+1) \leftarrow Active(t) + 1$ AND $t_{i+1} \leftarrow t+1$



Subsections
next up previous
Next: LEARNING RULES Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: INTRODUCTION
Juergen Schmidhuber 2003-02-24


Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning