POMDP specification. System life is separable into ``trials''. A trial consists of at most discrete time steps , where if the agent solves the problem in fewer than time steps. A POMDP is specified by , where is a finite set of environmental states, is the initial state, is a finite set of observations, the function is a many-to-one mapping of states to (ambiguous) observations, is a finite set of actions, maps state-action pairs to scalar reinforcement signals, is a discount factor which trades off immediate rewards against future rewards, and is a state transition function, where denotes the probability of transition to state given , where is the environmental state at time , and is the action executed at time . The system's goal is to obtain maximal (discounted) cumulative reinforcement during the trial.
POMDPs as RPP sequences.
The optimal policy of any deterministic POMDP with
final goal state is decomposable into
a finite sequence of optimal policies for appropriate RPPs,
along with subgoals determining transitions from one RPP to the next.
The trivial decomposition consists of single-state RPPs
and the corresponding subgoals.
In general, POMDPs whose only decomposition is trivial are
hard -- there is no efficient algorithm for solving them.
HQ, however, is aimed at situations that require few transitions
between RPPs. HQ's architecture implements such transitions by passing
control from one RPP-solving subagent to the next.
Architecture.
There is an ordered sequence of agents
,
,
...
,
each equipped with
a Q-table, an HQ-table, and a
control transfer unit, except for
,
which only has a Q-table (see figure 1).
Each agent is responsible for learning part of the system's policy.
Its Q-table represents its local policy for executing an action given
an input. It is given by
a matrix of size
, where is the
number of different possible observations
and the number of possible actions.
denotes
's Q-value (utility) of
action given observation .
The agent's current subgoal is generated with the help of
its HQ-table, a vector with elements.
For each possible observation there
is an HQ-table entry representing its estimated value as a subgoal.
denotes
's HQ-value (utility) of
selecting as its subgoal.
The system's current policy is the policy of the currently active agent. If is active at time step , then we will denote this by . The variable represents the only kind of short-term memory in the system.
Architecture limitations. The sequential architecture restricts the POMDP types HQ can solve. To see this consider the difference between RL goals of (1) achievement and (2) maintenance. The former refer to single state goals (e.g., find the exit of a maze), the latter to maintaining a desirable state over time (such as keeping a pole balanced). Our current HQ-variant handles achievement goals only. In case of maintenance goals it will eventually run out of agents -- there must be an explicit final desired state (this restriction may be overcome with different agent topologies beyond the scope of this paper).
Selecting a subgoal. In the beginning is made active. Once is active, its HQ-table is used to select a subgoal for . To explore different subgoal sequences we use the Max-Random rule: the subgoal with maximal value is selected with probability , a random subgoal is selected with probability . Conflicts between multiple subgoals with maximal -values are solved by randomly selecting one. denotes the subgoal selected by agent . This subgoal is only used in transfer of control as defined below and should not be confused with an observation.
Selecting an action.
's action choice depends only
on the current observation .
During learning, at time ,
the active agent
will select actions
according to the Max-Boltzmann distribution:
with probability
take the action with maximal value;
with probability
select an action according to the traditional Boltzmann or Gibbs distribution, where
the probability of selecting action given observation is:
Transfer of control.
Control is transferred from one active agent to the next
as follows.
Each time
has executed an action,
its control transfer unit
checks whether
has reached the goal.
If not, it checks whether
has solved its subgoal to decide whether control should be passed on to
.
We let denote the time at which
agent
is made active
(at system start-up, we set
).
IF no goal state reached AND current subgoal =
AND AND
THEN AND