Next: About this document ...
Up: HQ-Learning Adaptive Behavior 6(2):219-246,
Previous: ACKNOWLEDGMENTS
- Boutilier and Poole, 1996
-
Boutilier, C. and Poole, D. (1996).
Computing optimal policies for partially observable markov decision
processes using compact representations.
In Proceedings of the AAAI, Portland, OR.
- Caironi and Dorigo, 1994
-
Caironi, P. V. C. and Dorigo, M. (1994).
Training Q-agents.
Technical Report IRIDIA-94-14, Université Libre de Bruxelles.
- Chrisman, 1992
-
Chrisman, L. (1992).
Reinforcement learning with perceptual aliasing: The perceptual
distinctions approach.
In Proceedings of the Tenth International Conference on
Artificial Intelligence, pages 183-188. AAAI Press, San Jose, California.
- Cliff and Ross, 1994
-
Cliff, D. and Ross, S. (1994).
Adding temporary memory to ZCS.
Adaptive Behavior, 3:101-150.
- Cohn, 1994
-
Cohn, D. A. (1994).
Neural network exploration using optimal experiment design.
In Cowan, J., Tesauro, G., and Alspector, J., editors, Advances
in Neural Information Processing Systems 6, pages 679-686. Morgan Kaufmann.
- Dayan and Hinton, 1993
-
Dayan, P. and Hinton, G. (1993).
Feudal reinforcement learning.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 5, pages 271-278. Morgan
Kaufmann.
- Digney, 1996
-
Digney, B. (1996).
Emergent hierarchical control structures: Learning
reactive/hierarchical relationships in reinforcement environments.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson,
S. W., editors, From Animals to Animats 4: Proceedings of the Fourth
International Conference on Simulation of Adaptive Behavior, Cambridge, MA,
pages 363-372. MIT Press, Bradford Books.
- Fedorov, 1972
-
Fedorov, V. V. (1972).
Theory of optimal experiments.
Academic Press.
- Hihi and Bengio, 1996
-
Hihi, S. E. and Bengio, Y. (1996).
Hierarchical recurrent neural networks for long-term dependencies.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 493-499. MIT
Press.
- Humphrys, 1996
-
Humphrys, M. (1996).
Action selection methods using reinforcement learning.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson,
S. W., editors, From Animals to Animats 4: Proceedings of the Fourth
International Conference on Simulation of Adaptive Behavior, Cambridge, MA,
pages 135-144. MIT Press, Bradford Books.
- Jaakkola et al., 1995
-
Jaakkola, T., Singh, S. P., and Jordan, M. I. (1995).
Reinforcement learning algorithm for partially observable Markov
decision problems.
In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 345-352. MIT
Press.
- Jordan and Rumelhart, 1990
-
Jordan, M. I. and Rumelhart, D. E. (1990).
Supervised learning with a distal teacher.
Technical Report Occasional Paper #40, Center for Cog. Sci.,
Massachusetts Institute of Technology.
- Kaelbling et al., 1995
-
Kaelbling, L., Littman, M., and Cassandra, A. (1995).
Planning and acting in partially observable stochastic domains.
Technical report, Brown University, Providence RI.
- Koenig and Simmons, 1996
-
Koenig, S. and Simmons, R. G. (1996).
The effect of representation and knowedge on goal-directed
exploration with reinforcement learnign algorithm.
Machine Learning, 22:228-250.
- Levin, 1973
-
Levin, L. A. (1973).
Universal sequential search problems.
Problems of Information Transmission, 9(3):265-266.
- Lin, 1993
-
Lin, L. (1993).
Reinforcement Learning for Robots Using Neural Networks.
PhD thesis, Carnegie Mellon University, Pittsburgh.
- Littman, 1994
-
Littman, M. (1994).
Memoryless policies: Theoretical limitations and practical results.
In D. Cliff, P. Husbands, J. A. M. and Wilson, S. W., editors, Proc. of the International Conference on Simulation of Adaptive Behavior:
From Animals to Animats 3, pages 297-305. MIT Press/Bradford Books.
- Littman, 1996
-
Littman, M. (1996).
Algorithms for Sequential Decision Making.
PhD thesis, Brown University.
- Littman et al., 1995
-
Littman, M., Cassandra, A., and Kaelbling, L. (1995).
Learning policies for partially observable environments: Scaling up.
In Prieditis, A. and Russell, S., editors, Machine Learning:
Proceedings of the Twelfth International Conference, pages 362-370. Morgan
Kaufmann Publishers, San Francisco, CA.
- McCallum, 1993
-
McCallum, R. A. (1993).
Overcoming incomplete perception with utile distinction memory.
In Machine Learning: Proceedings of the Tenth International
Conference. Morgan Kaufmann, Amherst, MA.
- McCallum, 1996
-
McCallum, R. A. (1996).
Learning to use selective attention and short-term memory in
sequential tasks.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson,
S. W., editors, From Animals to Animats 4: Proceedings of the Fourth
International Conference on Simulation of Adaptive Behavior, Cambridge, MA,
pages 315-324. MIT Press, Bradford Books.
- Moore and Atkeson, 1993
-
Moore, A. and Atkeson, C. G. (1993).
Prioritized sweeping: Reinforcement learning with less data and less
time.
Machine Learning, 13:103-130.
- N. L. Zhang, 1996
-
N. L. Zhang, W. L. (1996).
Planning in stochastic domains: Problem characteristics and
approximation.
Technical Report HKUST-CS96-31, The Hong Kong University of
Science and Technology.
- Nguyen and Widrow, 1989
-
Nguyen and Widrow, B. (1989).
The truck backer-upper: An example of self learning in neural
networks.
In Proceedings of the International Joint Conference on Neural
Networks, pages 357-363. IEEE Press.
- Parr and Russell, 1995
-
Parr, R. and Russell, S. (1995).
Approximating optimal policies for partially observable stochastic
domains.
In Proceedings of the International Joint Conference on
Artificial Intelligence (IJCAI-95), pages 1088-1094. Morgan Kaufmann.
- Peng and Williams, 1996
-
Peng, J. and Williams, R. (1996).
Incremental multi-step Q-learning.
Machine Learning, 22:283-290.
- Ring, 1994
-
Ring, M. B. (1994).
Continual Learning in Reinforcement Environments.
PhD thesis, University of Texas at Austin, Austin, Texas 78712.
- Ron et al., 1994
-
Ron, D., Singer, Y., and Tishby, N. (1994).
Learning probabilistic automata with variable memory length.
In Aleksander, I. and Taylor, J., editors, Proceedings
Computational Learning Theory. ACM Press.
- Schmidhuber, 1991a
-
Schmidhuber, J. (1991a).
Curious model-building control systems.
In Proceedings of the International Joint Conference on Neural
Networks, Singapore, volume 2, pages 1458-1463. IEEE press.
- Schmidhuber, 1991b
-
Schmidhuber, J. (1991b).
Learning to generate sub-goals for action sequences.
In Kohonen, T., Mäkisara, K., Simula, O., and Kangas, J.,
editors, Artificial Neural Networks, pages 967-972. Elsevier Science
Publishers B.V., North-Holland.
- Schmidhuber, 1991c
-
Schmidhuber, J. (1991c).
Reinforcement learning in Markovian and non-Markovian
environments.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 500-506. Morgan
Kaufmann.
- Schmidhuber, 1992
-
Schmidhuber, J. (1992).
Learning complex, extended sequences using the principle of history
compression.
Neural Computation, 4(2):234-242.
- Schmidhuber, 1997
-
Schmidhuber, J. (1997).
What's interesting?
Technical Report IDSIA-35-97, IDSIA.
ftp://ftp.idsia.ch/pub/juergen/interest.ps.gz; extended abstract in
Proc. Snowbird'98, Utah, 1998.
- Schmidhuber et al., 1997a
-
Schmidhuber, J., Zhao, J., and Schraudolph, N. (1997a).
Reinforcement learning with self-modifying policies.
In Thrun, S. and Pratt, L., editors, Learning to learn, pages
293-309. Kluwer.
- Schmidhuber et al., 1997b
-
Schmidhuber, J., Zhao, J., and Wiering, M. (1997b).
Shifting inductive bias with success-story algorithm, adaptive
Levin search, and incremental self-improvement.
Machine Learning, 28:105-130.
- Singh, 1992
-
Singh, S. (1992).
The efficient learning of multiple task sequences.
In Moody, J., Hanson, S., and Lippman, R., editors, Advances in
Neural Information Processing Systems 4, pages 251-258, San Mateo, CA.
Morgan Kaufmann.
- Sondik, 1971
-
Sondik, E. J. (1971).
The Optimal Control of Partially Observable Markov Decision
Processes.
PhD thesis, Stanford, California.
- Storck et al., 1995
-
Storck, J., Hochreiter, S., and Schmidhuber, J. (1995).
Reinforcement driven information acquisition in non-deterministic
environments.
In Proceedings of the International Conference on Artificial
Neural Networks, Paris, volume 2, pages 159-164. EC2 & Cie.
- Sutton, 1988
-
Sutton, R. S. (1988).
Learning to predict by the methods of temporal differences.
Machine Learning, 3:9-44.
- Sutton, 1995
-
Sutton, R. S. (1995).
TD models: Modeling the world at a mixture of time scales.
In Prieditis, A. and Russell, S., editors, Machine Learning:
Proceedings of the Twelfth International Conference, pages 531-539. Morgan
Kaufmann Publishers, San Francisco, CA.
- Teller, 1994
-
Teller, A. (1994).
The evolution of mental models.
In Kenneth E. Kinnear, J., editor, Advances in Genetic
Programming, pages 199-219. MIT Press.
- Tham, 1995
-
Tham, C. (1995).
Reinforcement learning of multiple tasks using a hierarchical CMAC
architecture.
Robotics and Autonomous Systems, 15(4):247-274.
- Thrun, 1992
-
Thrun, S. (1992).
Efficient exploration in reinforcement learning.
Technical Report CMU-CS-92-102, Carnegie-Mellon University.
- Watkins, 1989
-
Watkins, C. (1989).
Learning from Delayed Rewards.
PhD thesis, King's College, Oxford.
- Watkins and Dayan, 1992
-
Watkins, C. J. C. H. and Dayan, P. (1992).
Q-learning.
Machine Learning, 8:279-292.
- Whitehead, 1992
-
Whitehead, S. (1992).
Reinforcement Learning for the adaptive control of perception
and action.
PhD thesis, University of Rochester.
- Wiering and Schmidhuber, 1996
-
Wiering, M. and Schmidhuber, J. (1996).
Solving POMDPs with Levin search and EIRA.
In Saitta, L., editor, Machine Learning: Proceedings of the
Thirteenth International Conference, pages 534-542. Morgan Kaufmann
Publishers, San Francisco, CA.
- Wiering and Schmidhuber, 1997
-
Wiering, M. A. and Schmidhuber, J. (1997).
Fast online Q().
Technical Report IDSIA-21-97, IDSIA.
In preparation.
- Wilson, 1994
-
Wilson, S. (1994).
ZCS: A zeroth level classifier system.
Evolutionary Computation, 2:1-18.
- Wilson, 1995
-
Wilson, S. (1995).
Classifier fitness based on accuracy.
Evolutionary Computation, 3(2):149-175.
- Wilson, 1996
-
Wilson, S. (1996).
Generalization in XCS.
In ICML'96 Workshop on Evolutionary Computing and Machine
Learning. Morgan Kaufmann Publishers, San Francisco, CA.
- Zhao and Schmidhuber, 1996
-
Zhao, J. and Schmidhuber, J. (1996).
Incremental self-improvement for life-time multi-agent reinforcement
learning.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson,
S. W., editors, From Animals to Animats 4: Proceedings of the Fourth
International Conference on Simulation of Adaptive Behavior, Cambridge, MA,
pages 516-525. MIT Press, Bradford Books.
Figure 1:
Basic architecture. Three agents are
connected in a sequential way. Each agent has a Q-table, an HQ-table,
and a control transfer unit,
except for the last agent which only has a Q-table.
The Q-table stores estimates of actual observation/action values
and is used to select the next action.
The HQ-table stores estimated subgoal values and
is used to generate a subgoal once the agent is made active. The
solid box indicates that the second agent is the currently active
agent.
Once the agent has achieved its subgoal,
the control transfer unit passes control to its successor.
|
Figure 2:
A partially observable maze (POM).
The task is to find a path
leading from start S to goal G.
The optimal solution requires 28 steps and at least three reactive agents.
The figure shows a possible sub-optimal
solution that costs 30 steps.
Asterisks mark appropriate subgoals.
|
Figure 3:
left: HQ-learning results for
the partially observable maze,
for 3, 4, 6, 8, and 12 agents.
We plot average test run length against trial numbers
(means of 100 simulations).
The system almost always converges to near-optimal
solutions. Using more than the required 3 agents tends to
improve performance. Right: results
for 4, 8, and 12 agents whose
actions are corrupted by 10% noise.
In most cases they find the goal, although
noisy actions decrease performance.
|
Table 1:
HQ-learning results for random actions
replacing the selected actions with
probability 0% and 10%.
All table entries refer to the final test trial.
The 2nd column lists average trial lengths.
The 3rd column lists goal hit percentages.
The 4th column lists average path lengths of solutions.
The 5th column lists
percentages of simulations during which the optimal
path is found.
System |
Av. steps |
() Goal |
Av. sol. |
() Optimal |
3 agents |
263 |
76 |
30 |
3 |
4 agents |
60 |
97 |
31 |
6 |
6 agents |
31 |
100 |
31 |
14 |
8 agents |
31 |
100 |
31 |
12 |
12 agents |
32 |
100 |
32 |
6 |
4 agents 10 noise |
177 |
86 |
43 |
2 |
8 agents 10 noise |
166 |
87 |
41 |
2 |
12 agents 10 noise |
196 |
84 |
43 |
0 |
Random |
912 |
19 |
537 |
0 |
Figure 4:
Subgoal combinations (SCs) generated by
a 3-agent system, sampled at intervals of 10 trials.
Initially many different SCs are tried out. After 10,000 trials
HQ explores less and less SCs until it finally converges to SC (3, 12).
|
Figure 5:
A partially observable maze containing a key K and a door (grey area).
Starting at ,
the system first has to find the key to open
the door, then proceed to the goal .
The shortest path costs 83 steps.
This optimal solution requires at least three reactive agents.
The number of possible world states is .
|
Figure 6:
left: HQ-learning results for the ``key and door'' problem. We plot
average test run length against trial number
(means of 100 simulations).
Within 20,000 trials systems with 3 (4, 6 and 8) agents
find good deterministic policies in 85% (96%, 96%
and 99%) of the simulations. Right: HQ-learning results with an 8
agent system whose actions are replaced by random actions with
probability 5%, 10%, and 25%.
|
Table 2:
Results of 100 HQ-learning simulations
for the ``key and door'' task.
All table entries refer to the final test trial.
The second column lists average trial lengths.
The third lists goal hit percentages.
The fourth lists average path lengths of solutions.
The fifth lists percentages of simulations during which the optimal
83 step path was found.
HQ-learning could solve the task
with a limit of 1000 steps per trial. Random search
needed a 10,000 step limit.
System |
Av. steps |
() Goal |
Av. sol. |
() Optimal |
3 agents |
224 |
85 |
87 |
8 |
4 agents |
126 |
96 |
90 |
9 |
6 agents |
127 |
96 |
91 |
8 |
8 agents |
101 |
99 |
92 |
6 |
8 agents (5 noise) |
360 |
92 |
304 |
0 |
8 agents (10 noise) |
399 |
90 |
332 |
0 |
8 agents (25 noise) |
442 |
84 |
336 |
0 |
Random |
*9310 |
19 |
6370 |
0 |
Next: About this document ...
Up: HQ-Learning Adaptive Behavior 6(2):219-246,
Previous: ACKNOWLEDGMENTS
Juergen Schmidhuber
2003-02-24
Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning