next up previous
Next: About this document ... Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: ACKNOWLEDGMENTS

Bibliography

Boutilier and Poole, 1996
Boutilier, C. and Poole, D. (1996).
Computing optimal policies for partially observable markov decision processes using compact representations.
In Proceedings of the AAAI, Portland, OR.

Caironi and Dorigo, 1994
Caironi, P. V. C. and Dorigo, M. (1994).
Training Q-agents.
Technical Report IRIDIA-94-14, Université Libre de Bruxelles.

Chrisman, 1992
Chrisman, L. (1992).
Reinforcement learning with perceptual aliasing: The perceptual distinctions approach.
In Proceedings of the Tenth International Conference on Artificial Intelligence, pages 183-188. AAAI Press, San Jose, California.

Cliff and Ross, 1994
Cliff, D. and Ross, S. (1994).
Adding temporary memory to ZCS.
Adaptive Behavior, 3:101-150.

Cohn, 1994
Cohn, D. A. (1994).
Neural network exploration using optimal experiment design.
In Cowan, J., Tesauro, G., and Alspector, J., editors, Advances in Neural Information Processing Systems 6, pages 679-686. Morgan Kaufmann.

Dayan and Hinton, 1993
Dayan, P. and Hinton, G. (1993).
Feudal reinforcement learning.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 5, pages 271-278. Morgan Kaufmann.

Digney, 1996
Digney, B. (1996).
Emergent hierarchical control structures: Learning reactive/hierarchical relationships in reinforcement environments.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 363-372. MIT Press, Bradford Books.

Fedorov, 1972
Fedorov, V. V. (1972).
Theory of optimal experiments.
Academic Press.

Hihi and Bengio, 1996
Hihi, S. E. and Bengio, Y. (1996).
Hierarchical recurrent neural networks for long-term dependencies.
In Touretzky, D. S., Mozer, M. C., and Hasselmo, M. E., editors, Advances in Neural Information Processing Systems 8, pages 493-499. MIT Press.

Humphrys, 1996
Humphrys, M. (1996).
Action selection methods using reinforcement learning.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 135-144. MIT Press, Bradford Books.

Jaakkola et al., 1995
Jaakkola, T., Singh, S. P., and Jordan, M. I. (1995).
Reinforcement learning algorithm for partially observable Markov decision problems.
In Tesauro, G., Touretzky, D. S., and Leen, T. K., editors, Advances in Neural Information Processing Systems 7, pages 345-352. MIT Press.

Jordan and Rumelhart, 1990
Jordan, M. I. and Rumelhart, D. E. (1990).
Supervised learning with a distal teacher.
Technical Report Occasional Paper #40, Center for Cog. Sci., Massachusetts Institute of Technology.

Kaelbling et al., 1995
Kaelbling, L., Littman, M., and Cassandra, A. (1995).
Planning and acting in partially observable stochastic domains.
Technical report, Brown University, Providence RI.

Koenig and Simmons, 1996
Koenig, S. and Simmons, R. G. (1996).
The effect of representation and knowedge on goal-directed exploration with reinforcement learnign algorithm.
Machine Learning, 22:228-250.

Levin, 1973
Levin, L. A. (1973).
Universal sequential search problems.
Problems of Information Transmission, 9(3):265-266.

Lin, 1993
Lin, L. (1993).
Reinforcement Learning for Robots Using Neural Networks.
PhD thesis, Carnegie Mellon University, Pittsburgh.

Littman, 1994
Littman, M. (1994).
Memoryless policies: Theoretical limitations and practical results.
In D. Cliff, P. Husbands, J. A. M. and Wilson, S. W., editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, pages 297-305. MIT Press/Bradford Books.

Littman, 1996
Littman, M. (1996).
Algorithms for Sequential Decision Making.
PhD thesis, Brown University.

Littman et al., 1995
Littman, M., Cassandra, A., and Kaelbling, L. (1995).
Learning policies for partially observable environments: Scaling up.
In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 362-370. Morgan Kaufmann Publishers, San Francisco, CA.

McCallum, 1993
McCallum, R. A. (1993).
Overcoming incomplete perception with utile distinction memory.
In Machine Learning: Proceedings of the Tenth International Conference. Morgan Kaufmann, Amherst, MA.

McCallum, 1996
McCallum, R. A. (1996).
Learning to use selective attention and short-term memory in sequential tasks.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 315-324. MIT Press, Bradford Books.

Moore and Atkeson, 1993
Moore, A. and Atkeson, C. G. (1993).
Prioritized sweeping: Reinforcement learning with less data and less time.
Machine Learning, 13:103-130.

N. L. Zhang, 1996
N. L. Zhang, W. L. (1996).
Planning in stochastic domains: Problem characteristics and approximation.
Technical Report HKUST-CS96-31, The Hong Kong University of Science and Technology.

Nguyen and Widrow, 1989
Nguyen and Widrow, B. (1989).
The truck backer-upper: An example of self learning in neural networks.
In Proceedings of the International Joint Conference on Neural Networks, pages 357-363. IEEE Press.

Parr and Russell, 1995
Parr, R. and Russell, S. (1995).
Approximating optimal policies for partially observable stochastic domains.
In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI-95), pages 1088-1094. Morgan Kaufmann.

Peng and Williams, 1996
Peng, J. and Williams, R. (1996).
Incremental multi-step Q-learning.
Machine Learning, 22:283-290.

Ring, 1994
Ring, M. B. (1994).
Continual Learning in Reinforcement Environments.
PhD thesis, University of Texas at Austin, Austin, Texas 78712.

Ron et al., 1994
Ron, D., Singer, Y., and Tishby, N. (1994).
Learning probabilistic automata with variable memory length.
In Aleksander, I. and Taylor, J., editors, Proceedings Computational Learning Theory. ACM Press.

Schmidhuber, 1991a
Schmidhuber, J. (1991a).
Curious model-building control systems.
In Proceedings of the International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE press.

Schmidhuber, 1991b
Schmidhuber, J. (1991b).
Learning to generate sub-goals for action sequences.
In Kohonen, T., Mäkisara, K., Simula, O., and Kangas, J., editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland.

Schmidhuber, 1991c
Schmidhuber, J. (1991c).
Reinforcement learning in Markovian and non-Markovian environments.
In Lippman, D. S., Moody, J. E., and Touretzky, D. S., editors, Advances in Neural Information Processing Systems 3, pages 500-506. Morgan Kaufmann.

Schmidhuber, 1992
Schmidhuber, J. (1992).
Learning complex, extended sequences using the principle of history compression.
Neural Computation, 4(2):234-242.

Schmidhuber, 1997
Schmidhuber, J. (1997).
What's interesting?
Technical Report IDSIA-35-97, IDSIA.
ftp://ftp.idsia.ch/pub/juergen/interest.ps.gz; extended abstract in Proc. Snowbird'98, Utah, 1998.

Schmidhuber et al., 1997a
Schmidhuber, J., Zhao, J., and Schraudolph, N. (1997a).
Reinforcement learning with self-modifying policies.
In Thrun, S. and Pratt, L., editors, Learning to learn, pages 293-309. Kluwer.

Schmidhuber et al., 1997b
Schmidhuber, J., Zhao, J., and Wiering, M. (1997b).
Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement.
Machine Learning, 28:105-130.

Singh, 1992
Singh, S. (1992).
The efficient learning of multiple task sequences.
In Moody, J., Hanson, S., and Lippman, R., editors, Advances in Neural Information Processing Systems 4, pages 251-258, San Mateo, CA. Morgan Kaufmann.

Sondik, 1971
Sondik, E. J. (1971).
The Optimal Control of Partially Observable Markov Decision Processes.
PhD thesis, Stanford, California.

Storck et al., 1995
Storck, J., Hochreiter, S., and Schmidhuber, J. (1995).
Reinforcement driven information acquisition in non-deterministic environments.
In Proceedings of the International Conference on Artificial Neural Networks, Paris, volume 2, pages 159-164. EC2 & Cie.

Sutton, 1988
Sutton, R. S. (1988).
Learning to predict by the methods of temporal differences.
Machine Learning, 3:9-44.

Sutton, 1995
Sutton, R. S. (1995).
TD models: Modeling the world at a mixture of time scales.
In Prieditis, A. and Russell, S., editors, Machine Learning: Proceedings of the Twelfth International Conference, pages 531-539. Morgan Kaufmann Publishers, San Francisco, CA.

Teller, 1994
Teller, A. (1994).
The evolution of mental models.
In Kenneth E. Kinnear, J., editor, Advances in Genetic Programming, pages 199-219. MIT Press.

Tham, 1995
Tham, C. (1995).
Reinforcement learning of multiple tasks using a hierarchical CMAC architecture.
Robotics and Autonomous Systems, 15(4):247-274.

Thrun, 1992
Thrun, S. (1992).
Efficient exploration in reinforcement learning.
Technical Report CMU-CS-92-102, Carnegie-Mellon University.

Watkins, 1989
Watkins, C. (1989).
Learning from Delayed Rewards.
PhD thesis, King's College, Oxford.

Watkins and Dayan, 1992
Watkins, C. J. C. H. and Dayan, P. (1992).
Q-learning.
Machine Learning, 8:279-292.

Whitehead, 1992
Whitehead, S. (1992).
Reinforcement Learning for the adaptive control of perception and action.
PhD thesis, University of Rochester.

Wiering and Schmidhuber, 1996
Wiering, M. and Schmidhuber, J. (1996).
Solving POMDPs with Levin search and EIRA.
In Saitta, L., editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 534-542. Morgan Kaufmann Publishers, San Francisco, CA.

Wiering and Schmidhuber, 1997
Wiering, M. A. and Schmidhuber, J. (1997).
Fast online Q($\lambda$).
Technical Report IDSIA-21-97, IDSIA.
In preparation.

Wilson, 1994
Wilson, S. (1994).
ZCS: A zeroth level classifier system.
Evolutionary Computation, 2:1-18.

Wilson, 1995
Wilson, S. (1995).
Classifier fitness based on accuracy.
Evolutionary Computation, 3(2):149-175.

Wilson, 1996
Wilson, S. (1996).
Generalization in XCS.
In ICML'96 Workshop on Evolutionary Computing and Machine Learning. Morgan Kaufmann Publishers, San Francisco, CA.

Zhao and Schmidhuber, 1996
Zhao, J. and Schmidhuber, J. (1996).
Incremental self-improvement for life-time multi-agent reinforcement learning.
In Maes, P., Mataric, M., Meyer, J.-A., Pollack, J., and Wilson, S. W., editors, From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, Cambridge, MA, pages 516-525. MIT Press, Bradford Books.

Figure 1: Basic architecture. Three agents are connected in a sequential way. Each agent has a Q-table, an HQ-table, and a control transfer unit, except for the last agent which only has a Q-table. The Q-table stores estimates of actual observation/action values and is used to select the next action. The HQ-table stores estimated subgoal values and is used to generate a subgoal once the agent is made active. The solid box indicates that the second agent is the currently active agent. Once the agent has achieved its subgoal, the control transfer unit passes control to its successor.

Figure 2: A partially observable maze (POM). The task is to find a path leading from start S to goal G. The optimal solution requires 28 steps and at least three reactive agents. The figure shows a possible sub-optimal solution that costs 30 steps. Asterisks mark appropriate subgoals.
\begin{figure}\begin{center}
\epsfxsize = 4.5cm
\epsfysize = 4.5cm
\epsfig{figure=inner_out.eps}\end{center}\end{figure}

Figure 3: left: HQ-learning results for the partially observable maze, for 3, 4, 6, 8, and 12 agents. We plot average test run length against trial numbers (means of 100 simulations). The system almost always converges to near-optimal solutions. Using more than the required 3 agents tends to improve performance. Right: results for 4, 8, and 12 agents whose actions are corrupted by 10% noise. In most cases they find the goal, although noisy actions decrease performance.
\begin{figure}\begin{center}
\epsfxsize = 6cm
\epsfysize = 5.2cm
\epsfig{figure=...
...epsfysize = 5.2cm
\epsfig{figure=maze_first_10noise.eps}\end{center}\end{figure}




Table 1: HQ-learning results for random actions replacing the selected actions with probability 0% and 10%. All table entries refer to the final test trial. The 2nd column lists average trial lengths. The 3rd column lists goal hit percentages. The 4th column lists average path lengths of solutions. The 5th column lists percentages of simulations during which the optimal path is found.
System Av. steps ($\%$) Goal Av. sol. ($\%$) Optimal
3 agents 263 76 30 3
4 agents 60 97 31 6
6 agents 31 100 31 14
8 agents 31 100 31 12
12 agents 32 100 32 6
4 agents 10$\%$ noise 177 86 43 2
8 agents 10$\%$ noise 166 87 41 2
12 agents 10$\%$ noise 196 84 43 0
Random 912 19 537 0

Figure 4: Subgoal combinations (SCs) generated by a 3-agent system, sampled at intervals of 10 trials. Initially many different SCs are tried out. After 10,000 trials HQ explores less and less SCs until it finally converges to SC (3, 12).
\begin{figure}\begin{center}
\epsfxsize = 6.0cm
\epsfysize = 9.2Cm
\epsfig{figure=figure_3d_output_part12.eps}\end{center}\end{figure}

Figure 5: A partially observable maze containing a key K and a door (grey area). Starting at $S$, the system first has to find the key to open the door, then proceed to the goal $G$. The shortest path costs 83 steps. This optimal solution requires at least three reactive agents. The number of possible world states is $960$.
\begin{figure}\begin{center}
\epsfxsize = 5.5cm
\epsfysize = 5.5cm
\epsfig{figure=doorway2.eps}\end{center}\end{figure}

Figure 6: left: HQ-learning results for the ``key and door'' problem. We plot average test run length against trial number (means of 100 simulations). Within 20,000 trials systems with 3 (4, 6 and 8) agents find good deterministic policies in 85% (96%, 96% and 99%) of the simulations. Right: HQ-learning results with an 8 agent system whose actions are replaced by random actions with probability 5%, 10%, and 25%.
\begin{figure}\begin{center}
\epsfxsize = 6.0cm
\epsfysize = 5.2Cm
\epsfig{figur...
...m
\epsfysize = 5.2Cm
\epsfig{figure=maze_door_noise.eps}\end{center}\end{figure}




Table 2: Results of 100 HQ-learning simulations for the ``key and door'' task. All table entries refer to the final test trial. The second column lists average trial lengths. The third lists goal hit percentages. The fourth lists average path lengths of solutions. The fifth lists percentages of simulations during which the optimal 83 step path was found. HQ-learning could solve the task with a limit of 1000 steps per trial. Random search needed a 10,000 step limit.
System Av. steps ($\%$) Goal Av. sol. ($\%$) Optimal
3 agents 224 85 87 8
4 agents 126 96 90 9
6 agents 127 96 91 8
8 agents 101 99 92 6
8 agents (5$\%$ noise) 360 92 304 0
8 agents (10$\%$ noise) 399 90 332 0
8 agents (25$\%$ noise) 442 84 336 0
Random *9310 19 6370 0


next up previous
Next: About this document ... Up: HQ-Learning Adaptive Behavior 6(2):219-246, Previous: ACKNOWLEDGMENTS
Juergen Schmidhuber 2003-02-24


Back to Reinforcement Learning and POMDP page
Back to Subgoal Learning - Hierarchical Learning