2010: First Journal Paper on Deep Reinforcement Learning with Policy Gradients for Long Short-Term Memory (LSTM)
Abstract. In 2020, we are celebrating the 10-year anniversary of the first journal paper [RPG] on Deep Reinforcement Learning (RL) with Policy Gradients for LSTM, based on work of 2007 [RPG07]. In the 2010s, this became the method of choice for training RL agents in complex environments. Impressive recent applications include DeepMind's Starcraft video game player and OpenAI's Dota player & dextrous robot hand.
The Long Short-Term Memory (LSTM) recurrent neural network [LSTM1-6] overcomes the Fundamental Deep Learning Problem [VAN1]. See Sec. 3 & Sec. 4 of [MIR] and Sec. XVII of [T20]. In 2000, LSTM learned to solve previously unsolvable language learning tasks [LSTM13]. In 2007, CTC-trained LSTM [CTC] learned to recognize speech [LSTM4] [LSTM14]. By the 2010s, when compute had become cheap enough, CTC-LSTM dramatically improved speech recognition [GSR] [GSR15] on billions of smartphones and other computers [DL4]. LSTM also revolutionised machine translation [S2S] [GT16] [WU] [FB17] and many other fields. Compare Sec. A & B & XVII of [T20].
However, these applications were about Supervised Learning where the network learns to imitate a teacher on a training set. But since 2002 we have also used LSTM for Reinforcement Learning (RL) without a teacher [LSTM-RL]. (And for Neuroevolution [LSTM12].)
In particular, my former PhD student Daan Wierstra and my PostDoc Alexander Förster and our collaborator Jan Peters applied Policy Gradient (PG) methods [PG] to RL LSTM. The first conference publication on this came out in 2007 [RPG07], the first journal publication in 2010 [RPG]. (Daan later became employee number 1 of DeepMind, the company co-founded by his friend Shane Legg, another PhD student from my lab—Shane and Daan were the first persons at DeepMind with AI publications and PhDs in computer science.)
During this time, my team with Frank Sehnke & Thomas Rückstiess & Christian Osendorfer & Alex Graves & Martin Felder & Tom Schaul & Sun Yi & Mandy Grüttner also published lots of additional work on this type of Direct Policy Search [DS], e.g., [SDE] [EPRL]. A particular very successful method was called Policy Gradients with Parameter-based Exploration (PGPE) [PGPE08] [PGPE]. Already in 2010, we used PGPE to train multi-dimensional LSTM on Atari-Go [ATA].
Policy Gradients for LSTM have become important. In 2018, a PG-trained [PPO17] LSTM was the core of OpenAI's impressive Dactyl which learned to control a dextrous robot hand without a teacher [OAI1] [OAI1a]. A PG-trained LSTM (with 84% of the model's total parameter count) also was the core of the famous OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018) [OAI2]. Bill Gates called this a "huge milestone in advancing artificial intelligence" [OAI2a].
In 2019, DeepMind beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain also has a PG-trained deep LSTM core [DM3]. See Sec. 4 of [MIR] and Sec. C of [T20].
Our company NNAISENSE is also sometimes using variants and extensions of PG-trained LSTMs to control complex industrial processes in the physical world.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
[RPG] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634. PDF.
[RPG07] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs with Recurrent Policy Gradients. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF.
[PGPE08] F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, and J. Schmidhuber. Policy gradients with parameter-based exploration for control. In J. Koutnik V. Kurkova, R. Neruda, editors, Proceedings of the International Conference on Artificial Neural Networks ICANN-2008 ICANN 2008, Prague, LNCS 5163, pages 387-396. Springer-Verlag Berlin Heidelberg, 2008. PDF.
[PGPE] F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, J. Schmidhuber. Parameter-exploring policy gradients. Neural Networks 23(2), 2010. PDF.
[SDE] T. Rückstiess, M. Felder, J. Schmidhuber. State-Dependent Exploration for Policy Gradient Methods. 19th European Conference on Machine Learning ECML, 2008. PDF.
[EPRL] T. Rückstiess, F. Sehnke, T. Schaul, D. Wierstra, S. Yi, J. Schmidhuber. Exploring Parameter Space in Reinforcement Learning. Paladyn Journal of Behavioral Robotics, 2010. PDF.
[ATA] M. Grüttner, F. Sehnke, T. Schaul, J. Schmidhuber. Multi-Dimensional Deep Memory Atari-Go Players for Parameter Exploring Policy Gradients. Proceedings of the International Conference on Artificial Neural Networks (ICANN-2010), Greece, 2010. PDF.
[DS] J. Schmidhuber. Sequential decision making based on direct search. In R. Sun and C. L. Giles, eds., Sequence Learning: Paradigms, Algorithms, and Applications. Lecture Notes on AI 1828, p. 203-240, Springer, 2001. PDF.
[DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117. More.
[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.
[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.
[LSTM4] S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.
[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN-09, Cyprus, 2009. PDF.
[LSTM10] A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004. PDF.
[LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.
[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF.
[LSTM14] S. Fernandez, A. Graves, J. Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk). PDF.
[LSTM15] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.
[GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM.
[GSR19] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016.
[LSTM-RL] B. Bakker, F. Linaker, J. Schmidhuber. Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Lausanne, 2002. PDF.
[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015. Link.
[DM3] S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019. [Alphastar has a "deep LSTM core."]
[OAI1] G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker. Learning Dexterity. OpenAI Blog, 2018.
[OAI1a] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba. Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
[OAI2] OpenAI: C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019). Dota 2 with Large Scale Deep Reinforcement Learning. Preprint arxiv:1912.06680. [An LSTM composes 84% of the model's total parameter count.]
[OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. [An LSTM was the core of OpenAI Five.]
[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award. Link.
[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.