## 2010: First Journal Paper on Deep Reinforcement Learning with Policy Gradients for Long Short-Term Memory (LSTM)
The Long Short-Term Memory (LSTM) recurrent neural network [LSTM1-6] overcomes the Fundamental Deep Learning Problem [VAN1]. See Sec. 3 & Sec. 4 of [MIR] and Sec. XVII of [T20][T22]. In 2000, LSTM learned to solve previously unsolvable language learning tasks [LSTM13]. In 2007, CTC-trained LSTM [CTC] learned to recognize speech [LSTM4][LSTM14]. By the 2010s, when compute had become cheap enough, CTC-LSTM dramatically improved speech recognition [GSR][GSR15] on billions of smartphones and other computers [DL4]. LSTM also revolutionized machine translation [S2S][GT16][WU][FB17] and many other fields [DEC][MOST]. Compare Sec. A & B & XVII of [T22].
However, these applications were about In particular, my former PhD student During this time,
my team with
Policy Gradients for LSTM have become important.
In 2018, a PG-trained [PPO17] LSTM was the core of OpenAI's impressive Dactyl which learned to control a dextrous robot hand without a teacher [OAI1][OAI1a].
A PG-trained LSTM (with 84% of the model's total parameter count) also was the core of the famous
OpenAI Five
which learned to defeat human experts in the
Dota 2 video game (2018) [OAI2].
Bill Gates called this a In 2019, DeepMind beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain also has a PG-trained deep LSTM core [DM3]. See Sec. 4 of [MIR] and Sec. C of [T22]. Our company NNAISENSE is also sometimes using variants and extensions of PG-trained LSTMs to control complex industrial processes in the physical world. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. ## References[RPG] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634. PDF.
[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
[PGPE08]
F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, and J. Schmidhuber.
Policy gradients with parameter-based exploration for control.
In J. Koutnik V. Kurkova, R. Neruda, editors,
Proceedings of the
[PGPE]
F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, J. Schmidhuber.
Parameter-exploring policy gradients.
[SDE]
T. Rückstiess, M. Felder, J. Schmidhuber.
State-Dependent Exploration for Policy Gradient Methods.
[EPRL]
T. Rückstiess, F. Sehnke, T. Schaul, D. Wierstra, S. Yi, J. Schmidhuber.
Exploring Parameter Space in Reinforcement Learning.
[ATA]
M. Grüttner, F. Sehnke, T. Schaul, J. Schmidhuber.
Multi-Dimensional Deep Memory Atari-Go Players for Parameter Exploring Policy Gradients.
[PG] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.3-4: 229-256, 1992. [DS] J. Schmidhuber. Sequential decision making based on direct search. In R. Sun and C. L. Giles, eds., Sequence Learning: Paradigms, Algorithms, and Applications. Lecture Notes on AI 1828, p. 203-240, Springer, 2001. PDF. [PPO17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov. Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347 [cs.LG], 2017. [DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117. More. [DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832. [DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.
[LSTM4]
S. Fernandez, A. Graves, J. Schmidhuber. An application of
recurrent neural networks to discriminative keyword
spotting.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF. [LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF. [LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN-09, Cyprus, 2009. PDF. [LSTM10] A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004. PDF. [LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award. [LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF. [LSTM14] S. Fernandez, A. Graves, J. Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk). PDF.
[LSTM15]
A. Graves, J. Schmidhuber.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF. [GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM. [GSR15] Dramatic improvement of Google's speech recognition through LSTM: Alphr Technology, Jul 2015, or 9to5google, Jul 2015 [GSR19] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019. [S2S] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112. [WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. [GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g., WIRED, Sep 2016, or siliconANGLE, Sep 2016 [FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day (The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) [LSTM-RL] B. Bakker, F. Linaker, J. Schmidhuber. Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Lausanne, 2002. PDF. [DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015. Link.
[DM3]
S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019.
[OAI1] G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker. Learning Dexterity. OpenAI Blog, 2018. [OAI1a] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba. Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
[OAI2]
OpenAI:
C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019).
Dota 2 with Large Scale Deep Reinforcement Learning.
Preprint
arxiv:1912.06680.
[OAI2a]
J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. [T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award. Link. [T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. . |