Abstract.
In 2020, we are celebrating the
10-year anniversary of
the first journal paper [RPG]
on Deep Reinforcement Learning (RL) with Policy Gradients for LSTM,
based on work of 2007 [RPG07].
In the 2010s,
this became the method of choice for training RL agents
in complex environments. Impressive recent applications include DeepMind's Starcraft
video game player and OpenAI's Dota player &
dextrous robot hand.
The
Long Short-Term Memory (LSTM) recurrent neural network [LSTM1-6] overcomes
the Fundamental Deep Learning Problem
[VAN1].
See
Sec. 3 & Sec. 4 of [MIR]
and Sec. XVII of [T20][T22].
In 2000, LSTM learned to solve previously unsolvable
language learning tasks [LSTM13].
In 2007, CTC-trained LSTM [CTC] learned to
recognize speech
[LSTM4][LSTM14]. By the 2010s, when compute had become cheap enough,
CTC-LSTM dramatically improved speech recognition [GSR][GSR15] on billions of smartphones and other computers [DL4]. LSTM also revolutionized machine translation [S2S][GT16][WU][FB17] and many other fields [DEC][MOST].
Compare
Sec. A & B & XVII of [T22].
However, these applications were about Supervised Learning where the network learns
to imitate a teacher on a training set. But since 2002 we have also used LSTM for Reinforcement Learning (RL) without a teacher [LSTM-RL].
(And for Neuroevolution [LSTM12].)
In particular, my former PhD student Daan Wierstra
and my PostDoc Alexander Förster
and our collaborator Jan Peters
applied Policy Gradient (PG)
methods [PG] to RL LSTM.
The first
conference publication on this came out in 2007 [RPG07],
the first journal publication in 2010
[RPG].
(Daan later
became employee number 1 of DeepMind, the company co-founded by his friend Shane Legg, another PhD student from my lab—Shane and Daan were the first persons at DeepMind with AI publications and PhDs in computer science.)
During this time,
my team with
Frank Sehnke &
Thomas Rückstiess &
Christian Osendorfer &
Alex Graves &
Martin Felder &
Tom Schaul &
Sun Yi &
Mandy Grüttner
also published lots of additional work on this type of Direct Policy Search [DS], e.g., [SDE][EPRL].
A particular very successful method was called
Policy Gradients with Parameter-based Exploration (PGPE)
[PGPE08][PGPE]. Already in 2010, we used
PGPE to train multi-dimensional LSTM
on Atari-Go [ATA].
Policy Gradients for LSTM have become important.
In 2018, a PG-trained [PPO17] LSTM was the core of OpenAI's impressive Dactyl which learned to control a dextrous robot hand without a teacher [OAI1][OAI1a].
A PG-trained LSTM (with 84% of the model's total parameter count) also was the core of the famous
OpenAI Five
which learned to defeat human experts in the
Dota 2 video game (2018) [OAI2].
Bill Gates called this a "huge milestone in advancing artificial intelligence"
[OAI2a].
In 2019, DeepMind beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using
Alphastar whose brain
also has a PG-trained deep LSTM core [DM3].
See
Sec. 4 of [MIR]
and
Sec. C of [T22].
Our company NNAISENSE is also
sometimes using variants and extensions of PG-trained LSTMs to control complex industrial
processes in the physical world.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
References
[RPG]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634. PDF.
[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[PGPE08]
F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, and J. Schmidhuber.
Policy gradients with parameter-based exploration for control.
In J. Koutnik V. Kurkova, R. Neruda, editors,
Proceedings of the
International Conference on Artificial Neural Networks ICANN-2008
ICANN 2008, Prague, LNCS 5163, pages 387-396. Springer-Verlag Berlin Heidelberg, 2008.
PDF.
[PGPE]
F. Sehnke, C. Osendorfer, T. Rückstiess, A. Graves, J. Peters, J. Schmidhuber.
Parameter-exploring policy gradients. Neural Networks 23(2), 2010.
PDF.
[SDE]
T. Rückstiess, M. Felder, J. Schmidhuber.
State-Dependent Exploration for Policy Gradient Methods.
19th European Conference on Machine Learning ECML,
2008.
PDF.
[EPRL]
T. Rückstiess, F. Sehnke, T. Schaul, D. Wierstra, S. Yi, J. Schmidhuber.
Exploring Parameter Space in Reinforcement Learning.
Paladyn Journal of Behavioral Robotics, 2010. PDF.
[ATA]
M. Grüttner, F. Sehnke, T. Schaul, J. Schmidhuber.
Multi-Dimensional Deep Memory Atari-Go Players for Parameter Exploring Policy Gradients.
Proceedings of the International Conference on Artificial Neural Networks (ICANN-2010),
Greece, 2010.
PDF.
[PG]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.3-4: 229-256, 1992.
[DS]
J. Schmidhuber.
Sequential decision making based on direct search.
In R. Sun and C. L. Giles, eds.,
Sequence Learning: Paradigms, Algorithms, and Applications.
Lecture Notes on AI 1828, p. 203-240, Springer, 2001.
PDF.
[PPO17]
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov.
Proximal Policy Optimization Algorithms. Preprint arXiv:1707.06347 [cs.LG], 2017.
[DL1] J. Schmidhuber, 2015.
Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ...
HTML.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[More on the Fundamental Deep Learning Problem.]
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
PDF.
[LSTM4]
S. Fernandez, A. Graves, J. Schmidhuber. An application of
recurrent neural networks to discriminative keyword
spotting.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
PDF.
[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber.
Evolving memory cell structures for sequence learning.
Proc. ICANN-09, Cyprus, 2009.
PDF.
[LSTM10]
A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004.
PDF.
[LSTM12]
D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.
[LSTM13]
F. A. Gers and J. Schmidhuber.
LSTM Recurrent Networks Learn Simple Context Free and
Context Sensitive Languages.
IEEE Transactions on Neural Networks 12(6):1333-1340, 2001.
PDF.
[LSTM14]
S. Fernandez, A. Graves, J. Schmidhuber.
Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proc.
IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk).
PDF.
[LSTM15]
A. Graves, J. Schmidhuber.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks.
Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552,
Vancouver, MIT Press, 2009.
PDF.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
PDF.
[GSR]
H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team.
Google voice search: faster and more accurate.
Google Research Blog, Sep 2015, see also
Aug 2015 Google's speech recognition based on CTC and LSTM.
[GSR15] Dramatic
improvement of Google's speech recognition through LSTM:
Alphr Technology, Jul 2015, or 9to5google, Jul 2015
[GSR19]
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein.
Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
[S2S]
I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Preprint arXiv:1609.08144 (PDF), 2016.
[GT16] Google's
dramatically improved Google Translate of 2016 is based on LSTM, e.g.,
WIRED, Sep 2016,
or
siliconANGLE, Sep 2016
[FB17]
By 2017, Facebook
used LSTM
to handle
over 4 billion automatic translations per day (The Verge, August 4, 2017);
see also
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[LSTM-RL]
B. Bakker, F. Linaker, J. Schmidhuber.
Reinforcement Learning in Partially Observable Mobile Robot
Domains Using Unsupervised Event Extraction.
In Proceedings of the 2002
IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2002), Lausanne, 2002.
PDF.
[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015.
Link.
[DM3]
S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019.
[Alphastar has a "deep LSTM core."]
[OAI1]
G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker.
Learning Dexterity. OpenAI Blog, 2018.
[OAI1a]
OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba.
Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
[OAI2]
OpenAI:
C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019).
Dota 2 with Large Scale Deep Reinforcement Learning.
Preprint
arxiv:1912.06680.
[An LSTM composes 84% of the model's total parameter count.]
[OAI2a]
J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. [An LSTM was the core of OpenAI Five.]
[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award. Link.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of my earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers).
Most of this started with our
Annus Mirabilis of 1990-1991.[MIR]
.