**Abstract.**
Modern Artificial Intelligence is dominated by artificial neural networks (NNs) and deep learning.^{[DL1-4]} Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I discuss: **(1)** Long Short-Term Memory^{[LSTM0-17]} (LSTM),
the most cited NN of the 20th century, **(2)** ResNet, the most cited NN of the 21st century (which is an open-gated version of our earlier Highway Net:^{[HW1-3]} the first working really deep feedforward NN), **(3)** AlexNet and VGG Net, the 2nd and 3rd most cited NNs of the 21st century (both building on our similar earlier DanNet:^{[GPUCNN1-9]} the first deep convolutional NN^{[CNN1-4]} to win
image recognition competitions),
**(4)** Generative Adversarial Networks^{[GAN0-1]} (an instance of my earlier
Adversarial Artificial Curiosity^{[AC90-20]}), and **(5)** variants of Transformers (linear Transformers are formally equivalent to my earlier Fast Weight Programmers).^{[TR1-6][FWP0-1,6]}
Most of this started with our
Annus Mirabilis of 1990-1991^{[MIR]} when compute was a million times more expensive than today.

**(1) LSTM**
According to Google Scholar, the most cited NN paper of the 20th century is our 1997 journal publication on Long Short-Term Memory (LSTM).^{[LSTM1]} LSTMs are now permeating the modern world, with innumerable applications
including in healthcare,^{[DEC]} learning robots,^{[LSTM-RL][LSTMPG][OAI1,1a]}
game playing,^{[LSTMPG][OAI2,2a][DM3]}
speech processing,^{[AM16][GSR][GSR15-19]}
and machine translation.^{[GT16][WU][FB17][DEC]} They are used billions of times a day by countless people.^{[DL4]} This led Bloomberg to say LSTM is "arguably the most commercial AI achievement."^{[AV1][DL4][MIR](Sec. 4)} LSTMs as we know them today go beyond earlier work^{[MOZ]} and were
made possible through my students
Sepp Hochreiter, Felix Gers, Alex Graves, Daan Wierstra, and others.^{[LSTM0-17,PG]}

**(2) Highway Net to ResNet**
The most cited NN paper of the 21st century introduced the name "ResNet" (Dec 2015).^{[HW2]} It cites our earlier Highway Net (May 2015) of which ResNet is a version.^{[HW1-3][R5]} Highway Nets were the first working feedforward NNs with over 100 layers (previous NNs had at most a few tens of layers). When comparing the two, ResNets are, in fact, Highway Nets whose gates are initialized such that they remain always open.^{[HW1-3]}
Highway Nets
showed how very deep NNs with skip connections work, and
perform roughly as well as ResNets on ImageNet.^{[HW3]}
Highway Nets were made possible through my students Rupesh Kumar Srivastava and Klaus Greff.
The USPTO granted a patent for this invention to NNAISENSE in 2021.

Remarkably, the most cited NNs of the 20th and 21st century **(1 & 2)** are closely connected, because
the Highway Net is actually the *feedforward* NN version of our
recurrent
LSTM.^{[LSTM2]}
Deep learning
is all about NN depth.^{[DL1]}
LSTMs
brought essentially *unlimited* depth to supervised *recurrent* NNs; Highway Nets brought it to *feedforward* NNs.

**(3) DanNet to AlexNet to VGG Net**
The 2nd most cited NN paper of the 21st century describes AlexNet (2012),^{[GPUCNN4]} a convolutional NN^{[CNN1-4]} (CNN) similar to our earlier DanNet (2011)^{[GPUCNN1-3]} which had a temporary
monopoly on winning computer vision contests and
won 4 of them before AlexNet arrived on the scene.^{[GPUCNN5][R5-6]}
AlexNet cited DanNet but also used ReLUs (1973)^{[CMB]} and stochastic delta rule/dropout (1990)^{[Drop1-2]} without citation.^{[T20][T21](Sec. XIV)}
DanNet and AlexNet actually followed
our earlier work on
supervised deep NNs (2010)^{[MLP1-2]}
which abandoned the
unsupervised pre-training for deep NNs
introduced
by myself in 1991^{[UN][UN0-3]}—and later
championed by an AlexNet co-author.^{[UN4][VID1]}
The 21st century's 3rd most cited NN—the VGG network^{[GPUCNN9]} is also similar to DanNet (also using its trick of
increasing NN depth through small convolution filters).
Other highly cited CNNs^{[RCNN1-3]}
further extended the work of 2011.
DanNet was made possible through my postdoc Dan Ciresan with the help of Ueli Meier and Jonathan Masci.^{[GPUCNN1-3,5-8]}

**(4) Curiosity to GANs** Another highly cited NN paper of 2014
on Generative Adversarial Networks (GANs)^{[GAN1]}
describes a system similar to
my
adversarial NNs using Predictability Minimization for creating disentangled representations
(1991).^{[PM0-2][AC20][R2][MIR](Sec. 7)}
In fact, GANs are
a simple application
of my even earlier popular Adversarial
Curiosity Principle
from 1990^{[AC90-20][MIR](Sec. 5)} where
two dueling NNs (a generator and a predictor) are trying to maximize each other's loss in a minimax game.^{[AC](Sec. 1)}
GANs are an instance of this where the trials are constrained such that they remain very short, like in bandit problems.^{[AC20][AC][T20][T21](Sec. XVII)}

**(5) Fast Weight Programmers to Transformers**
Recently, Transformers^{[TR1]} (2017) have been all the rage, e.g., generating human-sounding texts.^{[GPT3]}
It turns out that Transformers are an extension of my Fast Weight Programmers of 1991^{[FWP0-1,6]} which amount to being *linear* Transformers^{[TR5-6][FWP][FWP6]} (apart from normalisation).
The "self-attention" in standard Transformers^{[TR1-4]} combines this with a projection and *softmax* (using
attention terminology like the one I introduced in 1993^{[ATT][FWP2][R4]}).
The linear Transformers of 1991^{[FWP0-1]} separated storage and control like in traditional computers,
but in an adaptive and fully neural way (rather than in a hybrid fashion^{[PDA1-2][DNC]}).

Some of the world's most valuable companies were deeply influenced by our contributions **(1-5)** above.^{[DL4][DEC]} The paper on
ResNet—the open-gated version of our
Highway Net^{[HW1-3]} **(2)**—was published by **Microsoft**, and its first author was hired by **Facebook**. Most of the
AlexNet/VGG Net authors^{[GPUCNN4,9]}—who built on our 2011
DanNet^{[GPUCNN1-3,5-8]}
**(3)**—went to **Google**. Google also
published the 2017 Transformers^{[TR1]} related to my
linear Transformers of 1991^{[FWP0-1,6]} **(5)**, and bought the company **DeepMind** co-founded by a student from my lab.^{[MIR]}
The second author of the DanNet papers^{[GPUCNN1-3]} **(3)**
and the first author of a
2014 paper on GANs^{[GAN1]} (an instance of my ancient Adversarial Curiosity^{[AC90-20]}) **(4)** were hired by **Apple**. All of these companies have also
made extensive use of our LSTM^{[DL4][DEC]} **(1)**.

##

Concluding remarks

**Disclaimer.**
Of course, citation counts
are poor indicators of truly pioneering work. As I pointed out in
*Nature* (2011):
"like the less-than-worthless collateralized debt obligations that drove the 2008 financial bubble, citations are easy to print and inflate, providing an incentive for professors to maximize citation counts instead of scientific progress—witness how relatively unknown scientists can now collect more citations than the most influential founders of their fields."^{[NAT1]}

**Deep Learning History.**
As mentioned earlier,^{[MIR](Sec. 21)}
when only consulting surveys from the Anglosphere,
it is not always clear^{[DLC]}
that deep learning was first conceived outside of it. Deep learning was—in fact—born in 1965 in the Ukraine (back then the USSR) with the first nets of arbitrary depth that really learned,^{[DEEP1-2][R8]} going beyond the "shallow learning" (linear regression) of Gauss and Legendre around 1800.^{[DL1]}
Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.^{[GD1-2]} A few years later,
modern backpropagation^{[BP1-6][BPA-C]} (the reverse mode of automatic differentiation)
was published in Finland (1970).^{[BP1]} The basic deep convolutional NN architecture (now widely used) was invented in the 1970s in Japan^{[CNN1]} where NNs with convolutions were later also combined with "weight sharing" and backpropagation (1987).^{[CNN1a]} We are standing on the shoulders of these works and many others (see the 888 references in my
award-winning survey^{[DL1]} if you want to understand just how much we borrow from these).

Gradient-based unsupervised or self-supervised adversarial networks that duel each other in a minimax game originated in Munich^{[AC,AC90-20]} (also the birthplace of the first truly self-driving cars in the 1980s).
The principles of
linear Transformers,^{[FWP,FWP0]}
NN distillation,^{[UN][UN0-3]}
and the
fundamental problem of backpropagation-based Deep Learning^{[VAN1]} were also discovered in Munich (1991). So were the first "modern" Deep Learners to overcome this problem, through (1) unsupervised pre-training^{[UN-UN2]} (1991), and (2) Long Short-Term Memory.^{[LSTM0-7]}
LSTM was further developed in Switzerland, which is also home of
the first image recognition contest-winning
deep
GPU-based CNNs,^{[DAN][DAN1][GPUCNN3,5]}
the first
superhuman visual pattern recognition (2011),^{[GPUCNN3,5][DAN]}
and the first very deep, working feedforward NNs with hundreds of layers.^{[HW1-3]}

In the 2010s, all of this work was feverishly built on by an outstanding community of machine learning researchers, engineers, and practitioners to create amazing things that have impacted the lives of billions of people worldwide.^{[DL4][DEC]}

##

Acknowledgments

Thanks for useful comments to Dylan Ashley, Kazuki Irie, Sjoerd van Steenkiste, Aleksandar Stanic, Cesare Alippi, Róbert Csordás, Sepp Hochreiter, Mike Mozer, Michael Bronstein, Christoph von der Malsburg, David Ha, and Stephen J. Hanson. Since science is about self-correction, let me know under *juergen@idsia.ch* if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

##

References

[AC]
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. *Our artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.*

[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF.
*The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
where a generator NN is fighting a predictor NN in a minimax game
(more).*

[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, *Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats*, pages 222-227. MIT Press/Bradford Books, 1991.
PDF.
More.

[AC09]
J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112.
PDF. (More on
artificial scientists and artists.)

[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). * IEEE Transactions on Autonomous Mental Development*, 2(3):230-247, 2010.
IEEE link.
PDF.

[AC20]
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 58-66, 2020.
Preprint arXiv/1906.04493.

[AM16]
Blog of Werner Vogels, CTO of Amazon (Nov 2016):
Amazon's Alexa
*"takes advantage of bidirectional long short-term memory (LSTM) networks using a massive amount of data to train models that convert letters to sounds and predict the intonation contour. This technology enables high naturalness, consistent intonation, and accurate processing of texts."*

[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. *We had both hard attention*^{[ATT0-2]} (1990) and soft attention (1991-93).^{[FWP]} Today, both types are very popular.

[ATT0] J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI-128-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.

[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990.
PDF.
More.

[ATT2]
J. Schmidhuber.
Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton,
editors, *Proc. of the 1990 Connectionist Models Summer School*, pages
52-61. San Mateo, CA: Morgan Kaufmann, 1990.
PS. (PDF.)

[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week,
Bloomberg, May 15, 2018.

[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.
*Precursor of modern backpropagation.*^{[BP1-4]}

[BPB]
A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.

[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
*See chapters 6-7 and FORTRAN code on pages 58-60.*
PDF.
See also BIT 16, 146-160, 1976.
Link.
*The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.*

[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
*First application of backpropagation*^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis).

[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}

[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389-400.

[BP6]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175-185, 1977.
*See Section 3.1 on using gradient descent for learning in multilayer networks.*

[CMB]
C. v. d. Malsburg (1973).
Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. *See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.*

[CNN1] K. Fukushima: Neural network model for a mechanism of pattern
recognition unaffected by shift in position—Neocognitron.
Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979.
*The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia.*

[CNN1+]
K. Fukushima: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics, vol. 36, no. 4, pp. 193-202 (April 1980).
Link.

[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. *First application of backpropagation*^{[BP1][BP2]} and weight-sharing
to a convolutional architecture.

[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989.

[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
PDF.

[CNN3] Weng, J.,
Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. *A CNN whose downsampling layers use Max-Pooling
(which has become very popular) instead of Fukushima's
Spatial Averaging.*^{[CNN1]}

[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007

[DAN]
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. *Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.*^{[DAN1]}
Now everybody is using this approach.

[DAN1]
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
*At the IJCNN 2011 computer vision competition in Silicon Valley,
our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor, and six times better than the best non-neural method.*

[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. *The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.*

[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. *First working Deep Learners with many layers, learning internal representations.*

[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.

[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.

[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.
*Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.*

[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.

[DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444.
HTML. See [DLC].

[DLC] J. Schmidhuber, 2015. Critique of Paper [DL3] by "Deep Learning Conspiracy" (Nature 521 p 436). June 2015.
HTML.
*The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it).*

[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... *By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019
on-device speech recognition
(on the phone, not the server)
is still based on
LSTM.*

[DM3]
S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019.
*Alphastar has a "deep LSTM core."*

[DNC] Hybrid computing using a neural network with dynamic external memory.
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis.
Nature, 538:7626, p 471, 2016.

[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272.
*Dropout is a variation of the stochastic delta rule—compare preprint
arXiv:1808.03578, 2018.*

[Drop2]
N. Frazier-Logue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):1018-1032.

[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie,
Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981.
*First paper on fast weights or dynamic links.*

[FASTa]
J. A. Feldman. Dynamic connections in neural networks.
Biological Cybernetics, 46(1):27-39, 1982.
*2nd paper on fast weights.*

[FB17]
By 2017, Facebook
used LSTM
to handle
over 4 billion automatic translations per day (The Verge, August 4, 2017);
see also
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)

[FWP]
J. Schmidhuber (AI Blog, 26 March 2021).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
*30-year anniversary of a now popular
alternative*^{[FWP0-1]} to recurrent NNs.
A *slow* feedforward NN learns by gradient descent *to program the changes* of
the fast weights^{[FAST,FASTa]} of
another NN.
Such *Fast Weight Programmers*^{[FWP0-6,FWPMETA1-7]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns^{[FWP0-1]}
(now often called *keys* and *values* for *self-attention*^{[TR1-6]}).
The similar *Transformers*^{[TR1-2]} combine this with projections
and *softmax* and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
*linear* Transformers or Performers^{[TR5-6]}
which are *formally equivalent* to the 1991 Fast Weight Programmers (apart from normalization).
In 1993, I introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
extended the approach to
RNNs that program themselves.

[FWP0]
J. Schmidhuber.
Learning to control fast-weight memories: An alternative to recurrent nets.
Technical Report FKI-147-91, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
*First paper on fast weight programmers: a slow net learns by gradient descent to compute weight changes of a fast net.*

[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992.
PDF.
HTML.
Pictures (German).

[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993.
PDF.
*First recurrent fast weight programmer based on outer products. Introduced the terminology of learning "internal spotlights of attention."*

[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.

[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.

[FWP4a] J. Ba, G. Hinton, V. Mnih, J. Z. Leibo, C. Ionescu. Using Fast Weights to Attend to the Recent Past. NIPS 2016.
PDF. *Like [FWP0-2].*

[FWP4b]
D. Bahdanau, K. Cho, Y. Bengio (2014).
Neural Machine Translation by Jointly Learning to Align and Translate. Preprint arXiv:1409.0473 [cs.CL].

[FWP4d]
Y. Tang, D. Nguyen, D. Ha (2020).
Neuroevolution of Self-Interpretable Agents.
Preprint: arXiv:2003.08165.

[FWP5]
F. J. Gomez and J. Schmidhuber.
Evolving modular fast-weight networks for control.
In W. Duch et al. (Eds.):
*Proc. ICANN'05,*
LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005.
PDF.
HTML overview.
*Reinforcement-learning fast weight programmer.*

[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.

[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
Preprint: arXiv:2106.06295 (June 2021).

[FWPMETA1] J. Schmidhuber. Steps towards `self-referential' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
*First recurrent fast weight programmer that can learn
to run a learning algorithm or weight change algorithm on itself.*

[FWPMETA2] J. Schmidhuber. A self-referential weight matrix.
In *Proceedings of the International Conference on Artificial
Neural Networks, Amsterdam*, pages 446-451. Springer, 1993.
PDF.

[FWPMETA3] J. Schmidhuber.
An introspective network that can learn to run its own weight change algorithm. In *Proc. of the Intl. Conf. on Artificial Neural Networks,
Brighton*, pages 191-195. IEE, 1993.

[FWPMETA4]
J. Schmidhuber.
A neural network that embeds its own meta-levels.
In *Proc. of the International Conference on Neural Networks '93,
San Francisco*. IEEE, 1993.

[FWPMETA5]
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
*A recurrent neural net with a self-referential, self-reading, self-modifying weight matrix
can be found here.*

[FWPMETA6]
L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It. Metalearning Workshop at NeurIPS, 2020.
Preprint arXiv:2012.14905 [cs.LG], 2020.

[FWPMETA7]
I. Schlag, T. Munkhdalai, J. Schmidhuber.
Learning Associative Inference Using Fast Weight Memory.
To appear at ICLR 2021.
Report arXiv:2011.07831 [cs.AI], 2020.

[GAN0]
O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context.
Blog post, Internet Archive, 2010.
*A blog post describing the basic ideas*^{[AC][AC90, AC90b][AC20]} of GANs.

[GAN1]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio.
Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014.
*Description of GANs that does not cite the original work of 1990*^{[AC][AC90, AC90b][AC20][R2]} (also containing wrong claims about
Predictability Minimization^{[PM0-2][AC20]}).

[GD1]
S. I. Amari (1967).
A theory of adaptive pattern classifier, IEEE Trans, EC-16, 279-307 (Japanese version published in 1965).
PDF.
*Probably the first paper on using stochastic gradient descent for learning in multilayer neural networks
(without specifying the specific gradient descent method now known as reverse mode of automatic differentiation or backpropagation*^{[BP1]}).

[GD2]
S. I. Amari (1968).
Information Theory—Geometric Theory of Information, Kyoritsu Publ., 1968 (in Japanese).
PDF.
*Contains computer simulation results for a five layer network (with 2 modifiable layers) which learns internal representations to classify
non-linearily separable pattern classes.*

[GPT3]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei.
Language Models are Few-Shot Learners (2020).
Preprint arXiv/2005.14165.

[GPUNN]
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. *Speeding up traditional NNs on GPU by a factor of 20.*

[GPUCNN]
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. *Speeding up shallow CNNs on GPU by a factor of 4.*

[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. *International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona)*, 2011. PDF. ArXiv preprint.
*Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 2011-2012 before others won any
with similar approaches.*

[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
*International Joint Conference on Neural Networks (IJCNN-2011, San Francisco)*, 2011.
PDF.
HTML overview.
*First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.*^{[DAN1]} This led to massive interest from industry.

[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. *IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012*, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.

[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012.
PDF.

[GPUCNN5]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.

[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011.
PDF.

[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
PDF.

[GPUCNN8] J. Schmidhuber. First deep learner to win a contest on object detection in large images—
first deep learner to win a medical imaging contest (AI Blog, 2012). HTML.
*How IDSIA used GPU-based CNNs to win the
ICPR 2012 Contest on Mitosis Detection
and the
MICCAI 2013 Grand Challenge.*

[GPUCNN9]
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014).

[GSR]
H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team.
Google voice search: faster and more accurate.
Google Research Blog, Sep 2015, see also
Aug 2015 Google's speech recognition based on CTC and LSTM.

[GSR15] Dramatic
improvement of Google's speech recognition through LSTM:
Alphr Technology, Jul 2015, or 9to5google, Jul 2015

[GSR19]
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein.
Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[GT16] Google's
dramatically improved Google Translate of 2016 is based on LSTM, e.g.,
WIRED, Sep 2016,
or
siliconANGLE, Sep 2016

[HIN] J. Schmidhuber (AI Blog, 2020). Critique of 2019 Honda Prize. *Science must not allow corporate PR to distort the academic record.*

[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. *The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates*^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well.^{[HW3]}
More.

[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015.
Link.

[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). *Residual nets are a version of Highway Nets*^{[HW1]}
where the gates are always open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.

[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.

[LSTM0]
S. Hochreiter and J. Schmidhuber.
Long Short-Term Memory.
TR FKI-207-95, TUM, August 1995.
PDF.

[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
*The "vanilla LSTM architecture" with forget gates
that everybody is using today, e.g., in Google's Tensorflow.*

[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
PDF.

[LSTM4]
S. Fernandez, A. Graves, J. Schmidhuber. An application of
recurrent neural networks to discriminative keyword
spotting.
*Intl. Conf. on Artificial Neural Networks ICANN'07,*
2007.
PDF.

[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.

[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
PDF.

[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber.
Evolving memory cell structures for sequence learning.
Proc. ICANN-09, Cyprus, 2009.
PDF.

[LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013.
PDF.

[LSTM9]
O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton.
Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL].

[LSTM10]
A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004.
PDF.

[LSTM11]
N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575-581, Springer-Verlag Berlin Heidelberg, 2005.

[LSTM12]
D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.

[LSTM13]
F. A. Gers and J. Schmidhuber.
LSTM Recurrent Networks Learn Simple Context Free and
Context Sensitive Languages.
IEEE Transactions on Neural Networks 12(6):1333-1340, 2001.
PDF.

[LSTM14]
S. Fernandez, A. Graves, J. Schmidhuber.
Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proc.
IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk).
PDF.

[LSTM15]
A. Graves, J. Schmidhuber.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks.
*Advances in Neural Information Processing Systems 22, NIPS'22,* p 545-552,
Vancouver, MIT Press, 2009.
PDF.

[LSTM16]
M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015.
Preprint: arxiv:1506.07452.

[LSTM17]
J. A. Perez-Ortiz, F. A. Gers, D. Eck, J. Schmidhuber.
Kalman filters improve LSTM network performance in
problems unsolvable by traditional recurrent nets.
Neural Networks 16(2):241-250, 2003.
PDF.

[LSTM-RL]
B. Bakker, F. Linaker, J. Schmidhuber.
Reinforcement Learning in Partially Observable Mobile Robot
Domains Using Unsupervised Event Extraction.
In Proceedings of the 2002
IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2002), Lausanne, 2002.
PDF.

[LSTMPG]
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent famous applications of policy gradients to LSTM: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota player (2018)—Bill Gates called this a huge milestone in advancing AI.

[MC43]
W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.

[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020.
*The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich.*

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
*Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.*

[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
*By 2010, when compute was 100 times more expensive than today, both our feedforward NNs*^{[MLP1]} and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.

[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011.
HTML.

[MOZ]
M. Mozer. A Focused Backpropagation Algorithm for Temporal Pattern Recognition.
Complex Systems, 1989.

[NYT1]
NY Times article
by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'

[OAI1]
G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker.
Learning Dexterity. OpenAI Blog, 2018.

[OAI1a]
OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba.
Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).

[OAI2]
OpenAI:
C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019).
Dota 2 with Large Scale Deep Reinforcement Learning.
Preprint
arxiv:1912.06680.
*An LSTM composes 84% of the model's total parameter count.*

[OAI2a]
J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. *An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.*

[PDA1]
G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649-653, Princeton University, Princeton, NJ, 1990.

[PDA2]
M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of context-free languages. Proc. NIPS 1993.

[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF.
More.

[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on [PM0], 1991. PDF.
More.

[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996.
PDF. More.

*Relevant threads with many comments at **reddit.com/r/MachineLearning*, the largest machine learning forum with over 800k subscribers in 2019 (note that my name is often misspelled):

[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.

[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.

[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.

[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.

[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.

[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.

[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.

[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun

[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers

[RCNN]
R. Girshick, J. Donahue, T. Darrell, J. Malik.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Preprint arXiv/1311.2524, Nov 2013.

[RCNN2]
R. Girshick.
Fast R-CNN. Proc. of the IEEE international conference on computer vision, p. 1440-1448, 2015.

[RCNN3]
K. He, G. Gkioxari, P. Dollar, R. Girshick.
Mask R-CNN.
Preprint arXiv/1703.06870, 2017.

[RPG]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634.

[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
*Intl. Conf. on Artificial Neural Networks ICANN'07,*
2007.
PDF.

[T20] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Deep Learning.

[T21] J. Schmidhuber (AI Blog, 2021).
Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning. Technical Report IDSIA-77-21 (v1), IDSIA, Lugano, Switzerland, 24 Sep 2021.

[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.

[TR4]
M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.

[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.

[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.

[UN]
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. *Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.*

[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI-148-91, Institut für Informatik, Technische
Universität München, April 1991.
PDF.

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.^{[UN0]} PDF.
*First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales),
overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding.
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.*

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
*An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN
can be found here (depth > 1000).*

[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
*Proc. of Intl. Workshop on Neural Networks, RWTH Aachen*, pages 87-95.
Augustinus, 1993.

[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF.

[UN5]
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle.
Greedy layer-wise training of deep networks.
Proc. NIPS 06, pages 153-160, Dec. 2006.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
*More on the Fundamental Deep Learning Problem.*

[VID1] G. Hinton.
The Next Generation of Neural Networks.
Youtube video [see 28:16].
GoogleTechTalk, 2007.
*Quote: "Nobody in their right mind would ever suggest"
to use plain backpropagation for training deep networks.
But in 2010, our team showed*^{[MLP1-2]}
that
unsupervised pre-training is not necessary
to train deep NNs.

[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Preprint arXiv:1609.08144 (PDF), 2016. *Based on LSTM which it mentions at least 50 times.*

.