Jürgen Schmidhuber (9/2/2020, 2022)
Pronounce: You_again Shmidhoobuh
See also the 2025 update! AI Blog
Twitter: @SchmidhuberAI

2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history.

In 2020, we are celebrating the 10-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained by plain gradient descent on GPU. Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. That is, by 2010, when compute was 100 times more expensive than today, both our feedforward NNs and our earlier recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all competing algorithms on important problems of that time. In the 2010s, this deep learning revolution quickly spread from Europe to America and Asia.

Just one decade ago, many thought that deep NNs cannot learn much without unsupervised pre-training, a technique introduced by myself in 1991 [UN0-UN3][UN] and later also championed by others, e.g., [UN4-5][VID1][T20][T22]. In fact, it was claimed [VID1] that "nobody in their right mind would ever suggest" to use plain gradient descent through backpropagation [BP1] (see also [BPA-C] [BP2-6][R7]) to train feedforward NNs (FNNs) with many layers of neurons.

However, in March 2010, our team with my outstanding Romanian postdoc Dan Ciresan [MLP1] showed that deep FNNs can indeed be trained by plain backpropagation for important applications. This neither required unsupervised pre-training nor Ivakhnenko's incremental layer-wise training of 1965 [DEEP1-2]. By the standards of 2010, our supervised NN had many layers. It set a new performance record [MLP1] on the back then famous and widely used image recognition benchmark called MNIST [MNI]. This was achieved by greatly accelerating traditional multilayer perceptrons on highly parallel graphics processing units called GPUs, going beyond the important GPU work of Jung & Oh (2004) [GPUNN]. A reviewer called this a "wake-up call to the machine learning community."

Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This greatly improved earlier work [GPUCNN]. The so-called DanNet [GPUCNN1][R6] broke several benchmark records [DAN]. In May 2011, DanNet was the first deep CNN to win a computer vision competition [GPUCNN5,3]. In August 2011, it was the first to win a vision contest with superhuman performance [GPUCNN5][DAN1]. Our team kept winning vision contests in 2012 [GPUCNN5]. Subsequently, many researchers adopted this technique. By May 2015, we had the first extremely deep FNNs with more than 100 layers [HW1] (compare [HW2][HW3]).

The original successes required a precise understanding of the inner workings of GPUs [MLP1][GPUCNN1]. Today, convenient software packages shield the user from such details. Compute is roughly 100 times cheaper than a decade ago, and many commercial NN applications are based on what started in 2010 [MLP1-2][DL1-4][DEC].

In this context it should be mentioned that right before the 2010s, our team had already achieved another breakthrough in supervised deep learning with the more powerful recurrent NNs (RNNs) whose basic architectures were introduced in the 1920s [L20][I25][K41][MC43][W45][K56][AMH1-2]. My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6] (which overcome the famous vanishing gradient problem analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006). CTC-trained LSTM was the first RNN to win international contests. Compare Sec. 4 of [MIR] and Sec. A & B & XVII of [T22].

That is, by 2010, both our supervised FNNs and our supervised RNNs were able to outperform all other methods on important problems. In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia, with enormous impact on industry and daily life [DL4][DEC][MOST]. However, it should be mentioned that the conceptual roots of deep learning reach back deep into the previous millennium [DEEP1-2][DL1-2][MIR](Sec. 21 & Sec. 19) [T20][T22](e.g., Sec. II & D).

Finally let me emphasize that the above-mentioned supervised deep learning revolutions of the early 1990s (for recurrent NNs) [MIR] and of 2010 (for feedforward NNs) [MLP1-2] did not at all kill unsupervised learning. For example, pre-trained language models are now heavily used by Transformers which excel at the traditional LSTM domain of Natural Language Processing [TR1-6] (although there are still many language tasks that LSTM can rapidly learn to solve quickly [LSTM13] while plain Transformers can't). Remarkably, Transformers with linearized self-attention were also first published [FWP0-7] in our Annus Mirabilis of 1990-1991 [MIR][MOST], together with unsupervised pre-training for deep learning [UN-UN3]. And our unsupervised generative adversarial NNs since 1990 [AC90-AC20][PLAN][AC] are still used to endow agents with artificial curiosity [MIR](Sec. 5 & Sec. 6)—see also a version of our adversarial NNs [AC90b] called GANs [AC20][R2][PLAN][MOST][T22](Sec. XVII). Unsupervised learning still has a bright future!

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

References

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.

[MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both our feedforward NNs^[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.

[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich.

[MOST] J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win image recognition competitions), (4) Generative Adversarial Networks (an instance of my earlier Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers). Most of this started with our Annus Mirabilis of 1990-1991.^[MIR]

[MNI] Y. LeCun (1998). The MNIST database of handwritten digits. Link.

[AMH1] S. I. Amari (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972. PDF. First published learning RNN. First publication of what was later sometimes called the Hopfield network^[AMH2] or Amari-Hopfield Network.

[AMH2] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. of the National Academy of Sciences, vol. 79, pages 2554-2558, 1982. The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.^[AMH1]

[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention^[ATT0-2] (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular.

[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.

[DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117. More.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.

[DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019 on-device speech recognition (on the phone, not the server) is still based on LSTM.

[FWP] J. Schmidhuber (AI Blog, 26 March 2021). 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! 30-year anniversary of a now popular alternative^[FWP0-1] to recurrent NNs. A slow feedforward NN learns by gradient descent to program the changes of the fast weights of another NN. Such Fast Weight Programmers^[FWP0-7] can learn to memorize past data, e.g., by computing fast weight changes through additive outer products of self-invented activation patterns^[FWP0-1] (now often called keys and values for self-attention^[TR1-6]). The similar Transformers^[TR1-2] combine this with projections and softmax and are now widely used in natural language processing. For long input sequences, their efficiency was improved through Transformers with linearized self-attention^[TR5-6] which are formally equivalent to the 1991 Fast Weight Programmers (apart from normalization). In 1993, I introduced the attention terminology^[FWP2] now used in this context,^[ATT] and extended the approach to RNNs that program themselves.

[FWP0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. First paper on fast weight programmers: a slow net learns by gradient descent to compute weight changes of a fast net.

[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. PDF. HTML. Pictures (German).

[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent fast weight programmer based on outer products. Introduced the terminology of learning "internal spotlights of attention."

[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.

[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018. Preprint: arXiv:1811.12143. PDF.

[FWP5] F. J. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In W. Duch et al. (Eds.): Proc. ICANN'05, LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005. PDF. HTML overview. Reinforcement-learning fast weight programmer.

[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.

[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. Preprint: arXiv:2106.06295 (June 2021).

[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16]. GoogleTechTalk, 2007. Quote: "Nobody in their right mind would ever suggest" to use plain backpropagation for training deep networks. But in 2010, our [MLP1] showed that unsupervised pre-training is not necessary to train deep feedforward nets.

[T20] J. Schmidhuber (2020). Critique of 2018 Turing Award: http://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html

[T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.

[I25] E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925. First non-learning recurrent NN architecture: the Lenz-Ising model.

[K41] H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.

[W45] G. H. Wannier (1945). The Statistical Problem in Cooperative Phenomena. Rev. Mod. Phys. 17, 50.

[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.

[L20] W. Lenz (1920). Beiträge zum Verständnis der magnetischen Eigenschaften in festen Körpern. Physikalische Zeitschrift, 21: 613-615.

[MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. [More on the Fundamental Deep Learning Problem.]

[LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF.

[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.

[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.

[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF.

[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.

[HW1] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a version of this where g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. More.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a version of highway nets [HW1], with g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[DAN] J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs). 1st superhuman result in 2011.^[DAN1] Now everybody is using this approach.

[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. At the IJCNN 2011 computer vision competition in Silicon Valley, our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor, and six times better than the best non-neural method.

[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]

[GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]

[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint (1 Feb 2011). [Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches.]

[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.

[GPUCNN5] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.

[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.

[UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.

[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF.

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991 [UN0]. PDF. [First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.]

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. [An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN can be found here (depth > 1000).]

[UN3] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993.

[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 2006. PDF.

[UN5] Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. In Proc. ICML 26, p 873-880, ACM.

[AC] J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.

[AC90] J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. This report introduced a whole bunch of concepts that are now widely used: Planning with recurrent world models ([MIR], Sec. 11), high-dimensional reward signals as extra NN inputs / general value functions ([MIR], Sec. 13), deterministic policy gradients ([MIR], Sec. 14), unsupervised NNs that are both generative and adversarial ([MIR], Sec. 5), for Artificial Curiosity and related concepts.

[AC90b] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. Based on [AC90]. More.

[AC91b] J. Schmidhuber. Curious model-building control systems. Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE, 1991. PDF.

[AC95] J. Storck, S. Hochreiter, and J. Schmidhuber. Reinforcement-driven information acquisition in non-deterministic environments. In Proc. ICANN'95, vol. 2, pages 159-164. EC2 & CIE, Paris, 1995. PDF.

[AC97] J. Schmidhuber. What's interesting? Technical Report IDSIA-35-97, IDSIA, July 1997.

[AC99] J . Schmidhuber. Artificial Curiosity Based on Discovering Novel Algorithmic Predictability Through Coevolution. In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z. Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618, IEEE Press, Piscataway, NJ, 1999.

[AC02] J. Schmidhuber. Exploring the Predictable. In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing, p. 579-612, Springer, 2002. PDF.

[AC06] J. Schmidhuber. Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts. Connection Science, 18(2): 173-187, 2006. PDF.

[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF.

[AC11] Sun Yi, F. Gomez, J. Schmidhuber. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011. PDF.

[AC13] J. Schmidhuber. POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Cognitive Science, 2013. Preprint (2011): arXiv:1112.5309 [cs.AI]

[AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493.

[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.

[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.

[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.

[BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.

[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.

[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. First application of backpropagation^[BP1] to NNs (concretizing thoughts in his 1974 thesis).

[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^[DL2]

[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400.

[BP6] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. See Section 3.1 on using gradient descent for learning in multilayer networks.

[DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.

[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.

[DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.

[TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.

[TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.

[TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021.

.