2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history. Juergen Schmidhuber.

Jürgen Schmidhuber (9/2/2020)
Pronounce: You_again Shmidhoobuh

2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history.

In 2020, we are celebrating the 10-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained by plain gradient descent on GPU. Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. That is, by 2010, when compute was 100 times more expensive than today, both our feedforward NNs and our earlier recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all competing algorithms on important problems of that time. In the 2010s, this deep learning revolution quickly spread from Europe to America and Asia.

Just one decade ago, many thought that deep NNs cannot learn much without unsupervised pre-training, a technique introduced by myself in 1991 [UN0-UN3] and later also championed by others, e.g., [UN4-5] [VID1] [T20]. In fact, it was claimed [VID1] that "nobody in their right mind would ever suggest" to use plain gradient descent through backpropagation [BP1] (see also [BPA-C] [BP2-6] [R7]) to train feedforward NNs (FNNs) with many layers of neurons.

However, in March 2010, our team with my outstanding Romanian postdoc Dan Ciresan [MLP1] showed that deep FNNs can indeed be trained by plain backpropagation for important applications. This neither required unsupervised pre-training nor Ivakhnenko's incremental layer-wise training of 1965 [DEEP1-2]. By the standards of 2010, our supervised NN had many layers. It set a new performance record [MLP1] on the back then famous and widely used image recognition benchmark called MNIST [MNI]. This was achieved by greatly accelerating traditional multilayer perceptrons on highly parallel graphics processing units called GPUs, going beyond the important GPU work of Jung & Oh (2004) [GPUNN]. A reviewer called this a "wake-up call to the machine learning community."

Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This greatly improved earlier work [GPUCNN]. The so-called DanNet [GPUCNN1] [R6] broke several benchmark records. In May 2011, DanNet was the first deep CNN to win a computer vision competition [GPUCNN5] [GPUCNN3]. In August 2011, it was the first to win a vision contest with superhuman performance [GPUCNN5]. Our team kept winning vision contests in 2012 [GPUCNN5]. Subsequently, many researchers adopted this technique. By May 2015, we had the first extremely deep FNNs with more than 100 layers [HW1] (compare [HW2] [HW3]).

The original successes required a precise understanding of the inner workings of GPUs [MLP1] [GPUCNN1]. Today, convenient software packages shield the user from such details. Compute is roughly 100 times cheaper than a decade ago, and many commercial NN applications are based on what started in 2010 [MLP1] [DL1-4] [DEC].

In this context it should be mentioned that right before the 2010s, our team had already achieved another breakthrough in supervised deep learning with the more powerful recurrent NNs (RNNs) whose basic architectures were introduced over half a century earlier [MC43] [K56]. My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6] (which overcome the famous vanishing gradient problem analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006). CTC-trained LSTM was the first RNN to win international contests. Compare Sec. 4 of [MIR] and Sec. A & B of [T20].

That is, by 2010, both our supervised FNNs and our supervised RNNs were able to outperform all other methods on important problems. In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia, with enormous impact on industry and daily life [DL4] [DEC]. However, it should be mentioned that the conceptual roots of deep learning reach back deep into the previous millennium [DEEP1-2] [DL1-2] [MIR] (Sec. 21 & Sec. 19) [T20] (e.g., Sec. II & D).

Finally let me emphasize that the supervised deep learning revolution of the 2010s did not really kill all variants of unsupervised learning. Many are still important. For example, pre-trained language models are now heavily used in the context of transfer learning, e.g., [TR2]. And our active & generative unsupervised NNs since 1990 [AC90-AC20] are still used to endow agents with artificial curiosity [MIR] (Sec. 5 & Sec. 6)—see also a special case of our adversarial NNs [AC90b] called GANs [AC20] [R2] [T20] (Sec. XVII). Unsupervised learning still has a bright future!

Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.


[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint (1 March 2010). [Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.]

[MNI] Y. LeCun (1998). The MNIST database of handwritten digits. Link.

Deep Learning: Our Miraculous Year 1990-1991 [MIR] J. Schmidhuber (2019). Deep Learning: Our Miraculous Year 1990-1991. See also arxiv:2005.05744.

[DEC] J. Schmidhuber (2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.

[DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117. More.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.

[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML.

[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16]. GoogleTechTalk, 2007. [Quote: "Nobody in their right mind would ever suggest" to use plain backpropagation for training deep networks.] But in 2010, our [MLP1] showed that unsupervised pre-training is not necessary to train deep feedforward nets.

[T20] J. Schmidhuber (2020). Critique of 2018 Turing Award: http://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html

[MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.

[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. [More on the Fundamental Deep Learning Problem.]

[LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF.

LSTM [LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.

Winning Vision and Handwriting Recognition Competitions Through Purely Supervised Deep Learning Since 2009 [LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.

[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.

Highway Networks:
First Working Feedforward Networks With Over 100 Layers [HW1] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets [HW2] on ImageNet [HW3]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [HW3]. More.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a special case of highway nets [HW1], with g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.

[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]

[GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]

[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint (1 Feb 2011). [Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches.]

[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.

History of computer vision contests won by deep CNNs on GPUs [GPUCNN5] J. Schmidhuber. History of computer vision contests won by deep CNNs on GPU. March 2017. HTML. [How IDSIA used GPU-based CNNs to win four important computer vision competitions 2011-2012 before others started using similar approaches.]

[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.

[UN0] J.  Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF.

[UN1] My First Deep Learning System of 1991 + Deep Learning Timeline 1962-2013 J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991 [UN0]. PDF. [First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.]

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. [An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN can be found here (depth > 1000).]

[UN3] J.  Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993.

[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 2006. PDF.

[UN5] Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. In Proc. ICML 26, p 873-880, ACM.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[AC90] J.  Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. This report introduced a whole bunch of concepts that are now widely used: Planning with recurrent world models ([MIR], Sec. 11), high-dimensional reward signals as extra NN inputs / general value functions ([MIR], Sec. 13), deterministic policy gradients ([MIR], Sec. 14), unsupervised NNs that are both generative and adversarial ([MIR], Sec. 5), for Artificial Curiosity and related concepts.

[AC90b] J.  Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. Based on [AC90]. More.

[AC91b] J.  Schmidhuber. Curious model-building control systems. Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE, 1991. PDF.

[AC95] J. Storck, S. Hochreiter, and J.  Schmidhuber. Reinforcement-driven information acquisition in non-deterministic environments. In Proc. ICANN'95, vol. 2, pages 159-164. EC2 & CIE, Paris, 1995. PDF.

[AC97] J. Schmidhuber. What's interesting? Technical Report IDSIA-35-97, IDSIA, July 1997.

[AC99] J . Schmidhuber. Artificial Curiosity Based on Discovering Novel Algorithmic Predictability Through Coevolution. In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z. Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618, IEEE Press, Piscataway, NJ, 1999.

[AC02] J.  Schmidhuber. Exploring the Predictable. In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing, p. 579-612, Springer, 2002. PDF.

[AC06] J.  Schmidhuber. Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts. Connection Science, 18(2): 173-187, 2006. PDF.

[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF.

[AC11] Sun Yi, F. Gomez, J. Schmidhuber. Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011. PDF.

[AC13] J. Schmidhuber. POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Cognitive Science, 2013. Preprint (2011): arXiv:1112.5309 [cs.AI]

Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991) [AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493.

[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.

[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.

[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.

[BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. [The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.]

[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.

[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. [First application of backpropagation [BP1] to neural networks. Extending preliminary thoughts in his 1974 thesis.]

who invented backpropagation? [BP4] J. Schmidhuber. Who invented backpropagation? More [DL2].

[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400.

[BP6] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. [See Section 3.1 on using gradient descent for learning in multilayer networks.]

[DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. [First working Deep Learners with many layers, learning internal representations.]

[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.

[DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.


The 2010s: Our Decade of Deep Learning (Juergen Schmidhuber)