
(a Highway Net with open gates) 
Background: GPUs were originally developed for the video game industry. But they can also be used to speed up artificial neural networks (NNs), as shown in 2004 by Jung & Oh [1]. Nevertheless, until 2010, many researchers thought that one cannot train deep NNs by plain backpropagation, the popular technique published by Linnainmaa in 1970 [5, 5ac, 6].
Why not? Because of the fundamental deep learning problem identified in 1991 by my very first student Sepp Hochreiter [2c]: In typical deep or recurrent networks, backpropagated error signals either grow or decay exponentially in the number of layers.
That's why many scientists thought that NNs have to be pretrained by unsupervised learning  something that I did first for general purpose deep recurrent NNs in 1991 (my first very deep learner) [2,2a], and that others did for less general feedforward NNs in 2006 [19a] (in 2008 also on GPU [1b]).
In 2010, however, our team at IDSIA (Dan Ciresan et al [18a,18a+]) showed that GPUs can be used to train deep standard supervised NNs by plain backpropagation [5, 7], achieving a 50fold speedup over CPUs, and breaking the longstanding famous MNIST [15c] benchmark record [18a], using pattern distortions [15d]. This really was all about GPUs—no novel NN techniques were necessary, no unsupervised pretraining, only decadesold stuff. One of the reviewers called this a "wakeup call" to the machine learning community, which quickly adopted the method [18a+]. Compare also Sec. 19 of [24].
In 2011, we extended [18bg] this approach to the convolutional NNs (CNNs) developed by Fukushima (1979), Waibel (1987), LeCun (1989), Weng (1993), and others [4] (more on the history of CNNs in Sec. D & Sec. XVIII & Sec. XIV of [19a]). Our GPUbased deep DanNet [18bg] was 60 times faster [18b] than CPUbased CNNs, and much faster and deeper than previous shallow GPUCNNs [1b]. It became the basis for a whole series of victories in computer vision contests (see Table 1). In 2011, this attracted enormous interest from industry. Today, the world's most famous IT companies are heavily using such techniques.
In particular, in 2011, DanNet was the first pure, deep GPUCNN to win international pattern recognition contests [18cg]. The very first event won by DanNet was the Chinese handwriting recognition contest at ICDAR 2011 [18e]—highly important for all those cell phone makers who want to build smartphones that can read signs and restaurant menus in foreign languages.
This attracted a lot of industry attention—it became clear that this was the way forward in computer vision. In particular, Apple hired one of our awardwinning team. (Some people think that Apple came late to the deep learning GPUCNN party, but no, they got active as soon as this became commercially relevant.)
Less than 3 months later, in August 2011 in Silicon Valley, DanNet achieved the first superhuman pattern recognition result in the history of computer vision [18cg]. Our system was twice better than humans, three times better than the closest artificial competitor (from NYU), and six times better than the best nonneural method.
And then it kept winning those contests with larger and larger images, as shown in Table 1 (compare Kurzweil AI interview of 2012).
Table 1 also reflects that DanNet was the first neural network to win an image segmentation contest (Mar 2012) [20d,20d+], the first NN to win a contest on object detection in large images (10 Sep 2012) [20a,c], the first to win medical imaging contests in general, and the first to win cancer detection contests in particular (Mitosis Detection in Breast Cancer Histological Images, 2012 & 2013) [20ac]. Our fast CNN image scanners were over 1000 times faster than previous methods [20e].
Today, many startups as well as established companies such as IBM & Google are using such deep GPUCNNs for healthcare applications (note that healthcare makes up 10% of the world's GDP) [25].
In 20112012, DanNet won every contest it entered. It did not participate in ImageNet competitions, focusing instead on contests with larger images (ISBI 2012, ICPR 2012, MICCAI 2013, see Table 1). However, the ImageNet 2012 winner AlexNet [19] (see Table 1) is similar to DanNet [18bg]. Compare Sec. XIV of [19a].
We continued to make NNs even deeper and better. Until 2015, deep NNs had at most a few tens of layers, e.g., 2030 layers. But in May 2015, there was something new: our Highway Network [11ad] was the first working really deep feedforward NN with hundreds of layers, based on the LSTM principle [8,9] which enabled much deeper learning. The ImageNet 2015 winner ResNet [12] of Dec 2015 (Table 1) is a variant thereof. In fact, ResNets are Highway Nets whose gates are always open.
(Table 1 does not list contests won through combinations of CNNs and other techniques such as Support Vector Machines and Bag of Features, e.g., the 2009 TRECVID competitions [21, 22]. It also does not include benchmark records broken outside of contests with concrete deadlines.)
We never needed any of the popular NN regularisers, which tend to improve error rates by at most a few percent, which pales against the dramatic improvements brought by sheer GPU computing power. Compare Sec. XIV of [19a].
We used the GPUs of NVIDIA, which rebranded itself as a deep learning company during the period covered by the competitions in Table 1. BTW, thanks to NVIDIA and its CEO Jensen H. Huang (see image above) for our 2016 NN Pioneers of AI Award, and for generously funding our research!
Most of the major IT companies such as Facebook are now using such deep GPUCNNs for image recognition and a multitude of other applications [22]. Arcelor Mittal, the world's largest steel maker, worked with us to greatly improve steel defect detection [3].
However, long before our feedforward DanNet started winning competitions in 2011, our CTCtrained Long ShortTerm Memory (LSTM) [8,9,10,10a] became the first general purpose recurrent NN to win competitions, namely, three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). By the mid 2010s, LSTM was heavily used for natural language processing, image captioning, speech recognition and generation, chatbots, smart assistants, prediction, etc. Remarkably, LSTM concepts also invaded CNN territory [11a,b,c,d,12], also through GPUfriendly multidimensional LSTMs such as PyraMiDLSTM [23].
We are proud that our deep learning methods developed since 1991 have transformed machine learning and Artificial Intelligence (AI), and became available to billions of users through the world's four most valuable public companies: Apple (#1 as of March 31, 2017), Google (Alphabet, #2), Microsoft (#3), and Amazon (#4).
[1] Oh, K.S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):13111314. [Speeding up traditional NNs on GPU by a factor of 20.]
[1a] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a relatively small factor of 4.]
[1b] Raina, R., Madhavan, A., and Ng, A. (2009). Largescale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 873880. ACM. Based on a NIPS 2008 workshop paper.
[2] Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242. Based on TR FKI14891, TUM, 1991. More.
[2a] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment with credit assignment across 1200 time steps or virtual layers and unsupervised pretraining for a stack of recurrent NNs can be found here.
[2c] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, in J. Schmidhuber's lab, 1991.
[3] J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber. Steel Defect Classification with MaxPooling Convolutional Neural Networks. Proc. IJCNN 2012.
[4] Fukushima's CNN architecture [13a, 13b] (1979) (with Weng's MaxPooling [14], 1993) is trained [6] in the shiftinvariant 1D case [15a, 15b] or 2D case [15c, 16, 17] by Linnainmaa's automatic differentiation or backpropagation algorithm of 1970 [5, 7] (extending earlier work in control theory [5ac]).
[5] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 67 and FORTRAN code on pages 5860. PDF. See also BIT 16, 146160, 1976. Link. [The first publication of "modern" backpropagation, also known as the reverse mode of automatic differentiation.]
[5a] Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947954.
[5b] Bryson, A. E. (1961). A gradient method for optimizing multistage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications.
[5c] Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):3045.
[6] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. [First application of backpropagation [5, 7] to neural networks. Extending preliminary thoughts in his 1974 thesis.]
[7] J. Schmidhuber. Who invented backpropagation? More.
[8] Hochreiter, S. and Schmidhuber, J. (1997). Long ShortTerm Memory. Neural Computation, 9(8):17351780. Based on TR FKI20795, TUM (1995). More.
[9] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000. PDF. [The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow.]
[10] Graves, A., Fernandez, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. Proc. ICML'06, pp. 369376.
[10a] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
[11a] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (Jul 2015). Also at NIPS'2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM [8] with forget gates [9] for RNNs.) The later ResNets [12] are Highway Nets whose gates are always open, that is, g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets on ImageNet [11c]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [11c]. More.]
[11b] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 1011, 2015. Link.
[11c] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.
[11d] J. Schmidhuber (2015): Overview of Highway Networks: First working really deep feedforward neural networks with over 100 layers. (Updated 2020 for 5year anniversary.)
[12] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual Nets [12] are Highway Nets [11] whose gates are always open, with g(x)=1 (a typical Highway Net initialisation) and t(x)=1.
[13a] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62A, no. 10, pp. 658665, 1979. [The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [13b]. More in Scholarpedia.]
[13b] K. Fukushima. Neocognitron: A selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193202, 1980. Scholarpedia.
[14] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3D objects from 2D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121128.
[15a] A. Waibel. Phoneme Recognition Using TimeDelay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. [First application of backpropagation [5] and weightsharing to a convolutional network.]
[15b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328339, March 1989.
[15c] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541551, 1989.
[15d] Baird, H. (1990). Document image defect models. In Proc. IAPR Workshop on Syntactic and Structural Pattern Recognition, Murray Hill, NJ.
[16] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
[17] D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. ICANN 2010.
[18a] Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22(12):32073220.
[18a+] J. Schmidhuber (Sep 2020). 10year anniversary of supervised deep learning breakthrough (2010). No unsupervised pretraining. The rest is history
[18b] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI2011, Barcelona), 2011. [DanNet: Speeding up deep CNNs on GPU by a factor of 60. Basis of all our computer vision contest winners since 2011.]
[18c] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN2011, San Francisco), 2011.
[18c+] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. MultiColumn Deep Neural Network for Traffic Sign Classification. Neural Networks 32: 333338, 2012. PDF of preprint.
[18d] Results of 2011 IJCNN traffic sign recognition contest
[18e] Results of 2011 ICDAR Chinese handwriting recognition competition: WWW site, PDF.
[18f] Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012c). Multicolumn deep neural networks for image classification. Proc. CVPR, July 2012. Long preprint arXiv:1202.2745v1 [cs.CV], February 2012.
[18g] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[19] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, December 2012.
[19a] J. Schmidhuber (2020). Critique of 2018 Turing Award for deep learning.
[20a] Results of 2012 ICPR cancer detection contest
[20b] Results of 2013 MICCAI Grand Challenge (cancer detection)
[20c] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
[20d] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. NIPS 2012, Lake Tahoe, 2012.
[20d+] I. ArgandaCarreras, S. C. Turaga, D. R. Berger, D. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, D. Laptev, S. Dwivedi, J. M. Buhmann, T. Liu, M. Seyedhosseini, T. Tasdizen, L. Kamentsky, R. Burget, V. Uher, X. Tan, C. Sun, T. Pham, E. Bas, M. G. Uzunbas, A. Cardona, J. Schindelin, H. S. Seung. Crowdsourcing the creation of image segmentation algorithms for connectomics. Front. Neuroanatomy, November 2015.
[20e] J. Masci, A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with MaxPooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.
[21] Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221231.
[22] Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 85117. More. Short version at Scholarpedia.
[23] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel MultiDimensional LSTM, with Application to Fast Biomedical Volumetric Image Segmentation. NIPS 2015; arxiv:1506.07452.
[24] J. Schmidhuber (2019). Deep Learning: Our Miraculous Year 19901991. See also arxiv:2005.05744.
[25] J. Schmidhuber (2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.