(a Highway Net with
Background: GPUs were originally developed for the video game industry. But they can also be used to speed up artificial neural networks (NNs), as shown in 2004 by Jung & Oh . Nevertheless, until 2010, many researchers thought that one cannot train deep NNs by plain backpropagation, the popular technique published by Linnainmaa in 1970 [5, 5a-c, 6].
Why not? Because of the fundamental deep learning problem identified in 1991 by my very first student Sepp Hochreiter [2c]: In typical deep or recurrent networks, back-propagated error signals either grow or decay exponentially in the number of layers.
That's why many scientists thought that NNs have to be pre-trained by unsupervised learning - something that I did first for general purpose deep recurrent NNs in 1991 (my first very deep learner) [2,2a], and that others did for less general feedforward NNs in 2006 [19a] (in 2008 also on GPU [1b]).
In 2010, however, our team at IDSIA (Dan Ciresan et al [18a,18a+]) showed that GPUs can be used to train deep standard supervised NNs by plain backpropagation [5, 7], achieving a 50-fold speedup over CPUs, and breaking the long-standing famous MNIST [15c] benchmark record [18a], using pattern distortions [15d]. This really was all about GPUs—no novel NN techniques were necessary, no unsupervised pre-training, only decades-old stuff. One of the reviewers called this a "wake-up call" to the machine learning community, which quickly adopted the method [18a+]. Compare also Sec. 19 of .
In 2011, we extended [18b-g] this approach to the convolutional NNs (CNNs) developed by Fukushima (1979), Waibel (1987), LeCun (1989), Weng (1993), and others  (more on the history of CNNs in Sec. D & Sec. XVIII & Sec. XIV of [19a]). Our GPU-based deep DanNet [18b-g] was 60 times faster [18b] than CPU-based CNNs, and much faster and deeper than previous shallow GPU-CNNs [1b]. It became the basis for a whole series of victories in computer vision contests (see Table 1). In 2011, this attracted enormous interest from industry. Today, the world's most famous IT companies are heavily using such techniques.
In particular, in 2011, DanNet was the first pure, deep GPU-CNN to win international pattern recognition contests [18c-g]. The very first event won by DanNet was the Chinese handwriting recognition contest at ICDAR 2011 [18e]—highly important for all those cell phone makers who want to build smartphones that can read signs and restaurant menus in foreign languages.
This attracted a lot of industry attention—it became clear that this was the way forward in computer vision. In particular, Apple hired one of our award-winning team. (Some people think that Apple came late to the deep learning GPU-CNN party, but no, they got active as soon as this became commercially relevant.)
Less than 3 months later, in August 2011 in Silicon Valley, DanNet achieved the first superhuman pattern recognition result in the history of computer vision [18c-g]. Our system was twice better than humans, three times better than the closest artificial competitor (from NYU), and six times better than the best non-neural method.
And then it kept winning those contests with larger and larger images, as shown in Table 1 (compare Kurzweil AI interview of 2012).
Table 1 also reflects that DanNet was the first neural network to win an image segmentation contest (Mar 2012) [20d,20d+], the first NN to win a contest on object detection in large images (10 Sep 2012) [20a,c], the first to win medical imaging contests in general, and the first to win cancer detection contests in particular (Mitosis Detection in Breast Cancer Histological Images, 2012 & 2013) [20a-c]. Our fast CNN image scanners were over 1000 times faster than previous methods [20e].
Today, many startups as well as established companies such as IBM & Google are using such deep GPU-CNNs for healthcare applications (note that healthcare makes up 10% of the world's GDP) .
In 2011-2012, DanNet won every contest it entered. It did not participate in ImageNet competitions, focusing instead on contests with larger images (ISBI 2012, ICPR 2012, MICCAI 2013, see Table 1). However, the ImageNet 2012 winner AlexNet  (see Table 1) is similar to DanNet [18b-g] (compare Sec. XIV of [19a]). So is the ImageNet 2014 winner VGG Net .
We continued to make NNs even deeper and better. Until 2015, deep NNs had at most a few tens of layers, e.g., 20-30 layers. But in May 2015, there was something new: our Highway Network [11a-d] was the first working really deep feedforward NN with hundreds of layers, based on the LSTM principle [8,9] which enabled much deeper learning. The ImageNet 2015 winner ResNet  of Dec 2015 (Table 1) is a variant thereof. In fact, ResNets are Highway Nets whose gates are always open.
(Table 1 does not list contests won through combinations of CNNs and other techniques such as Support Vector Machines and Bag of Features, e.g., the 2009 TRECVID competitions [21, 22]. It also does not include benchmark records broken outside of contests with concrete deadlines.)
We never needed any of the popular NN regularisers, which tend to improve error rates by at most a few percent, which pales against the dramatic improvements brought by sheer GPU computing power. Compare Sec. XIV of [19a].
We used the GPUs of NVIDIA, which rebranded itself as a deep learning company during the period covered by the competitions in Table 1. BTW, thanks to NVIDIA and its CEO Jensen H. Huang (see image above) for our 2016 NN Pioneers of AI Award, and for generously funding our research!
Most of the major IT companies such as Facebook are now using such deep GPU-CNNs for image recognition and a multitude of other applications . Arcelor Mittal, the world's largest steel maker, worked with us to greatly improve steel defect detection .
However, long before our feedforward DanNet started winning competitions in 2011, our CTC-trained Long Short-Term Memory (LSTM) [8,9,10,10a] became the first general purpose recurrent NN to win competitions, namely, three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). By the mid 2010s, LSTM was heavily used for natural language processing, image captioning, speech recognition and generation, chatbots, smart assistants, prediction, etc. Remarkably, LSTM concepts also invaded CNN territory [11a,b,c,d,12], also through GPU-friendly multi-dimensional LSTMs such as PyraMiD-LSTM .
We are proud that our deep learning methods developed since 1991 have transformed machine learning and Artificial Intelligence (AI), and became available to billions of users through the world's four most valuable public companies: Apple (#1 as of March 31, 2017), Google (Alphabet, #2), Microsoft (#3), and Amazon (#4).
Thanks to several expert reviewers for useful comments. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
 Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]
[1a] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a relatively small factor of 4.]
[1b] Raina, R., Madhavan, A., and Ng, A. (2009). Large-scale deep unsupervised learning using graphics processors. In Proceedings of the 26th Annual International Conference on Machine Learning (ICML), pages 873-880. ACM. Based on a NIPS 2008 workshop paper.
 Schmidhuber, J. (1992). Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242. Based on TR FKI-148-91, TUM, 1991. More.
[2a] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NNs can be found here.
[2c] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, in J. Schmidhuber's lab, 1991.
 J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber. Steel Defect Classification with Max-Pooling Convolutional Neural Networks. Proc. IJCNN 2012.
 Fukushima's CNN architecture [13a, 13b] (1979) (with Weng's Max-Pooling , 1993) is trained  in the shift-invariant 1D case [15a, 15b] or 2D case [15c, 16, 17] by Linnainmaa's automatic differentiation or backpropagation algorithm of 1970 [5, 7] (extending earlier work in control theory [5a-c]).
 S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. [The first publication of "modern" backpropagation, also known as the reverse mode of automatic differentiation.]
[5a] Kelley, H. J. (1960). Gradient theory of optimal flight paths. ARS Journal, 30(10):947-954.
[5b] Bryson, A. E. (1961). A gradient method for optimizing multi-stage allocation processes. In Proc. Harvard Univ. Symposium on digital computers and their applications.
[5c] Dreyfus, S. E. (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1):30-45.
 P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. [First application of backpropagation [5, 7] to neural networks. Extending preliminary thoughts in his 1974 thesis.]
 Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735-1780. Based on TR FKI-207-95, TUM (1995). More.
 F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow.]
 Graves, A., Fernandez, S., Gomez, F. J., and Schmidhuber, J. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural nets. Proc. ICML'06, pp. 369-376.
[10a] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
[11a] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (Jul 2015). Also at NIPS'2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM  with forget gates  for RNNs.) The later ResNets  are Highway Nets whose gates are always open, that is, g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets on ImageNet [11c]. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well [11c]. More.]
[11b] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link.
[11c] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.
[11d] J. Schmidhuber (AI Blog, 2015): Overview of Highway Networks: First working really deep feedforward neural networks with over 100 layers. (Updated 2020 for 5-year anniversary.)
 He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual Nets  are Highway Nets  whose gates are always open, with g(x)=1 (a typical Highway Net initialisation) and t(x)=1.
[13a] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. [The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [13b]. More in Scholarpedia.]
[13b] K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4): 193-202, 1980. Scholarpedia.
 Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128.
[15a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. [First application of backpropagation  and weight-sharing to a convolutional network.]
[15b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989.
[15c] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
[15d] Baird, H. (1990). Document image defect models. In Proc. IAPR Workshop on Syntactic and Structural Pattern Recognition, Murray Hill, NJ.
 M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
 D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. ICANN 2010.
[18a] Ciresan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep big simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207-3220.
[18a+] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. The rest is history
[18b] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. [DanNet: Speeding up deep CNNs on GPU by a factor of 60. Basis of all our computer vision contest winners since 2011.]
[18c] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011.
[18c+] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. Multi-Column Deep Neural Network for Traffic Sign Classification. Neural Networks 32: 333-338, 2012. PDF of preprint.
[18f] Ciresan, D. C., Meier, U., and Schmidhuber, J. (2012c). Multi-column deep neural networks for image classification. Proc. CVPR, July 2012. Long preprint arXiv:1202.2745v1 [cs.CV], February 2012.
 A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, December 2012.
[19a] J. Schmidhuber (2020). Critique of 2018 Turing Award for deep learning.
[20a] Results of 2012 ICPR cancer detection contest
[20b] Results of 2013 MICCAI Grand Challenge (cancer detection)
[20c] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
[20d] D. Ciresan, A. Giusti, L. Gambardella, J. Schmidhuber. Deep Neural Networks Segment Neuronal Membranes in Electron Microscopy Images. NIPS 2012, Lake Tahoe, 2012.
[20d+] I. Arganda-Carreras, S. C. Turaga, D. R. Berger, D. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber, D. Laptev, S. Dwivedi, J. M. Buhmann, T. Liu, M. Seyedhosseini, T. Tasdizen, L. Kamentsky, R. Burget, V. Uher, X. Tan, C. Sun, T. Pham, E. Bas, M. G. Uzunbas, A. Cardona, J. Schindelin, H. S. Seung. Crowdsourcing the creation of image segmentation algorithms for connectomics. Front. Neuroanatomy, November 2015.
 Ji, S., Xu, W., Yang, M., and Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):221-231.
 M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, with Application to Fast Biomedical Volumetric Image Segmentation. NIPS 2015; arxiv:1506.07452.
 J. Schmidhuber (AI Blog, 2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.
 K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014).