In 2021, we are celebrating the 10-year anniversary of DanNet,
named after my outstanding Romanian postdoc Dan Claudiu Cireșan (aka Dan Ciresan).
In 2011, DanNet was the first pure deep convolutional neural network (CNN)
to win computer vision contests.
For a while, it enjoyed a monopoly.
From 2011 to 2012 it won every contest it entered,
winning four of them
in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012), driven
by a very fast implementation based on graphics processing units (GPUs).
already in 2011, DanNet achieved the first
in a vision challenge,
although compute was still 100 times more expensive than today.
In July 2012, our
CVPR paper on DanNet
hit the computer vision community. The
joined the party in
Our even much deeper
Highway Net (May 2015) and its variant
ResNet (Dec 2015)
further improved performance (a ResNet is a Highway Net whose gates are always open).
Today, a decade after DanNet, everybody is using fast deep CNNs for computer vision.
CNNs originated over 4 decades ago
The basic CNN architecture with convolutional layers
and downsampling layers is due to Kunihiko Fukushima (1979) [CNN1,CNN1+].
In 1987, NNs with convolutions were combined by Alex Waibel [CNN1a,b] with
backpropagation, a technique from 1970
[BP1-4][R7], and with weight sharing.
Waibel did not call this CNNs but TDNNs.
Yann LeCun's team later contributed important improvements, especially for images, e.g., [CNN2][CNN4][T20][T22](Sec. XVIII).
The popular downsampling variant
called max-pooling was introduced by Yamaguchi et al. for TDNNs in 1990 [CNN3a] and by Weng et al. for higher-dimensional CNNs in 1993 [CNN3].
In 2010, my own team at the Swiss AI Lab IDSIA showed [MLP1] that
unsupervised pre-training is not necessary
to train deep NNs
(a reviewer called this a
"wake-up call to the machine learning community"—
compare the survey blog post [MLP2]).
One year later, our team with my postdocs Dan Cireșan & Ueli Meier and my PhD student
Jonathan Masci (a fellow co-founder of NNAISENSE)
greatly sped up the training of deep
Our fast GPU-based [GPUNN] CNN of 1 Feb 2011 [GPUCNN1,2,6], often called "DanNet,"
was a practical breakthrough. Published later that year at IJCAI [GPUCNN1],
it was much deeper and faster than earlier GPU-accelerated
CNNs of 2006 [GPUCNN].
DanNet showed that deep CNNs worked
far better than the existing state-of-the-art for recognizing objects in images [GPUCNN2,2+,5,6].
On a sunny day
in Silicon Valley, at IJCNN 2011, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest [GPUCNN2-3,5]. Even the New York Times mentioned this.
DanNet performed twice as good as human test subjects and three times better than the already
impressive second place entry by LeCun's team [SER11].
Sec. D & Sec. XVIII of [T20].
DanNet has attracted tremendous interest from industry. Its temporary
monopoly on winning computer
vision competitions made it the first deep CNN to win:
a Chinese handwriting contest (ICDAR, May 2011),
a traffic sign recognition contest (IJCNN, Aug 2011),
an image segmentation contest (ISBI, May 2012),
and a contest on object detection in large images (ICPR, Sept 2012). The latter was actually
a medical imaging contest on cancer detection [GPUCNN8].
Our CNN image scanners were 1000 times faster than previous methods [SCAN]. The significance of these kind of improvements for the health care industry is obvious. Today IBM, Siemens, Google, and many startups are pursuing this approach.
In 2011, we also introduced our deep neural nets to Arcelor Mittal, the world's largest steel producer,
and were able to greatly improve steel defect detection [ST].
To the best of my knowledge, this was the first deep learning breakthrough in heavy industry.
A significant part of modern computer vision is extending our work of 2011, e.g., [DL1-4] and Sec. 19 of [MIR].
A follow up
technical report on DanNet in Feb 2012
summarized some of the recent breakthroughs. In July 2012, DanNet
was also presented at CVPR, the leading computer vision conference [GPUCNN3].
This helped to spread the word in the computer vision community.
As of 2020, the CVPR article was the most cited DanNet paper, albeit not the first [GPUCNN1-3,6].
After DanNet had won 4 image recognition competitions, the similar GPU-accelerated "AlexNet"
won the ImageNet [IM09] 2012 contest [GPUCNN4-5][R6]. Unlike DanNet,
AlexNet used Kunihiko Fukushima's
rectified linear neurons (ReLUs) [RELU1-2] (1969) and a variant of Stephen J. Hanson's stochastic delta rule (1990) called "dropout" without citing the original work [Drop1,4][T20][T22].
While both of these techniques helped,
they are not really
required to win vision
Back then, the only really important CNN-related task was to greatly
accelerate known techniques for training CNNs through GPUs.
We continued to make CNNs and other neural nets even deeper and better.
Until 2015, deep networks had at most a few tens of layers, e.g., 20-30 layers.
But in May 2015, our
was the first working extremely deep feedforward neural net with hundreds of layers [MOST].
The Highway Net is based on the
principle [LSTM1-2] which enables much deeper learning.
Its special case called
"ResNet" [HW2] (the ImageNet 2015 winner of Dec 2015)
is a Highway Net whose gates are always open
(compare [HW] & Sec. 4 of [MIR]).
Highway Nets perform roughly as well as ResNets on ImageNet [HW3].
Highway layers are also often used for natural language processing [HW3]
The original successes of DanNet required a precise understanding of
the inner workings of GPUs [GPUCNN1-3].
Today, convenient software packages shield the user from such details, and
compute is roughly 100 times cheaper than 10 years ago when
our results set the stage for the
recent decade of deep learning
Many current commercial neural net applications are based on what started in 2011
Thanks to several expert reviewers for useful comments. Since science is about self-correction, let me know under firstname.lastname@example.org if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.
See also BIT 16, 146-160, 1976.
The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
First application of backpropagation[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
[CNN1] K. Fukushima: Neural network model for a mechanism of pattern
recognition unaffected by shift in position—Neocognitron.
Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979.
The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia.
K. Fukushima: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics, vol. 36, no. 4, pp. 193-202 (April 1980).
[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation[BP1][BP2] and weight-sharing
to a convolutional architecture.
[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989. Based on [CNN1a].
[CNN1c] Bower Award Ceremony 2021:
Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021.
[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990.
An NN with convolutions using Max-Pooling instead of Fukushima's
[CNN3] Weng, J.,
Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. A CNN whose downsampling layers use Max-Pooling
(which has become very popular) instead of Fukushima's
[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
S. Behnke. Learning iterative image reconstruction in the neural abstraction pyramid. International Journal of Computational Intelligence and Applications, 1(4):427-438, 1999.
S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92-101, 2010.
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.[DAN1]
Now everybody is using this approach.
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
At the IJCNN 2011 computer vision competition in Silicon Valley,
our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (by LeCun's team), and six times better than the best non-neural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.
Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.
[DL2] J. Schmidhuber, 2015.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019
on-device speech recognition
(on the phone, not the server)
is still based on
[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272.
What's now called "dropout" is a variation of the stochastic delta rule—compare [Drop4].
N. Frazier-Logue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):1018-1032.
J. Hertz, A. Krogh, R. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, California: Addison-Wesley Pub. Co., pp. 45-46.
N. Frazier-Logue, S. J. Hanson (2018). Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning.
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. Speeding up traditional NNs on GPU by a factor of 20.
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. Speeding up shallow CNNs on GPU by a factor of 4.
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint.
Speeding up deep CNNs on GPU by a factor of 60.
win four important computer vision competitions 2011-2012 before others won any
with similar approaches.
[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011.
First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.[DAN1] This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012.
This paper describes AlexNet, which is similar to the earlier
the first pure deep CNN
to win computer vision contests in 2011[GPUCNN2-3,5] (AlexNet and VGG Net[GPUCNN9] followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)[RELU1] and dropout (a variant of Hanson 1990 stochastic delta rule)[Drop1-4] but neither cites the original work[RELU1][Drop1] nor the basic CNN architecture (Fukushima, 1979).[CNN1]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011.
[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2022 for 10th anniversary of DanNet's victory).
First deep learner to win a contest on object detection in large images—
first deep learner to win a medical imaging contest (2012). Link.
How the Swiss AI Lab IDSIA used GPU-based CNNs to win the
ICPR 2012 Contest on Mitosis Detection
and the MICCAI 2013 Grand Challenge.
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014).
(AI Blog, 2015, updated 2020 for 5-year anniversary).
Overview of Highway Networks: First working really deep feedforward neural networks with over 100 layers.
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates[LSTM2] for RNNs.) Resnets[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets[HW2] on ImageNet.[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
(Dec 2015). Residual nets are a version of Highway Nets[HW1]
where the gates are always open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
J. Deng, R. Socher, L.J. Li, K. Li, L. Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255). IEEE, 2009.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
The "vanilla LSTM architecture" with forget gates
that everybody is using today, e.g., in Google's Tensorflow.
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.
[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
By 2010, when compute was 100 times more expensive than today, both our feedforward NNs[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of Schmidhuber's team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrated that many of the basic ideas behind this revolution were published within fewer than 12 months in the "Annus Mirabilis" 1990-1991 at TU Munich.
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of the much earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to the much earlier Fast Weight Programmers).
Most of this started with the
Annus Mirabilis of 1990-1991.[MIR]
R. Csordas, K. Irie, J. Schmidhuber.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. doi:10.1109/TSSC.1969.300225.
This work introduced rectified linear units or ReLUs.
C. v. d. Malsburg (1973).
Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.
[SCAN] J. Masci,
A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.
P. Sermanet, Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. Proc. IJCNN 2011, p 2809-2813, IEEE, 2011
J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber
Steel Defect Classification with Max-Pooling Convolutional Neural Networks.
Proc. IJCNN 2012.
Apparently, this was the first deep learning breakthrough in heavy industry.
[T20] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022.