![]()
2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history.
In 2020, we are celebrating the 10-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained by plain gradient descent on GPU. Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. That is, by 2010, when compute was 100 times more expensive than today, both our feedforward NNs and our earlier recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all competing algorithms on important problems of that time. In the 2010s, this deep learning revolution
quickly spread from Europe to America and Asia.
Just one decade ago, many thought that deep NNs
cannot learn much without unsupervised pre-training,
a technique
introduced by myself in 1991 [UN0-UN3] and later also championed by others, e.g., [UN4-5] [VID1] [T20]. In fact, it was
claimed [VID1]
that "nobody in their right mind would ever suggest"
to use plain gradient descent through backpropagation [BP1]
(see also [BPA-C]
[BP2-6] [R7])
to train feedforward NNs (FNNs) with many layers of neurons.
However, in March 2010, our team with my outstanding Romanian
postdoc Dan Ciresan [MLP1]
showed that deep FNNs
can indeed be trained by plain backpropagation
for important applications.
This neither required unsupervised pre-training
nor Ivakhnenko's incremental layer-wise training of 1965 [DEEP1-2].
By the standards of 2010, our supervised NN had many layers.
It set a new performance record [MLP1] on
the back then famous and widely used image recognition benchmark called MNIST [MNI].
This was achieved by greatly accelerating traditional
multilayer perceptrons on highly parallel
graphics processing units called GPUs, going beyond the important GPU
work of Jung & Oh (2004) [GPUNN].
A reviewer called this a
"wake-up call to the machine learning community."
Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This
greatly improved earlier work
[GPUCNN].
The so-called DanNet
[GPUCNN1] [R6] broke several benchmark records.
In May 2011, DanNet was
the first deep CNN to win a computer vision competition [GPUCNN5] [GPUCNN3].
In August 2011, it was
the first to win a vision contest with superhuman performance
[GPUCNN5].
Our team kept winning vision contests in 2012 [GPUCNN5].
Subsequently, many researchers adopted this technique.
By May 2015, we had the first extremely deep
FNNs with more than 100 layers [HW1] (compare [HW2] [HW3]).
The original successes required a precise understanding of
the inner workings of GPUs [MLP1] [GPUCNN1].
Today, convenient software packages shield the user from such details.
Compute is roughly 100 times cheaper than a decade ago,
and many commercial NN applications are based on what started in 2010 [MLP1] [DL1-4] [DEC].
In this context it should be mentioned that right before the 2010s, our team had already achieved another breakthrough in supervised
deep learning with the
more powerful recurrent NNs (RNNs) whose basic architectures were introduced over
half a century earlier
[MC43] [K56].
My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research
groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6]
(which overcome the famous
vanishing gradient problem
analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006).
CTC-trained LSTM was the first RNN
to win international contests.
Compare Sec. 4 of [MIR] and Sec. A & B of [T20].
That is, by 2010, both our supervised FNNs and our supervised RNNs
were able to outperform
all other methods on important problems.
In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia,
with enormous impact on industry and daily life [DL4] [DEC].
However, it should be mentioned that
the conceptual roots of deep learning
reach back deep into the previous millennium [DEEP1-2] [DL1-2]
[MIR] (Sec. 21 & Sec. 19) [T20] (e.g., Sec. II & D).
Finally let me emphasize that the
supervised deep learning revolution of the 2010s did
not really kill all variants of unsupervised learning.
Many are still important.
For example, pre-trained language models are now heavily
used in the context of transfer learning, e.g., [TR2].
And
our active & generative unsupervised NNs since 1990
[AC90-AC20]
are still used to endow agents with
artificial curiosity [MIR] (Sec. 5 & Sec. 6)—see also a special case of our adversarial NNs [AC90b] called GANs [AC20] [R2]
[T20] (Sec. XVII).
Unsupervised learning still has a bright future!
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint (1 March 2010).
[Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.]
[MNI]
Y. LeCun (1998). The MNIST database of handwritten digits.
Link.
[DEC] J. Schmidhuber (2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.
[DL1] J. Schmidhuber, 2015.
Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ...
HTML.
[VID1] G. Hinton.
The Next Generation of Neural Networks.
Youtube video [see 28:16].
GoogleTechTalk, 2007.
[Quote: "Nobody in their right mind would ever suggest"
to use plain backpropagation for training deep networks.]
But in 2010, our [MLP1] showed
that
unsupervised pre-training is not necessary
to train deep feedforward nets.
[T20] J. Schmidhuber (2020). Critique of 2018 Turing Award: http://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html
[MC43]
W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.
[K56]
S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[More on the Fundamental Deep Learning Problem.]
[LSTM0]
S. Hochreiter and J. Schmidhuber.
Long Short-Term Memory.
TR FKI-207-95, TUM, August 1995.
PDF.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
PDF.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
PDF.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a special case of highway nets [HW1], with
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[GPUNN]
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]
[GPUCNN]
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint (1 Feb 2011).
[Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 2011-2012 before others won any
with similar approaches.]
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI-148-91, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
[UN1]
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
[An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN
can be found here (depth > 1000).]
[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95.
Augustinus, 1993.
[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 2006. PDF.
[UN5] Raina, R., Madhavan, A., and Ng, A. (2009).
Large-scale deep unsupervised learning using graphics processors.
In Proc. ICML 26, p 873-880, ACM.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF.
This report
introduced a whole bunch of concepts that are now widely used:
Planning with recurrent world models
([MIR], Sec. 11),
high-dimensional reward signals as extra NN inputs / general value functions
([MIR], Sec. 13),
deterministic policy gradients
([MIR], Sec. 14),
unsupervised NNs that are both generative and adversarial
([MIR], Sec. 5), for Artificial Curiosity and related concepts.
[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
PDF.
Based on [AC90].
More.
[AC91b]
J. Schmidhuber.
Curious model-building control systems.
Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 1458-1463. IEEE, 1991.
PDF.
[AC95]
J. Storck, S. Hochreiter, and J. Schmidhuber.
Reinforcement-driven information acquisition in non-deterministic
environments.
In Proc. ICANN'95, vol. 2, pages 159-164.
EC2 & CIE, Paris, 1995.
PDF.
[AC97]
J. Schmidhuber.
What's interesting?
Technical Report IDSIA-35-97, IDSIA, July 1997.
[AC99]
J . Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z.
Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618,
IEEE Press, Piscataway, NJ, 1999.
[AC02]
J. Schmidhuber.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579-612, Springer, 2002.
PDF.
[AC06]
J. Schmidhuber.
Developmental Robotics,
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173-187, 2006.
PDF.
[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010.
IEEE link.
PDF.
[AC11]
Sun Yi, F. Gomez, J. Schmidhuber.
Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments.
In Proc. Fourth Conference on Artificial General Intelligence (AGI-11),
Google, Mountain View, California, 2011.
PDF.
[AC13]
J. Schmidhuber.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
Preprint (2011):
arXiv:1112.5309 [cs.AI]
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.
[BPB]
A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.
PDF.
See also BIT 16, 146-160, 1976.
Link.
[The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.]
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
[First application of backpropagation [BP1] to neural networks. Extending preliminary thoughts in his 1974 thesis.]
[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389-400.
[BP6]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175-185, 1977.
[See Section 3.1 on using gradient descent for learning in multilayer networks.]
[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. [First working Deep Learners with many layers, learning internal representations.]
[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.
|