![]()
2010: Breakthrough of supervised deep learning. No unsupervised pre-training. The rest is history.
In 2020, we are celebrating the 10-year anniversary of our publication [MLP1] in Neural Computation (2010) on deep multilayer perceptrons trained by plain gradient descent on GPU. Surprisingly, our simple but unusually deep supervised artificial neural network (NN) outperformed all previous methods on the (back then famous) machine learning benchmark MNIST. That is, by 2010, when compute was 100 times more expensive than today, both our feedforward NNs and our earlier recurrent NNs (e.g., CTC-LSTM for connected handwriting recognition) were able to beat all competing algorithms on important problems of that time. In the 2010s, this deep learning revolution
quickly spread from Europe to America and Asia.
Just one decade ago, many thought that deep NNs
cannot learn much without unsupervised pre-training,
a technique
introduced by myself in 1991 [UN0-UN3][UN] and later also championed by others, e.g., [UN4-5][VID1][T20][T22]. In fact, it was
claimed [VID1]
that "nobody in their right mind would ever suggest"
to use plain gradient descent through backpropagation [BP1]
(see also [BPA-C]
[BP2-6][R7])
to train feedforward NNs (FNNs) with many layers of neurons.
However, in March 2010, our team with my outstanding Romanian
postdoc Dan Ciresan [MLP1]
showed that deep FNNs
can indeed be trained by plain backpropagation
for important applications.
This neither required unsupervised pre-training
nor Ivakhnenko's incremental layer-wise training of 1965 [DEEP1-2].
By the standards of 2010, our supervised NN had many layers.
It set a new performance record [MLP1] on
the back then famous and widely used image recognition benchmark called MNIST [MNI].
This was achieved by greatly accelerating traditional
multilayer perceptrons on highly parallel
graphics processing units called GPUs, going beyond the important GPU
work of Jung & Oh (2004) [GPUNN].
A reviewer called this a
"wake-up call to the machine learning community."
Our results set the stage for the recent decade of deep learning [DEC]. In February 2011, our team extended the approach to deep Convolutional NNs (CNNs) [GPUCNN1]. This
greatly improved earlier work
[GPUCNN].
The so-called DanNet
[GPUCNN1][R6] broke several benchmark records [DAN].
In May 2011, DanNet was
the first deep CNN to win a computer vision competition [GPUCNN5,3].
In August 2011, it was
the first to win a vision contest with superhuman performance
[GPUCNN5][DAN1].
Our team kept winning vision contests in 2012 [GPUCNN5].
Subsequently, many researchers adopted this technique.
By May 2015, we had the first extremely deep
FNNs with more than 100 layers [HW1] (compare [HW2][HW3]).
The original successes required a precise understanding of
the inner workings of GPUs [MLP1][GPUCNN1].
Today, convenient software packages shield the user from such details.
Compute is roughly 100 times cheaper than a decade ago,
and many commercial NN applications are based on what started in 2010 [MLP1-2][DL1-4][DEC].
In this context it should be mentioned that right before the 2010s, our team had already achieved another breakthrough in supervised
deep learning with the
more powerful recurrent NNs (RNNs) whose basic architectures were introduced in the 1920s
[L20][I25][K41][MC43][W45][K56][AMH1-2].
My PhD student Alex Graves won three connected handwriting competitions (French, Farsi, Arabic) at ICDAR 2009, the famous conference on document analysis and recognition. He used a combination of two methods developed in my research
groups at TU Munich and the Swiss AI Lab IDSIA: Supervised LSTM RNNs (1990s-2005) [LSTM0-6]
(which overcome the famous
vanishing gradient problem
analyzed by my PhD student Sepp Hochreiter [VAN1] in 1991) and Connectionist Temporal Classification [CTC] (2006).
CTC-trained LSTM was the first RNN
to win international contests.
Compare Sec. 4 of [MIR] and
Sec. A & B & XVII of [T22].
That is, by 2010, both our supervised FNNs and our supervised RNNs
were able to outperform
all other methods on important problems.
In the 2010s, this supervised deep learning revolution quickly spread from Europe to North America and Asia,
with enormous impact on industry and daily life [DL4][DEC][MOST].
However, it should be mentioned that
the conceptual roots of deep learning
reach back deep into the previous millennium [DEEP1-2][DL1-2][MIR](Sec. 21 & Sec. 19) [T20][T22](e.g., Sec. II & D).
Finally let me emphasize that the above-mentioned
supervised deep learning revolutions of
the early 1990s (for recurrent NNs) [MIR]
and of
2010 (for feedforward NNs)
[MLP1-2] did
not at all kill unsupervised learning.
For example, pre-trained language models are now heavily
used by Transformers which
excel at the traditional LSTM domain of
Natural Language Processing [TR1-6]
(although there are still many language tasks that LSTM can
rapidly learn to solve quickly [LSTM13]
while plain Transformers can't).
Remarkably,
Transformers with linearized self-attention were also first published [FWP0-7] in
our
Annus Mirabilis of 1990-1991 [MIR][MOST],
together with
unsupervised pre-training for deep learning
[UN-UN3].
And
our unsupervised generative adversarial NNs since
1990
[AC90-AC20][PLAN][AC]
are still used to endow agents with
artificial curiosity [MIR](Sec. 5 & Sec. 6)—see also a version of our adversarial NNs [AC90b] called GANs [AC20][R2][PLAN][MOST][T22](Sec. XVII). Unsupervised learning still has a bright future!
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.
[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training.
By 2010, when compute was 100 times more expensive than today, both our feedforward NNs[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, revised 2021). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of our team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in our "Annus Mirabilis" 1990-1991 at TU Munich.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in my labs at TU Munich and IDSIA. Here I mention: (1) Long Short-Term Memory (LSTM), (2) ResNet (which is our earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on our similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of my earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized self-attention are formally equivalent to my earlier Fast Weight Programmers).
Most of this started with our
Annus Mirabilis of 1990-1991.[MIR]
[MNI]
Y. LeCun (1998). The MNIST database of handwritten digits.
Link.
[AMH1]
S. I. Amari (1972).
Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972.
PDF.
First published learning RNN.
First publication of what was later sometimes called the Hopfield network[AMH2] or Amari-Hopfield Network.
[AMH2]
J. J. Hopfield (1982). Neural networks and physical systems with emergent
collective computational abilities. Proc. of the National Academy of Sciences,
vol. 79, pages 2554-2558, 1982.
The Hopfield network or Amari-Hopfield Network was published in 1972 by Amari.[AMH1]
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. We had both hard attention[ATT0-2] (1990) and soft attention (1991-93).[FWP] Today, both types are very popular.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[DL1] J. Schmidhuber, 2015.
Deep Learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019
on-device speech recognition
(on the phone, not the server)
is still based on LSTM.
[FWP]
J. Schmidhuber (AI Blog, 26 March 2021).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
30-year anniversary of a now popular
alternative[FWP0-1] to recurrent NNs.
A slow feedforward NN learns by gradient descent to program the changes of
the fast weights of
another NN.
Such Fast Weight Programmers[FWP0-7] can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1]
(now often called keys and values for self-attention[TR1-6]).
The similar Transformers[TR1-2] combine this with projections
and softmax and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
Transformers with linearized self-attention[TR5-6]
which are formally equivalent to the 1991 Fast Weight Programmers (apart from normalization).
In 1993, I introduced
the attention terminology[FWP2] now used
in this context,[ATT] and
extended the approach to
RNNs that program themselves.
[FWP0]
J. Schmidhuber.
Learning to control fast-weight memories: An alternative to recurrent nets.
Technical Report FKI-147-91, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
First paper on fast weight programmers: a slow net learns by gradient descent to compute weight changes of a fast net.
[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992.
PDF.
HTML.
Pictures (German).
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993.
PDF.
First recurrent fast weight programmer based on outer products. Introduced the terminology of learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.
[FWP5]
F. J. Gomez and J. Schmidhuber.
Evolving modular fast-weight networks for control.
In W. Duch et al. (Eds.):
Proc. ICANN'05,
LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005.
PDF.
HTML overview.
Reinforcement-learning fast weight programmer.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
Preprint: arXiv:2106.06295 (June 2021).
[VID1] G. Hinton.
The Next Generation of Neural Networks.
Youtube video [see 28:16].
GoogleTechTalk, 2007.
Quote: "Nobody in their right mind would ever suggest"
to use plain backpropagation for training deep networks.
But in 2010, our [MLP1] showed
that
unsupervised pre-training is not necessary
to train deep feedforward nets.
[T20] J. Schmidhuber (2020). Critique of 2018 Turing Award: http://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21 (v3), IDSIA, Lugano, Switzerland, 22 June 2022.
[I25]
E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925.
First non-learning recurrent NN architecture: the Lenz-Ising model.
[K41]
H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.
[W45]
G. H. Wannier (1945).
The Statistical Problem in Cooperative Phenomena.
Rev. Mod. Phys. 17, 50.
[K56]
S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.
[L20]
W. Lenz (1920). Beiträge zum Verständnis der magnetischen
Eigenschaften in festen Körpern. Physikalische Zeitschrift, 21:
613-615.
[MC43]
W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
[More on the Fundamental Deep Learning Problem.]
[LSTM0]
S. Hochreiter and J. Schmidhuber.
Long Short-Term Memory.
TR FKI-207-95, TUM, August 1995.
PDF.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
[The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.]
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
PDF.
[LSTM13]
F. A. Gers and J. Schmidhuber.
LSTM Recurrent Networks Learn Simple Context Free and
Context Sensitive Languages.
IEEE Transactions on Neural Networks 12(6):1333-1340, 2001.
PDF.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
PDF.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a version of highway nets [HW1], with
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[DAN]
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.[DAN1]
Now everybody is using this approach.
[DAN1]
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
At the IJCNN 2011 computer vision competition in Silicon Valley,
our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor, and six times better than the best non-neural method.
[GPUNN]
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]
[GPUCNN]
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint (1 Feb 2011).
[Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 2011-2012 before others won any
with similar approaches.]
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN5]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[UN]
J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised pre-training. Unsupervised hierarchical predictive coding finds compact internal representations of sequential data to facilitate downstream learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI-148-91, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
[UN1]
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
[An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN
can be found here (depth > 1000).]
[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95.
Augustinus, 1993.
[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504 - 507, 2006. PDF.
[UN5] Raina, R., Madhavan, A., and Ng, A. (2009).
Large-scale deep unsupervised learning using graphics processors.
In Proc. ICML 26, p 873-880, ACM.
[AC]
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF.
This report
introduced a whole bunch of concepts that are now widely used:
Planning with recurrent world models
([MIR], Sec. 11),
high-dimensional reward signals as extra NN inputs / general value functions
([MIR], Sec. 13),
deterministic policy gradients
([MIR], Sec. 14),
unsupervised NNs that are both generative and adversarial
([MIR], Sec. 5), for Artificial Curiosity and related concepts.
[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
PDF.
Based on [AC90].
More.
[AC91b]
J. Schmidhuber.
Curious model-building control systems.
Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 1458-1463. IEEE, 1991.
PDF.
[AC95]
J. Storck, S. Hochreiter, and J. Schmidhuber.
Reinforcement-driven information acquisition in non-deterministic
environments.
In Proc. ICANN'95, vol. 2, pages 159-164.
EC2 & CIE, Paris, 1995.
PDF.
[AC97]
J. Schmidhuber.
What's interesting?
Technical Report IDSIA-35-97, IDSIA, July 1997.
[AC99]
J . Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z.
Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618,
IEEE Press, Piscataway, NJ, 1999.
[AC02]
J. Schmidhuber.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579-612, Springer, 2002.
PDF.
[AC06]
J. Schmidhuber.
Developmental Robotics,
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173-187, 2006.
PDF.
[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010.
IEEE link.
PDF.
[AC11]
Sun Yi, F. Gomez, J. Schmidhuber.
Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments.
In Proc. Fourth Conference on Artificial General Intelligence (AGI-11),
Google, Mountain View, California, 2011.
PDF.
[AC13]
J. Schmidhuber.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
Preprint (2011):
arXiv:1112.5309 [cs.AI]
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.
[BPB]
A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.
PDF.
See also BIT 16, 146-160, 1976.
Link. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
First application of backpropagation[BP1] to NNs (concretizing thoughts in his 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.[DL2]
[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389-400.
[BP6]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175-185, 1977.
See Section 3.1 on using gradient descent for learning in multilayer networks.
[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.
[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.
[TR4]
M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.
[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.
[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.
|