Who Invented Backpropagation?
Jürgen Schmidhuber, 2014 (updated 2015)
Pronounce: You_again Shmidhoobuh
Efficient backpropagation (BP) is central to the ongoing
Neural Network (NN) ReNNaissance and "Deep Learning."
Who invented it?
Its modern version (also called automatic differentiation) was first published in 1970
by Finnish master student Seppo Linnainmaa. Important
concepts of BP were known even earlier though.
It is easy to find misleading accounts of BP's history (as of July 2014). I had a look at the original papers from the 1960s and 70s, and talked to BP pioneers. Here is a summary derived from my
survey (2014),
which has additional references:
The minimisation of errors through gradient descent (Cauchy 1847, Hadamard, 1908) in the parameter space of complex, nonlinear, differentiable, multistage, NNrelated systems has been discussed at least since the early 1960s (e.g., Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961; Pontryagin et al., 1961; Dreyfus, 1962; Wilkinson, 1965; Amari, 1967; Bryson and Ho, 1969; Director and Rohrer, 1969), initially within the framework of EulerLaGrange equations in the Calculus of Variations (e.g., Euler, 1744).
Steepest descent in the weight space of such systems can be performed (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969) by iterating the chain rule (Leibniz, 1676; L'Hopital, 1696) à la Dynamic Programming (DP, Bellman, 1957). A simplified derivation of this backpropagation method uses the chain rule only (Dreyfus, 1962).
The systems of the 1960s were already efficient in the DP sense. However, they backpropagated derivative information through standard Jacobian matrix calculations from one "layer" to the previous one, without explicitly addressing either direct links across several layers or potential additional efficiency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors).
Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NNlike networks apparently was first described in a 1970 master's thesis (Linnainmaa, 1970, 1976), albeit without reference to NNs. BP is also known as the reverse mode of automatic differentiation (e.g., Griewank, 2012), where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early BP FORTRAN code (Linnainmaa, 1970) and closely related work (Ostrovskii et al., 1971).
BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). This was followed by some preliminary, NNspecific discussion (Werbos, 1974, section 5.5.1), and a computer program for automatically deriving and implementing BP for any given differentiable system (Speelpenning, 1980).
To my knowledge, the first NNspecific application of efficient BP as above was described by Werbos (1982). Related work was published several years later (Parker, 1985; LeCun, 1985). When computers had become 10,000 times faster per Dollar and much more accessible than those of 19601970, a paper of 1986 significantly contributed to the popularisation of BP for NNs (Rumelhart et al., 1986), experimentally demonstrating the emergence of useful internal representations in hidden layers.
Compare also the first adaptive, deep, multilayer perceptrons (the GMDH networks; Ivakhnenko et al., since 1965), whose layers are incrementally grown and trained by regression analysis, as well as a more recent method for multilayer threshold NNs (Bobrowski, 1978).
Precise references and more history in:
J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, 61, p 85117, 2015.
(Based on 2014 TR with 88 pages and 888 references, with PDF & LATEX source & complete public BIBTEX file).
J. Schmidhuber.
Deep Learning.
Scholarpedia, 10(11):32832, 2015.
See also this Google+ post
and
backprop history in a nutshell at the
AMA (Ask Me Anything) on
reddit.
The contents of this site may be used for educational and noncommercial purposes, including articles for Wikipedia and similar sites.
Overview web sites with lots of additional details and papers on Deep Learning
[A] 1991: Fundamental Deep Learning Problem discovered and analysed: in standard NNs, backpropagated error gradients tend to vanish or explode. More
[B] Our first Deep Learner of 1991 (RNN stack pretrained in unsupervised fashion): More, also under www.deeplearning.me
[C] 2009: First recurrent Deep Learner to win international competitions with secret test sets: deep LSTM recurrent neural networks [H] won three connected handwriting contests at ICDAR 2009 (French, Arabic, Farsi), performing simultaneous segmentation and recognition. More
[D] Deep Learning 19912013  our deep NNs have, so far, won 9 important contests in pattern recognition, image segmentation, object detection.
More, also under www.deeplearning.it
[E] 2011: First superhuman visual pattern recognition in an official international competition (with secret test set known only to the organisers)  twice better than humans, three times better than the closest artificial NN competitor, six times better than the best nonneural method. More
[F] 2012: First Deep Learner to win a contest on object detection in large images: our deep NNs won both the ICPR 2012 Contest and the MICCAI 2013 Grand Challenge on Mitosis Detection (important for cancer prognosis etc, perhaps the most important application area of Deep Learning). More
[G] 2012: First Deep Learner to win a pure image segmentation competition: our deep NNs won the ISBI'12 Brain Image Segmentation Challenge (relevant for the billion Euro brain projects in EU and US). More
[H] Deep LSTM recurrent NNs since 1995: More
[I] Deep Evolving NNs: More
[J] Deep Reinforcement Learning NNs: More
[K] Compressed NN Search for Huge RNNs: More
.
