Who Invented Backpropagation?

Jürgen Schmidhuber, 2014 (updated 2015)
Pronounce: You_again Shmidhoobuh

Efficient backpropagation (BP) is central to the ongoing Neural Network (NN) ReNNaissance and "Deep Learning." Who invented it?

Its modern version (also called the reverse mode of automatic differentiation) was first published in 1970 by Finnish master student Seppo Linnainmaa.

Important concepts of BP were known even earlier though. It is easy to find misleading accounts of BP's history (as of July 2014). I had a look at the original papers from the 1960s and 70s, and talked to BP pioneers. Here is a summary derived from my survey (2014), which has additional references:

The minimisation of errors through gradient descent (Cauchy 1847, Hadamard, 1908) in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems has been discussed at least since the early 1960s (e.g., Kelley, 1960; Bryson, 1961; Bryson and Denham, 1961; Pontryagin et al., 1961; Dreyfus, 1962; Wilkinson, 1965; Amari, 1967; Bryson and Ho, 1969; Director and Rohrer, 1969), initially within the framework of Euler-LaGrange equations in the Calculus of Variations (e.g., Euler, 1744).

Steepest descent in the weight space of such systems can be performed (Bryson, 1961; Kelley, 1960; Bryson and Ho, 1969) by iterating the chain rule (Leibniz, 1676; L'Hopital, 1696) à la Dynamic Programming (DP, Bellman, 1957). A simplified derivation of this backpropagation method uses the chain rule only (Dreyfus, 1962).

The systems of the 1960s were already efficient in the DP sense. However, they backpropagated derivative information through standard Jacobian matrix calculations from one "layer" to the previous one, without explicitly addressing either direct links across several layers or potential additional efficiency gains due to network sparsity (but perhaps such enhancements seemed obvious to the authors).

Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks apparently was first described in a 1970 master's thesis (Linnainmaa, 1970, 1976), albeit without reference to NNs. BP is also known as the reverse mode of automatic differentiation (e.g., Griewank, 2012), where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early BP FORTRAN code (Linnainmaa, 1970) and closely related work (Ostrovskii et al., 1971).

BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). This was followed by some preliminary, NN-specific discussion (Werbos, 1974, section 5.5.1), and a computer program for automatically deriving and implementing BP for any given differentiable system (Speelpenning, 1980).

To my knowledge, the first NN-specific application of efficient BP as above was described by Werbos (1982). Related work was published several years later (Parker, 1985; LeCun, 1985). When computers had become 10,000 times faster per Dollar and much more accessible than those of 1960-1970, a paper of 1986 significantly contributed to the popularisation of BP for NNs (Rumelhart et al., 1986), experimentally demonstrating the emergence of useful internal representations in hidden layers.

Compare also the first adaptive, deep, multilayer perceptrons (the GMDH networks; Ivakhnenko et al., since 1965), whose layers are incrementally grown and trained by regression analysis, as well as a more recent method for multilayer threshold NNs (Bobrowski, 1978).

Precise references and more history in:

J. Schmidhuber. Deep Learning in Neural Networks: An Overview. Neural Networks, 61, p 85-117, 2015. (Based on 2014 TR with 88 pages and 888 references, with PDF & LATEX source & complete public BIBTEX file).

J. Schmidhuber. Deep Learning. Scholarpedia, 10(11):32832, 2015.

See also this Google+ post and backprop history in a nutshell at the AMA (Ask Me Anything) on reddit.

The contents of this site may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.

Overview web sites with lots of additional details and papers on Deep Learning

[A] 1991: Fundamental Deep Learning Problem discovered and analysed: in standard NNs, backpropagated error gradients tend to vanish or explode. More

[B] Our first Deep Learner of 1991 (RNN stack pre-trained in unsupervised fashion): More, also under www.deeplearning.me

[C] 2009: First recurrent Deep Learner to win international competitions with secret test sets: deep LSTM recurrent neural networks [H] won three connected handwriting contests at ICDAR 2009 (French, Arabic, Farsi), performing simultaneous segmentation and recognition. More

[D] Deep Learning 1991-2013 - our deep NNs have, so far, won 9 important contests in pattern recognition, image segmentation, object detection. More, also under www.deeplearning.it

[E] 2011: First superhuman visual pattern recognition in an official international competition (with secret test set known only to the organisers) - twice better than humans, three times better than the closest artificial NN competitor, six times better than the best non-neural method. More

[F] 2012: First Deep Learner to win a contest on object detection in large images: our deep NNs won both the ICPR 2012 Contest and the MICCAI 2013 Grand Challenge on Mitosis Detection (important for cancer prognosis etc, perhaps the most important application area of Deep Learning). More

[G] 2012: First Deep Learner to win a pure image segmentation competition: our deep NNs won the ISBI'12 Brain Image Segmentation Challenge (relevant for the billion Euro brain projects in EU and US). More

[H] Deep LSTM recurrent NNs since 1995: More

[I] Deep Evolving NNs: More

[J] Deep Reinforcement Learning NNs: More

[K] Compressed NN Search for Huge RNNs: More

.