Who Invented Backpropagation?
Efficient backpropagation (BP) is central to the ongoing
Neural Network (NN) ReNNaissance and "Deep Learning."
Who invented it?
BP's modern version (also called the reverse mode of automatic differentiation) was first published in 1970
by Finnish master student Seppo Linnainmaa [BP1] [R7]. In 2020, we are celebrating BP's half-century anniversary!
A precursor of BP was published by Henry J. Kelley in 1960
[BPA]—in 2020, we are celebrating its 60-year anniversary.
As of 2019, it was still easy to find misleading accounts of BP's history [HIN] [T20]. I had a look at the original papers from the 1960s and 70s, and talked to BP pioneers. Here is a summary derived from my
2014 survey [DL1]
which includes most of the references mentioned below.
The minimisation of errors through gradient descent (Cauchy 1847, Hadamard, 1908) in the parameter space of complex, nonlinear, differentiable, multi-stage, NN-related systems has been discussed at least since the early 1960s,
e.g., Kelley (1960) [BPA]; Bryson (1961) [BPB]; Bryson and Denham, (1961); Pontryagin et al. (1961); Dreyfus (1962) [BPC]; Wilkinson (1965); Amari (1967, 1977) [BP6]; Bryson and Ho (1969); Director and Rohrer (1969), initially within the framework of Euler-LaGrange equations in the Calculus of Variations, e.g., Euler (1744).
Steepest descent in the weight space of such systems can be performed (Kelley, 1960 [BPA]; Bryson, 1961) by iterating the chain rule (Leibniz, 1676; L'Hopital, 1696) à la Dynamic Programming (DP, e.g., Bellman, 1957). A simplified derivation of this backpropagation method uses the chain rule only (Dreyfus, 1962) [BPC].
The systems of the 1960s were already efficient in the DP sense. However, they backpropagated derivative information through standard Jacobian matrix calculations from one "layer" to the previous one, without explicitly addressing either direct links across several layers or potential additional efficiency gains due to network sparsity.
Explicit, efficient error backpropagation (BP) in arbitrary, discrete, possibly sparsely connected, NN-like networks apparently was first described in a 1970 master's thesis (Linnainmaa, 1970, 1976) [BP1] [R7], albeit without reference to NNs. This kind of BP is also known as the reverse mode of automatic differentiation (e.g., Griewank, 2012 [BP5]), where the costs of forward activation spreading essentially equal the costs of backward derivative calculation. See early BP FORTRAN code (Linnainmaa, 1970) [BP1] and closely related but slightly later work (Ostrovskii et al., 1971). As of 2020, all modern software packages for NNs (such as Google's Tensorflow) are based on Linnainmaa's method of 1970.
BP was soon explicitly used to minimize cost functions by adapting control parameters (weights) (Dreyfus, 1973). This was followed by some preliminary, NN-specific discussion (Werbos, 1974, section 5.5.1) and a computer program for automatically deriving and implementing BP in differentiable systems (Speelpenning, 1980).
The first NN-specific application of efficient BP as above was apparently described by Werbos in 1982 [BP2] (but not yet in his 1974 thesis, as is sometimes claimed). However, compare Amari's paper of 1977
Similar work was published several years later (Parker, 1985; LeCun, 1985).
By 1985, compute was about 1,000 times cheaper than in 1970 [BP1], and
the first desktop computers
became accessible in wealthier academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs (Rumelhart et al., 1986) [RUM]. This was essentially an experimental analysis of a known method [BP1][BP2] [HIN] [T20].
Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T20]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].
Compare also the first deep learning multilayer perceptrons (the GMDH networks; Ivakhnenko et al., since 1965) whose layers are incrementally grown and trained by regression analysis [DEEP1-2] [R8], as well as a more recent method for multilayer threshold NNs (Bobrowski, 1978).
Thanks to several expert reviewers for useful comments. Since science is about self-correction, let me know under firstname.lastname@example.org if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
References (more in [DL1])
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.
See also BIT 16, 146-160, 1976.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
[Extending thoughts in his 1974 thesis.]
[BP4] J. Schmidhuber (2014).
Who invented backpropagation?
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.
A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389-400.
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175-185, 1977.
[See Section 3.1 on using gradient descent for learning in multilayer networks.]
B. Speelpenning (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD
thesis, Department of Computer Science, University of Illinois, Urbana-Champaign.
[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation. TR No. ICS-8506, California Univ San Diego La Jolla Inst for Cognitive Science. Later version published as:
Learning representations by back-propagating errors. Nature, 323, p. 533-536 (1986).
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. [First working Deep Learners with many layers, learning internal representations.]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.
[DL2] J. Schmidhuber, 2015.
[DLC] J. Schmidhuber, 2015. Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). June 2015.
[HIN] J. Schmidhuber (2020). Critique of 2019 Honda Prize.
[T20] J. Schmidhuber (June 2020). Critique of 2018 Turing Award.
[SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value those who ensure that science is self-correcting. Nature, vol 577, p 9, 2/2/2020.
[MIR] J. Schmidhuber (2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint
[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.