Critique of Honda Prize for Dr. Hinton Jürgen Schmidhuber
We must stop crediting the wrong people for inventions made by others. Instead let's heed the recent call in the journal Nature: "Let 2020 be the year in which we value those who ensure that science is selfcorrecting." [SV20] Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's [DL1] [DL2] [NASC19]. The present page is offered as a resource for members of the machine learning community who share this inclination. I am also inviting others to contribute additional relevant references. By grounding research in its true intellectual foundations, I do not mean to diminish important contributions made by others. My goal is to encourage the entire community to be more scholarly in its efforts and to recognize the foundational work that sometimes gets lost in the frenzy of modern AI and machine learning. Here I will focus on six false and/or misleading attributions of credit to Dr. Hinton in the press release of the 2019 Honda Prize [HON]. For each claim there is a paragraph (I, II, III, IV, V, VI) labeled by "Honda," followed by a critical comment labeled "Critique." Reusing material and references from recent blog posts [MIR] [DEC], I'll point out that Hinton's most visible publications failed to mention essential relevant prior work  this may explain some of Honda's misattributions. Executive Summary. Hinton has made significant contributions to artificial neural networks (NNs) and deep learning, but Honda credits him for fundamental inventions of others whom he did not cite. Science must not allow corporate PR to distort the academic record. Sec. I: Modern backpropagation was created by Linnainmaa (1970), not by Rumelhart & Hinton & Williams (1985). Ivakhnenko's deep feedforward nets (since 1965) learned internal representations long before Hinton's shallower ones (1980s). Sec. II: Hinton's unsupervised pretraining for deep NNs in the 2000s was conceptually a rehash of my unsupervised pretraining for deep NNs in 1991. And it was irrelevant for the deep learning revolution of the early 2010s which was mostly based on supervised learning  twice my lab spearheaded the shift from unsupervised pretraining to pure supervised learning (199195 and 200611). Sec. III: The first superior endtoend neural speech recognition was based on two methods from my lab: LSTM (1990s2005) and CTC (2006). Hinton et al. (2012) still used an old hybrid approach of the 1980s and 90s, and did not compare it to the revolutionary CTCLSTM (which was soon on most smartphones). Sec. IV: Our group at IDSIA had superior awardwinning computer vision through deep learning (2011) before Hinton's (2012). Sec. V: Hanson (1990) had a variant of "dropout" long before Hinton (2012). Sec. VI: In the 2010s, most major AIbased services across the world (speech recognition, language translation, etc.) on billions of devices were mostly based on our deep learning techniques, not on Hinton's. Repeatedly, Hinton omitted references to fundamental prior art (Sec. I & II & III & V) [DL1] [DL2] [DLC] [MIR] [R4R8]. However, as Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away." I. Honda: "Dr. Hinton has created a number of technologies that have enabled the broader application of AI, including the backpropagation algorithm that forms the basis of the deep learning approach to AI." Critique: Hinton and his coworkers have made certain significant contributions to deep learning, e.g., [BM] [CDI] [RMSP] [TSNE] [CAPS]. However, the claim above is plain wrong. He was 2nd of 3 authors of an article on backpropagation [RUM] (1985) which failed to mention that 3 years earlier, Paul Werbos proposed to train neural networks (NNs) with this method (1982) [BP2]. And the article [RUM] even failed to mention Seppo Linnainmaa, the inventor of this famous algorithm for credit assignment in networks [BP1] (1970), also known as "reverse mode of automatic differentiation." (In 1960, Kelley already had a precursor thereof in the field of control theory [BPA]; compare [BPB] [BPC].) See also [R7]. By 1985, compute had become about 1,000 times cheaper than in 1970, and desktop computers had become accessible in some academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs [RUM]. But this was essentially just an experimental analysis of a known method [BP1][BP2]. And the authors [RUM] did not cite the prior art [DLC]. (BTW, Honda [HON] claims over 60,000 academic references to [RUM] which seems exaggerated [R5].) More on the history of backpropagation can be found at Scholarpedia [DL2] and in my awardwinning survey [DL1]. The first successful method for learning useful internal representations in hidden layers of deep nets was published two decades before [RUM]. In 1965, Ivakhnenko & Lapa had the first general, working learning algorithm for deep multilayer perceptrons with arbitrarily many layers (also with multiplicative gates which have become popular) [DEEP12] [DL1] [DL2]. Ivakhnenko's paper of 1971 [DEEP2] already described a deep learning feedforward net with 8 layers, much deeper than those of 1985 [RUM], trained by a highly cited method which was still popular in the new millennium [DL2], especially in Eastern Europe, where much of Machine Learning was born. (Ivakhnenko did not call it an NN, but that's what it was.) Hinton has never cited this, not even in his recent survey [DLC]. Compare [MIR] (Sec. 1) [R8]. Note that there is a misleading "history of deep learning" propagated by Hinton and coauthors, e.g., Sejnowski [S20]. It goes more or less like this: In 1958, there was "shallow learning" in NNs without hidden layers [R58]. In 1969, Minsky & Papert [M69] showed that such NNs are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s" [S20]. However, "shallow learning" (through linear regression and the method of least squares) has actually existed since about 1800 (Gauss & Legendre [DL1] [DL2]). Ideas from the early 1960s on deeper adaptive NNs [R61] [R62] did not get very far, but by 1965, deep learning worked [DEEP12] [DL2] [R8]. So the 1969 book [M69] addressed a "problem" that had already been solved for 4 years. (Maybe Minsky really did not know; he should have known though.) II. Honda: In 2002, he introduced a fast learning algorithm for restricted Boltzmann machines (RBM) that allowed them to learn a single layer of distributed representation without requiring any labeled data. These methods allowed deep learning to work better and they led to the current deep learning revolution. Critique: No, Hinton's interesting unsupervised [CDI] pretraining for deep NNs (e.g., [UN4]) was irrelevant for the current deep learning revolution. In 2010, our team showed that deep feedforward NNs (FNNs) can be trained by plain backpropagation and do not at all require unsupervised pretraining for important applications [MLP1]  see Sec. 2 of [DEC]. This was achieved by greatly accelerating traditional FNNs on highly parallel graphics processing units called GPUs. Subsequently, in the early 2010s, this type of unsupervised pretraining was largely abandoned in commercial applications  see [MIR], Sec. 19. Apart from this, Hinton's unsupervised pretraining for deep FNNs (2000s, e.g., [UN4]) was conceptually a rehash of my unsupervised pretraining for deep recurrent NNs (RNNs) (1991) [UN0UN3] which he did not cite. Hinton's 2006 justification was essentially the one I used for my stack of RNNs called the neural history compressor [UN12]: each higher level in the NN hierarchy tries to reduce the description length (or negative log probability) of the data representation in the level below. (BTW, [UN12] also introduced the concept of "compressing" or "collapsing" or "distilling" one NN into another, another technique later reused by Hinton without citing it  see Sec. 2 of [MIR] and [R4].) By 1993, my method was able to solve previously unsolvable "Very Deep Learning" tasks of depth > 1000 [UN2] [DL1]. See [MIR], Sec. 1: First Very Deep NNs, Based on Unsupervised PreTraining (1991). (See also our 1996 work on unsupervised neural probabilistic models of text [SNT] and on unsupervised pretraining of FNNs through adversarial NNs [PM2].) Then, however, we replaced the history compressor by the even better, purely supervised LSTM  see Sec. III. That is, twice my lab spearheaded a shift from unsupervised to supervised learning (which dominated the deep learning revolution of the early 2010s [DEC]). See [MIR], Sec. 19: From Unsupervised PreTraining to Pure Supervised Learning (199195 & 200611). III. Honda: "In 2009, Dr. Hinton and two of his students used multilayer neural nets to make a major breakthrough in speech recognition that led directly to greatly improved speech recognition." Critique: This is very misleading. See Sec. 1 of [DEC]: The first superior endtoend neural speech recogniser that outperformed the state of the art was based on two methods from my lab: (1) Long ShortTerm Memory (LSTM, 1990s2005) [LSTM06] (overcoming the famous vanishing gradient problem first analysed by my student Sepp Hochreiter in 1991 [VAN1]); (2) Connectionist Temporal Classification [CTC] (my student Alex Graves et al., 2006). Our team successfully applied CTCtrained LSTM to speech in 2007 [LSTM4] (also with hierarchical LSTM stacks [LSTM14]). This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs), e.g., [BW] [BRI] [BOU]. Hinton et al. (20092012) still used the old hybrid approach [HYB12]. They did not compare their hybrid to CTCLSTM. Alex later reused our superior endtoend neural approach [LSTM4] [LSTM14] as a postdoc in Hinton's lab [LSTM8]. By 2015, when compute had become cheap enough, CTCLSTM dramatically improved Google's speech recognition [GSR] [GSR15] [DL4]. This was soon on almost every smartphone. Google's 2019 ondevice speech recognition of 2019 (not any longer on the server) was still based on LSTM. See [MIR], Sec. 4. IV. Honda: "In 2012, Dr. Hinton and two more students revolutionized computer vision by showing that deep learning worked far better than the existing stateoftheart for recognizing objects in images." Critique: See Sec. 2 of [DEC] (relevant parts repeated here for convenience): The basic ingredients of the computer vision revolution through convolutional NNs (CNNs) were developed by Fukushima (1979), Waibel (1987), LeCun (1989), Weng (1993) and others since the 1970s [CNN14]. A success of Hinton's team (ImageNet, Dec 2012) [GPUCNN4] was mostly due to GPUs used to speed up CNNs (they also used Malsburg's ReLUs [CMB] and a variant of Hanson's rule [Drop1] without citation; see Sec. V). However, the first superior awardwinning GPUbased CNN was created earlier in 2011 by our team in Switzerland (my postdoc Dan Ciresan et al.) [GPUCNN1,3,5] [R6]. Our deep and fast CNN, sometimes called "DanNet," was a practical breakthrough. It was much deeper and faster than earlier GPUaccelerated CNNs [GPUCNN]. Already in 2011, it showed "that deep learning worked far better than the existing stateoftheart for recognizing objects in images." In fact, it won 4 important computer vision competitions in a row between May 15, 2011, and September 10, 2012 [GPUCNN5], before the similar GPUaccelerated CNN of Hinton's student Krizhevsky won the ImageNet 2012 contest [GPUCNN45] [R6]. At IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest (where a team of Hinton's frequent coauthor LeCun took second place). Even the NY Times mentioned this. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), a contest on object detection in large images (ICPR, 10 Sept 2012), at the same time a medical imaging contest on cancer detection. All before ImageNet 2012 [GPUCNN45] [R6]. Our CNN image scanners were 1000 times faster than previous methods [SCAN]. The tremendous importance for health care etc. is obvious. Today IBM, Siemens, Google and many startups are pursuing this approach. Much of modern computer vision is extending the work of 2011, e.g., [MIR], Sec. 19. V. Honda: "To achieve their dramatic results, Dr. Hinton also invented a widely used new method called "dropout" which reduces overfitting in neural networks by preventing complex coadaptations of feature detectors." Critique: However, "dropout" is actually a variant of Hanson's much earlier stochastic delta rule (1990) [Drop1]. Hinton's 2012 paper [GPUCNN4] did not cite this. Apart from this, already in 2011 we showed that dropout is not necessary to win computer vision competitions and achieve superhuman results  see Sec. IV above. Back then, the only really important task was to make CNNs deep and fast on GPUs [GPUCNN1,3,5] [R6]. (Today, dropout is rarely used for CNNs.) VI. Honda: "Of the countless AIbased technological services across the world, it is no exaggeration to say that few would have been possible without the results Dr. Hinton created." Critique: Name one that would NOT have been possible! Most famous AI applications are based on results created by others. Here a representative list of our contributions, taken from Sec. 1 and Sec. 2 of [DEC]: 1. Computer vision. See Sec. IV, V above, and Sec. 2 of [DEC]. 2. Speech recognition. See Sec. III above, and Sec. 1 of [DEC]. 3. Language processing. The first superior endtoend neural machine translation was also based on our LSTM. In 1995, we already had excellent neural probabilistic models of text [SNT]. In 2001, we showed that our LSTM can learn languages unlearnable by traditional models such as HMMs [LSTM13]. That is, a neural "subsymbolic" model suddenly excelled at learning "symbolic" tasks. Compute still had to get 1000 times cheaper, but by 201617, both Google Translate [GT16] [WU] (which mentions LSTM over 50 times) and Facebook Translate [FB17] were based on two connected LSTMs [S2S], one for incoming texts, one for outgoing translations  much better than what existed before [DL4]. By 2017, Facebook's users made 30 billion LSTMbased translations per week [FB17] [DL4]. Compare: the most popular youtube video needed 2 years to achieve only 6 billion clicks. 4. Connected handwriting recognition. Already in 2009, through the efforts of Alex, CTCLSTM [CTC] [LSTM16] became the first recurrent NN (RNN) to win international competitions, namely, three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic). 5. Robotics. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics, e.g., [LSTMRL] [RPG]. In the 2010s, combinations of RL and LSTM have become standard. For example, in 2018, an RL LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher [OAI1] [OAI1a]. 6. Video Games. In 2019, DeepMind famously beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by RL [DM3]. An RL LSTM (with 84% of the model's total parameter count) also was the core of the famous OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018) [OAI2] [OAI2a]. See [MIR], Sec. 4. In the recent decade of deep learning, all of 26 above depended on our LSTM. See [MIR], Sec. 4. And there are innumerable additional LSTM applications ranging from healthcare & chemistry & molecule design to stock market prediction and selfdriving cars [DEC]. By 2016, more than a quarter of the power of all those Tensor Processing Units in Google's datacenters was used for LSTM (only 5% for CNNs) [JOU17]. Apparently [LSTM1] has become the most cited AI and NN research paper of the 20th century [R5]. By 2019, it got more citations per year than any other computer science paper of the 20th century [DEC]. The current record holder of the 21st century [HW2][R5] is also related to LSTM, since ResNet [HW2] (Dec 2015) is a special case of our Highway Net (May 2015) [HW1], the feedforward net version of vanilla LSTM [LSTM2] and the first working, really deep feedforward NN with over 100 layers. (Admittedly, however, citations are a highly questionable measure of true impact [NAT1].) 7. Medical imaging etc. Some of the most important NN applications are in healthcare. In 2012, our Deep Learner was the first to win a medical imaging contest (on cancer detection), before ImageNet 2012 [GPUCNN5] [R6]. Similar for materials science and quality control: Already in 2010, we introduced our deep and fast GPUbased NNs to Arcelor Mittal, the world's largest steel maker, and were able to greatly improve steel defect detection [ST]. This may have been the first deep learning breakthrough in heavy industry. There are many other early applications of our deep learning methods which were frequently used by Hinton. Our additional priority disputes with Hinton included: compressing / distilling one NN into another [MIR] (Sec. 2), learning sequential attention with NNs [MIR] (Sec. 9), fast weights through outer products [MIR] (Sec. 8), unsupervised pretraining for deep NNs [MIR] (Sec. 1), and other topics. Compare [R4]. Concluding RemarksDr. Hinton and coworkers have made certain significant contributions to NNs and deep learning, e.g., [BM] [CDI] [RMSP] [TSNE] [CAPS]. But his most visible work (lauded by Honda) popularized methods created by other researchers whom he did not cite. As emphasized earlier [DLC]: "The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)." It is a sign of our field's immaturity that popularizers are sometimes still credited for inventions of others. Honda should correct this. Else others will. Science must not allow corporate PR to distort the academic record. Similar for certain scientific journals, which "need to make clearer and firmer commitments to selfcorrection" [SV20]. Unfortunately, Hinton's frequent failures to credit essential prior work by others cannot serve as a role model for PhD students who are told by their advisors to perform meticulous research on prior art, and to avoid at all costs the slightest hint of plagiarism. Yes, this critique is also an implicit critique of certain other awards to Dr. Hinton. It is also related to some of the most popular posts and comments of 2019 at reddit/ml, the largest machine learning forum with over 800k subscribers. See, e.g., posts [R4R8] influenced by [MIR] (although my name is frequently misspelled). Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas, as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation [NASC12], the telephone [NASC3], the computer [NASC47], resilient robots [NASC8], and scientists of the 19th century [NASC9]. At least in science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end. (No fancy award can ever change that.) As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away." Edit of 4/24/2020: Reply to Dr. Hintons's ReplyDr. Hinton's response to a relevant post on Reddit/ML [R11] is copied below. I am inserting answers marked by "Reply." Summary: The facts presented in Sec. I, II, III, IV, V, VI still stand. Dr. Hinton: Having a public debate with Schmidhuber about academic credit is not advisable because it just encourages him and there is no limit to the time and effort that he is willing to put into trying to discredit his perceived rivals. Reply: This is apparently an ad hominem argument [AH3] [AH2] true to the motto: "If you cannot dispute a factbased message, attack the messenger himself." Obviously I am not "discrediting" others (e.g., popularisers) by crediting the inventors. Dr. Hinton: He has even resorted to tricks like having multiple aliases in Wikipedia to make it look as if other people are agreeing with what he says. Reply: Another ad hominem attack which I reject. (Many of my web pages encourage others though through this statement: "The contents of this article may be used for educational and noncommercial purposes, including articles for Wikipedia and similar sites.") Dr. Hinton: The page on his website about Alan Turing is a nice example of how he goes about trying to diminish other people's contributions. Reply: This is yet another factfree comment that has nothing to do with the contents of my post. Nevertheless, I'll take the bait and respond (skip this reply if you are not interested in this deviation from the topic). I believe that my web pages on Kurt Gödel (the founder of theoretical computer science in 1931 [GOD]) and Alan Turing paint an accurate picture of the origins of our field (also crediting important pioneers ignored by certain movies about Turing). As always, in the interest of selfcorrecting science [SV20], I'll be happy to correct the pages based on evidence. But what exactly should I correct? Here the brief summary: Both Gödel and the American computer science pioneer Alonzo Church (1935) [CHU] were cited by Turing who published later (1936) [TUR]. Gödel introduced the first universal coding language (based on the integers). He used it to represent both data (such as axioms and theorems) and programs (such as proofgenerating sequences of operations on the data). He famously constructed formal statements that talk about the computation of other formal statements, especially selfreferential statements which state that they are not provable by any computational theorem prover. Thus he exhibited the fundamental limits of mathematics and computing and Artificial Intelligence [GOD]. Compare [MIR] (Sec. 18). Church (1935) extended Gödel's result to the famous Entscheidungsproblem (decision problem) [CHU], using his alternative universal language called Lambda Calculus, basis of LISP. Later, Turing introduced yet another universal model (the Turing Machine) to do the same (1936) [TUR]. Nevertheless, although he was standing on the shoulders of others, Turing was certainly one of the most important computer science pioneers. Dr. Hinton: Despite my own best judgement, I feel that I cannot leave his charges completely unanswered so I am going to respond once and only once. I have never claimed that I invented backpropagation. David Rumelhart invented it independently long after people in other fields had invented it. It is true that when we first published we did not know the history so there were previous inventors that we failed to cite. What I have claimed is that I was the person to clearly demonstrate that backpropagation could learn interesting internal representations and that this is what made it popular. I did this by forcing a neural net to learn vector representations for words such that it could predict the next word in a sequence from the vector representations of the previous words. It was this example that convinced the Nature referees to publish the 1986 paper. It is true that many people in the press have said I invented backpropagation and I have spent a lot of time correcting them. Here is an excerpt from the 2018 book by Michael Ford entitled "Architects of Intelligence": "Lots of different people invented different versions of backpropagation before David Rumelhart. They were mainly independent inventions and sit's something I feel I have got too much credit for. I've seen things in the press that say that I invented backpropagation, and that is completely wrong. It's one of these rare cases where an academic feels he has got too much credit for something! My main contribution was to show how you can use it for learning distributed representations, so I'd like to set the record straight on that." Reply: This is finally a response related to my post. However, it does not at all contradict what I wrote in the relevant Sec. I. It is true that Dr. Hinton credited in 2018 his coauthor Rumelhart [RUM] with the "invention" of backpropagation [AOI]. But neither in [AOI] nor in his 2015 survey [DL3] he mentioned Linnainmaa (1970) [BP1], the true inventor of this efficient algorithm for applying the chain rule to networks with differentiable nodes [BP4]. It should be mentioned that [DL3] does cite Werbos (1974) who however described the method correctly only later in 1982 [BP2] and also failed to cite [BP1]. Linnainmaa's method was wellknown, e.g., [BP5] [DL1] [DL2] [DLC]. It wasn't created by "lots of different people" but by exactly one person who published first [BP1] and therefore should get the credit. (Sec. I above also mentions the method's precursors [BPA] [BPB] [BPC].) Dr. Hinton accepted the Honda Prize although he apparently agrees that Honda's claims (e.g., Sec. I) are false. He should ask Honda to correct their statements. Dr. Hinton: Maybe Juergen would like to set the record straight on who invented LSTMs? Reply: This question is again deviating from what's in my post. Nevertheless, I'll happily respond: See [MIR], Sec. 3 and Sec. 4 on the fundamental contributions of my former student Sepp Hochreiter in his 1991 diploma thesis [VAN1] which I called "one of the most important documents in the history of machine learning." (Sec. 4 also mentions later great contributions by other students including Felix Gers, Alex Graves, and others.) To summarize, Dr. Hintons comments and ad hominem arguments diverge from the contents of my post and do not challenge any of the facts presented in Sec. I, II, III, IV, V, VI. The facts still stand. AcknowledgmentsThanks to several expert reviewers for useful comments. Since science is about selfcorrection, let me know under juergen@idsia.ch if you can spot any remaining error. The contents of this article may be used for educational and noncommercial purposes, including articles for Wikipedia and similar sites. References (mostly from [DEC])[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. (Containing most references cited above. For convenience also appended below.) [SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value those who ensure that science is selfcorrecting. Nature, vol 577, p 9, 2/2/2020. [HON] Honda Prize, Sept 20, 2019. WWW link. PDF. Local copy. [Drop1] Hanson, S. J.(1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265272. (Compare preprint arXiv:1808.03578 on dropout as a special case, 2018.) [CMB] C. v. d. Malsburg (1973). SelfOrganization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85100, 1973. [See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.] [HAH] Hahnloser et al. (2000). Digital selection and analogue amplification coexist in a cortexinspired silicon circuit. Nature, 405, 2000. [MAL] Malik, J. and Perona, P. (1990). Preattentive texture discrimination with early vision mechanisms. Journal of the Optical Society of America A, 7(5):923932. [BM] D. Ackley, G. Hinton, T. Sejnowski (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1):147169. [CDI] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation 14.8 (2002): 17711800. [RMSP] T. Tieleman, G. E. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4.2 (2012): 2631. [TSNE] L. V. D. Maaten, G. E. Hinton (2008). Visualizing data using tSNE. Journal of Machine Learning research, 9, 25792605. [CAPS] S. Sabour, N. Frosst, G. E. Hinton (2017). Dynamic routing between capsules. Proc. NIPS 2017, pp. 38563866. [RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation. TR No. ICS8506, California Univ San Diego La Jolla Inst for Cognitive Science. Later version published as: Learning representations by backpropagating errors. Nature, 323, p. 533536 (1986). [MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115133, 1943. [K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 342, Princeton, N.J., 1956. [R58] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. [R61] Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ. [R62] Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York. [M69] M. Minsky, S. Papert. Perceptrons (MIT Press, Cambridge, MA, 1969). [S20] T. Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. PNAS, January 28, 2020. Link. [NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003. [NASC2] J. Schmidhuber. Zooming in on aviation history. Correspondence, Nature, vol 566, p 39, 7 Feb 2019. [NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008. [NASC4] J. Schmidhuber. Turing: Keep his work in perspective. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b. [NASC5] J. Schmidhuber. Turing in Context. Letter, Science, vol 336, p 1639, June 2012. (On Gödel, Zuse, Turing.) See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639a) [NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006. [NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004 [NASC8] J. Schmidhuber. Prototype resilient, selfmodeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007. [NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008. [MIR] J. Schmidhuber (2019). Deep Learning: Our Miraculous Year 19901991. (Containing most references cited above and in [DEC]. For convenience also appended below.) [BW] H. Bourlard, C. J. Wellekens (1989). Links between Markov models and multilayer perceptrons. NIPS 1989, p. 502510. [BRI] Bridle, J.S. (1990). AlphaNets: A Recurrent "Neural" Network Architecture with a Hidden Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 8392. [BOU] H Bourlard, N Morgan (1993). Connectionist speech recognition. Kluwer, 1993. [HYB12] Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):8297. [R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990. [R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber MetaLearning Fiasco. [R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton [R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber. [R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century. [R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet. [R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970. [R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965. [DL1] J. Schmidhuber, 2015. Deep Learning in neural networks: An overview. Neural Networks, 61, 85117. More. [DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832. [DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436444. HTML. [DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML. [DLC] J. Schmidhuber, 2015. Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). June 2015. HTML. [DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015. Link. [DM3] S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019. [Alphastar has a "deep LSTM core."] [OAI1] G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker. Learning Dexterity. OpenAI Blog, 2018. [OAI1a] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba. Learning Dexterous InHand Manipulation. arxiv:1312.5602 (PDF). [OAI2] OpenAI et al. (Dec 2019). Dota 2 with Large Scale Deep Reinforcement Learning. Preprint arxiv:1912.06680. [An LSTM composes 84% of the model's total parameter count.] [OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. [An LSTM was the core of OpenAI Five.] [VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J.S.) PDF. [More on the Fundamental Deep Learning Problem.] [VAN2] Y. Bengio, P. Simard, P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE TNN 5(2), p 157166, 1994 [VAN3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001. PDF. [LSTM0] S. Hochreiter and J. Schmidhuber. Long ShortTerm Memory. TR FKI20795, TUM, August 1995. PDF. [LSTM1] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. PDF. Based on [LSTM0]. More. [LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000. PDF. [The "vanilla LSTM architecture" that everybody is using today, e.g., in Google's Tensorflow.] [LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:56, pp. 602610, 2005. PDF. [LSTM4] S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF. [LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF. [LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545552, Vancouver, MIT Press, 2009. PDF. [LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN09, Cyprus, 2009. PDF. [LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013. PDF. [LSTM9] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton. Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL]. [LSTM10] A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, BioADIT 2004, Lausanne, Switzerland, p. 175184, 2004. PDF. [LSTM11] N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575581, SpringerVerlag Berlin Heidelberg, 2005. [LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 17951802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award. [LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):13331340, 2001. PDF. [LSTM14] S. Fernandez, A. Graves, J. Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI 07, p. 774779, Hyderabad, India, 2007 (talk). PDF. [LSTM15] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. Advances in Neural Information Processing Systems 22, NIPS'22, p 545552, Vancouver, MIT Press, 2009. PDF. [S2S] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 31043112. [CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF. [GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk  Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 [GSR15] Dramatic improvement of Google's speech recognition through LSTM: Alphr Technology, Jul 2015, or 9to5google, Jul 2015 [WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. [GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g., WIRED, Sep 2016, or siliconANGLE, Sep 2016 [FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day (The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017) [LSTMRL] B. Bakker, F. Linaker, J. Schmidhuber. Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Lausanne, 2002. PDF. [RPG] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620634. [HW1] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS'2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where g(x)=t(x)=const=1. More. [HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a special case of highway nets [HW1], with g(x)=1 (a typical highway net initialization) and t(x)=1. More. [HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017. [JOU17] Jouppi et al. (2017). InDatacenter Performance Analysis of a Tensor Processing Unit. Preprint arXiv:1704.04760 [CNN1] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position  Neocognitron. Trans. IECE, vol. J62A, no. 10, pp. 658665, 1979. [The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. More in Scholarpedia.] [CNN1a] A. Waibel. Phoneme Recognition Using TimeDelay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. [First application of backpropagation [BP1][BP2] and weightsharing to a convolutional architecture.] [CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328339, March 1989. [CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541551, 1989. PDF. [CNN3] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3D objects from 2D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121128. [A CNN whose downsampling layers use MaxPooling (which has become very popular) instead of Fukushima's Spatial Averaging [CNN1].] [CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007 [GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.] [GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI2011, Barcelona), 2011. PDF. ArXiv preprint. [Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 20112012 before others won any with similar approaches.] [GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN2011, San Francisco), 2011. PDF. HTML overview. [First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor. This led to massive interest from industry.] [GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multicolumn Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 36423649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More. [GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012. PDF. [GPUCNN5] J. Schmidhuber. History of computer vision contests won by deep CNNs on GPU. March 2017. HTML. [How IDSIA used GPUbased CNNs to win four important computer vision competitions 20112012 before others started using similar approaches.] [GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI11), Google, Mountain View, California, 2011. PDF. [SCAN] J. Masci, A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with MaxPooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690. [ST] J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber Steel Defect Classification with MaxPooling Convolutional Neural Networks. Proc. IJCNN 2012. PDF. [MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 32073220, 2010. ArXiv Preprint. [Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pretraining.] [BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947954, 1960. [BPB] A. E. Bryson. A gradient method for optimizing multistage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961. [BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 3045, 1962. [BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 67 and FORTRAN code on pages 5860. PDF. See also BIT 16, 146160, 1976. Link. [BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. [Extending thoughts in his 1974 thesis.] [BP4] J. Schmidhuber. Who invented backpropagation? More in [DL2]. [BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389400. [UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI14891, Institut für Informatik, Technische Universität München, April 1991. PDF. [UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. Based on TR FKI14891, TUM, 1991 [UN0]. PDF. [First working Deep Learner based on a deep RNN hierarchy, overcoming the vanishing gradient problem. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills  such approaches are now widely used. More.] [UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. [An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pretraining for a stack of recurrent NN can be found here. Plus lots of additional material and images related to other refs in the present page.] [UN3] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 8795. Augustinus, 1993. [UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504  507, 2006. PDF. [PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces wellknown feature detectors. Neural Computation, 8(4):773786, 1996. PDF. More. [SNT] J. Schmidhuber, S. Heil (1996). Sequential neural text compression. IEEE Trans. Neural Networks, 1996. PDF. (An earlier version appeared at NIPS 1995.) [DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. [First working Deep Learners with many layers, learning internal representations.] [DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 4355. [DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364378. [NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011. HTML. [GOD] Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173198, 1931. [CHU] A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the American Mathematical Society, 41: 332333. Abstract of a talk given on 19 April 1935, to the American Mathematical Society. Also in American Journal of Mathematics, 58(2), 345363 (1 Apr 1936). [TUR] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230267. Received 28 May 1936. Errata appeared in Series 2, 43, pp 544546 (1937). [AOI] M. Ford. Architects of Intelligence: The truth about AI from the people building it. Packt Publishing, 2018. (Preface to German edition by J. Schmidhuber.) [AH2] F. H. van Eemeren , B. Garssen & B. Meuffels. The disguised abusive ad hominem empirically investigated: Strategic manoeuvring with direct personal attacks. Journal Thinking & Reasoning, Vol. 18, 2012, Issue 3, p. 344364. Link.
[AH3]
D. Walton (PhD Univ. Toronto, 1972), 1998. Ad hominem arguments. University of Alabama Press.
