Critique of 2018 Turing Award for Bengio & Hinton & LeCun

Critique of 2018 Turing Award
for Drs. Bengio & Hinton & LeCun

Jürgen Schmidhuber (25 June 2020)
Pronounce: You_again Shmidhoobuh
@SchmidhuberAI

Abstract. ACM's 2018 A.M. Turing Award was about deep learning in artificial neural networks. ACM lauds the awardees for work based on algorithms and conceptual foundations first published by other researchers whom the awardees failed to cite (see Executive Summary and Sec. I, V, II, XII, XIX, XXI, XIII, XIV, XX, XVII). ACM explicitly mentions "astonishing" deep learning breakthroughs in 4 fields: (A) speech recognition, (B) natural language processing, (C) robotics, (D) computer vision, as well as "powerful" new deep learning tools in 3 fields: (VII) medicine, astronomy, materials science. Most of these breakthroughs and tools, however, were directly based on the results of my own labs in the past 3 decades (e.g., Sec. A, B, C, D, VII, XVII, VI, XVI). I correct ACM's distortions of deep learning history (e.g., Sec. II, V, XX, XVIII) and also mention 8 of our direct priority disputes with Bengio & Hinton (Sec. XVII, I).

This document (~11,000 words) reuses and expands some of the material in my Critique of the 2019 Honda Prize [HIN] (~3,000 words). It has several layers of hierarchical abstraction: Abstract (150 words), Executive Summary with links to details (~1,000 words), Body with 21 comments on 21 claims by ACM (~7,700 words) and Conclusion (~1,700 words). All backed up by over 200 references (~6,500 words).

We must stop crediting the wrong people for inventions made by others. Instead let's heed the recent call in the journal Nature: "Let 2020 be the year in which we value those who ensure that science is self-correcting" [SV20]. Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's [DL1] [DL2] [HIN] [NASC1-9]. The present page is offered as a resource for computer scientists who share this inclination. By grounding research in its true intellectual foundations and crediting the original inventors, I am not diminishing important contributions made by popularizers of those inventions. My goal is to encourage the entire community to be more scholarly in its efforts, to recognize the foundational work that sometimes gets lost in the frenzy of modern AI and machine learning, and to fight plagiarism in all of its more or less subtle forms. I am also inviting others to contribute additional relevant references (send them to juergen@idsia.ch).

I will focus on contributions praised by ACM's official justification [T19] of the 2018 A.M. Turing Award for Drs. Bengio & Hinton & LeCun [R1] published in 2019. After the Executive Summary, ACM's full text [T19] is split into 21 parts labeled by "ACM:" I, II, III, IV, V, VI, VII, VIII, IX, X, XI, XII, XIII, XIV, XV, XVI, XVII, XVIII, XIX, XX, XXI. Each part is followed by a critical "Comment." Most of the comments are based on references to original papers and other material from recent blog posts [MIR] [DEC] [HIN]. I'll point out that highly cited publications of the awardees ignored fundamental relevant prior work—this may be the reason for some of ACM's misattributions. Since ACM's text is a bit repetitive and redundant, so are the partially overlapping sections of my critique.

Executive Summary (~1000 words, with links to details)

While Drs. LeCun & Bengio & Hinton (LBH for short) have made useful improvements of algorithms for artificial neural networks (NNs) and deep learning (e.g., Sec. I), ACM lauds them for more visible work based on fundamental methods whose inventors they did not cite, not even in later surveys (this may actually explain some of ACM's misattributions). I correct ACM's distortions of deep learning history. Numerous references can be found under the relevant section links I-XXI which adhere to the sequential order of ACM's text [T19] (while this summary groups related sections together).

Sec. II: In contrast to ACM's claims, NNs for pattern recognition etc. were introduced long before the 1980s. Deep learning with multilayer perceptrons started in 1965 through Ivakhnenko & Lapa long before LBH who have never cited them, not even in recent work. In the 1980s, "modern" gradient-based learning worked only for rather shallow NNs, but it became really deep in 1991 in my lab, first through unsupervised pre-training of NNs, then through supervised LSTM. Sec. I contains 4 subsections A, B, C, D on the 4 deep learning "breakthroughs" explicitly mentioned by ACM. ACM does not mention that they were mostly based on deep learning techniques of my team:

Sec. A: Speech Recognition (see also VI & XI & XV): The first superior end-to-end neural speech recognition combines two methods from my lab: LSTM (1990s-2005) and CTC (2006), applied to speech in 2007. Hinton (2012) and Bengio (XV) still used an old hybrid approach of the 1980s and 90s; Hinton et al. (2012) did not compare it to our revolutionary CTC-LSTM (which was soon on most smartphones).

Sec. B: Natural Language Processing (see also VI & XI & XVI): The first superior end-to-end neural machine translation (soon used for several billions of translations each day by the big platform companies) was also based on our LSTM.

Sec. C: Robotics. Our LSTM trained by Reinforcement Learning (RL) was also the core of the most visible breakthroughs in robotics and video games.

Sec. D: Computer Vision (see also XVIII & XIV & XI & VI) was revolutionized by convolutional NNs (CNNs). The basic CNN architecture is due to Fukushima (1979). NNs with convolutions were later (1987) combined by Waibel with backpropagation and weight sharing, and applied to speech. All before LeCun's CNN work (XVIII). We showed twice (1991-95 and 2006-10) that deep NNs don't need unsupervised pre-training (in contrast to Hinton's claims). Our team (Ciresan et al.) made CNNs fast & deep enough for superior computer vision in 2011, winning 4 image recognition contests in a row before Hinton's team won one. ResNet (ImageNet 2015 winner) is a special case of our earlier Highway Nets.

Sec. XIV: Again ACM recognizes work that failed to cite the pioneers. Long before Hinton (2012), Hanson (1990) had a variant of dropout, and v. d. Malsburg (1973) had rectified linear neurons; Hinton did not cite them. Already in 2011, our deep & fast CNN more than "halved the error rate for object recognition" (ACM's wording) in a computer vision contest (where LeCun participated), long before Hinton's similar CNN (2012). Sec. XI: ACM mentions GPU-accelerated NNs pioneered by Jung & Oh (2004). LBH did not cite them. Our deep GPU-NN of 2010 debunked unsupervised pre-training (introduced by myself in 1991 and later championed by Hinton), and our GPU-CNN of 2011 was the first to win contests in computer vision (explicitly mentioned by ACM).

Sec. XVIII: ACM credits LeCun for developing CNNs. However, the foundations of CNNs were laid earlier by Fukushima and Waibel (Sec. D). ACM also explicitly mentions autonomous driving and medical image analysis. The first team to win relevant international contests in these fields through deep CNNs was ours (2011, 2012, 2013). Sec. VII: ACM explicitly mentions medicine and materials science. Our deep NNs were the first to win medical imaging competitions in 2012 and 2013, and the first to apply deep NNs to material defect detection in industry (since 2010).

Sec. XII & XIX & XXI: Modern backpropagation was first published by Linnainmaa (1970), not by LeCun or Hinton or their collaborators (1985) who did not cite Linnainmaa, not even in later surveys. Sec. XIII & II & V (& III & IX & X & XX): Ivakhnenko's deep feedforward nets (since 1965) learned internal representations long before Hinton's shallower ones (1980s). Hinton has never cited him. Sec. XX: ACM credits LeCun for work on hierarchical feature representation which did not cite Ivakhnenko's much earlier work on this (since 1965). Sec. XXI: ACM credits LeCun for work on automatic differentiation which did not cite its inventor Linnainmaa (1970). And also for work on deep learning for graphs that failed to cite the earlier work by Sperduti & Goller & Küchler & Pollack.

Sec. XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences. His work was not the first on this topic, and is not important for modern deep learning speech recognition systems (mentioned by ACM) based on our CTC-LSTM (Sec. A & B). Sec. XVI: ACM credits Bengio for neural probabilistic language models. Our 1995 neural probabilistic text model greatly predates Bengio's. ACM mentions NNs that learn sequential attention. We started this in 1990-93 long before LBH who did not cite this.

Sec. XVII: ACM mentions Generative Adversarial Networks (GANs, 2010-14) of Bengio's team, a special case of my Adversarial Artificial Curiosity (1990) which he did not cite. I list 7 of our additional priority disputes with Bengio & Hinton (more than can be explained by chance), on vanishing gradients (1991), meta-learning (1987), unsupervised pre-training (1991), compressing or distilling one NN into another (1991), fast weights through outer products (1993), learning sequential attention with NNs (1990), and other topics [R2-R6].

Sec. IV is on Turing (1936) and his predecessors Gödel (1931) and Church (1935).

Sec. Conclusion: In the recent decade of deep learning, most major AI applications mentioned by ACM (speech recognition, language translation, etc.) on billions of devices (also healthcare applications) heavily depended on our deep learning techniques and conceptual foundations, while LBH's most visible work ignored essential prior art since the 1960s—see, e.g., Sec. II & III & V & XII & XIII & XVII & XIV & XIX & XX & XXI, [DL1] [DL2] [DLC] [MIR] [HIN] [R2-R8]. But in science, by definition, the facts will always win in the end. As long as the facts have not yet won it's not yet the end.

I. ACM: ACM named Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM A.M. Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. ... Working independently and together, Hinton, LeCun and Bengio developed conceptual foundations for the field, identified surprising phenomena through experiments, and contributed engineering advances that demonstrated the practical advantages of deep neural networks. In recent years, deep learning methods have been responsible for astonishing breakthroughs in computer vision, speech recognition, natural language processing, and robotics—among other applications.

Comment: LBH and their co-workers have contributed useful improvements of deep learning methods, e.g., [CNN2] [CDI] [LAN] [CNN4] [RMSP] [XAV] [ATT14] [CAPS]. However, the essential "conceptual foundations" of deep learning (mentioned by ACM) were laid by others, e.g., deep learning multilayer perceptrons that learn internal representations (1965) [DEEP1-2] [R8], modern backpropagation (1970) [BP1] [R7] [BP2], architectures of recurrent NNs (1943-56) [MC43] [K56] and convolutional NNs (1979) [CNN1], principles of generative adversarial NNs and artificial curiosity (1990) [AC90, AC90b] [AC20], unsupervised pre-training for deep NNs (1991) [UN1-2], vanishing gradients (1991) [VAN1] & LSTM (Sec. A), supervised GPU-accelerated NNs (since 2004) [GPUNN] [GPUCNN5], and other foundations [DL1] [DL2] [R2-R8]. Often LBH failed to cite essential prior work, even in their later surveys [DLC] [HIN] [MIR] (Sec. 21) [R2-R5, R7-R8]. This may explain some of ACM's misattributions [T19]. Compare Sec. II & III & V & XIII & X & XVII & XII & XVIII & XX.

ACM's statement on "astonishing breakthroughs in computer vision, speech recognition, natural language processing, and robotics" is correct. Although ACM does not literally claim that LBH were somehow responsible for these breakthroughs, ACM's wording seems to suggest this. In particular, ACM does not mention that these breakthroughs were largely based on what happened in my own deep learning research group in the past 3 decades or so (e.g., A & B & C). The deep NNs of our team have revolutionised Pattern Recognition and Machine Learning. By the 2010s [DEC], they were heavily used in academia and industry [DL4], in particular, by Microsoft, Google & Facebook, former employers of Hinton & LeCun. I will focus on the 4 fields explicitly mentioned by ACM (labeled as A, B, C, D below):

A. Speech recognition. The first superior end-to-end neural speech recogniser that outperformed the state of the art was based on two methods from my lab: (A1) Long Short-Term Memory or LSTM (1990s-2005) [LSTM0-6] overcomes the famous vanishing gradient problem first analysed by my student Sepp Hochreiter in 1991 [VAN1] long before Bengio (see XVII and [MIR], Sec. 3 & Sec. 4) and was refined with my student Felix Gers [LSTM2] through "forget gates" based on end-to-end-differentiable fast weights [MIR] (Sec. 8) [FAST0-1]. (A2) Connectionist Temporal Classification [CTC] (my student Alex Graves et al., 2006). Our team successfully applied CTC-trained LSTM to speech in 2007 [LSTM4] (also with hierarchical LSTM stacks [LSTM14]). This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs), e.g., [BW] [BRI] [BOU] (Sec. XV). Hinton et al. (2012) still used the old hybrid approach [HYB12], and did not compare it to CTC-LSTM. In 2009, through the efforts of Alex, CTC-trained LSTM became the first recurrent NN (RNN) to win international competitions. He later reused our end-to-end neural speech recognizer [LSTM4] [LSTM14] as a postdoc in Hinton's lab [LSTM8]. By 2015, when compute had become cheap enough, CTC-LSTM dramatically improved Google's speech recognition [GSR] [GSR15] [DL4]. By the time the Turing Award was handed out, this had been on most smartphones for years; Google's 2019 on-device speech recognition [GSR19] (not any longer on the server) is still based on LSTM. See [MIR], Sec. 4 (and VI & XI & XV.)

B. Natural Language Processing (NLP). The first superior end-to-end neural machine translation was also based on our LSTM. In 1995, we already had excellent neural probabilistic models of text [SNT] (compare XVI). In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs [LSTM13]. That is, a neural "subsymbolic" model suddenly excelled at learning "symbolic" tasks. Compute still had to get 1000 times cheaper, but by 2016-17, both Google Translate [GT16] [WU] (which mentions LSTM over 50 times) and Facebook Translate [FB17] were based on two connected LSTMs [S2S], one for incoming texts, one for outgoing translations—much better than what existed before [DL4]. By 2017, Facebook's users made 30 billion LSTM-based translations per week [FB17] [DL4]. (It should be mentioned that further improvements were due to an attention mechanism of Bengio's team [ATT14]; see Sec. XVI.) Compare: the most popular youtube video needed 2 years to achieve only 6 billion clicks. (See also VI & XI & XV.)

C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics, e.g., [LSTM-RL] [RPG]. In the 2010s, combinations of RL and LSTM have become standard. For example, in 2018, an RL LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher [OAI1] [OAI1a]. Similar for Video Games: In 2019, DeepMind (co-founded by a student from my lab) famously beat a pro player in the game of Starcraft, which is harder than Chess or Go [DM2] in many ways, using Alphastar whose brain has a deep LSTM core trained by RL [DM3]. An RL LSTM (with 84% of the model's total parameter count) also was the core of the famous OpenAI Five which learned to defeat human experts in the Dota 2 video game (2018) [OAI2]. Bill Gates called this a "huge milestone in advancing artificial intelligence" [OAI2a]. See [MIR], Sec. 4.

Apart from A, B, C above, the 2010s saw many additional LSTM applications, e.g., in healthcare (lots of papers on this), chemistry, molecular design, lip reading, speech synthesis [AM16], stock market prediction, self-driving cars, mapping brain signals to speech, predicting what's going on in nuclear fusion reactors, and so on [DEC] [DL4]. By 2016, more than a quarter of the power of all those Tensor Processing Units in Google's data centers was used for LSTM (only 5% for the CNNs of Sec. D) [JOU17]. Apparently [LSTM1] has become the most cited AI and NN research paper of the 20th century [R5]. By 2019, it got more citations per year than any other computer science paper of the past century [DEC]. (Admittedly, however, citations are a highly questionable measure of true impact [NAT1].)

D. Computer Vision was revolutionized in the 2010s by a particular feedforward NN called the convolutional NN (CNN) [CNN1-4]. The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979) [CNN1]. The popular downsampling variant called max-pooling was introduced by Weng et al. (1993) [CNN3]. In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation [CNN1a]. LeCun's team later contributed improvements of CNNs, especially for images, e.g., [CNN2] [CNN4] (Sec. XVIII). Finally, my own team showed in 2010 [MLP1] that unsupervised pre-training is not necessary to train deep NNs, contrary to claims by Hinton [VID1] who said that "nobody in their right mind would ever suggest" this. Then we greatly sped up the training of deep CNNs (Dan Ciresan et al. 2011). Our fast GPU-based CNN of 2011 [GPUCNN1], sometimes called "DanNet," was a practical breakthrough. It was much deeper and faster than earlier GPU-accelerated CNNs of 2006 [GPUCNN]. Already in 2011, DanNet showed that deep learning worked far better than the existing state-of-the-art for recognizing objects in images. In fact, DanNet won 4 important computer vision competitions in a row between May 15, 2011, and Sept 10, 2012 [GPUCNN5], before the similar GPU-accelerated AlexNet of Hinton's student Krizhevsky won the ImageNet [IM09] 2012 contest [GPUCNN4-5] [R6] (now also without unsupervised pre-training). In particular, at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition in an international contest where LeCun's team took a distant second place, with three times worse performance. Even the NY Times mentioned this. DanNet was also the first deep CNN to win: a Chinese handwriting contest (ICDAR 2011), an image segmentation contest (ISBI, May 2012), a contest on object detection in large images (ICPR, 10 Sept 2012), at the same time a medical imaging contest on cancer detection [GPUCNN8]. All before ImageNet 2012 [GPUCNN4-5] [R6]. Our CNN image scanners were 1000 times faster than previous methods [SCAN]. The tremendous importance for health care etc. is obvious. Today IBM, Siemens, Google and many startups are pursuing this approach. Much of modern computer vision is extending the work of 2011, e.g., [MIR], Sec. 19. ResNet, the ImageNet 2015 winner [HW2] (Dec 2015) which currently gets more citations per year than any other paper, is a special case of our earlier Highway Net (May 2015) [HW1] [HW3] [R5], the feedforward net version of vanilla LSTM [LSTM2] and the first working, really deep feedforward NN with over 100 layers. See also XVIII & XIV & XI & VI.

II. ACM: While the use of artificial neural networks as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s, ...

Comment: Perhaps ACM's lack of knowledge about NN history is the reason why they praise works by LBH that failed to cite the original work. In fact, NNs of the kind mentioned by ACM appeared long before the 1980s. The most powerful NN architectures (recurrent NNs) were proposed already in the 1940s/50s [MC43] [K56] (but don't forget prior work in physics since the 1920s [L20] [I25] [K41] [W45]). Fukushima's now widely used deep convolutional NN architecture was proposed in the 1970s [CNN1]. Minsky's simple neural SNARC computer dates back to 1951. NNs without hidden layers learned in 1958 [R58] (such "shallow learning" started around 1800 when Gauss & Legendre introduced linear regression and the method of least squares [DL1] [DL2]). In the early 1960s, interesting ideas about deeper adaptive NNs [R61] [R62] did not get very far. Successful learning in deep architectures started in 1965 when Ivakhnenko & Lapa published the first general, working learning algorithms for deep multilayer perceptrons with arbitrarily many hidden layers (already containing the now popular multiplicative gates) [DEEP1-2] [DL1] [DL2]. A paper of 1971 [DEEP2] already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium [DL2], especially in Eastern Europe, where much of Machine Learning was born. (Ivakhnenko did not call it an NN, but that's what it was.) LBH have never cited this. Compare [MIR] (Sec. 1) [R8]. (See also XIII & III & V & VIII & IX & X.)

ACM seems to be influenced by a misleading "history of deep learning" propagated by LBH & co-authors, e.g., Sejnowski [S20] (see XIII). It goes more or less like this: In 1969, Minsky & Papert [M69] showed that shallow NNs without hidden layers are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s" [S20]. However, as mentioned above, the 1969 book [M69] addressed a "problem" that had already been solved for 4 years by Ivakhnenko & Lapa whose popular deep learning method [DEEP1-2] [DL2] has been used by many throughout the decades. Minsky should have known, or at least corrected this later. Compare [HIN] (Sec. I).

In the 1980s, "modern" gradient-based learning worked only for rather shallow NNs (see [MOZ] though). However, it became really deep in 1991 in my lab [UN0-UN3] which has always focused on the depth in deep learning. See [MIR], Sec. 1: First Very Deep NNs, Based on Unsupervised Pre-Training (1991). By 1993, my unsupervised pre-training helped to solve previously unsolvable "Very Deep Learning" tasks of depth > 1000 [UN2] [DL1]. Then, however, we replaced it by the even better, purely supervised LSTM—see Sec. A and [MIR] (Sec. 4). (By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000 [LSTM17] and more.) In fact, twice my lab drove the shift from unsupervised pre-training to purely supervised learning (1991-95 and 2006-10). See [HIN] (Sec. II) & [MIR] (Sec. 19) & Sec. III.

III. ACM: ... by the early 2000s, LeCun, Hinton and Bengio were among a small group who remained committed to this approach. Though their efforts to rekindle the AI community's interest in neural networks were initially met with skepticism, their ideas recently resulted in major technological advances, and their methodology is now the dominant paradigm in the field.

Comment: However, it isn't "their" methodology because it was introduced much earlier by others whom they did not cite [DLC], e.g., [DEEP1-2] [BP1] [DL1] [DL2] [R7-R8] [R2-R4]. As mentioned above, others introduced deep learning multilayer perceptrons (1965) [DEEP1-2] [R8], modern backpropagation (1970) [BP1] [R7] [BP2], architectures of recurrent NNs (1943-56) [MC43] [K56] and convolutional NNs (1979) [CNN1], principles of generative adversarial NNs and artificial curiosity (1990) [AC90, AC90b] [AC20], unsupervised pre-training for deep NNs [UN1-2], the vanishing gradient problem (1991) [VAN1] & LSTM (Sec. A), supervised GPU-accelerated NNs (since 2004) [GPUNN] [GPUCNN5], and other foundations [DL1] [DL2] [R2-R8]. Often LBH failed to cite essential prior work [DLC] [HIN] [MIR] (Sec. 21). Compare Sec. II & V & XIII & IX & X & XVII & XII & XVIII & XX & I, and Sec. 21 of [MIR].

ACM may have been misled by LBH's web site deeplearning.net which until 2019 advertised deep learning as "moving beyond shallow machine learning since 2006" [DL7], referring to Hinton's [UN4] and Bengio's [UN5] unsupervised layer-wise pre-training for deep NNs (2006) although we had this type of deep learning already in 1991 [UN1-2]; see Sec. II & XVII (5). Not to mention Ivakhnenko's even earlier supervised layer-wise training of deep NNs [DEEP1-2] which Hinton [UN4] & Bengio [UN5] & LBH [DL3] did not cite either. See also Sec. X.

IV. ACM: The ACM A.M. Turing Award, often referred to as the "Nobel Prize of Computing," carries a $1 million prize, with financial support provided by Google, Inc. It is named for Alan M. Turing, the British mathematician who articulated the mathematical foundation and limits of computing.

Comment: Skip this comment if you are not interested in deviating from the topic of LBH—this comment appears here only because my comments track the sequential order of ACM's claims [T19].

ACM's statement on Turing is not wrong. But it is misleading, like some of its other statements [T19]. ACM correctly states that Turing "articulated the mathematical foundation and limits of computing." However, many have done this, and in science the important question is: Who did it first? It wasn't Turing. Both the Austrian mathematician Kurt Gödel, the very founder of theoretical computer science (1931) [GOD], and the American Alonzo Church (1935) [CHU], were cited by Turing who published later (1936) [TUR]. Gödel introduced the first universal coding language (based on the integers). He used it to represent both data (such as axioms and theorems) and programs (such as proof-generating sequences of operations on the data). He famously constructed formal statements that talk about the computation of other formal statements, especially self-referential statements which state that they are not provable by any computational theorem prover. Thus he exhibited the fundamental limits of mathematics and computing and Artificial Intelligence [GOD]. Compare [MIR] (Sec. 18). Church (1935) extended Gödel's result to the famous Entscheidungsproblem (decision problem) [CHU], using his alternative universal language called Lambda Calculus, basis of LISP. Later, Turing introduced yet another universal model (the Turing Machine) to do the same (1936) [TUR]. (See also my reply to Hinton who criticized my website on Turing without suggesting any fact-based corrections [HIN].) Nevertheless, although he was standing on the shoulders of others, Turing was certainly one of the most important computer science pioneers.

Remarkably, Gödel (1906-1978) never got a Turing award, although he not only laid the foundations of the field, but also identified its most famous open problem "P=NP?" in his famous letter to von Neumann (1956). Neither did Church. Likewise, Konrad Zuse (1910-1995) never got a Turing award despite having built the world's first working programmable computer 1935-41. (This was not just a theoretical pen & paper construct like those of Gödel & Church & Turing.) There would have been plenty of time though—these pioneers died years after the award was introduced in 1966.

V. ACM: "Artificial intelligence is now one of the fastest-growing areas in all of science and one of the most talked-about topics in society," said ACM President Cherri M. Pancake. "The growth of and interest in AI is due, in no small part, to the recent advances in deep learning for which Bengio, Hinton and LeCun laid the foundation."

Comment: As mentioned above, the foundations of deep learning were actually laid by others much earlier, e.g., deep learning multilayer perceptrons that learn internal representations (1965) [DEEP1-2] [R8], modern backpropagation (1970) [BP1] [R7] [BP2], architectures of recurrent NNs (1943-56) [MC43] [K56] and convolutional NNs (1979) [CNN1], principles of generative adversarial NNs and artificial curiosity (1990) [AC90, AC90b] [AC20], unsupervised pre-training for deep NNs (1991) [UN1-2], vanishing gradients (1991) [VAN1] & LSTM (Sec. A), supervised GPU-accelerated NNs (since 2004) [GPUNN] [GPUCNN5], and other foundations [DL1] [DL2] [R2-R8]. Often LBH failed to cite essential prior work [DLC] [HIN] [MIR] (Sec. 21) [R2-R5, R7, R8, R11]. Compare Sec. II & I & III & XIII & X & XVII & XII & XVIII & XX.

VI. ACM: These technologies are used by billions of people. Anyone who has a smartphone in their pocket can tangibly experience advances in natural language processing and computer vision that were not possible just 10 years ago.

Comment: ACM's statement is true. However, those "advances in natural language processing" and in speech in the past 10 years came mainly through the LSTM and CTC of our group [LSTM1-6] [CTC] (1991-2007)—see Sec. B & Sec. A. And even the "advances in computer vision" were possible only through the speedups of supervised NNs and CNNs achieved in our group 2010-2011 [MLP1] [GPUCNN5] [R6] and through Highway Net-like NNs (2015) [HW1] [HW2] [HW3] [R5], although the principles of CNNs were invented and developed by others since the 1970s [CNN1-4]. See Sec. D & XVIII & XIV as well as Sec. 4 & Sec. 19 of [MIR].

VII. ACM: In addition to the products we use every day, new advances in deep learning have given scientists powerful new tools—in areas ranging from medicine, to astronomy, to materials science."

Comment: ACM's statement is true. But who really started this? ACM explicitly mentions medicine. Our team was the first to win a medical imaging contest through deep learning (Sept 2012, on cancer detection) [GPUCNN5] [GPUCNN8]. ACM also explicitly mentions materials science. In 2010, we introduced our deep and fast GPU-based NNs to Arcelor Mittal, the world's largest steel producer, and were able to greatly improve steel defect detection [ST]. To the best of my knowledge, this was the first deep learning breakthrough in heavy industry. (All of this happened before the similar GPU-accelerated CNN of Hinton's student Krizhevsky won ImageNet 2012 [GPUCNN5] [R6].) One year later, our team also won the MICCAI Grand Challenge on mitosis detection [MGC] [GPUCNN5] [GPUCNN8]. Our approach of 2012-2013 has transformed medical imaging. Many major companies are using it now. See Sec. D & XI. And of course, our LSTM (Sec. A & B & C) is also massively used in healthcare and medical diagnosis—one can find thousands of articles on this at Google Scholar.

VIII. ACM: "Deep neural networks are responsible for some of the greatest advances in modern computer science, helping make substantial progress on long-standing problems in computer vision, speech recognition, and natural language understanding," said Jeff Dean, Google Senior Fellow and SVP, Google AI.

"At the heart of this progress are fundamental techniques developed starting more than 30 years ago by this year's Turing Award winners, Yoshua Bengio, Geoffrey Hinton, and Yann LeCun."

Comment: However, as mentioned above, LBH actually used the "fundamental techniques" invented by others, including our team, often without citing them [DL1] [DLC] [HIN] [R2-R4] [R7-R8]. See Sec. V & XII & XIX & II & III & XIII & XVII & X & I.

IX. ACM: By dramatically improving the ability of computers to make sense of the world, deep neural networks are changing not just the field of computing, but nearly every field of science and human endeavor."

Machine Learning, Neural Networks and Deep Learning

In traditional computing, a computer program directs the computer with explicit step-by-step instructions. In deep learning, a subfield of AI research, the computer is not explicitly told how to solve a particular task such as object classification. Instead, it uses a learning algorithm to extract patterns in the data that relate the input data, such as the pixels of an image, to the desired output such as the label "cat." The challenge for researchers has been to develop effective learning algorithms that can modify the weights on the connections in an artificial neural network so that these weights capture the relevant patterns in the data.

Geoffrey Hinton, who has been advocating for a machine learning approach to artificial intelligence since the early 1980s, looked to how the human brain functions to suggest ways in which machine learning systems might be developed. Inspired by the brain, he and others proposed "artificial neural networks" as a cornerstone of their machine learning investigations.

Comment: ACM's statement is not wrong. However, as mentioned above, those "others" mentioned by ACM proposed such systems decades before Hinton who often failed to cite them, even in later work, e.g., [HIN] [DLC] [DL1] [DL2] [DEEP1-2] [CMB] [R7-R8], Sec. II & III & XIII & V & X & XIV & I.

X. ACM: In computer science, the term "neural networks" refers to systems composed of layers of relatively simple computing elements called "neurons" that are simulated in a computer. These "neurons," which only loosely resemble the neurons in the human brain, influence one another via weighted connections. By changing the weights on the connections, it is possible to change the computation performed by the neural network. Hinton, LeCun and Bengio recognized the importance of building deep networks using many layers—hence the term "deep learning."

Comment: The ancient term "deep learning" (explicitly mentioned by ACM) was actually first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000) [DL2]. To my knowledge, LBH have never cited them. Apparently our 2005 paper on deep RL [DL6] was the first machine learning publication with the word combination "learn deep" in the title. Later LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006" [DL7], e.g., Sec. III.

It is true though that LBH "recognized the importance of building deep networks using many layers". However, others recognized this much earlier than LBH, e.g., [DEEP1-2] [CNN1] [HIN] [R8] [DL1] [DLC] (also for deep learning through unsupervised pre-training [UN1-3] [R4]; see [HIN], Sec. II). See also Sec. II & III & XIII & V & I.

XI. ACM: The conceptual foundations and engineering advances laid by LeCun, Bengio and Hinton over a 30-year period were significantly advanced by the prevalence of powerful graphics processing unit (GPU) computers, as well as access to massive datasets. In recent years, these and other factors led to leap-frog advances in technologies such as computer vision, speech recognition and machine translation.

Comment: Again ACM lauds work that failed to cite the pioneers. As mentioned above, the essential "conceptual foundations" of deep learning were laid by others ignored by LBH's papers (see Sec. V & II & III & I & XIII & XII & XIX & X & XVII and [HIN] [R7-R8] [R2-R5]).

ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004) [GPUNN] [GPUCNN5], apparently never cited by LBH. In 2010, our team (Dan Ciresan et al.) was the one that made GPU-based NNs fast and deep enough to break an important benchmark record [MLP1], demonstrating that unsupervised pre-training (pioneered by myself in 1991 [UN1]) is not necessary to train deep NNs, contrary to Hinton's claims [VID1]. In 2011, our CNNs were deep and fast enough to win competitions in computer vision (explicitly mentioned by ACM) for the first time [GPUCNN5] [R6]—see Sec. D.

Furthermore, by the mid 2010s, speech recognition and machine translation (explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team [LSTM1-4] [CTC]. In particular, as mentioned in Sec. A, the CTC-LSTM combination (2006-2007) was the first superior end-to-end neural speech recogniser, while previous methods since the late 1980s (including Bengio's and Hinton's) combined NNs with traditional models such as HMMs, e.g., [BW] [BOU] [BRI] [HYB12]. As mentioned in Sec. B, XVI, the first superior end-to-end neural machine translation was also based on LSTM.

XII. ACM: ... Select Technical Accomplishments ...

Geoffrey Hinton

Backpropagation: In a 1986 paper, "Learning Internal Representations by Error Propagation," co-authored with David Rumelhart and Ronald Williams, Hinton demonstrated that the backpropagation algorithm allowed neural nets to discover their own internal representations of data, making it possible to use neural nets to solve problems that had previously been thought to be beyond their reach. The backpropagation algorithm is standard in most neural networks today.

Comment: ACM credits Hinton for work that failed to cite the origins of the backpropagation algorithm. ACM's statement is "less wrong" than Honda's [HIN] (Sec. I) but still very misleading since non-experts (and apparently even other award committees, e.g., [HIN], Sec. I) are left with the impression that Hinton and colleagues created this method. In fact, Hinton was co-author of an article on backpropagation by Rumelhart et al. (1985-86) [RUM] which did not state that 3 years earlier, Werbos proposed to train NNs in this way (1982) [BP2]. And the article [RUM] even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970) [BP1], also known as "reverse mode of automatic differentiation." (In 1960, Kelley already had a precursor thereof in the field of control theory [BPA]; compare [BPB] [BPC].) See also [R7]. By 1985, compute had become about 1,000 times cheaper than in 1970, and the first desktop computers had just become accessible in wealthier academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs [RUM]. But this was essentially just an experimental analysis of a known method [BP1][BP2]. And the authors [RUM] did not cite the prior art [DLC]. More on the history of backpropagation can be found at Scholarpedia [DL2] and in my award-winning survey [DL1]. Compare Sec. XIX, II.

Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes. (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].

See the recent debate [HIN]: It is true that in 2018, Hinton [AOI] did not credit himself but his co-author Rumelhart [RUM] with the "invention" of backpropagation. Nevertheless, he accepted the Honda Prize for "creating" the method and for other things he didn't do [HIN]. Neither in [AOI] nor in other recent work [DL3] he cited Linnainmaa (1970) [BP1], the true creator [BP4] [BP5]. It should be mentioned that [DL3] does cite Werbos (1974) who however described the method correctly only later in 1982 [BP2] and also failed to cite [BP1]. (Compare also [BP6].) Linnainmaa's method was well-known, e.g., [BP5] [DL1] [DL2] [DLC]. It wasn't created by "lots of different people" as Hinton suggested [AOI] [HIN] [R11] but by exactly one person who published first [BP1] and therefore should get the credit. For decades, Hinton has not published errata or corrigenda of his papers.

XIII. ACM: Boltzmann Machines: In 1983, with Terrence Sejnowski, Hinton invented Boltzmann Machines, one of the first neural networks capable of learning internal representations in neurons that were not part of the input or output.

Comment: Again ACM credits work that failed to cite the pioneers. I have called the Boltzmann Machine [BM] a significant contribution to deep learning [HIN]. Recently, however, I learnt through a reader of [HIN] that even [BM] did not cite prior relevant work by Sherrington & Kirkpatrick [SK75] and Glauber [G63] (compare also [H86] [H88] [S93]). ACM may be right by calling [BM] one of the first NNs capable of learning internal representations. Nevertheless, two decades before [BM], in 1965, Ivakhnenko & Lapa published the first general, working learning algorithms for deep multilayer perceptrons with arbitrarily many layers [DEEP1-2] [HIN]. These networks were fully "capable of learning internal representations in neurons that were not part of the input or output." [BM] did not cite this. LBH have never cited this, not even in recent work. Compare [MIR] (Sec. 1) [R8] and Sec. II & V & X.

As mentioned in Sec. II, Sejnowski's rather self-serving "history of deep learning" [S20] claims: In 1969, Minsky & Papert [M69] showed that shallow NNs are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s" [S20]. However, the 1969 book [M69] addressed a "deep learning problem" that had already been solved for 4 years (Sec. II), and deep learning research was alive and kicking also in the 1970s, at least outside of the Anglosphere, e.g., [DEEP2] [BP6] [CNN1] [DL1] [DL2].

XIV. ACM: Improvements to convolutional neural networks: In 2012, with his students, Alex Krizhevsky and Ilya Sutskever, Hinton improved convolutional neural networks using rectified linear neurons and dropout regularization. In the prominent ImageNet competition, Hinton and his students almost halved the error rate for object recognition and reshaped the computer vision field.

Comment: Again ACM recognizes work that failed to cite the pioneers. Rectified linear neurons (ReLUs) were actually known much earlier—see v. d. Malsburg's work [CMB] (1973). Hinton's 2012 paper [GPUCNN4] did not cite their origins. Instead, it cited another paper by Hinton which also did not cite the original work. Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990) [Drop1]. Hinton's 2012 paper did not cite this either.

Apart from this, as we showed already in 2011 in a contest where LeCun's team participated as well, neither dropout nor ReLUs are necessary to win computer vision competitions and achieve superhuman results—see Sec. D above. Back then, the only really important CNN-related task was to greatly accelerate the training of deep CNNs through GPUs [GPUCNN1,3,5] [R6]. Already before ImageNet 2012 [R6], our earlier fast implementation of deep CNNs (using neither ReLUs nor dropout / Hanson's rule) had a monopoly on winning computer vision competitions [GPUCNN5]. It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011 [GPUCNN2], long before the similar system of Hinton's student. See Sec. D, and Sec. 19 of [MIR].

XV. ACM: Yoshua Bengio

Probabilistic models of sequences: In the 1990s, Bengio combined neural networks with probabilistic models of sequences, such as hidden Markov models. These ideas were incorporated into a system used by AT&T/NCR for reading handwritten checks, were considered a pinnacle of neural network research in the 1990s, and modern deep learning speech recognition systems are extending these concepts.

Comment: However, such hybrids of NNs and Hidden Markov Models (HMMs) etc. have existed since the late 1980s, e.g., [BW] [BRI] [BOU]. And it is not true that "modern deep learning speech recognition systems are extending these concepts" (ACM's wording) because they basically abandon HMMs and are based on two methods from my lab: LSTM (1990s-2005) [LSTM0-6] and CTC [CTC] (2006), applied to speech in 2007 [LSTM4] (also with hierarchical LSTM stacks [LSTM14]). CTC-LSTM is end-to-end-neural and thus very different from (and superior to) hybrid methods since the late 1980s [BW] [BRI] [BOU] [HYB12]. By the time the 2018 Turing Award was handed out, CTC-LSTM-based speech recognition was on most smartphones. See Sec. A.

XVI. ACM: High-dimensional word embeddings and attention: In 2000, Bengio authored the landmark paper, "A Neural Probabilistic Language Model," that introduced high-dimension word embeddings as a representation of word meaning. Bengio's insights had a huge and lasting impact on natural language processing tasks including language translation, question answering, and visual question answering. His group also introduced a form of attention mechanism which led to breakthroughs in machine translation and form a key component of sequential processing with deep learning.

Comment: In 1995, we already had a similar, excellent neural probabilistic text model [SNT]. Bengio [NPM] characterizes [SNT] only briefly as "related." (See also Pollack's earlier work on embeddings of words and other structures [PO87] [PO90].) And in the 2010s, the central method in the mentioned fields of "language translation, question answering, and visual question answering" was actually the LSTM of our team [LSTM0-6], "arguably the most commercial AI achievement" [AV1]. See Sec. B, and Sec. 4 of [MIR].

The particular attention mechanism of Bengio's team [ATT14] has indeed become important. For example, it helped to further improve Facebook's LSTM-based translation (Sec. B). Nevertheless, already in 1990-93, we had both of the now common types of adaptive neural sequential attention: end-to-end-differentiable "soft" attention (in latent space) through multiplicative units within NNs [FAST2], and "hard" attention (in observation space) in the context of Reinforcement Learning (RL) [ATT0] [ATT1] (1990).

See Sec. 9 of [MIR] and [R4] for my related priority dispute on attention with Hinton. He reviewed my 1990 paper [ATT2] which summarised in Section 5 our early work on attention, to my knowledge the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller). Two decades later Hinton wrote about his own work [ATT3]: "To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)."

It should be mentioned that towards the end of the 2010s [DEC], despite their limited time windows, attention-based non-recurrent Transformers [TR1] [TR2] started to excel at Natural Language Processing, a traditional LSTM domain (Sec. B). Nevertheless, there are still many language tasks that LSTM can rapidly learn to solve quickly [LSTM13] [LSTM17] (in time proportional to sentence length) while plain Transformers can't. See [TR3] [TR4] for additional limitations of Transformers.

XVII. ACM: Generative adversarial networks: Since 2010, Bengio's papers on generative deep learning, in particular the Generative Adversarial Networks (GANs) developed with Ian Goodfellow, have spawned a revolution in computer vision and computer graphics. In one fascinating application of this work, computers can actually create original images, reminiscent of the creativity that is considered a hallmark of human intelligence.

Comment: Again ACM lauds Bengio for work that did not cite the original work. GANs [GAN0] [GAN1] (2010-2014) are actually a simple application of my popular adversarial curiosity principle from 1990 [AC90, AC90b] [AC20] (see also surveys [AC09] [AC10]). This principle is now widely used for exploration in Reinforcement Learning (RL, e.g., Sec. C) and for image synthesis [GAN1] (also mentioned by ACM in Sec. XVIII). It works as follows. One NN probabilistically generates outputs, another NN sees those outputs and predicts environmental reactions to them. Using gradient descent, the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error. One net's loss is the other net's gain. GANs are a special case of this where the environment simply returns 1 or 0 depending on whether the generator's output is in a given set [AC20]. (Other early adversarial machine learning settings [S59] [H90] were very different—they neither involved unsupervised NNs nor were about modeling data nor used gradient descent [AC20].) Bengio et al. neither cited the original work nor corrected their erroneous claims [GAN1] about my other adversarial NNs using "predictability minimization" for creating disentangled representations (1991) [PM1-2] [AC20]. Compare [R2] and [MIR], Sec. 5.

The priority dispute above was picked up by the popular press including Bloomberg [AV1] after the following event. Bengio's student Goodfellow gave a talk on GANs at NIPS. In the beginning, he encouraged people to ask questions (a normal thing to do at academic conferences). I did, addressing problems with their NIPS 2014 paper [GAN1] which contains false claims about our much earlier work [AC20]. Goodfellow interrupted this. Subsequent efforts to correct this in a common paper failed. The authors of [GAN1] have not published an erratum or corrigendum since then. I had to do this myself [AC20]. (And LeCun called GANs "the coolest idea in machine learning in the last twenty years" without mentioning that they are special cases of my earlier work [R2] [AC20].)

My group has had additional highly visible priority disputes with Bengio's, also going back 3 decades and more:

(2) The dispute on the famous vanishing gradient problem [MIR] (Sec. 3) [VAN1] [VAN2] was settled in favor of my brilliant student Sepp Hochreiter [VAN1]. However, even after a common publication [VAN3], Bengio published papers (e.g., [VAN4] [XAV]) that cited only his own 1994 paper but not Sepp's original work (1991). Disturbingly, this has apparently helped him to get more citations for vanishing gradients than Sepp—another sign that citation counts are poor indicators of truly pioneering work [NAT1]. In fact, Bengio states [YB20] that in 2018 he "ranked as the most cited computer scientist worldwide." The above illustrates what such citation counts are really worth. The deontology of science requires: If one "re-invents" something that was already known, and only becomes aware of it later, one must at least clarify it later [DLC], and correctly give credit in every related follow-up paper or presentation.

(3) Bengio also claims [YB20] that in 1995 he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue" although my publications on exactly this topic date back to 1991-93 [UN0-2].

(4) Another dispute was on meta-learning (learning to learn—now a hot topic) which I started in 1987 [META1] long before Bengio who suggested in public that he did it before me [R3].

There is more. For example, Bengio also writes [YB20] that in 1999 he "introduced, for the first time, auto-regressive neural networks for density estimation" although we used a very similar set-up for text compression in 1995 [SNT]—compare XVI. He has also heavily used our LSTM (Sec. A-C), but for some reason he introduced in 2014 the new name "gated recurrent units (GRU)" [LSTMGRU] for a variant of our vanilla LSTM architecture [LSTM2] (2000) which he did not cite although [LSTM2] introduced gated recurrent units. (And our team automatically evolved lots of additional LSTM variants and topologies already in 2009 [LSTM7] without changing the name of the basic method.) BTW, GRU cells lack an important gate and can neither learn to count [LSTMGRU2] nor learn simple non-regular languages [LSTMGRU2]. They also do not work as well for challenging translation tasks, according to Google Brain [LSTMGRU3].

Additional priority disputes (many more than can be explained by chance) with Hinton since 1990 included the following:

(5) The dispute on unsupervised pre-training for deep NNs [UN0-4] [HIN] (Sec. II) [MIR] (Sec. 1). Hinton's paper [UN4] (2006) got more citations than my earlier work on this [UN1-2] although [UN1] led to the first NNs shown to solve very deep problems (see Sec. II above). [UN1] was published in 1991-92 when compute was about 1000 times more expensive than in 2006. Hinton did not mention [UN1], not even in LBH's later survey [DL3] [DLC] (2015), although he and Bengio knew it well (also from discussions by email). This illustrates once more what citation counts are really worth. Compare Sec. II & III.

(6) Similar for the dispute on compressing or distilling one NN into another [UN0-2] [DIST1] [DIST2] [MIR] (Sec. 2). Hinton [DIST2] (2006) did not cite my much earlier work on this [UN1] (1991), not even in his later patent application US20150356461A1.

(7) The dispute on fast weights [FAST-FAST3a] through tensor-like outer products [FAST2] (1993) [FAST4a] (2016) [MIR] (Sec. 8).

(8) The dispute on learning sequential attention with NNs [MIR] (Sec. 9). Hinton [ATT3] (2010) did not mention our much earlier work on this [ATT1] although he was both reviewer and editor of my summary [ATT2] (1990). See Sec. XVI above.

The eight priority disputes mentioned in the present Sec. XVII are not on the only ones, e.g., [R4]. Remarkably, three of them are related to [UN1] which in many ways started "modern" deep learning, going beyond Ivakhnenko's "early" deep learning [DEEP1-2] (which LBH did not cite either [DLC]—see Sec. II & III). Six of them go back to work of 1990-91 [MIR]. See Sec. I for additional related issues of credit assignment.

XVIII. ACM: Yann LeCun

Convolutional neural networks: In the 1980s, LeCun developed convolutional neural networks, a foundational principle in the field, which, among other advantages, have been essential in making deep learning more efficient.

In the late 1980s, while working at the University of Toronto and Bell Labs, LeCun was the first to train a convolutional neural network system on images of handwritten digits. Today, convolutional neural networks are an industry standard in computer vision, as well as in speech recognition, speech synthesis, image synthesis, and natural language processing. They are used in a wide variety of applications, including autonomous driving, medical image analysis, voice-activated assistants, and information filtering.

Comment: It is true that LeCun's team has made important contributions to CNNs since 1989, e.g., [CNN2] [CNN4]. However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979) [CNN1]. NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation [CNN1a]. Waibel also was the first to apply this to speech (explicitly mentioned by ACM). All of this happened before LeCun's work on CNNs. See Sec. 21 of [MIR] and Sec. D above.

ACM explicitly mentions autonomous driving. The first team to win a relevant international contest through deep CNNs was ours: at IJCNN 2011 in Silicon Valley, our DanNet [GPUCNN1-3] won the traffic sign recognition competition with superhuman performance while LeCun's team took a distant second place (with three times worse performance). See Sec. D.

ACM explicitly mentions medical image analysis. The first team to win a medical image analysis competition through deep CNNs was again ours: at ICPR 2012, our DanNet [GPUCNN1-3] won the medical imaging contest (Sept 2012, on detection of mitosis/cancer) [GPUCNN5] [GPUCNN8] (before the similar AlexNet won ImageNet 2012 [GPUCNN5] [R6]). One year later, our team also won the MICCAI Grand Challenge on mitosis detection [MGC] [GPUCNN5] [GPUCNN8]. This approach has transformed medical imaging. Many major companies are using it now. See Sec. D & VII.

ACM also addresses image synthesis—see Sec. XVII. ACM also explicitly mentions speech recognition, speech synthesis [AM16] [DL1], natural language processing, voice-activated assistants, and information filtering. All of these fields were heavily shaped in the 2010s by our non-CNN methods, e.g., [DL1] [DL4] [AM16] [GSR] [GSR15] [GT16] [WU] [FB17]—see Sec. A, B, VI, XI.

XIX. ACM: Improving backpropagation algorithms: LeCun proposed an early version of the backpropagation algorithm (backprop), and gave a clean derivation of it based on variational principles. His work to speed up backpropagation algorithms included describing two simple methods to accelerate learning time.

Comment: ACM recognizes LeCun for work that did not cite the pioneers of this method. As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982) [BP2] [BP4] (see also [BP6]). And already in 1970, the modern backpropagation algorithm itself (the real centerpiece of all this later applied work, also known as the reverse mode of automatic differentiation) was published by Linnainmaa [BP1] [BP4] [R7] (with a "clean derivation," of course). LeCun has never cited this, not even in recent work [DL3] [DLC]. In 1960, Kelley already had a precursor of the algorithm [BPA]. And many besides LeCun have worked "to speed up backpropagation algorithms", e.g., [DL1]. More on the history of backpropagation can be found at Scholarpedia [DL2] and in [BP4].

XX. ACM: Broadening the vision of neural networks: LeCun is also credited with developing a broader vision for neural networks as a computational model for a wide range of tasks, introducing in early work a number of concepts now fundamental in AI. For example, in the context of recognizing images, he studied how hierarchical feature representation can be learned in neural networks—a concept that is now routinely used in many recognition tasks.

Comment: However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965) [DEEP1-2] (and also Fukushima [CNN1] [DL2]) had long before LeCun. ACM may have been misled by the fact that LeCun has never cited Ivakhnenko, not even in his later survey [DL3] [DLC]. See also D & II & XIII & V.

XXI. ACM: Together with Leon Bottou, he proposed the idea, used in every modern deep learning software, that learning systems can be built as complex networks of modules where backpropagation is performed through automatic differentiation. They also proposed deep learning architectures that can manipulate structured data, such as graphs.

Comment: What does ACM mean by "modules"? Neuron-like elements? Bigger modules? Anyway, LeCun et al. neither cited the origins [BP1] (1970) of this widely used type of automatic differentiation for differentiable networks of modules [DL2] [BP4] [BP5] [DLC] nor a computer program (1980) for automatically deriving and implementing backpropagation for such systems [S80]. See also Sec. XIX & XII.

And "deep learning architectures that can manipulate structured data, such as graphs" were proposed by Sperduti & Goller & Küchler in the 1990s [GOL] [KU] [SP93-97] before LeCun who did not cite them. See also Pollack's even earlier relevant work [PO87] [PO90].

(And "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993) [UN2]. For example, our adaptive subgoal generators (1991) [HRL0] [HRL1] [HRL2] were trained through end-to-end-differentiable chains of such modules—see Sec. 10 of [MIR]. Same for our "Planning and Reinforcement Learning with Recurrent Neural World Models" (1990)—see Sec. 11 of [MIR]. Same for my fast weight systems consisting of chains of several modules (since 1991)—see Sec. 8 of [MIR].)

Concluding Remarks

In the hard sciences, the only things that count are the facts. Science is not democratic. If 100 persons claim one thing, and only one person claims the opposite, but she can back it up through facts, then she wins. Compare "100 Authors against Einstein" [AH1].

The deontology of science enforces proper scientific standards and behavior when it comes to identifying prior art and assigning credit. Unlike politics, science is immune to ad hominem attacks [AH2] [AH3] true to the motto: "If you cannot dispute a fact-based message, attack the messenger himself" [HIN]. Science has a well-established way of dealing with plagiarism and priority disputes, based on facts such as time stamps of publications and patents. Sometimes it may take a while to settle disputes, but in the end, the facts always win. As long as the facts have not yet won it's not yet the end. No fancy award can ever change that [HIN].

Hinton & LeCun & Bengio and their co-workers have contributed useful improvements of deep learning methods, e.g., [CNN2] [CDI] [LAN] [CNN4] [RMSP] [XAV] [ATT14] [CAPS]. But their most visible work (praised by ACM) mainly helped to popularize methods created by other researchers whom they did not cite, not even in later surveys (e.g., Sec. II & V & XII & XIX & XXI & XIII & XIV & XI & XX). My lab is especially affected by ACM's misleading statements (e.g., Sec. I & A & B & C & D & XVII & VI & XVI). As emphasized earlier [DLC] [HIN]: "The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)." If one "re-invents" something that was already known, and only becomes aware of it later, one must at least clarify it later, and correctly give credit in every related follow-up paper or presentation.

It is a sign of our field's immaturity that popularizers are sometimes still credited for the creations of other researchers whom they ignored. Of course, ACM (or anyone for that matter) is free to hand out awards to anybody, but one should not decorate anybody for work based on unmentioned contributions of others. In the interest of the reputation of the Turing Award, ACM should revise its statements. Else others will. Similar for scientific journals, which "need to make clearer and firmer commitments to self-correction" [SV20].

Could it be that seemingly unbiased award committees are actually affected by PR efforts in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video [VID2] is thanking Hinton for speech recognition and machine translation, although both were actually done (at production time of the video) on billions of smartphones by deep learning methods developed in my labs in Germany and Switzerland (LSTM & CTC, Sec. A) long before Hinton's less powerful methods. Similarly, in 2016, the NY Times published an article [NYT3] about the new, greatly improved, LSTM-based Google Translate without even mentioning our LSTM (instead featuring Hinton who had little to do with it), although Google's original 2016 paper on Google Translate [WU] mentions LSTM over 50 times (see Sec. B). In ad hominem style [AH2] [AH3], LeCun stated in the NY Times that "Jürgen ... keeps claiming credit he doesn't deserve for many, many things" [NYT1], without providing a single example. LeCun also called the GANs of Bengio's team [GAN1] "the coolest idea in machine learning in the last twenty years" without mentioning that GANs are a special case of my work in 1990 [AC90, AC90b] [AC20] [R2]. According to Bloomberg [AV2], Bengio has simply "denied my claims" without backing up his denial by any facts; compare Sec. XVII.

LBH have cited and otherwise supported each other through interviews and other PR at the expense of the true pioneers. Apparently this has earned them many citations, which is just another sign that citation counts are poor indicators of truly pioneering work—see Sec. XVII. As I pointed out in Nature (2011) [NAT1]: like the less-than-worthless collateralized debt obligations that drove the 2008 financial bubble, citations are easy to print and inflate, providing an incentive for professors to maximize citation counts instead of scientific progress—witness how relatively unknown scientists can now collect more citations than the most influential founders of their fields.

Nevertheless, many of my critical comments above do address highly cited work. Our [LSTM1] has got more citations than any paper by Bengio or LeCun [R5]. Hinton's most cited paper (2012) is the one on GPU-based CNNs [GPUCNN4] [R5]. It follows our earlier work on supervised deep NNs (2010) [MLP1] (which abandoned the unsupervised pre-training for deep NNs introduced by myself [UN0-UN3] and later championed by Hinton [UN4] [VID1]; Sec. D). Hinton [GPUCNN4] (2012) characterizes our deep and fast CNN "DanNet" (2011) [GPUCNN1-3] as "somewhat similar"—DanNet won 4 computer vision contests before Hinton's AlexNet won one [R6]; see Sec. D, XIV. Hinton's 2nd most cited paper [RUM] [R5] is the one on experiments with backpropagation (note that in 2019 his Google Scholar page greatly exaggerated the citation count of [RUM], adding citations for a book by Rumelhart & McClelland [R5]). Backpropagation is a previously invented method [BP1] whose origins Hinton did not cite, not even in later surveys [R7]; see Sec. XII. His nets learned internal representations two decades after the nets of Ivakhnenko whom he has never cited [DEEP1-2] [R7-R8]; see Sec. II, XIII. Bengio's 2nd most cited research paper is the one on GANs (2014) [GAN1], a special case of my artificial curiosity (1990) [AC90, AC90b] [AC20] [R2] which he did not cite; see Sec. XVII. As of 2019, the paper with the most citations per year is the one on ResNet (2015) [HW2] [R5], a special case of our earlier Highway Net [HW1] [HW3], the first working feedforward NN with over 100 layers; see Sec. D. Hinton's highly cited papers on unsupervised pre-training for deep NNs (2006-) [UN4] were preceded by ours (1991-) by 15 years, but he did not cite them [R4]—see Sec. II & III and [HIN] (Sec. II). His papers on dropout and rectified neurons were preceded by Hanson's [Drop1] and v. d. Malsburg's [CMB] by decades, but he did not cite them—see Sec. XIV. Consult the Executive Summary and Sec. I-XXI of this critique for more.

So virtually all the algorithms that have attracted many citations in the recent deep learning revolution have their conceptual and technical roots in my labs in Munich & Lugano, apart from the old basic principles of deep learning MLPs (Sec. II, XX) since 1965 [DEEP1-2] and backpropagation (Sec. XIX, XII, 1960-70) [BPA] [BP1] and convolutional NNs (Sec. XVIII, D) since 1979 [CNN1-4]. Here an overview of relevant work compressed into a few lines that link to subsections of the present article: Our LSTM (1990s, Sec. A, B; also for RL, 2003-, Sec. C) → our Highway Net (May 2015) → ResNet (Dec 2015, Sec. D). Our adversarial Artificial Curiosity (1990) → GANs (2010s, Sec. XVII). We abandoned our own unsupervised pre-training of deep NNs (1991, Sec. II & III) for recurrent NNs in the 1990s → our LSTM, Sec. A-C and for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012) → our Highway Net → ResNet (Sec. D). Our DanNet brought superior computer vision (2011, Sec. D, XVIII) & medical diagnosis (2012, Sec. VII, XVIII) and many other applications [DEC] (Sec. 2). Our LSTM brought superior speech recognition (with our CTC, 2007-15, Sec. A) & machine translation (2016, Sec. B) & robotics & video game players (2018-19, Sec. C) and many other applications [DEC] (Sec. 1). In fact, our methods and conceptual foundations shaped most of the application areas mentioned by ACM, e.g., Sec. I, A, B, C, D, VII, XVIII.

As mentioned in Sec. 21 of ref [MIR], LBH's survey does not make clear [DLC] that deep learning was invented outside of the Anglosphere. It started in 1965 in the Ukraine (back then the USSR) with the first nets of arbitrary depth that really learned [DEEP1-2] [R8]. Five years later, modern backpropagation was published "next door" in Finland (1970) [BP1]. The basic deep convolutional NN architecture (now widely used) was invented in the 1970s in Japan [CNN1], where NNs with convolutions were later (1987) also combined with "weight sharing" and backpropagation [CNN1a]. We are standing on the shoulders of these authors and many others—see 888 references in ref [DL1]. Our own work since the 1980s mostly took place in Germany and Switzerland.

Unfortunately, LBH's frequent failures to credit essential prior work by others cannot serve as a role model for PhD students who are told by their advisors to perform meticulous research on prior art, and to avoid at all costs the slightest hint of plagiarism. It is worrisome that the 2018 Turing award seems to reward LBH for this behavior. I encourage all students to ignore the award and keep doing what's right.

Yes, this critique is also an implicit critique of certain other awards to LBH, e.g., [HIN]. It is also related to some of the most popular posts and comments of 2019 at reddit.com/r/MachineLearning, the largest machine learning forum with back then over 800k subscribers. See, e.g., posts [R1-R11], many of them influenced by [MIR] (although my name is frequently misspelled).

Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas [HIN], as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation [NASC1-2], the telephone [NASC3], the computer [NASC4-7], resilient robots [NASC8], and scientists of the 19th century [NASC9].

As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away." It is fun to speculate how future supersmart AI scientists and AI historians equipped with artificial curiosity [SA17] [AC90-AC20] [PP-PP2] will be fascinated by their own roots, and how they will rummage through all available data (old papers, email messages, videos, etc) to fully understand every little detail of their origins in human civilization. However, today's scientists won't have to wait for AI historians to establish proper credit assignment. It is easy enough to do the right thing right now.

Acknowledgments

Thanks to several expert reviewers for useful comments. Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites. Many additional relevant publications can be found in my publication page and my arXiv page.

200+ References (mostly taken from [MIR], [DEC], [HIN])

[MIR] J. Schmidhuber (2019). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020.

[DEC] J. Schmidhuber (02/20/2020). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s.

[HIN] J. Schmidhuber (2020). Critique of Honda Prize for Dr. Hinton.

[T19] ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only).

[T20] J. Schmidhuber (2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun: http://people.idsia.ch/~juergen/critique-turing-award-bengio-hinton-lecun.html. (The present critique with all the WWW links.)

[SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value those who ensure that science is self-correcting. Nature, vol 577, p 9, 2/2/2020.

[Drop1] Hanson, S. J.(1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272. (Compare preprint arXiv:1808.03578, 2018.)

[BW] H. Bourlard, C. J. Wellekens (1989). Links between Markov models and multilayer perceptrons. NIPS 1989, p. 502-510.

[BRI] Bridle, J.S. (1990). Alpha-Nets: A Recurrent "Neural" Network Architecture with a Hidden Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 83-92.

[BOU] H Bourlard, N Morgan (1993). Connectionist speech recognition. Kluwer, 1993.

[HYB12] Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82-97.

[CMB] C. v. d. Malsburg (1973). Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. [See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.]

[BM] D. Ackley, G. Hinton, T. Sejnowski (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1):147-169.

[L20] W. Lenz (1920). Beitraege zum Verstaendnis der magnetischen Eigenschaften in festen Koerpern. Physikalische Zeitschrift, 21: 613-615.

[I25] E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925.

[K41] H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.

[W45] G. H. Wannier (1945). The Statistical Problem in Cooperative Phenomena. Rev. Mod. Phys. 17, 50.

[G63] R. J Glauber (1963). Time-dependent statistics of the Ising model. Journal of Mathematical Physics, 4(2):294-307, 1963.

[SK75] D. Sherrington, S. Kirkpatrick (1975). Solvable Model of a Spin-Glass. Phys. Rev. Lett. 35, 1792, 1975.

[H86] J. L. van Hemmen (1986). Spin-glass models of a neural network. Phys. Rev. A 34, 3435, 1 Oct 1986.

[H88] H. Sompolinsky (1988). Statistical Mechanics of Neural Networks. Physics Today 41, 12, 70, 1988.

[S93] D. Sherrington (1993). Neural networks: the spin glass approach. North-Holland Mathematical Library, vol 51, 1993, p. 261-291.

[CDI] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation 14.8 (2002): 1771-1800.

[LAN] J. L. Ba, J. R.Kiros, G. E. Hinton. Layer Normalization. arXiv:1607.06450, 2016.

[RMSP] T. Tieleman, G. E. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4.2 (2012): 26-31.

[CAPS] S. Sabour, N. Frosst, G. E. Hinton (2017). Dynamic routing between capsules. Proc. NIPS 2017, pp. 3856-3866.

[ATT14] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014-16. Preprint arXiv/1409.0473, 2014-16.

[XAV] X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Intl. Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.

[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation. TR No. ICS-8506, California Univ San Diego La Jolla Inst for Cognitive Science. Later version published as: Learning representations by back-propagating errors. Nature, 323, p. 533-536 (1986).

[S20] T. Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. PNAS, January 28, 2020. Link.

[M69] M. Minsky, S. Papert. Perceptrons (MIT Press, Cambridge, MA, 1969).

[GOL] C. Goller & A. Küchler (1996). Learning task-dependent distributed representations by backpropagation through structure. Proceedings of International Conference on Neural Networks (ICNN'96). Vol. 1, p. 347-352 IEEE, 1996. Based on TR AR-95-02, TU Munich, 1995.

[KU] A. Küchler & C. Goller (1996). Inductive learning in symbolic domains using structure-driven recurrent neural networks. Lecture Notes in Artificial Intelligence, vol 1137. Springer, Berlin, Heidelberg.

[SP93] A. Sperduti (1993). Encoding Labeled Graphs by Labeling RAAM. NIPS 1993: 1125-1132

[SP94] A. Sperduti (1994). Labelling Recursive Auto-associative Memory. Connect. Sci. 6(4): 429-459 (1994)

[SP95] A. Sperduti (1995). Stability properties of labeling recursive auto-associative memory. IEEE Trans. Neural Networks 6(6): 1452-1460 (1995)

[SPG95] A. Sperduti, A. Starita, C. Goller (1995). Learning Distributed Representations for the Classification of Terms. IJCAI 1995: 509-517

[SPG96] A. Sperduti, D. Majidi, A. Starita (1996). Extended Cascade-Correlation for Syntactic and Structural Pattern Recognition. SSPR 1996: 90-99

[SPG97] A. Sperduti, A. Starita (1997). Supervised neural networks for the classification of structures. IEEE Trans. Neural Networks 8(3): 714-735, 1997.

[PO87] J. B. Pollack. On Connectionist Models of Natural Language Processing. PhD thesis, Computer Science Department, University of Illinois, Urbana, 1987.

[PO90] J. B. Pollack. Recursive Distributed Representations. Artificial Intelligence, 46(1-2):77-105, 1990.

[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.

[NASC2] J. Schmidhuber. Zooming in on aviation history. Correspondence, Nature, vol 566, p 39, 7 Feb 2019.

[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.

[NASC4] J. Schmidhuber. Turing: Keep his work in perspective. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b.

[NASC5] J. Schmidhuber. Turing in Context. Letter, Science, vol 336, p 1639, June 2012. (On Gödel, Zuse, Turing.) See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a)

[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.

[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004

[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.

[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.

Relevant threads with many comments at reddit.com/r/MachineLearning, the largest machine learning forum with over 800k subscribers in 2019 (note that my name is often misspelled):

[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award.

[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.

[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.

[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.

[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.

[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.

[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.

[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.

[R9] Reddit/ML, 2019. We find it extremely unfair that Schmidhuber did not get the Turing award. That is why we dedicate this song to Juergen to cheer him up.

[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton

[AH1] Hentschel K. (1996) A. v. Brunn: Review of "100 Authors against Einstein" [March 13, 1931]. In: Hentschel K. (eds) Physics and National Socialism. Science Networks—Historical Studies, vol 18. Birkhaeuser Basel. Link.

[AH2] F. H. van Eemeren , B. Garssen & B. Meuffels. The disguised abusive ad hominem empirically investigated: Strategic manoeuvring with direct personal attacks. Journal Thinking & Reasoning, Vol. 18, 2012, Issue 3, p. 344-364. Link.

[AH3] D. Walton (PhD Univ. Toronto, 1972), 1998. Ad hominem arguments. University of Alabama Press.

[AOI] M. Ford. Architects of Intelligence: The truth about AI from the people building it. Packt Publishing, 2018. (Preface to German edition by J. Schmidhuber.)

[DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. More.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.

[DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444. HTML.

[DL4] J. Schmidhuber, 2017. Our impact on the world's most valuable public companies: 1. Apple, 2. Alphabet (Google), 3. Microsoft, 4. Facebook, 5. Amazon ... HTML.

[DLC] J. Schmidhuber, 2015. Critique of Paper by "Deep Learning Conspiracy" (Nature 521 p 436). June 2015. HTML.

[DL6] F. Gomez and J. Schmidhuber. Co-evolving recurrent neurons learn deep memory POMDPs. In Proc. GECCO'05, Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. PDF.

[DL7] "Deep Learning ... moving beyond shallow machine learning since 2006!" Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020), referring to Hinton's [UN4] and Bengio's [UN5] unsupervised pre-training for deep NNs [UN4] (2006) although this type of deep learning dates back to 1991 [UN1-2]. Compare Sec. II & XVII & III.

[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week, Bloomberg, May 15, 2018.

[AV2] A. Vance. Apple and Its Rivals Bet Their Futures on These Men's Dreams. Business Week, Bloomberg, May 17, 2018.

[NYT1] NY Times article by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'

[NYT3] NY Times article by G. Lewis-Kraus, Dec. 14, 2016: The Great A.I. Awakening

[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16]. GoogleTechTalk, 2007. [Quote: "Nobody in their right mind would ever suggest" to use plain backpropagation for training deep networks. But our [MLP1] showed that unsupervised pre-training is not necessary to train deep NNs.]

[VID2] Bloomberg Hello World. The Rise of AI. Youtube video, 2018.

[MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.

[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.

[R58] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386.

[R61] Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.

[R62] Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.

[ROB] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.

[CUB0] R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report NU-CCS-89-27, Northeastern University, College of Computer Science, 1989.

[NHE] J. Schmidhuber. The Neural Heat Exchanger. Oral presentations since 1990 at various universities including TUM and the University of Colorado at Boulder. Also in In S. Amari, L. Xu, L. Chan, I. King, K. Leung, eds., Proceedings of the Intl. Conference on Neural Information Processing (1996), pages 194-197, Springer, Hongkong. Link.

[HEL] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Computation, 7:889-904, 1995.

[ATT0] J. Schmidhuber and R. Huber. Learning to generate focus trajectories for attentive vision. Technical Report FKI-128-90, Institut für Informatik, Technische Universität München, 1990. PDF.

[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990. PDF. More.

[ATT2] J. Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52-61. San Mateo, CA: Morgan Kaufmann, 1990. PS. (PDF.)

[ATT3] H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010.

[HRL0] J. Schmidhuber. Towards compositional learning with dynamic neural networks. Technical Report FKI-129-90, Institut für Informatik, Technische Universität München, 1990. PDF.

[HRL1] J. Schmidhuber. Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990. HTML & images in German.

[HRL2] J. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pages 196-202. MIT Press, 1992. PDF. HTML & images in German.

[HRL3] P. Dayan and G. E. Hinton. Feudal Reinforcement Learning. Advances in Neural Information Processing Systems 5, NIPS, 1992.

[HRL4] M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):219-246, 1997. PDF.

[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF.

[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991 [UN0]. PDF. [First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding. Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.]

[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. [An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised pre-training for a stack of recurrent NN can be found here (depth > 1000).]

[UN3] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993.

[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF.

[UN5] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle. Greedy layer-wise training of deep networks. Proc. NIPS 06, pages 153-160, Dec. 2006.

[SNT] J. Schmidhuber, S. Heil (1996). Sequential neural text compression. IEEE Trans. Neural Networks, 1996. PDF. (An earlier version appeared at NIPS 1995.)

[NPM] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, p 1137-1155, 2003.

[HB96] S. El Hihi, Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. NIPS, 1996.

[YB20] Y. Bengio. Notable Past Research. WWW link (retrieved 15 May 2020). Local copy (plain HTML only).

[CW] J. Koutnik, K. Greff, F. Gomez, J. Schmidhuber. A Clockwork RNN. Proc. 31st International Conference on Machine Learning (ICML), p. 1845-1853, Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].

[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981. [First neural network with fast weights or dynamic links.]

[FASTb] G. E. Hinton, D. C. Plaut. Using fast weights to deblur old memories. Proc. 9th annual conference of the Cognitive Science Society (pp. 177-186), 1987.

[FAST0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, March 1991. PDF. [First neural network with end-to-end-differentiable control of fast weights. More.]

[FAST1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. PDF. More.

[FAST2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. More. [First neural network with fast weight control through tensor-like outer products. Designed to learn so-called "internal spotlights of attention" in end-to-end-differentiable fashion. ]

[FAST3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @NIPS 2017, Long Beach, CA, USA.

[FAST3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. Preprint: arXiv:1811.12143. PDF.

[FASTMETA1] J. Schmidhuber. Steps towards `self-referential' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992. More.

[FASTMETA2] J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446-451. Springer, 1993. PDF. More.

[FASTMETA3] J. Schmidhuber. An introspective network that can learn to run its own weight change algorithm. Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pages 191-195. IEE, 1993. More.

[FAST4a] J. Ba, G. Hinton, V. Mnih, J. Z. Leibo, C. Ionescu. Using Fast Weights to Attend to the Recent Past. NIPS 2016. PDF.

[FAST5] F. J. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In W. Duch et al. (Eds.): Proc. ICANN'05, LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005. PDF. HTML overview.

[KO2] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857-873, 1997. PDF.

[CO1] J. Koutnik, F. Gomez, J. Schmidhuber (2010). Evolving Neural Networks in Compressed Weight Space. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2010), Portland, 2010. PDF.

[CO2] J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez. Evolving Large-Scale Neural Networks for Vision-Based Reinforcement Learning. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), Amsterdam, July 2013. PDF.

[CO3] R. K. Srivastava, J. Schmidhuber, F. Gomez. Generalized Compressed Network Search. Proc. GECCO 2012. PDF.

[DM1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Tech Report, 19 Dec. 2013, arxiv:1312.5602.

[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015. Link.

[DM3] S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019. [Alphastar has a "deep LSTM core."]

[OAI1] G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker. Learning Dexterity. OpenAI Blog, 2018.

[OAI1a] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba. Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).

[OAI2] OpenAI et al. (Dec 2019). Dota 2 with Large Scale Deep Reinforcement Learning. Preprint arxiv:1912.06680. [An LSTM composes 84% of the model's total parameter count.]

[OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. [An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.]

[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More.

[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on [PM0], 1991. PDF. More.

[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996. PDF. More.

[S59] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3:210-229, 1959.

[H90] W. D. Hillis. Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D: Nonlinear Phenomena, 42(1-3):228-234, 1990.

[GAN0] O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context. Blog post, Internet Archive, 2010

[GAN1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014.

[GOD] Kurt Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173-198, 1931.

[CHU] A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the American Mathematical Society, 41: 332-333. Abstract of a talk given on 19 April 1935, to the American Mathematical Society. Also in American Journal of Mathematics, 58(2), 345-363 (1 Apr 1936).

[TUR] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230-267. Received 28 May 1936. Errata appeared in Series 2, 43, pp 544-546 (1937).

[PHD] J. Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem). Dissertation, Institut für Informatik, Technische Universität München, 1990. PDF. HTML.

[AC90] J. Schmidhuber. Making the world differentiable: On using fully recurrent self-supervised neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. More.

[AC90b] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. More.

[AC91] J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI-149-91, Inst. f. Informatik, Tech. Univ. Munich, April 1991. PDF.

[AC91b] J. Schmidhuber. Curious model-building control systems. Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE, 1991. PDF.

[AC06] J. Schmidhuber. Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts. Connection Science, 18(2): 173-187, 2006. PDF.

[AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112. PDF. (More on artificial scientists and artists.)

[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF.

[AC19] J. Schmidhuber. Unsupervised Minimax: Adversarial Curiosity, Generative Adversarial Networks, and Predictability Minimization. Preprint arXiv/1906.04493, 2019.

[AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493.

[PP] J. Schmidhuber. POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Cognitive Science, 2013. ArXiv preprint (2011): arXiv:1112.5309 [cs.AI]

[PP1] R. K. Srivastava, B. Steunebrink, J. Schmidhuber. First Experiments with PowerPlay. Neural Networks, 2013. ArXiv preprint (2012): arXiv:1210.8385 [cs.AI].

[PP2] V. Kompella, M. Stollenga, M. Luciw, J. Schmidhuber. Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. Artificial Intelligence, 2015.

[PLAN2] J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253-258, 1990. Based on [AC90]. More.

[PLAN3] J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, NIPS'3, pages 500-506. San Mateo, CA: Morgan Kaufmann, 1991. PDF. Partially based on [AC90].

[PLAN4] J. Schmidhuber. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models. Report arXiv:1210.0118 [cs.AI], 2015.

[PLAN5] One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.

[PLAN6] D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.) Preprint: arXiv:1809.01999. Github: World Models.

[BPTT1] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78.10, 1550-1560, 1990.

[BPTT2] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks. In: Backpropagation: Theory, architectures, and applications, p 433, 1995.

[PG] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.3-4: 229-256, 1992.

[BB2] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403-412, 1989. (The Neural Bucket Brigade—figures omitted!). PDF. HTML. Compare TR FKI-124-90, TUM, 1990. PDF.

[META1] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diploma thesis, Tech Univ. Munich, 1987. HTML. Searchable PDF scan (created by OCRmypdf which uses LSTM).

[FM] S. Hochreiter and J. Schmidhuber. Flat minimum search finds simple nets. Technical Report FKI-200-94, Fakultät für Informatik, Technische Universität München, December 1994. PDF.

[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. [More on the Fundamental Deep Learning Problem.]

[VAN2] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE TNN 5(2), p 157-166, 1994

[VAN3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001. PDF.

[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.

[LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF.

[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.

[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. [The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow.]

[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.

[LSTM4] S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF.

[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.

[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.

[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN-09, Cyprus, 2009. PDF.

[LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013. PDF.

[LSTM9] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton. Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL].

[LSTM10] A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004. PDF.

[LSTM11] N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575-581, Springer-Verlag Berlin Heidelberg, 2005.

[LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.

[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF.

[LSTM14] S. Fernandez, A. Graves, J. Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk). PDF.

[LSTM15] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.

[LSTM16] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015. Preprint: arxiv:1506.07452.

[LSTM17] J. A. Perez-Ortiz, F. A. Gers, D. Eck, J. Schmidhuber. Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Networks 16(2):241-250, 2003. PDF.

[LSTM-RL] B. Bakker, F. Linaker, J. Schmidhuber. Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Lausanne, 2002. PDF.

[LSTMGRU] J. Chung, C. Gulcehre, K. Cho, Y. Bengio (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Preprint arXiv:1412.3555 [cs.NE].

[LSTMGRU2] G. Weiss, Y. Goldberg, E. Yahav. On the Practical Computational Power of Finite Precision RNNs for Language Recognition. Preprint arXiv:1805.04908.

[LSTMGRU3] D. Britz et al. (2017). Massive Exploration of Neural Machine Translation Architectures. Preprint arXiv:1703.03906

[RPG] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634.

[S2S] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112.

[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.

[DNC] Hybrid computing using a neural network with dynamic external memory. A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis. Nature, 538:7626, p 471, 2016.

[PDA1] G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649-653, Princeton University, Princeton, NJ, 1990.

[PDA2] M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of context-free languages. Proc. NIPS 1993.

[MOZ] M. Mozer. A Focused Backpropagation Algorithm for Temporal Pattern Recognition. Complex Systems, 1989.

[GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM.

[GSR15] Dramatic improvement of Google's speech recognition through LSTM: Alphr Technology, Jul 2015, or 9to5google, Jul 2015

[GSR19] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.

[AM16] Blog of Werner Vogels, CTO of Amazon (Nov 2016): Amazon's Alexa "takes advantage of bidirectional long short-term memory (LSTM) networks using a massive amount of data to train models that convert letters to sounds and predict the intonation contour. This technology enables high naturalness, consistent intonation, and accurate processing of texts."

[NAS] B. Zoph, Q. V. Le. Neural Architecture Search with Reinforcement Learning. Preprint arXiv:1611.01578 (PDF), 2017.

[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. [Based on LSTM which it mentions at least 50 times.]

[GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g., WIRED, Sep 2016, or siliconANGLE, Sep 2016

[FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day (The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)

[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.

[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.

[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.

[TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.

[HW1] Srivastava, R. K., Greff, K., Schmidhuber, J. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS'2015. The first working very deep feedforward nets with over 100 layers. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates [LSTM2] for RNNs.) Resnets [HW2] are a special case of this where g(x)=t(x)=const=1. See also [HW3]. More.

[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a special case of highway nets [HW1], with g(x)=1 (a typical highway net initialization) and t(x)=1. More.

[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017. Highway Nets perform roughly as well as ResNets on ImageNet. Highway layers are also often used for natural language processing, where the simpler residual layers do not work as well. More.

[THE17] S. Baker (2017). Which countries and universities are leading on AI research? Times Higher Education World University Rankings, 2017. Link.

[JOU17] Jouppi et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. Preprint arXiv:1704.04760

[CNN1] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. [The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. More in Scholarpedia.]

[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. [First application of backpropagation [BP1][BP2] and weight-sharing to a convolutional architecture.]

[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989.

[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989. PDF.

[CNN3] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. [A CNN whose downsampling layers use Max-Pooling (which has become very popular) instead of Fukushima's Spatial Averaging [CNN1].]

[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007

[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. [Speeding up traditional NNs on GPU by a factor of 20.]

[GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. [Speeding up shallow CNNs on GPU by a factor of 4.]

[IM09] J. Deng, R. Socher, L.J. Li, K. Li, L. Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255). IEEE, 2009.

[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. [Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches.]

[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011. PDF. HTML overview. [First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor. This led to massive interest from industry.]

[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.

[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012. PDF.

[GPUCNN5] J. Schmidhuber. History of computer vision contests won by deep CNNs on GPU. March 2017. HTML. [How IDSIA used GPU-based CNNs to win four important computer vision competitions 2011-2012 before others started using similar approaches.]

[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011. PDF.

[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013. PDF.

[MGC] MICCAI 2013 Grand Challenge on Mitosis Detection, organised by M. Veta, M.A. Viergever, J.P.W. Pluim, N. Stathonikos, P. J. van Diest of University Medical Center Utrecht.

[GPUCNN8] J. Schmidhuber. First deep learner to win a contest on object detection in large images— first deep learner to win a medical imaging contest (2012). HTML. [How IDSIA used GPU-based CNNs to win the ICPR 2012 Contest on Mitosis Detection and the MICCAI 2013 Grand Challenge.]

[SCAN] J. Masci, A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.

[ST] J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber Steel Defect Classification with Max-Pooling Convolutional Neural Networks. Proc. IJCNN 2012. PDF.

[DIST1] J. Schmidhuber, 1991. See [UN1].

[DIST2] O. Vinyals, J. A. Dean, G. E. Hinton. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531 [stat.ML], 2015.

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. [Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.]

[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.

[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.

[BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.

[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link.

[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. [Extending thoughts in his 1974 thesis.]

[BP4] J. Schmidhuber. Who invented backpropagation? More [DL2].

[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400.

[BP6] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. [See Section 3.1 on using gradient descent for learning in multilayer networks.]

[S80] B. Speelpenning (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis, Department of Computer Science, University of Illinois, Urbana-Champaign.

[DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. [First working Deep Learners with many layers, learning internal representations.]

[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.

[DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.

[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011. HTML.

[SA17] J. Schmidhuber. Falling Walls: The Past, Present and Future of Artificial Intelligence. Scientific American, Observations, Nov 2017.
.