How 3 Turing Awardees Republished Key Methods and Ideas Whose Creators They Failed to Credit

.

Jürgen Schmidhuber
Pronounce: You_again Shmidhoobuh
Technical Report IDSIA-23-23, IDSIA AI Blog
Twitter: @SchmidhuberAI
14 December 2023

How 3 Turing Awardees Republished Key Methods and Ideas Whose Creators They Failed to Credit

This write-up is meant to correct an inaccurate history of Artificial Intelligence (AI) propagated by recent uninformed news articles, posts in social media, and a large language model. Most of its statements are taken from a less streamlined report^[T22] that has been reviewed on relevant AI mailing lists, profiting from feedback by many experts and well-known AI pioneers. The piece is aimed at people who are not aware of the numerous AI priority disputes, but are willing to check the facts.
Abstract. It is well known that plagiarism can be either "unintentional" or "intentional or reckless."^[PLAG1-6] A paper in Nature argues that "intelligent plagiarists are the most dangerous" as they "rewrite previous findings in different words, purposely hiding the sources of their ideas, and then during subsequent years forcefully claim that they have discovered new phenomena."^[FAKE2] The deontology of science requires: unintentional plagiarists must correct their publications through errata, and credit the original sources in all follow-up papers and presentations. Here I will show how 3 researchers who each got 1/3 of the ACM 2018 Turing Award for deep learning^[R1] have frequently republished methods and concepts first described by my team without proper attribution, often without rectifying this in later publications. Their most visible work builds directly on ours.^[MOST][T22] ACM's Code of Ethics and Professional Conduct^[ACM18] states that computing professionals should "credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works." The awardees didn't; instead they credited each other (and collected citations) for inventions of other researchers.^[T19,T22] I mention no fewer than seven of our direct priority disputes with Dr. Bengio (B1-B7), six with Dr. Hinton (H1-H6), and four with Dr. LeCun (L1-L4), all backed up by plenty of references.^[T22] Many of the disputes have their roots in a time when compute was a million times more expensive than it is today. Even in recent years, however, the awardees have kept diminishing our work in a way that's incompatible with the basic universally-accepted rules of scientific integrity.^{[T22][LEC,LEC22c-d]} Sec. 3 also mentions other scientists whose relevant work was not credited by the awardees.^[T22]
Disclaimer. Following a recent paper,[LEC] I would like to start this by acknowledging that I am not without a conflict of interest here. My seeking to correct the record will naturally seem self-interested. The truth of the matter is that it is. Much of the closely related work pointed to below was done in my lab, and I naturally wish that it be acknowledged, and recognized. Setting my conflict aside, I ask the reader to study the original papers and judge for themselves the scientific content of these remarks, as I seek to set emotions aside and minimize bias so much as I am capable.
Request for comments and errata. While the present write-up is making a clear case for its claims, the public should have a right to be informed about opposing views of the 3 awardees, if any. I ask Drs. LeCun & Bengio & Hinton (LBH) for a written response that addresses the priority disputes one by one (LBH's response can be public if they wish it to be). If they consider some claim to be factually inaccurate, they should clearly explain why—science is about self-correction,^[SV20] and I'll happily publish any fact-based correction in a revised version of this report. The responses, however, should indeed focus on concrete scientific facts and neither contain ad hominem arguments^[AH1-3][HIN] nor only unspecific general statements about the claims being wrong.^[LEC] Furthermore, I ask LBH to publish errata or corrigenda of their affected papers, and correctly credit the creators of relevant work in all future papers and presentations.^[ACM18] Finally, I also invite others to send comments and additional relevant references (please send any and all directly to juergen@idsia.ch).
The "Policy for Honors Conferred by ACM"^[ACM23] mentions that ACM "retains the right to revoke an Honor previously granted if ACM determines that it is in the best interests of the field to do so." So I ask ACM to evaluate the presented evidence and decide about further actions.

1. Overview

B: Priority disputes with Dr. Bengio (original date v Bengio's date):
B1: Generative adversarial networks or GANs (1990 v 2014)
B2: Vanishing gradient problem (1991 v 1994)
B3: Metalearning (1987 v 1991)
B4: Learning soft attention (1991-93 v 2014) for Transformers etc.
B5: Gated recurrent units (2000 v 2014)
B6: Auto-regressive neural nets for density estimation (1995 v 1999)
B7: Time scale hierarchy in neural nets (1991 v 1995)

H: Priority disputes with Dr. Hinton (original date v Hinton's date):
H1: Unsupervised/self-supervised pre-training for deep learning (1991 v 2006)
H2: Distilling one neural net into another neural net (1991 v 2015)
H3: Learning sequential attention with neural nets (1990 v 2010)
H4: NNs program NNs: fast weight programmers (1991 v 2016) and linear Transformers
H5: Speech recognition through deep learning (2007 v 2012)
H6: Biologically plausible forward-only deep learning (1989, 1990, 2021 v 2022)

L: Priority disputes with Dr. LeCun (original date v LeCun's date):
L1: Differentiable architectures / intrinsic motivation (1990 v 2022)
L2: Multiple levels of abstraction and time scales (1990-91 v 2022)
L3: Informative yet predictable representations (1997 v 2022)
L4: Learning to act largely by observation (2015 v 2022)

2. The awardees promoted each other while downplaying our contributions
3. Other researchers who were not credited by the awardees
4. Ad hominem attacks
5. Effects on other researchers
6. On fathers and godfathers
7. Discussion
8. Encouraging scientific integrity by a pillory list of plagiarism cases
9. Acknowledgments
10. 300+ partially annotated references (many more in the award-winning survey^[DL1])

B. Priority disputes with Dr. Bengio since 1987

I will show how Dr. Bengio and his team have been republishing important parts of our work without proper attribution for over three decades. In fact, his most visible work is directly based on ours, but does not correctly cite it. In what follows, I will list no fewer than seven priority disputes with him and his co-workers. Most of the disputes go back to work we published over 3 decades ago when compute was a million times more expensive than today. Much of it has become important through the ongoing hardware acceleration.

B1. Generative adversarial networks or GANs (1990 v 2014)

In 2014, Bengio and his team published a paper on gradient-based Generative Adversarial Networks (GANs).^[GAN1] This has become his most cited research paper. It fails to cite my original gradient-based NNs of 1990 which are effectively the same thing.^{[AC][AC90,AC90b][AC20][R2][T22]} It also severely misrepresents—to its own advantage—my other gradient-based adversarial neural nets (NNs) of 1991.^{[PM0-2][AC20][T22]} Bengio has never corrected this in later papers, despite the issue being raised with him on several occasions.^[T22] There is a peer-reviewed journal publication on this dispute.^[AC20]
To justify ACM's Turing Award for Bengio, ACM wrote:^[T19,T22] "Since 2010, Bengio's papers on generative deep learning, in particular the Generative Adversarial Networks (GANs) developed with Ian Goodfellow, have spawned a revolution in computer vision and computer graphics. In one fascinating application of this work, computers can actually create original images, reminiscent of the creativity that is considered a hallmark of human intelligence."

However, GANs^[GAN0-1] (2010-2014) are actually just a simple application^[AC] of my artificial curiosity (AC) principle from 1990^{[AC90,90b][AC20]} (see also surveys^[AC09-10]). This principle is now widely used for exploration in Reinforcement Learning (RL) and for image synthesis.^[GAN1][T22] (Note that any (un)supervised learning task can be formulated as an RL task, and many of them can be solved by simple gradient descent.) The principle works as follows: one NN—the controller—probabilistically generates outputs. Another NN—the world model—sees the outputs of the controller and predicts environmental reactions to them. Using gradient descent, the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain.
4 years before Bengio's GAN paper,^[GAN1] my well-known 2010 survey^[AC10] summarised the generative adversarial NNs of 1990 as follows: a "neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" (which are minimized). GANs are a version of this where the trials are very short (like in bandit problems) and the environment simply returns the values "fake" or "real" depending on whether the controller's (or generator's) output is in a given set.^[AC20][AC]
Other early adversarial machine learning settings since 1959^[S59][H90] were very different—they neither involved unsupervised NNs, nor were about modeling data, nor used gradient descent.^[AC20]
Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected their erroneous claims^[GAN1] about my other adversarial NNs for "predictability minimization" (PM) which create disentangled representations of input data (1991).^{[PM1-2][AC20][R2][MIR](Sec. 5)} In particular, they wrongly claim that PM is not a minimax game.

The priority dispute above was picked up by the popular press, e.g., Bloomberg,^[AV1] after a particularly notable encounter between me and Bengio's student Dr. Goodfellow at a N(eur)IPS conference. He gave a talk on GANs, encouraging people to ask questions. I did, addressing problems in their NIPS 2014 paper^[GAN1] and some of the erroneous claims it made about my prior work.^[AC20] Subsequent efforts to correct these issues in a common paper didn't work out. Goodfellow eventually admitted that my PM is adversarial (his paper^[GAN1] still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^[AC90] like in StyleGANs^[GAN2]). It is actually a generalized version of GANs. When the authors^[GAN1] did not produce an erratum, I published a peer-reviewed one myself in the hopes of correcting the annals of history.^[AC20]
Remarkably, Bengio was backed by LeCun who called GANs "the coolest idea in machine learning in the last twenty years" without mentioning that they are instances of my earlier work.^[R2][AC20] See also dispute L1.
Even in their much more recent 2021 Turing lecture,^[DL3a] LeCun & Bengio & Hinton (LBH) cite only Bengio's 2014 paper on Generative Adversarial Networks (GANs)^[GAN0-1] without mentioning that GANs are instances of my Artificial Curiosity Principle of 1990.^{[AC90-20][MIR](Sec. 5)} As recently as of Dec 2022, LeCun said^[LEC22c] that my claims about "GANs and other things [...] didn't turn out to be true." This claim has no justification, no references, and is both false and misleading.^[LEC] As mentioned above, one of my previous peer-reviewed publications^[AC20] was able to quite clearly show the correctness of my claim and this work remains unchallenged. See Addendum III of [LEC].

B2. Vanishing gradient problem (1991 v 1994)

Deep learning is hard because of the Fundamental Deep Learning Problem identified and analyzed in 1991 by my first student Sepp Hochreiter in his diploma thesis which I had the pleasure to supervise.^[VAN1] He showed that deep NNs suffer from the now famous problem of vanishing or exploding gradients: in typical deep or recurrent networks, back-propagated error signals either shrink rapidly, or grow out of bounds. In both cases, learning fails.
3 years later, Bengio published another vanishing gradient analysis,^[VAN2] without citing Hochreiter. A showdown at the 1996 N(eur)IPS conference where I defended Hochreiter's work^[VAN1] settled this dispute in his favor.
However, even after a common publication,^[VAN3] Bengio published papers^{[VAN4-5][XAV]} that cited only his own 1994 paper but not Hochreiter's original work (1991). Disturbingly, this has apparently helped him to get more citations for vanishing gradients than Hochreiter—another sign that citation counts are poor indicators of truly pioneering work.^[NAT1] (Margin note: Bengio stated^[YB20] that in 2018 he "ranked as the most cited computer scientist worldwide"—the above illustrates what such citation counts are really worth.)
The deontology of science requires: If one "re-invents" something that was already known, and only becomes aware of it later, one must at least clarify it later,^[DLC] and correctly give credit in all follow-up papers and presentations. Bengio didn't. Isn't failure to do so intentional plagiarism as it turns even unintentional plagiarism^[PLAG1-6] into an intentional form?^[FAKE2]

B3. Metalearning (1987 v 1991)

Metalearning or learning to learn is now a hot topic. The most widely used machine learning algorithms were invented and hardwired by humans. Can we also construct metalearning algorithms that can learn better learning algorithms, to build truly self-improving AIs without any limits other than the limits of computability and physics? I started this type of research in my 1987 diploma thesis.^{[META1][META]} Bengio, however, suggested in public at N(eur)IPS 2019 that he did it before me, citing his much later 1991 paper.^[R3]

(Margin note: In 2003, I published a mathematically optimal, self-referential, meta-learning, universal problem solver making provably optimal self-improvements by re-writing its own computer code: the Gödel Machine^[GM3-9] which extended my earlier work on self-modifying code.^[METARL2-9] Decades later, Bengio's collaborator, Hinton (who has never published on this topic), started warning of this kind of research in interviews (2023).)

B4. Learning soft attention (1991-93 v 2014) for Transformers etc.

Recently, attention-based Transformers^[TR1] have been all the rage, e.g., generating human-sounding texts through the famous ChatGPT.^[GPT3] In March 1991, I published a first Transformer variant, now called unnormalised Transformer with linearized self-attention^{[TR5-7][FWP0-1][FWP6][FWP]} (see tweet of 2022 for 30-year anniversary). My "linear" Transformer was more efficient than later "quadratic" Transformer variants,^[TR1-2] resulting in costs that scale linearly in input size, rather than quadratically.
My so-called "Fast Weight Programmers" or "Fast Weight Controllers"^[FWP0-1] separated storage and control like in traditional computers, but in an end-to-end-differentiable, adaptive, fully neural way (rather than in a hybrid fashion^{[PDA1-2][DNC]}). The "self-attention" in more recent "quadratic" Transformer types^[TR1-4] combines this with a projection and softmax,^[FWP] using attention terminology like the one I introduced in a follow-up paper (1993).^{[FWP2][R4][ATT]}
To justify a Turing Award for Bengio, ACM wrote^[T19,T22] that Bengio's group "introduced a form of attention mechanism which led to breakthroughs in machine translation and form a key component of sequential processing with deep learning."
However, Bengio's work on soft "attention"^[ATT14] failed to cite my much earlier original work of 1991-1993 on soft attention ("learning internal spotlights of attention"^[FWP2]) and linear Transformers.^{[FWP,FWP0-2,6][ATT]} Even LBH's recent 2021 Turing lecture^[DL3a] cites only Bengio's much later work. While it has extra sections^[DL3a] on Transformers^[TR1-7][DLH] and self-supervised pre-training, it fails to clarify that over 3 decades ago we laid foundations of Generative AI, introducing principles of soft attention-based Transformers (1991, the "T" in "ChatGPT" stands for "Transformer"),^{[TR1-7][FWP0-1,6][DLH]} self-supervised pre-training for deep NNs (1991, the "P" in "GPT" stands for "pre-trained"),^{[UN][UN0-3][MOST]} and GANs (1990, now used for deepfakes).^{[AC90-20][DLH]} Compare disputes B1, H4, H1.

Furthermore, in the 2010s, what the ACM called the key "breakthrough in machine translation"^[T19,T22] was not due to Bengio but to the LSTM of our team.^[LSTM0-6] It greatly improved Google Translate in 2016^{[GT16][S2S][WU][DL4]} and, by 2017, Facebook's users made 30 billion LSTM-based translations per week.^[FB17][DL4]

B5. Gated recurrent units (2000 v 2014)

Bengio has heavily used our LSTM,^[LSTM1][R5] but for some reason he and his team invented the new name "gated recurrent units (GRU, 2014)"^[LSTMGRU] for a variant of our vanilla LSTM architecture^[LSTM2] (2000) which he did not cite although our work^[LSTM2] was the one that introduced these so-called gated recurrent units. He cited only the earlier 1997 LSTM^[LSTM1] which did not yet have recurrent units with "forget gates."^[LSTM2]
Note also that long before Bengio's 2014 paper, in 2009, my team already automatically evolved lots of additional LSTM variants and topologies.^[LSTM7] Of course, unlike Bengio, we did not change the name^[FAKE2] of the basic method.
Footnote: GRU cells lack an important gate and can neither learn to count^[LSTMGRU2] nor learn simple non-regular languages.^[LSTMGRU2] According to Google Brain, they also do not work as well for challenging translation tasks.^[LSTMGRU3]

B6. Auto-regressive NNs for density estimation (1995 v 1999)

Bengio wrote^[YB20] that in 1999 he "introduced, for the first time, auto-regressive neural networks for density estimation." To justify a Turing Award for Bengio, ACM even wrote:^[T19][T22] "Bengio authored the landmark paper, "A Neural Probabilistic Language Model," that introduced high-dimension word embeddings as a representation of word meaning. Bengio's insights had a huge and lasting impact on natural language processing tasks including language translation, question answering, and visual question answering."
However, several years earlier, in 1995, we already had a similar, excellent, auto-regressive, neural probabilistic text model.^[SNT] In 2003, Bengio^[NPM] only briefly characterized it as "related" (see also Pollack's earlier work on embeddings of words and other structures^[PO87][PO90]).
Furthermore, in the 2010s,^[DEC] the central method in the fields of "language translation, question answering, and visual question answering" mentioned by ACM^[T19][T22] was actually the LSTM of our team,^{[LSTM0-6][DL4]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)}
Even much later, in 2021, LBH still claimed^[DL3a] that Bengio's team^[NPM] first showed in 2002 on "real sentences" that "activity vectors can be used to model the structure inherent in a set of symbol strings by learning appropriate activity vectors for each symbol and learning non-linear transformations that allow the activity vectors that correspond to missing elements of a symbol string to be filled in." However, we showed this on real sentences already in 1995.^[SNT]

B7. Time scale hierarchy in neural nets (1991 v 1995)

In 2020, Bengio claimed^[YB20] that in 1995 he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue"^[HB96] although my publications on a hierarchy of time scales to combat the vanishing gradients issue date back to 1991-93.^[UN0-2][UN]
See also the dispute H1 on unsupervised/self-supervised pre-training for downstream deep learning, the dispute B2 on vanishing gradients, the dispute L2 on multiple levels of abstraction and time scales, and the dispute H2 on distilling neural networks. LeCun, Bengio, and Hinton (LBH) have republished several different ideas from the same 1991 paper without citing it.^[UN0-1]

H. Priority disputes with Dr. Hinton (1990-)

Like Dr. Bengio, Dr. Hinton has republished important aspects of our work without proper attribution for over three decades. In what follows, I will list no fewer than six priority disputes with him. Like Bengio's most cited work, Hinton's most cited work builds directly on ours. Like in Bengio's case, most of the disputes with Hinton go back to work we published over 3 decades ago when compute was a million times more expensive than today.

H1. Deep un(self-)supervised pre-training (1991 v 2006)

At least until 2019, LBH's web site deeplearning.net advertised deep learning as "moving beyond shallow machine learning since 2006",^[DL7] referring to Hinton's^[UN4] and Bengio's^[UN5] unsupervised layer-wise pre-training for deep NNs (2006), as if deep learning had started with this work. However, we had this type of deep learning already in 1991.^[UN][UN1-2] Hinton & Bengio did not mention the prior work, not even in later surveys.^{[DL3,DL3a][T22]} More on this below.
Background: today's most powerful NNs tend to be very deep, that is, they have many layers of neurons or many subsequent computational stages.^[MIR] Before the 1990s, however, gradient-based training did not work well for deep NNs, only for shallow ones^[DL1-2][DLH] (but see a 1989 paper^[MOZ]). This deep learning problem was most obvious for recurrent NNs. Like the human brain, but unlike the more limited feedforward NNs (FNNs), RNNs have feedback connections. This makes RNNs powerful, general purpose, parallel-sequential computers that can process input sequences of arbitrary length (think of human speech or videos). RNNs can in principle implement any program that can run on your laptop or any other computer in existence. If we want to build an Artificial General Intelligence (AGI), then its underlying computational substrate would be something more like an RNN than like an FNN as FNNs are fundamentally insufficient; RNNs and similar systems are to FNNs as general computers are to pocket calculators. In particular, unlike FNNs, RNNs can in principle deal with problems of arbitrary depth.^[DL1] Before the 1990s, however, RNNs failed to learn deep problems in practice.^[MIR]

To overcome this drawback through RNN-based "general deep learning," I built a self-supervised RNN hierarchy that learns representations at multiple levels of abstraction and multiple self-organizing time scales:^[LEC] the Neural Sequence Chunker^[UN0] (1991) or Neural History Compressor.^[UN1] Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs (and therefore also targets) to the next RNN above. The resulting compressed sequence representations greatly facilitate downstream supervised deep learning such as sequence classification.
Although desktop computers back then were about a million times slower than today, by 1993, the Neural History Compressor above was able to solve previously unsolvable "very deep learning" tasks of depth > 1000^[UN2] (requiring more than 1,000 subsequent computational stages—the more such stages, the deeper the learning). In 1993, we even published a continuous version of the Neural History Compressor.^[UN3]
More than a decade after this work,^[UN1] Hinton published a similar unsupervised method for more limited feedforward NNs (FNNs), facilitating supervised learning by unsupervised pre-training of stacks of FNNs called Deep Belief Networks (DBNs).^[UN4] The 2006 justification was essentially the one I used in the early 1990s for my RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]} Hinton did not mention the 1991 work, not even in later surveys.^[T22]
Bengio also published similar work (2006) without citing the original method,^[UN5] not even in LBH's much later surveys (2015-2021),^{[DL3,DL3a][DLC]} although both Hinton and Bengio knew it well (also from discussions by email). Even LBH's 2021 Turing Lecture^[DL3a] dedicates an extra section to their unsupervised pre-training of deep neural networks (NNs) around 2006, without mentioning that I pioneered this class of methods in 1991.^[UN-UN2]
Remarkably, no fewer than four of our priority disputes with LBH (H1, H2, B7, L2) are related to this work of 1991-92.^[UN0-1][UN] Today, self-supervised pre-training is heavily used for famous applications such as Chat-GPT—the "P" stands for "pre-trained," and the "T" for "Transformer." Note that my first Transformer variant (the unnormalised linear Transformer) also dates back to 1991;^{[FWP0-1,6][TR1-7][DLH]} see disputes H4, B4.

H2. Distilling one neural net into another neural net (1991 v 2015)

The hierarchical internal representations of the neural history compressor above (see dispute H1) can be collapsed into a single recurrent NN (RNN), using my NN distillation procedure of 1991.^[UN0-1][MIR] Here the knowledge of a teacher NN is "distilled" into a student NN, by training the student NN to imitate the behavior of the teacher NN (while also re-training the student NN on previously learned skills such that it does not forget them). This is widely used today.
Hinton republished NN distillation many years later in 2015.^{[DIST2][MIR][HIN][T22]} Again, he did not cite my much earlier original work on this (1991),^[UN0-1][UN] not even in his later patent application US20150356461A1.

H3. Learning sequential attention with neural nets (1990 v 2010)

In 1990, we showed how an NN can learn to steer an artificial fovea to find objects in visual scenes through sequences of saccades.^{[ATT0-1][ATT][MIR](Sec. 9)}
Hinton was both reviewer and editor of my 1990 summary^[ATT2] which summarised this work in its Section 5: the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller).

Astonishingly, 20 years later, Hinton republished very similar work^[ATT3] (2010) without mentioning ours (which he reviewed in 1990),^{[ATT1-2][ATT]} claiming:^[ATT3] "To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)."^{[MIR](Sec. 9)[R4]} Even in later surveys, Hinton did not make any attempt to correct this or to recognize the 1990 work he undeniably was aware of.^[T22]

H4. NNs program NNs: fast weight programmers (1991 v 2016)

Hinton's 2016 paper^[FWP4a] did not make clear that much of it was a rehash of my papers on fast weight programmers (FWPs, 1991-93)^{[FWP0-2][FWP]} and one of their motivations: in standard fully connected RNNs with n hidden units, the ratio between the numbers of adaptive parameters (weights) and temporary variables (unit activations) is O(n), but through FWPs we can bring it down to O(1).^{[FWP2][MIR](Sec. 8)}
Hinton writes^[FWP4a] that my 1993 paper^[FWP2] did not implement "this method of achieving recursion," ignoring my even earlier papers^[FWP0-1] which did describe experiments with such an implementation.
Even LBH's 2021 Turing Lecture^[DL3a] refers to Hinton's 2016 paper on "a high-capacity, short-term memory" through fast weights without clarifying that this was first described in the 1991-93 papers on Fast Weight Programmers and unnormalised linear Transformers.^[FWP0-1,6] (Note that earlier work on fast weights (1981-82)^{[FAST,FASTa-b][FWP]} did not consider NNs programming the fast weight changes of other NNs.)
In fact, LBH's 2021 Turing Lecture^[DL3a] talks a lot about Transformers^[TR1-7][DLH] and their self-supervised pre-training, without mentioning that in 1991, I published both the first Transformer variant (the unnormalised linear Transformer)^{[TR1-7][FWP0-1,6][DLH]} and self-supervised pre-training for deep NNs.^[UN][UN0-3] Recall that the "T" in the famous "ChatGPT" stands for "Transformer," and the "P" stands for "pre-trained."^[MOST]
As recently as of 2022, Hinton^[HIN22] asks: "For sequential data, is it possible to use fast weights to mimic a simplified Transformer?" Here again he cites only his own 2016 paper,^[FWP4a] but not my original 1991 fast weight programmer^[FWP0] which does implement a Transformer variant, as was well-known by 2022.^[FWP6]
See also the dispute B4 on learning soft attention, the dispute H1 on self-supervised pre-training, and the dispute H6 on biologically plausible deep learning.

H5. Speech recognition through deep learning (2007 v 2012)

Hinton's 2012 publication on speech recognition^[HYB12] did not cite our much earlier LSTM^[LSTM0-6] trained by our Connectionist Temporal Classification (CTC, 2006).^[CTC] CTC-LSTM was successfully applied to speech in 2007^[LSTM4] (also with hierarchical LSTM stacks^[LSTM14]). It became the first superior end-to-end neural speech recogniser that outperformed the state of the art, dramatically improving Google's speech recognition.^{[GSR][GSR15][DL4]} CTC-LSTM was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as hidden Markov models (HMMs).^{[BW][BRI][BOU]}
Hinton^[HYB12] still used the old hybrid approach, without comparing it to CTC-LSTM. Later, however, when my former PhD student and PostDoc Alex Graves joined Hintons' group, Hinton adopted our approach.^[LSTM8]

By the time the 2018 Turing Award was handed out, our CTC-LSTM-based speech recognition (not that of Hinton) had been on most smartphones for years.^{[GSR][GSR15-19][DL4]} Similarly for Natural Language Processing (NLP) and machine translation, which were also revolutionized by LSTM, due to its ability to learn languages unlearnable by traditional models such as HMMs.^[LSTM13] By 2016-17, both Google Translate^[GT16]—whose whitepaper^[WU] mentions LSTM over 50 times—and Facebook Translate^[FB17] were based on two connected LSTMs,^[S2S] one for incoming texts, and one for outgoing translations—much better than what existed before.^[DL4] By 2017, Facebook's users made 30 billion LSTM-based translations per week.^[FB17][DL4]

However, even LBH's much later 2021 Turing Lecture^[DL3a] cites only Hinton et al.'s later work on speech recognition since 2009, without mentioning our earlier and vastly superior methods from 2007^[LSTM4,14] based on LSTM^[LSTM0-6] (1990s-2005) and CTC (2006).^[CTC]

H6. Biologically plausible deep learning (1989, 1990, 2021 v 2022)

In 2022, Hinton proposed^[HIN22] a biologically more plausible "forward-only" deep learning algorithm that—unlike backpropagation—is local in space and time.^[BB2] While he did not directly plagiarize our earlier related proposals of 1989,^[BB2] 1990,^{[NAN1-5][NHE][HEL]} and 2021,^[FWPMETA6] he did not cite them either. See also the overviews: [MIR](Sec. 15, Sec. 17, Sec. 18). See also the following tweets on this: tweet1, tweet2, tweet3.
In particular, Hinton's 2022 paper^[HIN22] ignores our 2021 paper^[FWPMETA6] on meta-learning "forward-only" learning algorithms implemented in LSTMs without a separate backward phase (see tweet2). Hinton even asks the closely related question^[HIN22] whether it is "possible to use fast weights to mimic a simplified transformer," citing his own 2016 paper^[FWP4a] but not my original 1991 Transformer variant^[FWP0] providing the affirmative answer (reviewed in 2021).^[FWP6] See also disputes H4 and B4.

L. Priority disputes with Dr. LeCun (1990-)

Some of our priority disputes with Dr. LeCun are covered here.^[LEC] In particular, years ago, we published most of what LeCun called his "main original contributions" even as late as 2022:^[LEC22a] neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models and plan (1990); controllers that learn informative predictable representations (1997), etc.^[LEC22a-c] This was also discussed in the media, on reddit, and on Hacker News. More on this below. Compare tweets of 7 Jul 2022, 6 Dec 2023.

L1. Differentiable architectures / intrinsic motivation (1990 v 2022)

In 2022, LeCun^[LEC22a] claimed that one of his "main original contributions" are "cognitive architectures in which all modules are differentiable and many of them are trainable," with "behavior driven through intrinsic motivation" (see LeCun's abstract).

He did not cite my differentiable 1990 neural architecture for online learning & planning (through what's now called "rollouts")^{[AC90-90b][PLAN2]} which was the first with "intrinsic motivation" for a controller NN incentivised to improve a predictive world model NN. It was both generative and adversarial; its principles have been frequently cited, implemented, and used. The 2014 GAN cited by LeCun is a version thereof, as pointed out in a peer-reviewed publication.^[AC20]^[R2] See also the dispute B1 with Bengio. This work extended feedforward NN-based system identification and control of the 1980s.^{[WER87-89][MUN87][NGU89][LEC]}

L2. Multiple levels of abstraction and time scales (1990-91 v 2022)

In 2022, LeCun claimed that one of his "main original contributions"^[LEC22a] is a "hierarchical architecture for predictive world models that learn representations at multiple levels of abstraction and multiple time scales." He did not mention that I implemented such an architecture 3 decades earlier through my 1991 self-supervised neural history compressor.^[UN0-1]
Using predictive coding, the history compressor learns in a self-supervised fashion hierarchical internal representations of long sequences of data, to greatly facilitate downstream learning. These representations can be collapsed into a single recurrent NN (RNN), using my NN distillation procedure of 1991.^[UN0-1][UN] Remarkably, Bengio and Hinton republished without attribution several ideas from the same 1991 paper^[UN1][UN] in different contexts (see the disputes H1, H2, and B7).
In his paper^[LEC22a] LeCun also writes about predictive differentiable models "for hierarchical planning under uncertainty": "One question that is left unanswered is how the configurator can learn to decompose a complex task into a sequence of subgoals that can individually be accomplished by the agent. I shall leave this question open for future investigation."
Far from a future investigation, I published exactly this over 3 decades ago in 1990: a controller NN gets extra command inputs of the form (start, goal). An evaluator NN (similar to LeCun's much later so-called "JEPA"^[LEC22a]) learns to predict the expected costs of going from start to goal. A differentiable (R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of cost-minimizing intermediate subgoals.^[HRL0-1] Compare a tweet of 6 Dec 2023.

L3. Informative yet predictable representations (1997 v 2022)

In 2022, LeCun claimed^[LEC22a] that one of his "main original contributions" is a "self-supervised learning paradigm that produces representations that are simultaneously informative and predictable." He did not mention that this was implemented in the context of control by my 1997 system.^{[AC97][AC99][AC02]} Instead of predicting all details (e.g. pixels) of future inputs,^[AC90-95] it can ask arbitrary abstract questions with computable answers in what LeCun calls "representation space." Two learning, reward-maximizing adversaries called "left brain" and "right brain" play a zero-sum game, trying to surprise each other, occasionally betting on different yes/no outcomes of such computational experiments, until the outcomes become predictable and boring. See also the dispute B1 with Bengio on my even earlier generative adversarial system of 1990.
The "chunker NN" of my earlier but less general neural history compressor (1991)^[UN0-1] also produces "representations that are simultaneously informative and predictable." An "automatizer NN" learns to predict them. Thus the chunker's knowledge gets distilled into the automatizer. This facilitates downstream deep learning.^[UN] See also the dispute H1 on unsupervised/self-supervised pre-training for downstream deep learning, the dispute L2 on multiple levels of abstraction and time scales, and the dispute H2 on distilling neural networks, which are all about the same paper.^[UN0-1]
Yet another different, non-generative, supervised neural network of 1992 also discovers informative yet predictable representations,^[PMax0-1] complementing my earlier 1991 work on adversarial NNs learning to create informative yet unpredictable internal representations^[PM0-2] (see B1).

L4. Learning to act largely by observation (2015 v 2022)

In 2022, LeCun emphasized NNs that "learn to act largely by observation." He did not mention that we addressed this a long time ago, e.g., in 2015.^[PLAN4] A neural world model M may be good at predicting some things but uncertain about others. For example, M may have been trained to predict/encode lots of YouTube videos showing humans and robots interacting with the world. A neural controller C wants to improve its skills by extracting relevant information from M. C learns to become a prompt engineer that maximizes its objective function by learning to query (a copy of) M through sequences of self-invented questions (activation patterns) and to interpret the answers (more activation patterns).^[PLAN4] C may profit from learning to extract any type of algorithmic information from M, e.g., for hierarchical planning and reasoning, exploiting passive observations encoded in M, etc. Compare tweets of 7 Jul 2022, 30 Nov 2023, 6 Dec 2023.

LeCun asked me^[LEC] for only four relevant publications on the disputes L1-L4 above—I gave him five: (I),^[AC90] (II),^[UN1] (III),^[AC02] (IV),^[HRL1] (V).^[PLAN4] Subsequently, however, LeCun hasn't followed the standard scientific procedure, namely, either defend his work on OpenReview (where he posted his report) through facts against my critique (see Addendum I of [LEC]), or acquiesce to my arguments, and correct his papers. Instead he gave an interview to the popular science venue ZDNet^[LEC22c] where my critique was mentioned. There he simply made wrong and misleading claims about my work, without any justification, without any references. I debunked these claims in Addendum III of [LEC].

2. LBH promoted each other while downplaying our contributions

Sometimes LBH ignored or downplayed our work while promoting each other. For example, as mentioned in Sec. B1, LeCun called Bengio's GANs "the coolest idea in machine learning in the last twenty years," without mentioning that they are instances of my much earlier work.^{[R2][AC90][AC10][AC20]} While LBH's deep learning surveys attribute to LBH several important concepts pioneered by my team,^[DLC][T22] some of our breakthroughs they attribute to others,^[T22] thus further diminishing our contributions.
For example, in their 2021 Turing lecture,^[DL3a] LBH mention the "most popular class of convolutional net architecture for computer vision," the "ResNet family," without clarifying that the ResNet is little more than an (open-gated) version of our Highway Net, the first really deep feedforward NN.^[HW1-3][HW] The so-called "ResNet family" is actually our Highway Net family. For the cognoscenti: essentially, if you set the gates of a Highway Net to 1.0, you get a ResNet. Likewise, the 1997 LSTM^[LSTM1] with its residual connections^[VAN1] (1991) is an open-gated version of our 2000 vanilla LSTM^[LSTM2] (which inspired our Highway Net) where the "forget gate" is set to 1.0 (e.g., through a strong positive bias weight). See also dispute B5.

In this context, LBH devote an extra section to the importance of NN depth, without mentioning that the relevant breakthroughs emphasized by LBH were all driven by my lab:^[MOST] In 1991, I had the first very deep NNs based on unsupervised or self-supervised pre-training^[UN-UN2] (see dispute H1); soon afterwards our LSTMs brought essentially unlimited depth to gradient-based supervised recurrent NNs^{[LSTM0-17][25y97]} (see disputes H5 and B5); later our Highway Nets^[HW1-3] brought it to feedforward NNs. LSTM has become the most cited NN of the 20th century; the Highway Net version called ResNet is the most cited NN of the 21st.^[MOST]
LBH also devote extra sections/paragraphs to Transformers, self-supervised pre-training, and GANs, all central to Generative AI, without clarifying that their principles were introduced by myself over 30 years ago: generalised GANs in 1990 (now used for deepfakes),^{[AC90-20][DLH]} unnormalized linear Transformers in 1991 (the "T" in "ChatGPT" stands for "Transformer"),^{[TR1-7][FWP0-1,6][DLH]} and self-supervised pre-training for deep NNs in 1991 (the "P" in "GPT" stands for "pre-trained").^{[UN][UN0-3][MOST]} Compare disputes B1, B4, H4, H1.
LBH have appropriated our core contributions to deep learning, without clarifying their origins.
Sometimes LBH built on our work and even cited it, but downplayed or ignored it later. For example, in 2007, Hinton claimed that "nobody in their right mind would ever suggest" to train deep NNs by backpropagation.^[VID1] Instead he promoted the concept of pre-training NNs in an unsupervised or self-supervised fashion, the approach pioneered by myself in 1991 (although Hinton has not admitted this—see dispute H1^[UN][UN0-5]). However, in 2010, my team with Dan Ciresan showed^[MLP1-2] that unsupervised pre-training is not necessary to train deep feedforward NNs, contrary to Hinton's claims. In fact, twice my lab drove a shift from unsupervised pre-training to purely supervised learning (1991-95; 2006-10).^{[MIR](Sec. 19)}

Then, in 2011, we also greatly sped up the training (without unsupervised pre-training) of deep convolutional NNs (CNNs^[CNN1-5]). Our fast GPU-based CNN of 2011^[GPUCNN1] known as DanNet^{[DAN,DAN1][R6]} was a practical breakthrough. It was much deeper and faster than earlier GPU-accelerated CNNs of 2006.^[GPUCNN]
In 2011, DanNet was the first pure deep CNN to win computer vision contests. For a while, it enjoyed a monopoly. From 2011 to 2012 it won every contest it entered, winning four of them in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^[GPUCNN5] In particular, at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^[DAN1] in an international contest (where LeCun's team took a distant second place, with three times worse performance). In July 2012, our CVPR paper on DanNet^[GPUCNN3] hit the computer vision community.
Hinton's team followed in our foot steps, first by discarding unsupervised pre-training,^[MLP1-2] and then totally adopting our approach,^[GPUCNN4] citing the "somewhat similar" DanNet,^{[GPUCNN3][DAN]} and winning the ImageNet^[IM09] contest in Dec 2012,^{[GPUCNN4-5][R6]} although Hinton first discouraged his student Krizhevsky from pursuing this approach. This has become Hinton's most cited paper.^{[GPUCNN4][R5][R6]}

In their 2021 Turing lecture,^[DL3a] however, LBH mention none of this background. Instead they claim that ReLUs of Fukushima^[RELU1-2] (whom they do not cite either!) enabled deep learning to outperform previous methods for object recognition, referring to their GPU-based ImageNet 2012 winner,^[GPUCNN4] without mentioning that our earlier groundbreaking deep GPU-based DanNet^{[GPUCNN1-3,5-8][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011.^{[GPUCNN1-8][R5-6]}

Competition^[GPUCNN5] Date/Deadline Image size Improvement Winner

ICDAR 2011 Chinese handwriting May 15, 2011 variable 3.8% / 28.9% DanNet^[GPUCNN1-3]

IJCNN 2011 traffic signs Aug 06, 2011 variable 68.0% (superhuman) DanNet^{[DAN,DAN1][R6]}

ISBI 2012 image segmentation Mar 01, 2012 512x512 26.1% DanNet^[GPUCNN3a]

ICPR 2012 medical imaging Sep 10, 2012 2048x2048x3 8.9% DanNet^[GPUCNN8]

ImageNet 2012 Sep 30, 2012 256x256x3 41.4% AlexNet^[GPUCNN4]

MICCAI 2013 Grand Challenge Sep 08, 2013 2048x2048x3 26.5% DanNet^[GPUCNN8]

ImageNet 2014 Aug 18, 2014 256x256x3 VGG Net^[GPUCNN9]

ImageNet 2015 Sep 30, 2015 256x256x3 15.8% ResNet,^[HW2] a
Highway Net^[HW1]
with open gates

LBH also have participated in other PR work that has misled many. For example, the narrator of a popular 2018 Bloomberg video^[VID2] is thanking Hinton for speech recognition and machine translation, although both were actually done (at production time of the video) on billions of smartphones by deep learning methods developed in my labs in Germany and Switzerland (LSTM & CTC) long before Hinton's less useful and more traditional methods (see dispute H5). Similarly, in 2016, the NY Times published an article^[NYT3] about the new, greatly improved, LSTM-based Google Translate without even mentioning our LSTM (instead featuring Hinton who had little to do with it), although Google's original 2016 paper on Google Translate^[WU] mentions LSTM over 50 times.

In 2022, LeCun listed the 5 best ideas 2012-2022 without mentioning that most of them are from my lab, and much older. See Addendum II of [LEC] and this tweet (22 Nov 2022).
LeCun also claimed about me:^[LEC22c] "... there's a big difference between just having the idea, and then getting it to work on a toy problem, and then getting it to work on a real problem, and then doing a theory that shows why it works, and then deploying it. There's a whole chain, and his idea of scientific credit is that it's the very first person who just, sort-of, you know, had the idea of that, that should get all the credit. And that's ridiculous."
In no universe is this straw man argument true.^[LEC] As I wrote in a previous critique (one which LBH know well):^[DLC] "the inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)." Nothing more or less than the standard elementary principles of scientific credit assignment.^[T22] LBH, however, apparently aren't satisfied with credit for popularising the inventions of others; they also want the inventor's credit.^[LEC]

3. Other researchers who were not credited by LBH

As recently as of July 2021, Dr. LeCun, Dr. Bengio, Dr. Hinton (LBH), and the ACM have continued to promulgate their revisionist "history" of deep learning by publishing yet another misleading overview of the field based on LBH's Turing Lecture.[DL3a][T22] LBH credit again themselves for fundamental work first done by others, and fail to correct LBH's well-known earlier omissions.^{[DLC][HIN][T22]} In this section, I focus on disputes with researchers other than ourselves.
★1. LBH claim to "briefly describe the origins of deep learning"^[DL3a] without even mentioning the world's first working deep learning networks by Ivakhnenko and Lapa (1965).^{[DEEP1-2][R8]}
Moreover, LBH fail to cite Amari's 1967-68 work—which included computer simulations—on learning internal representations of multilayer perceptrons through stochastic gradient descent,^[GD1-3] almost two decades before LBH's first experimental work on learning internal representations.
★2. LBH^[DL3a] cite Hinton^[GPUCNN4] (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule which he did not cite.^{[Drop1-4][GPUCNN4]}
★3. Several times, LBH^[DL3a] mention backpropagation—and LBH's papers on applications of this algorithm—but neither its inventor Linnainmaa (1970),^{[BP1-5][BPA-C]} nor its first application to NNs by Werbos (1982),^[BP2] nor Kelley's precursor of the method (1960).^[BPA][T22]

Some claim that the backpropagation algorithm is just the chain rule of Leibniz (1676)^{[LEI07-10][DLH]} popularised by L'Hopital (1696).^[CONN21] No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this).^[T22] It was not published until 1970.^[BP1,4,5]
★4. LBH^[DL3a] devote an extra section to rectified linear units (ReLUs), citing papers of the 2000s by Hinton and his former students, without citing Fukushima who introduced ReLUs in 1969.^[RELU1-2]
★5. LBH^[DL3a] refer to LeCun's work on CNNs, citing neither Fukushima—who created the basic CNN architecture in the 1970s,^[CNN1-4] nor Waibel—who in 1987 was the first to combine NNs with convolutions with backpropagation^[BP1-6] and weight sharing—nor the first backprop-trained two-dimensional CNNs of Zhang (1988).^[CNN1a+] Modern CNNs originated before LeCun's team helped to improve them.
★6. LBH^[DL3a] cite Hinton (1981) for multiplicative gating, without mentioning Ivakhnenko and Lapa who had multiplicative gating in deep networks already in 1965.^{[DEEP1-2][R8]}
★7. LBH^[DL3a] cite the "fast weights" of Hinton (1987) without mentioning the earlier fast weights of v. d. Malsburg (1981) and Feldman (1982).^{[FAST,FASTa-b][FWP]}
★8. ACM lauds LeCun for "deep learning architectures that can manipulate structured data, such as graphs."^[T19,T22] However, such architectures were proposed by Sperduti, Goller, and Küchler in the 1990s^{[SP93-97][GOL][KU]} before LeCun, who cited neither them nor our graph NN-like, Transformer-like Fast Weight Programmers of 1991^{[FWP0-1][FWP6][FWP]} which learn to continually rewrite NN-based mappings from queries to answers. See also Pollack's even earlier relevant work^[PO87-90] and compare the important work of Baldi and colleagues.^[BA96-03]
★9. Hinton's 1985 Boltzmann Machine paper about learning internal representations^[BM] lauded by ACM^[T19] neither cited relevant prior work by Sherrington & Kirkpatrick^[SK75] & Glauber^[G63] nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} nor Amari's work (1967-68)^[GD1-2] on learning internal representations in deep nets end-to-end through stochastic gradient descent. Even later surveys by the authors^[S20][DLC] failed to cite the prior art.^[T22]
Here one must also mention the unfortunate role of the well-known N(eur)IPS conference. Hinton's 1985 co-author Sejnowski^[BM] has been its president for decades. Over the years, N(eur)IPS has frequently invited LBH to give keynotes, especially Hinton, continually providing a platform for Sejnowski's and LBH's revisionist history of deep learning.^[S20] In our 2021 debate on the Connectionists Mailing List,^[CONN21] perhaps the oldest mailing list on NNs, I blasted Sejnowski's 2020 deep learning survey in PNAS^[S20] which wrongly claims that his 1985 Boltzmann machine^[BM] was the first NN to learn internal representations (see ★9 above), although it is well-known that such networks emerged much earlier in Ukraine^{[DEEP1-2][HIN]} (1965) and Japan^[GD1-2] (1967). In an interview, Sejnowski claimed: "Our goal was to try to take a network with multiple layers—an input layer, an output layer and layers in between—and make it learn. It was generally thought, because of early work that was done in AI in the 60s, that no one would ever find such a learning algorithm because it was just too mathematically difficult." My reply was:^[CONN21] "You are a well-known scientist, head of NeurIPS, and chief editor of a major journal. You must correct this. We must all be better than this as scientists. We owe it to both the past, present, and future scientists as well as those we ultimately serve." Generally speaking, I am encouraging N(eur)IPS to fight systemic academic corruption.
In summation, for decades, much of the more prominent work of Dr. Hinton, Dr. Bengio, Dr. LeCun has simply been repackaged versions of earlier work that they produced without proper citation.^[T22] The repetitive nature of LBH's and ACM's failures to uphold basic scientific standards represents a serious attack on the integrity of the field of Artificial Intelligence. If we, in turn, choose to ignore this, then we will be committing a sin against ourselves and our scientific predecessors.^[T22]

4. Ad hominem attacks

Apparently to avoid a fact-focused scientific debate, Hinton and LeCun even conducted ad hominem attacks^[AH2-3] against me true to the motto: "If you cannot dispute a fact-based message, attack the messenger himself."^{[HIN][T22][LEC]}
In particular, LeCun stated in the NY Times that "Jürgen ... keeps claiming credit he doesn't deserve for many, many things,"^[NYT1] without any evidence, without providing a single example.^[T22] Likewise, in the popular science venue ZDNet,^[LEC22c] he made wrong and misleading claims about my work, without any justification, without any references. I debunked these claims in Addendum III of [LEC]. In conjunction with previous work,^[T22][LEC] the present piece makes clear that it is actually LBH themselves who "keep claiming credit they doesn't deserve for many, many things," providing numerous examples, plus the references to back them up.
Fortunately, unlike politics, however, science is immune to ad hominem attacks—at least in the long run. In the hard sciences, the only things that count are the facts. Science is not democratic. If 100 persons claim one thing, and only one person claims the opposite, but he/she can back it up through facts, then he/she wins. If you haven't already read it, see "100 Authors against Einstein."^[AH1]

5. Effects on other researchers

ACM's Turing award for LBH may already have encouraged other machine learning researchers to follow in their footsteps and conduct what can only be described as bad science.[T22][DLH] Some seem to think: if these guys can get away with it, I can do so, too. In particular, certain researchers at companies that employed Hinton & LeCun apparently felt encouraged to abandon scientific integrity as well (while those companies remained deeply influenced by our contributions^[DL4][DEC]).
The famous ResNet paper^[HW2] was published by Microsoft, and its first author was hired by Meta (formerly Facebook). The paper did cite our earlier Highway Net,^[HW1-3] but did not explicitly mention that ResNet is an (open-gated) version of the Highway Net (a ResNet is like a Highway Net whose gates are set to 1.0).
Google published a famous 2017 paper on attention-based "quadratic" Transformers^[TR1][FWP] without mentioning my closely related unnormalised "linear" Transformers of 1991,^{[FWP,FWP0-1,6]} not even in later papers after the formal connection was very concretely pointed out to them in a peer-reviewed publication (2021).^[FWP6] The 1991 Transformer variant^[FWP0-1] already learned to generate what's now called KEY and VALUE patterns^[TR1] to create an efficient linearized version of what's now called "self-attention,"^[TR1] in 1993 called "internal spotlights of attention."^[FWP2] Sec. 2 of the blog post^[FWP] reviews the roles of QUERY/KEY/VALUE patterns in linear (1991) and quadratic (2017) Transformers. Google has done great work scaling this old principle up but should be noting this connection now that it's known. See also disputes B4, H4.

Google also acquired the company DeepMind (co-founded by a student from my lab^[MIR]) which published quite a few well-known papers that did not mention our closely related earlier work.^{[DM1][DNC][NAN5]}
The first author of a 2014 paper on GANs^[GAN1] (an instance of my ancient Artificial Curiosity^[AC90-20]) went to Apple and then to Google DeepMind, where he never admitted that his paper contains wrong statements about our earlier work—see dispute B1. Even those with a terminally short attention span can easily find many additional examples of recent misattributions. The company Meta (whose Chief AI Scientist is LeCun) released LLaMA 2 in 2023. As a large language model, LLaMA inherits from many of my 1991 ideas [FWP,FWP0-2,6][UN0-1][MOST].
In 2023, the company Meta (whose Chief AI Scientist is LeCun) released its LLaMA 2 software. As a pre-trained Transformer-based large language model, LLaMA inherits many of my 1991 ideas.^{[FWP,FWP0-2,6][UN0-1][MOST]} However, LLaMA has propagated provably false rhetoric, to the detriment of science itself, claiming that I "have been involved in harmful activities" and have not made "positive contributions to society, such as pioneers in their field." See this tweet (25 July 2023) and disputes B4, H4, H1. Obviously, large language models can be used to propagate a misleading history of AI.

6. On fathers and godfathers

Before the term "godfather" became overwhelmingly associated with the mafia through the famous 1972 movie,[GF1] the word instead meant a person who bears witness to the baptism (christening) of a child, usually in the presence of the child's father and mother. Both "father" and "godfather" are occasionally applied in the field of Artificial Intelligence (AI), often in an inconsistent and misleading fashion.
For example, LeCun & Bengio & Hinton (LBH) have been called "godfathers of AI,"^[GF2] based on a suggestion of one of Hinton's former students. This makes sense only from the "deep learning mafia"^[DLC2] point of view: for decades, LBH have kept renaming inventions of other researchers, without citing the original inventors. That is, these so-called godfathers baptised creations fathered by others without approval of the fathers.^[T22] The sections above are full of examples.
Who were the true AI pioneers? The 20th century's "father of practical AI" was Leonardo Torres y Quevedo.^[DLH] He built the first working chess end game player in 1914^[BRU1-4] (back then chess was considered as an activity restricted to the realms of intelligent creatures).
Quevedo's machine was still considered impressive decades later when another AI pioneer, Norbert Wiener,^[WI48]—played against it at the 1951 Paris conference, now often viewed as the first conference on AI,^{[AI51][BRO21][BRU4]}—predating the later 1956 Dartmouth conference where the name "AI" was coined by John McCarthy and colleagues, the true "godfathers of AI," who introduced a new name for works of earlier "fathers of AI."
The "father of AI theory" was Kurt Gödel, who, in 1931-34, identified fundamental limits of any type of computation-based AI—and of computation/theorem proving/math in general.^{[GOD][BIB3][GOD21,a]} In 1935, Alonzo Church^[CHU]—and then in 1936-37 also Alan Turing^[TUR][TUR21]—extended Gödel's result; later Church and Turing also discussed and named certain aspects of AI.^{[TUR1-3ab][TUR21]}
In modern AI, deep learning is king. The "fathers of deep learning" were Alexey Ivakhnenko and Valentin Lapa, who, in 1965, had the first general, working learning algorithm for deep neural networks with many hidden layers.^{[DEEP1-2][DL1-2][DLH]}
With a similar logic to McCarthy et al., Aizenberg et al. might be called the "godfathers of deep learning," because they introduced the ancient term "deep learning" to NNs in 2000 (after Rina Dechter introduced it to AI/ML in 1986).^[DL2]
The mathematical roots of deep learning, however, go back centuries: the chain rule—the heart of modern deep learning—is due to Gottfried Wilhelm Leibniz (1676), the first NNs (now called linear NNs) are due to Johann Carl Friedrich Gauss & Adrien-Marie Legendre (circa 1800).^[DLH] The recent survey^[DLH] lists many additional pioneers of the field who may merit yet-unclaimed titles not mentioned above.

7. Discussion

Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's.[DL1-2][DLH][HIN][T22][NASC1-9] The present page is offered as a resource for all good scientists who share this inclination.
LBH and their co-workers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]} The essential foundations of deep learning, however, were laid by others whom they did not cite.^[T22]
Remarkably, four of the numerous priority disputes mentioned above (H1, H2, B7, L2) are related to my 1991 paper^[UN0-1][UN] which in many ways kickstarted what people now call deep learning, going beyond Ivakhnenko's "early" deep learning^[DEEP1-2] which LBH did not cite either.^[DLC] In fact, most of the disputes go back to the work in our Annus Mirabilis 1990-91:^[MIR] B1, B2, B4, B7, H1, H2, H3, H4, L1, L2.
Since one cannot easily look into the minds of people, we cannot exclude the possibility, however small, that all misattributions mentioned above were unintentional^{[PLAG1-6][CONN21]} rather than intentional.^[FAKE2] This might even be forgivable to some extent save for the fact that LBH have not corrected their misattributions and self-aggrandizations in later surveys.^{[DL3-3a][T22]} From the standpoint of scientific integrity, this is unacceptable.
The deontology of science enforces proper scientific standards and behavior when it comes to identifying prior art and assigning credit. Science has a well-established way of dealing with "multiple discovery" and plagiarism, be it unintentional^{[PLAG1-6][CONN21]} or not,^[FAKE2] based on facts such as time stamps of publications and patents. Sometimes it may take a while to settle disputes, but in the end, the facts must always win. As long as the facts have not yet won, it's not yet the end. No fancy award can ever change that.^[HIN][T22]
As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away."^[T22]

On social media, a frequent but misinformed comment on all this is that it sometimes took me years to notice connections of recent papers to our old work of the 1990s. However, there are hundreds of new AI papers each month, and it is impossible to catch immediately all cases of intentional or accidental plagiarism. A new paper may take years to become visible enough to attract my attention, and only then I'll recognize what's a rehash of previous work. Anyway, all of this is irrelevant in science: it's not the job of old authors (who might be dead already) to study new related work; it's the job of new authors to study old related work!
Perhaps the greatest existential risk to scientific integrity is that the ongoing attack on it isn't discussed more yet.^[T22] Dear reader, let me urge you: don't become part of the problem! For example, don't simply cite (without rectification) a flawed paper that misrepresents or does not mention earlier relevant work, just because it was cited by many others. Don't be prey to the ancient idiom: "eat dung—a billion flies can't be wrong!"^[T22] Have you failed to correctly assign credit in the past? Then you must rectify this in future publications. Don't participate in systemic academic corruption. Don't wait until your name appears on the pillory list of Sec. 8. It's a shame for our field that such elementary rules of scientific conduct need to be emphasized.^[T22]

8. Pillory list of additional plagiarism cases

To discourage AI plagiarism in the future, I propose to establish a web site listing papers on deep learning and AI that duplicate previous work without proper attribution, and without rectifying this in later publications: the pillory list you don't want to be on. An international unbiased committee of AI experts should be established (perhaps with the help of ACM[ACM18-23]) to discuss candidate papers brought to their attention by members of the machine learning community, and decide about adding papers to the list, or removing them once the concerns have been adequately addressed. The following paragraphs list a few older non-LBH candidate cases and may be viewed as a start.
Case 1. Around 1960, Frank Rosenblatt not only had linear NNs plus threshold functions, he also had much more interesting MLPs with a non-learning first layer with randomized weights and an adaptive output layer.^[R62] So Rosenblatt basically had what much later was rebranded by Huang and others as Extreme Learning Machines (ELMs)^[ELM1] without proper attribution.^[DLH] The revisionist narrative of ELMs^{[ELM2][CONN21]} is a bit like the revisionist narrative of deep learning criticized by the present report. The "ELM conspiracy" apparently feels they can get away with outrageous improper credit assignment, just like the self-proclaimed "deep learning conspiracy"^[DLC1-2] seems to get away with it on an even grander scale.
Case 2. In 1972, Shun-Ichi Amari made the original Lenz-Ising recurrent architecture^{[L20][I24,I25][K41][W45][T22]} adaptive such that it could learn to associate input patterns with output patterns by changing its connection weights.^[AMH1] 10 years later, the Amari network was republished by Hopfield^[AMH2] who did not cite Amari, not even in later papers. Subsequently, this network was frequently called the Hopfield Network!^[DLH] See also this tweet of 2022.
Case 3. ... (to be continued)
...

9. Acknowledgments

Some of the material above was taken from previous AI Blog posts.^{[MIR][DEC][GOD21] [LEI21][AC][ATT][DAN] [DAN1][DL4][GPUCNN5,8][DLC][DLH][FWP][LEC] [META][MLP2][MOST] [PLAN][UN][LSTMPG][BP4] [DL6a][HIN][T22]} Thanks to many expert reviewers (including several famous neural net pioneers) for useful comments.^[T22] Since science is about self-correction, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my publication page and my arXiv page. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

10. 333+ partially annotated references

[25y97] In 2022, we are celebrating the following works from a quarter-century ago. 1. Journal paper on Long Short-Term Memory, the most cited neural network (NN) of the 20th century (and basis of the most cited NN of the 21st). 2. First paper on physical, philosophical and theological consequences of the simplest and fastest way of computing all possible metaverses (= computable universes). 3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments. 4. Journal paper on meta-reinforcement learning. 5. Journal paper on hierarchical Q-learning. 6. First paper on reinforcement learning to play soccer: start of a series. 7. Journal papers on flat minima & low-complexity NNs that generalize well. 8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age. 9. Journal paper on probabilistic incremental program evolution.

[AC] J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Schmidhuber's artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90] J. Schmidhuber. Making the world differentiable: On using self-supervised fully recurrent neural networks for dynamic reinforcement learning and planning in non-stationary environments. Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990. PDF. The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks where a generator NN is fighting a predictor NN in a minimax game (more).
[AC90b] J. Schmidhuber. A possibility for implementing curiosity and boredom in model-building neural controllers. In J. A. Meyer and S. W. Wilson, editors, Proc. of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pages 222-227. MIT Press/Bradford Books, 1991. PDF. More.
[AC91] J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI-149-91, Inst. f. Informatik, Tech. Univ. Munich, April 1991. PDF.
[AC91b] J. Schmidhuber. Curious model-building control systems. Proc. International Joint Conference on Neural Networks, Singapore, volume 2, pages 1458-1463. IEEE, 1991. PDF.
[AC97] J. Schmidhuber. What's interesting? Technical Report IDSIA-35-97, IDSIA, July 1997. Focus on automatic creation of predictable internal abstractions of complex spatio-temporal events: two competing, intrinsically motivated agents agree on essentially arbitrary algorithmic experiments and bet on their possibly surprising (not yet predictable) outcomes in zero-sum games, each agent potentially profiting from outwitting / surprising the other by inventing experimental protocols where both modules disagree on the predicted outcome. The focus is on exploring the space of general algorithms (as opposed to traditional simple mappings from inputs to outputs); the general system focuses on the interesting things by losing interest in both predictable and unpredictable aspects of the world. Unlike Schmidhuber et al.'s previous systems with intrinsic motivation,^[AC90-AC95] the system also takes into account the computational cost of learning new skills, learning when to learn and what to learn. See later publications.^[AC99][AC02]
[AC99] J. Schmidhuber. Artificial Curiosity Based on Discovering Novel Algorithmic Predictability Through Coevolution. In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z. Zalzala, eds., Congress on Evolutionary Computation, p. 1612-1618, IEEE Press, Piscataway, NJ, 1999.
[AC02] J. Schmidhuber. Exploring the Predictable. In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing, p. 579-612, Springer, 2002. PDF.
[AC06] J. Schmidhuber. Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts. Connection Science, 18(2): 173-187, 2006. PDF.
[AC09] J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112. PDF. (More on artificial scientists and artists.)
[AC10] J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010. IEEE link. PDF. With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]} where a generator NN is fighting a predictor NN in a minimax game (more).
[AC20] J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991). Neural Networks, Volume 127, p 58-66, 2020. Preprint arXiv/1906.04493.
[ACM18] ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: "Computing professionals should therefore credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works."
[ACM23] Policy for Honors Conferred by ACM. Association for Computing Machinery (ACM), 2023. Quote: "ACM also retains the right to revoke an Honor previously granted if ACM determines that it is in the best interests of the field to do so." Copy in the Internet Archive (2023).
[AH1] Hentschel K. (1996) A. v. Brunn: Review of "100 Authors against Einstein" [March 13, 1931]. In: Hentschel K. (eds) Physics and National Socialism. Science Networks—Historical Studies, vol 18. Birkhaeuser Basel. Link.
[AH2] F. H. van Eemeren, B. Garssen & B. Meuffels. The disguised abusive ad hominem empirically investigated: Strategic manoeuvring with direct personal attacks. Journal Thinking & Reasoning, Vol. 18, 2012, Issue 3, p. 344-364. Link.
[AH3] D. Walton (PhD Univ. Toronto, 1972), 1998. Ad hominem arguments. University of Alabama Press.
[AIB] J. Schmidhuber. AI Blog. Includes variants of chapters of the AI Book.
[AI51] Les Machines a Calculer et la Pensee Humaine: Paris, 8.-13. Januar 1951, Colloques internationaux du Centre National de la Recherche Scientifique; no. 37, Paris 1953. [H. Bruderer rightly calls that the first conference on AI.]
[AM16] Blog of Werner Vogels, CTO of Amazon (Nov 2016): Amazon's Alexa "takes advantage of bidirectional long short-term memory (LSTM) networks using a massive amount of data to train models that convert letters to sounds and predict the intonation contour. This technology enables high naturalness, consistent intonation, and accurate processing of texts."
[AMH0] S. I. Amari (1972). Characteristics of random nets of analog neuron-like elements. IEEE Trans. Syst. Man Cybernetics, 2, 643-657. First published 1969 in Japanese, long before Wilson & Cowan's very similar work (1972-73).
[AMH1] S. I. Amari (1972). Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972. PDF. First publication of what was later sometimes called the Hopfield network^[AMH2] or Amari-Hopfield Network,^[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.^{[L20][I25][T22]} See also this tweet.
[AMH1b] W. A. Little. The existence of persistent states in the brain. Mathematical Biosciences, 19.1-2, p. 101-120, 1974. Mentions the recurrent Ising model^[L20][I25]on which the (uncited) Amari network^[AMH1,2] is based.
[AMH2] J. J. Hopfield (1982). Neural networks and physical systems with emergent collective computational abilities. Proc. of the National Academy of Sciences, vol. 79, pages 2554-2558, 1982. The Hopfield network or Amari-Hopfield Network was first published in 1972 by Amari.^[AMH1] [AMH2] did not cite [AMH1].
[AMH3] A. P. Millan, J. J. Torres, J. Marro. How Memory Conforms to Brain Development. Front. Comput. Neuroscience, 2019
[AOI] M. Ford. Architects of Intelligence: The truth about AI from the people building it. Packt Publishing, 2018. Preface to German edition by J. Schmidhuber.
[ATT] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber had both hard attention (1990) and soft attention (1991-93).^[FWP] Today, both types are very popular.
[ATT0] J. Schmidhuber and R. Huber. Learning to generate focus trajectories for attentive vision. Technical Report FKI-128-90, Institut für Informatik, Technische Universität München, 1990. PDF.
[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990. PDF. More.
[ATT2] J. Schmidhuber. Learning algorithms for networks with internal and external feedback. In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton, editors, Proc. of the 1990 Connectionist Models Summer School, pages 52-61. San Mateo, CA: Morgan Kaufmann, 1990. PS. (PDF.) Reviewed by Dr. Hinton.
[ATT3] H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite. In fact, Hinton was the reviewer of a 1990 paper^[ATT2] which summarised in its Section 5 Schmidhuber's early work on attention: the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component with an attentional component (the fixation controller). Two decades later, Hinton wrote about his own work:^[ATT3] "To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
[ATT14] D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 2014-16. Preprint arXiv/1409.0473, 2014-16. This work on soft "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week, Bloomberg, May 15, 2018.
[AV2] A. Vance. Apple and Its Rivals Bet Their Futures on These Men's Dreams. Business Week, Bloomberg, May 17, 2018.
[BA93] P. Baldi and Y. Chauvin. Neural Networks for Fingerprint Recognition, Neural Computation, Vol. 5, 3, 402-418, (1993). First application of CNNs with backpropagation to biomedical/biometric images.
[BA96] P. Baldi and Y. Chauvin. Hybrid Modeling, HMM/NN Architectures, and Protein Applications, Neural Computation, Vol. 8, 7, 1541-1565, (1996). One of the first papers on graph neural networks.
[BA99] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the Past and the Future in Protein Secondary Structure Prediction, Bioinformatics, Vol. 15, 11, 937-946, (1999).
[BA03] P. Baldi and G. Pollastri. The Principled Design of Large-Scale Recursive Neural Network Architectures-DAG-RNNs and the Protein Structure Prediction Problem. Journal of Machine Learning Research, 4, 575-602, (2003).
[BB2] J. Schmidhuber. A local learning algorithm for dynamic feedforward and recurrent networks. Connection Science, 1(4):403-412, 1989. (The Neural Bucket Brigade—figures omitted!). PDF. HTML. Compare TR FKI-124-90, TUM, 1990. PDF. Proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Based on a "neural economy" for reinforcement learning. See also this tweet.
[BIB3] W. Bibel (2003). Mosaiksteine einer Wissenschaft vom Geiste. Invited talk at the conference on AI and Gödel, Arnoldsheim, 4-6 April 2003. Manuscript, 2003.
[BM] D. Ackley, G. Hinton, T. Sejnowski (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1):147-169. This paper neither cited relevant prior work by Sherrington & Kirkpatrick^[SK75] & Glauber^[G63] nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} nor Amari's work (1967-68)^[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors^[S20][DLC] failed to cite the prior art.^[T22]
[BOU] H Bourlard, N Morgan (1993). Connectionist speech recognition. Kluwer, 1993.
[BPA] H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960. Precursor of modern backpropagation.^[BP1-4]
[BPB] A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC] S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 6-7 and FORTRAN code on pages 58-60. PDF. See also BIT 16, 146-160, 1976. Link. The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP, Springer, 1982. PDF. First application of backpropagation^[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020). Who invented backpropagation? More.^[DL2]
[BP5] A. Griewank (2012). Who invented the reverse mode of differentiation? Documenta Mathematica, Extra Volume ISMP (2012): 389-400.
[BPTT1] P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78.10, 1550-1560, 1990.
[BPTT2] R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks. In: Backpropagation: Theory, architectures, and applications, p 433, 1995.
[BRI] Bridle, J.S. (1990). Alpha-Nets: A Recurrent "Neural" Network Architecture with a Hidden Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 83-92.
[BRU1] H. Bruderer. Computing history beyond the UK and US: selected landmarks from continental Europe. Communications of the ACM 60.2 (2017): 76-84.
[BRU2] H. Bruderer. Meilensteine der Rechentechnik. 2 volumes, 3rd edition. Walter de Gruyter GmbH & Co KG, 2020.
[BRU3] H. Bruderer. Milestones in Analog and Digital Computing. 2 volumes, 3rd edition. Springer Nature Switzerland AG, 2020.
[BRU4] H. Bruderer. The Birthplace of Artificial Intelligence? Communications of the ACM, BLOG@CACM, Nov 2017. Link.
[BRO21] D. C. Brock (2021). Cybernetics, Computer Design, and a Meeting of the Minds. An influential 1951 conference in Paris considered the computer as a model of—and for—the human mind. IEEE Spectrum, 2021. Link.
[BW] H. Bourlard, C. J. Wellekens (1989). Links between Markov models and multilayer perceptrons. NIPS 1989, p. 502-510.
[CAPS] S. Sabour, N. Frosst, G. E. Hinton (2017). Dynamic routing between capsules. Proc. NIPS 2017, pp. 3856-3866.
[CDI] G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation 14.8 (2002): 1771-1800.
[CHU] A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the American Mathematical Society, 41: 332-333. Abstract of a talk given on 19 April 1935, to the American Mathematical Society. Also in American Journal of Mathematics, 58(2), 345-363 (1 Apr 1936). First explicit proof that the Entscheidungsproblem (decision problem) does not have a general solution.
[CNN1] K. Fukushima: Neural network model for a mechanism of pattern recognition unaffected by shift in position—Neocognitron. Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979. The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia.
[CNN1+] K. Fukushima: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, vol. 36, no. 4, pp. 193-202 (April 1980). Link.
[CNN1a] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^[BP1][BP2] and weight-sharing to a convolutional architecture.
[CNN1a+] W. Zhang. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988. First backpropagation-trained 2D CNN.
[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989. Based on [CNN1a].
[CNN1c] Bower Award Ceremony 2021: Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021.
[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989. PDF.
[CNN3a] K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990. An NN with convolutions using Max-Pooling instead of Fukushima's Spatial Averaging.^[CNN1]
[CNN3] Weng, J., Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. A CNN whose downsampling layers use Max-Pooling (which has become very popular) instead of Fukushima's Spatial Averaging.^[CNN1]
[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
[CNN5a] S. Behnke. Learning iterative image reconstruction in the neural abstraction pyramid. International Journal of Computational Intelligence and Applications, 1(4):427-438, 1999.
[CNN5b] S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
[CNN5c] D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92-101, 2010.
[CO1] J. Koutnik, F. Gomez, J. Schmidhuber (2010). Evolving Neural Networks in Compressed Weight Space. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2010), Portland, 2010. PDF.
[CO2] J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez. Evolving Large-Scale Neural Networks for Vision-Based Reinforcement Learning. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), Amsterdam, July 2013. PDF. The first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning, without any unsupervised pre-training.

[CO3] R. K. Srivastava, J. Schmidhuber, F. Gomez. Generalized Compressed Network Search. Proc. GECCO 2012. PDF.
[CONN21] Since November 2021: Comments on earlier versions of the report^[T22] in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006. PDF.
[CUB0] R. J. Williams. Complexity of exact gradient computation algorithms for recurrent neural networks. Technical Report NU-CCS-89-27, Northeastern University, College of Computer Science, 1989.
[CW] J. Koutnik, K. Greff, F. Gomez, J. Schmidhuber. A Clockwork RNN. Proc. 31st International Conference on Machine Learning (ICML), p. 1845-1853, Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].
[DAN] J. Schmidhuber (AI Blog, 2021). 10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs). 1st superhuman result in 2011.^[DAN1] Now everybody is using this approach.
[DAN1] J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition. At the IJCNN 2011 computer vision competition in Silicon Valley, our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (by LeCun's team), and six times better than the best non-neural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2022). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.

[DEEP1] Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.
[DEEP1a] Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
[DEEP2] Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.
[DIST2] O. Vinyals, J. A. Dean, G. E. Hinton. Distilling the Knowledge in a Neural Network. Preprint arXiv:1503.02531 [stat.ML], 2015. The authors did not cite Schmidhuber's original 1991 NN distillation procedure,^{[UN0-2][MIR](Sec. 2)} not even in the later patent application US20150356461A1. See also this tweet.
[DL1] J. Schmidhuber, 2015. Deep learning in neural networks: An overview. Neural Networks, 61, 85-117. More. Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.

[DL2] J. Schmidhuber, 2015. Deep Learning. Scholarpedia, 10(11):32832.
[DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436-444. HTML. A "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML. Local copy (HTML only). Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL4] J. Schmidhuber (AI Blog, 2017). Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019 on-device speech recognition (on the phone, not the server) is still based on LSTM.
[DL6] F. Gomez and J. Schmidhuber. Co-evolving recurrent neurons learn deep memory POMDPs. In Proc. GECCO'05, Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. PDF.
[DL6a] J. Schmidhuber (AI Blog, Nov 2020). 15-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?

[DL7] "Deep Learning ... moving beyond shallow machine learning since 2006!" Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the Internet Archive), referring to Hinton's^[UN4] and Bengio's^[UN5] unsupervised pre-training for deep NNs^[UN4] (2006) although this type of deep learning dates back to Schmidhuber's work of 1991.^[UN1-2][UN] Compare Sec. II & XVII & III.
[DLC] J. Schmidhuber (AI Blog, June 2015). Critique of Paper by self-proclaimed^[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436). The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it).
[DLC1] Y. LeCun. IEEE Spectrum Interview by L. Gomes, Feb 2015. Quote: "A lot of us involved in the resurgence of Deep Learning in the mid-2000s, including Geoff Hinton, Yoshua Bengio, and myself—the so-called 'Deep Learning conspiracy' ..."
[DLC2] M. Bergen, K. Wagner (2015). Welcome to the AI Conspiracy: The 'Canadian Mafia' Behind Tech's Latest Craze. Vox recode, 15 July 2015. Quote: "... referred to themselves as the 'deep learning conspiracy.' Others called them the 'Canadian Mafia.'"
[DLH] J. Schmidhuber (AI Blog, 2022). Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Lugano, Switzerland, 2022. Preprint arXiv:2212.11279. Tweet of 2022.

[DM1] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Tech Report, 19 Dec. 2013, arxiv:1312.5602.
[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015. Link. DeepMind's first famous paper. Its abstract claims: "While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces." It also claims to bridge "the divide between high-dimensional sensory inputs and actions." Similarly, the first sentence of the abstract of the earlier tech report version^[DM1] of [DM2] claims to "present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning." However, the first such system (requiring no unsupervised pre-training) was created earlier by Jan Koutnik et al. in Schmidhuber's lab.^[CO2] DeepMind was co-founded by Shane Legs, a PhD student from this lab; he and Daan Wierstra (another PhD student of Schmidhuber and DeepMind's 1st employee) were the first persons at DeepMind who had AI publications and PhDs in computer science. More.
[DM3] S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019. Alphastar has a "deep LSTM core."
[DM4] J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Zidek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. Romera-Paredes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli & D. Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589, 2021. DeepMind's breakthrough application of deep learning did not cite Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007).
[DNC] A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:7626, p 471, 2016. This work of DeepMind did not cite the original work of the early 1990s on neural networks learning to control dynamic external memories.^{[PDA1-2][FWP0-1]}
[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272. What's now called "dropout" is a variation of the stochastic delta rule—compare preprint arXiv:1808.03578, 2018.
[Drop2] N. Frazier-Logue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):1018-1032.
[Drop3] J. Hertz, A. Krogh, R. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, California: Addison-Wesley Pub. Co., pp. 45-46.
[Drop4] N. Frazier-Logue, S. J. Hanson (2018). Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning. Preprint arXiv:1808.03578, 2018.
[ELM1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew. Extreme learning machine: A new learning scheme of feedforward neural networks. Proc. IEEE Int. Joint Conf. on Neural Networks, Vol. 2, 2004, pp. 985-990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work around 1960.^[R62][T22]
[ELM2] ELM-ORIGIN, 2004. The Official Homepage on Origins of Extreme Learning Machines (ELM). "Extreme Learning Machine Duplicates Others' Papers from 1988-2007." Local copy. This overview does not mention that the "ELM" concept goes back to Rosenblatt's work around 1960.^[R62][T22]
[FAKE] H. Hopf, A. Krief, G. Mehta, S. A. Matlin. Fake science and the knowledge crisis: ignorance can be fatal. Royal Society Open Science, May 2019. Quote: "Scientists must be willing to speak out when they see false information being presented in social media, traditional print or broadcast press" and "must speak out against false information and fake science in circulation and forcefully contradict public figures who promote it."
[FAKE2] L. Stenflo. Intelligent plagiarists are the most dangerous. Nature, vol. 427, p. 777 (Feb 2004). Quote: "What is worse, in my opinion, ..., are cases where scientists rewrite previous findings in different words, purposely hiding the sources of their ideas, and then during subsequent years forcefully claim that they have discovered new phenomena.
[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie, Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981. First paper on fast weights or dynamic links.
[FASTa] J. A. Feldman. Dynamic connections in neural networks. Biological Cybernetics, 46(1):27-39, 1982. 2nd paper on fast weights.
[FASTb] G. E. Hinton, D. C. Plaut. Using fast weights to deblur old memories. Proc. 9th annual conference of the Cognitive Science Society (pp. 177-186), 1987. 3rd paper on fast weights (two types of weights with different learning rates).
[FAT1] H. Jones. Juergen Schmidhuber, Renowned 'Father Of Modern AI,' Says His Life's Work Won't Lead To Dystopia. Forbes Magazine, 26 May 2023. Link.
[FAT2] E. Colton. 'Father of AI' says tech fears misplaced: 'You cannot stop it'. Fox News, 7 May 2023. Link.
[FAT3] J. Taylor. Rise of artificial intelligence is inevitable but should not be feared, 'father of AI' says. The Guardian, 7 May 2023. Link.
[FB17] By 2017, Facebook used LSTM to handle over 4 billion automatic translations per day (The Verge, August 4, 2017); see also Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[FM] S. Hochreiter and J. Schmidhuber. Flat minimum search finds simple nets. Technical Report FKI-200-94, Fakultät für Informatik, Technische Universität München, December 1994. PDF.
[FWP] J. Schmidhuber (AI Blog, 26 March 2021, updated 2023). 26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff! In 2022, ChatGPT took the world by storm, generating large volumes of text that are almost indistinguishable from what a human might write.^[GPT3] ChatGPT and similar large language models (LLMs) are based on a family of artificial neural networks (NNs) called Transformers.^[TR1-2] Already in 1991, when compute was a million times more expensive than today, Schmidhuber published the first Transformer variant, which is now called an unnormalised linear Transformer.^{[FWP0-1,6][TR5-6]} That wasn't the name it got given at the time, but today the mathematical equivalence is obvious. In a sense, computational restrictions drove it to be even more efficient than later "quadratic" Transformer variants,^[TR1-2] resulting in costs that scale linearly in input size, rather than quadratically. In the same year, Schmidhuber also introduced self-supervised pre-training for deep NNs, now used to train LLMs (the "P" in "GPT" stands for "pre-trained").^[UN][UN0-3] In 1993, he introduced the attention terminology^[FWP2] now used in this context,^[ATT] and extended the approach to recurrent NNs that program themselves. See tweet of 2022.
[FWP0] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Technical Report FKI-147-91, Institut für Informatik, Technische Universität München, 26 March 1991. PDF. First paper on fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as an "unnormalised linear Transformer."^[FWP] That wasn't the name it got given at the time, but today the mathematical equivalence is obvious. In a sense, computational restrictions drove it to be even more efficient than later "quadratic" Transformer variants,^[TR1-2] resulting in costs that scale linearly in input size, rather than quadratically.
[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0]. PDF. See tweet of 2022 for 30-year anniversary. Overview.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993. PDF. First recurrent NN-based fast weight programmer using outer products, introducing the terminology of learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018. Preprint: arXiv:1811.12143. PDF.
[FWP4a] J. Ba, G. Hinton, V. Mnih, J. Z. Leibo, C. Ionescu. Using Fast Weights to Attend to the Recent Past. NIPS 2016. PDF. Very similar to [FWP0-2], in both motivation [FWP2] and execution.
[FWP4b] D. Bahdanau, K. Cho, Y. Bengio (2014). Neural Machine Translation by Jointly Learning to Align and Translate. Preprint arXiv:1409.0473 [cs.CL]. This work on "attention" did not cite Schmidhuber's much earlier original work of 1991-1993 on soft attention and Transformers with linearized self-attention.^{[FWP,FWP0-2,6][ATT]}
[FWP4d] Y. Tang, D. Nguyen, D. Ha (2020). Neuroevolution of Self-Interpretable Agents. Preprint: arXiv:2003.08165.
[FWP5] F. J. Gomez and J. Schmidhuber. Evolving modular fast-weight networks for control. In W. Duch et al. (Eds.): Proc. ICANN'05, LNCS 3697, pp. 383-389, Springer-Verlag Berlin Heidelberg, 2005. PDF. HTML overview. Reinforcement-learning fast weight programmer.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber. Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. NeurIPS 2021. Preprint: arXiv:2106.06295 (June 2021).
[FWPMETA1] J. Schmidhuber. Steps towards `self-referential' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992. First recurrent fast weight programmer that can learn to run a learning algorithm or weight change algorithm on itself.
[FWPMETA2] J. Schmidhuber. A self-referential weight matrix. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 446-451. Springer, 1993. PDF. See also this tweet.
[FWPMETA3] J. Schmidhuber. An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks, Brighton, pages 191-195. IEE, 1993.
[FWPMETA4] J. Schmidhuber. A neural network that embeds its own meta-levels. In Proc. of the International Conference on Neural Networks '93, San Francisco. IEEE, 1993.
[FWPMETA5] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. A recurrent neural net with a self-referential, self-reading, self-modifying weight matrix can be found here.
[FWPMETA6] L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It. Advances in Neural Information Processing Systems (NeurIPS), 2021. Preprint arXiv:2012.14905 [cs.LG], 2020. See also tweet1 and tweet2.
[FWPMETA7] I. Schlag, T. Munkhdalai, J. Schmidhuber. Learning Associative Inference Using Fast Weight Memory. To appear at ICLR 2021. Report arXiv:2011.07831 [cs.AI], 2020.
[FWPMETA8] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber. A Modern Self-Referential Weight Matrix That Learns to Modify Itself. International Conference on Machine Learning (ICML), 2022. Preprint: arXiv:2202.05780.
[FWPMETA9] L. Kirsch and J. Schmidhuber. Self-Referential Meta Learning. First Conference on Automated Machine Learning (Late-Breaking Workshop), 2022.
[GM3] J. Schmidhuber (2003). Goedel Machines: Self-Referential Universal Problem Solvers Making Provably Optimal Self-Improvements. Preprint arXiv:cs/0309048 (2003). More.
[GM6] J. Schmidhuber (2006). Gödel machines: Fully Self-Referential Optimal Universal Self-Improvers. In B. Goertzel and C. Pennachin, eds.: Artificial General Intelligence, p. 199-226, 2006. PDF.
[GM9] J. Schmidhuber (2009). Ultimate Cognition à la Gödel. Cognitive Computation 1(2):177-193, 2009. PDF. More.
[G63] R. J Glauber (1963). Time-dependent statistics of the Ising model. Journal of Mathematical Physics, 4(2):294-307, 1963.
[GD'] C. Lemarechal. Cauchy and the Gradient Method. Doc Math Extra, pp. 251-254, 2012.
[GD''] J. Hadamard. Memoire sur le probleme d'analyse relatif a Vequilibre des plaques elastiques encastrees. Memoires presentes par divers savants estrangers à l'Academie des Sciences de l'Institut de France, 33, 1908.
[GDa] Y. Z. Tsypkin (1966). Adaptation, training and self-organization automatic control systems, Avtomatika I Telemekhanika, 27, 23-61. On gradient descent-based on-line learning for non-linear systems.
[GDb] Y. Z. Tsypkin (1971). Adaptation and Learning in Automatic Systems, Academic Press, 1971. On gradient descent-based on-line learning for non-linear systems.
[GD1] S. I. Amari (1967). A theory of adaptive pattern classifier, IEEE Trans, EC-16, 279-307 (Japanese version published in 1965). PDF. Probably the first paper on using stochastic gradient descent^[STO51-52] for learning in multilayer neural networks (without specifying the specific gradient descent method now known as reverse mode of automatic differentiation or backpropagation^[BP1]).
[GD2] S. I. Amari (1968). Information Theory—Geometric Theory of Information, Kyoritsu Publ., 1968 (in Japanese). OCR-based PDF scan of pages 94-135 (see pages 119-120). Contains computer simulation results for a five layer network (with 2 modifiable layers) which learns internal representations to classify non-linearily separable pattern classes. See also this tweet.
[GD2a] H. Saito (1967). Master's thesis, Graduate School of Engineering, Kyushu University, Japan. Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.^[GD1] (S. Amari, personal communication, 2021.)
[GD3] S. I. Amari (1977). Neural Theory of Association and Concept Formation. Biological Cybernetics, vol. 26, p. 175-185, 1977. See Section 3.1 on using gradient descent for learning in multilayer networks.
[GSR] H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team. Google voice search: faster and more accurate. Google Research Blog, Sep 2015, see also Aug 2015 Google's speech recognition based on CTC and LSTM.
[GSR15] Dramatic improvement of Google's speech recognition through LSTM: Alphr Technology, Jul 2015, or 9to5google, Jul 2015
[GSR19] Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein. Streaming end-to-end speech recognition for mobile devices. ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
[GT16] Google's dramatically improved Google Translate of 2016 is based on LSTM, e.g., WIRED, Sep 2016, or siliconANGLE, Sep 2016
[GAN0] O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context. Blog post, Internet Archive, 2010. A blog post describing the basic ideas^{[AC][AC90, AC90b][AC20]} of GANs.
[GAN1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014. A description of GANs that does not cite Schmidhuber's original GAN principle of 1990^{[AC][AC90,AC90b][AC20][R2][T22]} (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization^{[PM0-2][AC20][T22]}).
[GAN2] T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 4401-4410, 2019.
[GF1] The Godfather (1972). Movie directed by Francis Ford Coppola, based on Mario Puzo's 1969 novel.
[GF2] T. Ranosa. Godfathers Of AI Win This Year's Turing Award And $1 Million. Tech Times, 29 March 2019.
[GOD] K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173-198, 1931. In the early 1930s, Gödel founded theoretical computer science. He identified fundamental limits of mathematics and theorem proving and computing and Artificial Intelligence.

[GOD34] K. Gödel (1934). On undecidable propositions of formal mathematical systems. Notes by S. C. Kleene and J. B. Rosser on lectures at the Institute for Advanced Study, Princeton, New Jersey, 1934, 30 pp. (Reprinted in M. Davis, (ed.), The Undecidable. Basic Papers on Undecidable Propositions, Unsolvable Problems, and Computable Functions, Raven Press, Hewlett, New York, 1965.) Gödel introduced a universal coding language.
[GOD56] R. J. Lipton and K. W. Regan. Gödel's lost letter and P=NP. Link.
[GOD86] K. Gödel. Collected works Volume I: Publications 1929-36, S. Feferman et. al., editors, Oxford Univ. Press, Oxford, 1986.
[GOD21] J. Schmidhuber (2021). 90th anniversary celebrations: 1931: Kurt Gödel, founder of theoretical computer science, shows limits of math, logic, computing, and artificial intelligence. This was number 1 on Hacker News.
[GOD21a] J. Schmidhuber (2021). Als Kurt Gödel die Grenzen des Berechenbaren entdeckte. (When Kurt Gödel discovered the limits of computability.) Frankfurter Allgemeine Zeitung, 16/6/2021.

[GOL] C. Goller & A. Küchler (1996). Learning task-dependent distributed representations by backpropagation through structure. Proceedings of International Conference on Neural Networks (ICNN'96). Vol. 1, p. 347-352 IEEE, 1996. Based on TR AR-95-02, TU Munich, 1995.
[GPT3] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei. Language Models are Few-Shot Learners (2020). Preprint arXiv/2005.14165.
[GPUNN] Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. Speeding up traditional NNs on GPU by a factor of 20.
[GPUCNN] K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. Speeding up shallow CNNs on GPU by a factor of 4.
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint. Speeding up deep CNNs on GPU by a factor of 60. Used to win four important computer vision competitions 2011-2012 before others won any with similar approaches.

[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber. A Committee of Neural Networks for Traffic Sign Classification. International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011. PDF. HTML overview. First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.^[DAN1] This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012. PDF. This paper describes AlexNet, which is similar to the earlier DanNet,^{[DAN,DAN1][R6]} the first pure deep CNN to win computer vision contests in 2011^{[GPUCNN2-3,5]} (AlexNet and VGG Net^[GPUCNN9] followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)^[RELU1] and dropout (a variant of Hanson 1990 stochastic delta rule)^[Drop1-4] but neither cites the original work^{[RELU1][Drop1]} nor the basic CNN architecture (Fukushima, 1979).^[CNN1]
[GPUCNN5] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.

[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011. PDF.
[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013. PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet). First deep learner to win a contest on object detection in large images— first deep learner to win a medical imaging contest (2012). Link. How the Swiss AI Lab IDSIA used GPU-based CNNs to win the ICPR 2012 Contest on Mitosis Detection and the MICCAI 2013 Grand Challenge.

[GPUCNN9] K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014).
[H86] J. L. van Hemmen (1986). Spin-glass models of a neural network. Phys. Rev. A 34, 3435, 1 Oct 1986.
[H88] H. Sompolinsky (1988). Statistical Mechanics of Neural Networks. Physics Today 41, 12, 70, 1988.
[H90] W. D. Hillis. Co-evolving parasites improve simulated evolution as an optimization procedure. Physica D: Nonlinear Phenomena, 42(1-3):228-234, 1990.
[HB96] S. El Hihi, Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. NIPS, 1996. Bengio claimed^[YB20] that in 1995 he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue" although Schmidhuber's publications on exactly this topic date back to 1991-93.^[UN0-2][UN]
[HEL] P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel. The Helmholtz machine. Neural Computation, 7:889-904, 1995. An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.^[NHE]
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record. See also this tweet.

[HIN22] G. Hinton. The Forward-Forward Algorithm: Some Preliminary Investigations. Preprint, Google Brain, 2022. Proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Does not mention previous related work.^{[BB2][NAN1-4][NHE][MIR](Sec. 15, Sec. 17)[FWPMETA6]}
[HO07] S. Hochreiter, M. Heusel, K. Obermayer. Fast model-based protein homology detection without alignment. Bioinformatics 23(14):1728-36, 2007. Successful application of deep learning to protein folding problems, through an LSTM that was orders of magnitude faster than competing methods.
[HRL0] J. Schmidhuber. Towards compositional learning with dynamic neural networks. Technical Report FKI-129-90, Institut für Informatik, Technische Universität München, 1990. PDF. An RL machine gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of cost-minimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals. The system is learning action plans at multiple levels of abstraction and multiple time scales and solves what Y. LeCun called an "open problem" in 2022.^[LEC] See also this tweet.
[HRL1] J. Schmidhuber. Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990 [HRL0].
[HRL2] J. Schmidhuber and R. Wahnsiedler. Planning simple trajectories using neural subgoal generators. In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc. of the 2nd International Conference on Simulation of Adaptive Behavior, pages 196-202. MIT Press, 1992. PDF. (See also HTML & images in German.)
[HRL3] P. Dayan and G. E. Hinton. Feudal Reinforcement Learning. Advances in Neural Information Processing Systems 5, NIPS, 1992. This work did not cite Schmidhuber's gradient-based subgoal generators for hierarchical reinforcement learning (1990).^[HRL0-2]
[HRL4] M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):219-246, 1997. PDF.
[HRLW] C. Watkins (1989). Learning from delayed rewards.
[HW] J. Schmidhuber (AI Blog, 2015, updated 2020 for 5-year anniversary). Overview of Highway Networks: First working really deep feedforward neural networks with over 100 layers (previous NNs had at most a few tens of layers).
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates^[LSTM2] for RNNs.) Resnets^[HW2] are a version of this where the gates are always open: g(x)=t(x)=const=1. Highway Nets perform roughly as well as ResNets^[HW2] on ImageNet.^[HW3] Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.^[NDR] More.
[HW1a] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015. Link.
[HW2] He, K., Zhang, X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint arXiv:1512.03385 (Dec 2015). Residual nets are a version of Highway Nets^[HW1] where the gates are always open: g(x)=1 (a typical highway net initialization) and t(x)=1. More.
[HW3] K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint arxiv:1612.07771 (2016). Also at ICLR 2017.
[HYB12] Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag., 29(6):82-97. This work did not cite the earlier LSTM^[LSTM0-6] trained by Connectionist Temporal Classification (CTC, 2006).^[CTC] CTC-LSTM was successfully applied to speech in 2007^[LSTM4] (also with hierarchical LSTM stacks^[LSTM14]) and became the first superior end-to-end neural speech recogniser that outperformed the state of the art, dramatically improving Google's speech recognition.^{[GSR][GSR15][DL4]} This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as hidden Markov models (HMMs).^{[BW][BRI][BOU]} [HYB12] still used the old hybrid approach and did not compare it to CTC-LSTM. Later, however, Hinton switched to LSTM, too.^[LSTM8]
[I24] E. Ising (1925). Beitrag zur Theorie des Ferro- und Paramagnetismus. Dissertation, 1924.
[I25] E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925. The first non-learning recurrent NN architecture (the Ising model or Lenz-Ising model) was introduced and analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.^{[L20][I25][K41][W45][T22]} It settles into an equilibrium state in response to input conditions, and is the foundation of the first published learning RNNs.^[AMH1-2]
[IM09] J. Deng, R. Socher, L.J. Li, K. Li, L. Fei-Fei (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255). IEEE, 2009.
[JOU17] Jouppi et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit. Preprint arXiv:1704.04760
[K41] H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.
[K56] S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.
[KO2] J. Schmidhuber. Discovering neural nets with low Kolmogorov complexity and high generalization capability. Neural Networks, 10(5):857-873, 1997. PDF.
[KU] A. Küchler & C. Goller (1996). Inductive learning in symbolic domains using structure-driven recurrent neural networks. Lecture Notes in Artificial Intelligence, vol 1137. Springer, Berlin, Heidelberg.
[L20] W. Lenz (1920). Beitrag zum Verständnis der magnetischen Erscheinungen in festen Körpern. Physikalische Zeitschrift, 21:613-615. See also [I25].
[LAN] J. L. Ba, J. R.Kiros, G. E. Hinton. Layer Normalization. arXiv:1607.06450, 2016.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years ago, Schmidhuber's team published most of what Y. LeCun calls his "main original contributions:" neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc. This was also discussed on Hacker News, reddit, and in the media. See tweet1. LeCun also listed the "5 best ideas 2012-2022" without mentioning that most of them are from Schmidhuber's lab, and older. See tweet2.
[LEC22a] Y. LeCun (27 June 2022). A Path Towards Autonomous Machine Intelligence. OpenReview Archive. Link. See critique [LEC].
[LEC22b] M. Heikkilä, W. D. Heaven. Yann LeCun has a bold new vision for the future of AI. MIT Technology Review, 24 June 2022. Link. See critique [LEC].
[LEC22c] ZDNet, 2022. Meta's AI guru LeCun: Most of today's AI approaches will never lead to true intelligence. Here LeCun makes wrong and misleading claims about Schmidhuber's work, as discussed in Addendum III of [LEC].
[LEC22d] Analytics India, Dec 2022. Angels & Demons of AI. More of LeCun's misleading statements about the disputes with Schmidhuber, as discussed in of [LEC].
[LECP] Y. LeCun. A New Publishing Model in Computer Science. Pamphlet, 2000-2004. Local copy (HTML only).
[LEI07] J. M. Child (translator), G. W. Leibniz (Author). The Early Mathematical Manuscripts of Leibniz. Merchant Books, 2007. See p. 126: the chain rule appeared in a 1676 memoir by Leibniz.
[LEI10] O. H. Rodriguez, J. M. Lopez Fernandez (2010). A semiotic reflection on the didactics of the Chain rule. The Mathematics Enthusiast: Vol. 7 : No. 2 , Article 10. DOI: https://doi.org/10.54870/1551-3440.1191.
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
[LEI21a] J. Schmidhuber (2021). Der erste Informatiker. Wie Gottfried Wilhelm Leibniz den Computer erdachte. (The first computer scientist. How Gottfried Wilhelm Leibniz conceived the computer.) Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online: 19/5/2021.

[LIT21] M. L. Littman (2021). Collusion Rings Threaten the Integrity of Computer Science Research. Communications of the ACM, Vol. 64 No. 6, p. 43-44, June 2021.
[LSTM0] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. TR FKI-207-95, TUM, August 1995. PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF. Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000. PDF. The "vanilla LSTM architecture" with forget gates that everybody is using today, e.g., in Google's Tensorflow.
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005. PDF.
[LSTM4] S. Fernandez, A. Graves, J. Schmidhuber. An application of recurrent neural networks to discriminative keyword spotting. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009. PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.
[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber. Evolving memory cell structures for sequence learning. Proc. ICANN-09, Cyprus, 2009. PDF.
[LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013. PDF. Based on [LSTM1-2,4,14][CTC].
[LSTM9] O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton. Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL].
[LSTM10] A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004. PDF.
[LSTM11] N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575-581, Springer-Verlag Berlin Heidelberg, 2005.
[LSTM12] D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.
[LSTM13] F. A. Gers and J. Schmidhuber. LSTM Recurrent Networks Learn Simple Context Free and Context Sensitive Languages. IEEE Transactions on Neural Networks 12(6):1333-1340, 2001. PDF.
[LSTM14] S. Fernandez, A. Graves, J. Schmidhuber. Sequence labelling in structured domains with hierarchical recurrent neural networks. In Proc. IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk). PDF.
[LSTM15] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552, Vancouver, MIT Press, 2009. PDF.
[LSTM16] M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015. Preprint: arxiv:1506.07452.
[LSTM17] J. A. Perez-Ortiz, F. A. Gers, D. Eck, J. Schmidhuber. Kalman filters improve LSTM network performance in problems unsolvable by traditional recurrent nets. Neural Networks 16(2):241-250, 2003. PDF.
[LSTMPG] J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent famous applications: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota player (2018)—Bill Gates called this a huge milestone in advancing AI.

[LSTM-RL] B. Bakker, F. Linaker, J. Schmidhuber. Reinforcement Learning in Partially Observable Mobile Robot Domains Using Unsupervised Event Extraction. In Proceedings of the 2002 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2002), Lausanne, 2002. PDF.
[LSTMGRU] J. Chung, C. Gulcehre, K. Cho, Y. Bengio (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Preprint arXiv:1412.3555 [cs.NE]. The so-called gated recurrent units (GRU) are actually a variant of the vanilla LSTM architecture^[LSTM2] (2000) which the authors did not cite although this work^[LSTM2] was the one that introduced gated recurrent units. They cited only the 1997 LSTM^[LSTM1] which did not yet have "forget gates."^[LSTM2] Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009^[LSTM7] without changing the name of the basic method. (Margin note: GRU cells lack an important gate and can neither learn to count^[LSTMGRU2] nor learn simple non-regular languages;^[LSTMGRU2] they also do not work as well for challenging translation tasks, according to Google Brain.^[LSTMGRU3])
[LSTMGRU2] G. Weiss, Y. Goldberg, E. Yahav. On the Practical Computational Power of Finite Precision RNNs for Language Recognition. Preprint arXiv:1805.04908.
[LSTMGRU3] D. Britz et al. (2017). Massive Exploration of Neural Machine Translation Architectures. Preprint arXiv:1703.03906
[M69] M. Minsky, S. Papert. Perceptrons (MIT Press, Cambridge, MA, 1969). A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert^[M69] showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^[S20] However, the 1969 book^[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^[DL1-2] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]} and then also by Amari's SGD for MLPs.^[GD1-2] Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
[MC43] W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity. Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.
[META] J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of first publication on metalearning machines that learn to learn (1987). For its cover I drew a robot that bootstraps itself. 1992-: gradient descent-based neural metalearning. 1994-: Meta-Reinforcement Learning with self-modifying policies. 1997: Meta-RL plus artificial curiosity and intrinsic motivation. 2002-: asymptotically optimal metalearning for curriculum learning. 2003-: mathematically optimal Gödel Machine. 2020: new stuff!
[META1] J. Schmidhuber. Evolutionary principles in self-referential learning, or on learning how to learn: The meta-meta-... hook. Diploma thesis, Institut für Informatik, Technische Universität München, 1987. Searchable PDF scan (created by OCRmypdf which uses LSTM). HTML. For example, Genetic Programming (GP) is applied to itself, to recursively evolve better GP methods through Meta-Evolution. More.
[META10] T. Schaul and J. Schmidhuber. Metalearning. Scholarpedia, 5(6):4650, 2010.
[METARL2] J. Schmidhuber. On learning how to learn learning strategies. Technical Report FKI-198-94, Fakultät für Informatik, Technische Universität München, November 1994. PDF.
[METARL3] J. Schmidhuber. Beyond "Genetic Programming": Incremental Self-Improvement. In J. Rosca, ed., Proc. Workshop on Genetic Programming at ML95, pages 42-49. National Resource Lab for the study of Brain and Behavior, 1995.
[METARL4] M. Wiering and J. Schmidhuber. Solving POMDPs using Levin search and EIRA. In L. Saitta, ed., Machine Learning: Proceedings of the 13th International Conference (ICML 1996), pages 534-542, Morgan Kaufmann Publishers, San Francisco, CA, 1996. PDF. HTML.
[METARL5] J. Schmidhuber and J. Zhao and M. Wiering. Simple principles of metalearning. Technical Report IDSIA-69-96, IDSIA, June 1996. PDF.
[METARL6] J. Zhao and J. Schmidhuber. Solving a complex prisoner's dilemma with self-modifying policies. In From Animals to Animats 5: Proceedings of the Fifth International Conference on Simulation of Adaptive Behavior, 1998.
[METARL7] Shifting inductive bias with success-story algorithm, adaptive Levin search, and incremental self-improvement. Machine Learning 28:105-130, 1997. PDF.
[METARL8] J. Schmidhuber, J. Zhao, N. Schraudolph. Reinforcement learning with self-modifying policies. In S. Thrun and L. Pratt, eds., Learning to learn, Kluwer, pages 293-309, 1997. PDF; HTML.
[METARL9] A general method for incremental self-improvement and multiagent learning. In X. Yao, editor, Evolutionary Computation: Theory and Applications. Chapter 3, pp.81-123, Scientific Publ. Co., Singapore, 1999.
[METARL10] L. Kirsch, S. van Steenkiste, J. Schmidhuber. Improving Generalization in Meta Reinforcement Learning using Neural Objectives. International Conference on Learning Representations, 2020.
[MGC] MICCAI 2013 Grand Challenge on Mitosis Detection, organised by M. Veta, M.A. Viergever, J.P.W. Pluim, N. Stathonikos, P. J. van Diest of University Medical Center Utrecht.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 1990-1991. Preprint arXiv:2005.05744, 2020. The deep learning neural networks of Schmidhuber's team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 2020-21, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in the "Annus Mirabilis" 1990-1991 at TU Munich. See also this inaugural tweet.

[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint. Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.
[MLP2] J. Schmidhuber (AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs^[MLP1] and the earlier recurrent NNs of Schmidhuber's team were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.
[MOST] J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long Short-Term Memory (LSTM), the most cited AI of the 20th century. (2) ResNet (open-gated Highway Net), the most cited AI of the 21st century. (3) AlexNet & VGG Net (the similar DanNet of 2011 won 4 image recognition challenges before them). (4) GAN (an instance of Schmidhuber's Adversarial Artificial Curiosity of 1990). (5) Transformer variants (unnormalised linear Transformers are formally equivalent to Schmidhuber's Fast Weight Programmers of 1991). In particular, Schmidhuber laid foundations of Generative AI, publishing principles of (4) GANs (1990, now used for deepfakes), (5) Transformers (1991, the "T" in "ChatGPT" stands for "Transformer"), and (6) self-supervised pre-training for deep NNs (the "P" in "GPT" stands for "pre-trained"). Most of this started with the Annus Mirabilis of 1990-1991.^[MIR]
[MOZ] M. Mozer. A Focused Backpropagation Algorithm for Temporal Pattern Recognition. Complex Systems, 1989.
[MUN87] P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165-176, 1987.
[NAN1] J. Schmidhuber. Networks adjusting networks. In J. Kindermann and A. Linden, editors, Proceedings of `Distributed Adaptive Neural Information Processing', St.Augustin, 24.-25.5. 1989, pages 197-208. Oldenbourg, 1990. Extended version: TR FKI-125-90 (revised), Institut für Informatik, TUM. PDF. Includes the proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Based on neural nets learning to estimate gradients for other neural nets.
[NAN2] J. Schmidhuber. Networks adjusting networks. Technical Report FKI-125-90, Institut für Informatik, Technische Universität München. Revised in November 1990. PDF.
[NAN3] Recurrent networks adjusted by adaptive critics. In Proc. IEEE/INNS International Joint Conference on Neural Networks, Washington, D. C., volume 1, pages 719-722, 1990.
[NAN4] J. Schmidhuber. Additional remarks on G. Lukes' review of Schmidhuber's paper `Recurrent networks adjusted by adaptive critics'. Neural Network Reviews, 4(1):43, 1990.
[NAN5] M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, K. Kavukcuoglu. Decoupled Neural Interfaces using Synthetic Gradients. Preprint arXiv:1608.05343, 2016. This work of DeepMind is similar to [NAN1-2].
[NAS] B. Zoph, Q. V. Le. Neural Architecture Search with Reinforcement Learning. Preprint arXiv:1611.01578 (PDF), 2017.
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC2] J. Schmidhuber. Zooming in on aviation history. Correspondence, Nature, vol 566, p 39, 7 Feb 2019.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
[NASC4] J. Schmidhuber. Turing: Keep his work in perspective. Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b.
[NASC5] J. Schmidhuber. Turing in Context. Letter, Science, vol 336, p 1639, June 2012. (On Gödel, Zuse, Turing.) See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639-a)
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004
[NASC8] J. Schmidhuber. Prototype resilient, self-modeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011. HTML.
[NDR] R. Csordas, K. Irie, J. Schmidhuber. The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
[NGU89] D. Nguyen and B. Widrow; The truck backer-upper: An example of self learning in neural networks. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 357-364, 1989.
[NHE] J. Schmidhuber. The Neural Heat Exchanger. Oral presentations since 1990 at various universities including TUM and the University of Colorado at Boulder. Also in In S. Amari, L. Xu, L. Chan, I. King, K. Leung, eds., Proceedings of the Intl. Conference on Neural Information Processing (1996), pages 194-197, Springer, Hongkong. Link. Proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Inspired by the physical heat exchanger: inputs "heat up" while being transformed through many successive layers, targets enter from the other end of the deep pipeline and "cool down."
[NPMa] M. Nakamura, K. Shikano. A study of English word category prediction based on neural networks. Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), p. 731-734, 1989.
[NPM] Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin (2003). A Neural Probabilistic Language Model. Journal of Machine Learning Research 3, p 1137-1155, 2003. Based on Schmidhuber & Heil's excellent 1995 neural probabilistic text model.^[SNT] See also Nakamura and Shikano's 1989 word category prediction model.^[NPMa]
[NS56] A. Newell and H. Simon. The logic theory machine—A complex information processing system. IRE Transactions on Information Theory 2.3 (1956):61-79.
[NYT1] NY Times article by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'
[NYT3] NY Times article by G. Lewis-Kraus, Dec. 14, 2016: The Great A.I. Awakening
[OAI1] G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker. Learning Dexterity. OpenAI Blog, 2018.
[OAI1a] OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba. Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
[OAI2] OpenAI: C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019). Dota 2 with Large Scale Deep Reinforcement Learning. Preprint arxiv:1912.06680. An LSTM composes 84% of the model's total parameter count.
[OAI2a] J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.
[PDA1] G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649-653, Princeton University, Princeton, NJ, 1990.
[PDA2] M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of context-free languages. Proc. NIPS 1993.
[PG] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.3-4: 229-256, 1992.
[PHD] J. Schmidhuber. Dynamische neuronale Netze und das fundamentale raumzeitliche Lernproblem (Dynamic neural nets and the fundamental spatio-temporal credit assignment problem). Dissertation, Institut für Informatik, Technische Universität München, 1990. PDF. HTML.
[PLAG1] Oxford's guide to types of plagiarism (2021). Quote: "Plagiarism may be intentional or reckless, or unintentional." Copy in the Internet Archive. Local copy.
[PLAG2] Jackson State Community College (2022). Unintentional Plagiarism. Copy in the Internet Archive.
[PLAG3] R. L. Foster. Avoiding Unintentional Plagiarism. Journal for Specialists in Pediatric Nursing; Hoboken Vol. 12, Iss. 1, 2007.
[PLAG4] N. Das. Intentional or unintentional, it is never alright to plagiarize: A note on how Indian universities are advised to handle plagiarism. Perspect Clin Res 9:56-7, 2018.
[PLAG5] InfoSci-OnDemand (2023). What is Unintentional Plagiarism? Copy in the Internet Archive.
[PLAG6] Copyrighted.com (2022). How to Avoid Accidental and Unintentional Plagiarism (2023). Copy in the Internet Archive. Quote: "May it be accidental or intentional, plagiarism is still plagiarism."
[PLAN] J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs, the GAN principle (widely used today). Agents with adaptive recurrent world models even suggest a simple explanation of consciousness & self-awareness.
[PLAN2] J. Schmidhuber. An on-line algorithm for dynamic reinforcement learning and planning in reactive environments. Proc. IEEE/INNS International Joint Conference on Neural Networks, San Diego, volume 2, pages 253-258, 1990. Based on TR FKI-126-90 (1990).^[AC90] More. Extending NN-based system identification and control of the 1980s by Werbos, Munro, Nguyen & Widrow, and others.
[PLAN3] J. Schmidhuber. Reinforcement learning in Markovian and non-Markovian environments. In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors, Advances in Neural Information Processing Systems 3, NIPS'3, pages 500-506. San Mateo, CA: Morgan Kaufmann, 1991. PDF. Partially based on TR FKI-126-90 (1990).^[AC90]
[PLAN4] J. Schmidhuber. On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models. Report arXiv:1210.0118 [cs.AI], 2015.
[PLAN5] One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
[PLAN6] D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.) Preprint: arXiv:1809.01999. Github: World Models.
[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF. More.
[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on [PM0], 1991. PDF. More.
[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996. PDF. More.
[PMax0] J. Schmidhuber and D. Prelinger. Discovering predictable classifications. Technical Report CU-CS-626-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
[PMax] J. Schmidhuber and D. Prelinger. Discovering predictable classifications. Neural Computation, 5(4):625-635, 1993. PDF.
[PO87] J. B. Pollack. On Connectionist Models of Natural Language Processing. PhD thesis, Computer Science Department, University of Illinois, Urbana, 1987.
[PO90] J. B. Pollack. Recursive Distributed Representations. Artificial Intelligence, 46(1-2):77-105, 1990.
[PP] J. Schmidhuber. POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem. Frontiers in Cognitive Science, 2013. ArXiv preprint (2011): arXiv:1112.5309 [cs.AI]
[PP1] R. K. Srivastava, B. Steunebrink, J. Schmidhuber. First Experiments with PowerPlay. Neural Networks, 2013. ArXiv preprint (2012): arXiv:1210.8385 [cs.AI].
[PP2] V. Kompella, M. Stollenga, M. Luciw, J. Schmidhuber. Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. Artificial Intelligence, 2015.
Relevant threads with many comments at reddit.com/r/MachineLearning, the largest machine learning forum with over 800k subscribers in 2019 (note that my name is often misspelled):
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco. Schmidhuber started metalearning (learning to learn—now a hot topic) in 1987^{[META1][META]} long before Bengio who suggested in public at N(eur)IPS 2019 that he did it before Schmidhuber.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R9] Reddit/ML, 2019. We find it extremely unfair that Schmidhuber did not get the Turing award. That is why we dedicate this song to Juergen to cheer him up.
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun
[R58] Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386. This paper not only described single layer perceptrons, but also deeper multilayer perceptrons (MLPs). Although these MLPs did not yet have deep learning, because only the last layer learned,^[DL1][DLH] Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM1-2][CONN21][T22]}
[R61] Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.
[R62] Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.
[RCNN] R. Girshick, J. Donahue, T. Darrell, J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Preprint arXiv/1311.2524, Nov 2013.
[RCNN2] R. Girshick. Fast R-CNN. Proc. of the IEEE international conference on computer vision, p. 1440-1448, 2015.
[RCNN3] K. He, G. Gkioxari, P. Dollar, R. Girshick. Mask R-CNN. Preprint arXiv/1703.06870, 2017.
[RELU1] K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. doi:10.1109/TSSC.1969.300225. This work introduced rectified linear units or ReLUs.
[RELU2] C. v. d. Malsburg (1973). Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.
[RMSP] T. Tieleman, G. E. Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4.2 (2012): 26-31.
[ROB] A. J. Robinson and F. Fallside. The utility driven dynamic error propagation network. Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.
[RPG] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634.
[RPG07] D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs with Recurrent Policy Gradients. Intl. Conf. on Artificial Neural Networks ICANN'07, 2007. PDF.
[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation. TR No. ICS-8506, California Univ San Diego La Jolla Inst for Cognitive Science. Later version published as: Learning representations by back-propagating errors. Nature, 323, p. 533-536 (1986). This experimental analysis of backpropagation did not cite the origin of the method,^[BP1-5] also known as the reverse mode of automatic differentiation. The paper also failed to cite the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP1-2][HIN]} as well as Amari's work (1967-68)^[GD1-2] on learning internal representations in deep nets through stochastic gradient descent. Even later surveys by the authors^[DL3,3a] failed to cite the prior art.^[T22]
[S93] D. Sherrington (1993). Neural networks: the spin glass approach. North-Holland Mathematical Library, vol 51, 1993, p. 261-291.
[S20] T. Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. PNAS, January 28, 2020. Link. A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert^[M69] showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^[S20] However, the 1969 book^[M69] addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^[DL1-2][DLH] that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP1-2][DL2]} and then also by Amari's SGD for MLPs.^[GD1-2] Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)} Deep learning research was alive and kicking in the 1960s-70s, especially outside of the Anglosphere.^{[DEEP1-2][GD1-3][CNN1][DL1-2][T22][DLH]}
[S80] B. Speelpenning (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD thesis, Department of Computer Science, University of Illinois, Urbana-Champaign.
[S2S] I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112.
[S59] A. L. Samuel. Some studies in machine learning using the game of checkers. IBM Journal on Research and Development, 3:210-229, 1959.
[STO51] H. Robbins, S. Monro (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics. 22(3):400, 1951.
[STO52] J. Kiefer, J. Wolfowitz (1952). Stochastic Estimation of the Maximum of a Regression Function. The Annals of Mathematical Statistics. 23(3):462, 1952.
[SA17] J. Schmidhuber. Falling Walls: The Past, Present and Future of Artificial Intelligence. Scientific American, Observations, Nov 2017.
[SCAN] J. Masci, A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.
[SE59] O. G. Selfridge (1959). Pandemonium: a paradigm for learning. In D. V. Blake and A. M. Uttley, editors, Proc. Symposium on Mechanisation of Thought Processes, p 511-529, London, 1959.
[SNT] J. Schmidhuber, S. Heil (1996). Sequential neural text compression. IEEE Trans. Neural Networks, 1996. PDF. An earlier version appeared at NIPS 1995. Much later this was called a probabilistic language model.^[T22]
[SK75] D. Sherrington, S. Kirkpatrick (1975). Solvable Model of a Spin-Glass. Phys. Rev. Lett. 35, 1792, 1975.
[ST] J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber Steel Defect Classification with Max-Pooling Convolutional Neural Networks. Proc. IJCNN 2012. PDF. Apparently, this was the first deep learning breakthrough in heavy industry.
[ST61] K. Steinbuch. Die Lernmatrix. (The learning matrix.) Kybernetik, 1(1):36-45, 1961.
[ST95] W. Hilberg (1995). Karl Steinbuch, ein zu Unrecht vergessener Pionier der künstlichen neuronalen Systeme. (Karl Steinbuch, an unjustly forgotten pioneer of artificial neural systems.) Frequenz, 49(1995)1-2.
[SP93] A. Sperduti (1993). Encoding Labeled Graphs by Labeling RAAM. NIPS 1993: 1125-1132 One of the first papers on graph neural networks.
[SP94] A. Sperduti (1994). Labelling Recursive Auto-associative Memory. Connect. Sci. 6(4): 429-459 (1994)
[SP95] A. Sperduti (1995). Stability properties of labeling recursive auto-associative memory. IEEE Trans. Neural Networks 6(6): 1452-1460 (1995)
[SPG95] A. Sperduti, A. Starita, C. Goller (1995). Learning Distributed Representations for the Classification of Terms. IJCAI 1995: 509-517
[SPG96] A. Sperduti, D. Majidi, A. Starita (1996). Extended Cascade-Correlation for Syntactic and Structural Pattern Recognition. SSPR 1996: 90-99
[SPG97] A. Sperduti, A. Starita (1997). Supervised neural networks for the classification of structures. IEEE Trans. Neural Networks 8(3): 714-735, 1997.
[SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value those who ensure that science is self-correcting. Nature, vol 577, p 9, 2/2/2020.
[T19] ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link. Local copy 1 (HTML only). Local copy 2 (HTML only). [T22] debunks this justification.
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T21v1] J. Schmidhuber. Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning. Technical Report IDSIA-77-21 (v1), IDSIA, 24 Sep 2021.
[T22] J. Schmidhuber (AI Blog, 2022). Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. Debunking [T19] and [DL3a] .
[THE17] S. Baker (2017). Which countries and universities are leading on AI research? Times Higher Education World University Rankings, 2017. Link.
[TR1] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008. This paper introduced the name "Transformers" for a now widely used NN type. It did not cite the 1991 publication on what's now called "unnormalized Transformers with linearized self-attention."^{[FWP0-6][TR5-7]} Schmidhuber also introduced the now popular attention terminology in 1993.^{[ATT][FWP2][R4]} See tweet of 2022 for 30-year anniversary.
[TR2] J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.
[TR4] M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.
[TR5] A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret. Transformers are RNNs: Fast autoregressive Transformers with linear attention. In Proc. Int. Conf. on Machine Learning (ICML), July 2020.
[TR6] K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with Performers. In Int. Conf. on Learning Representations (ICLR), 2021.
[TR7] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A Smith, L. Kong. Random feature attention. ICLR 2021.
[TUR] A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230-267. Received 28 May 1936. Errata appeared in Series 2, 43, pp 544-546 (1937). 2nd explicit proof that the Entscheidungsproblem (decision problem) does not have a general solution.
[TUR1] A. M. Turing. Intelligent Machinery. Unpublished Technical Report, 1948. Link. In: Ince DC, editor. Collected works of AM Turing - Mechanical Intelligence. Elsevier Science Publishers, 1992.
[TUR2] A. M. Turing (1952). The Chemical Basis of Morphogenesis. Philosophical Transactions of the Royal Society of London 237 (641):37-72.
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though.

[TUR3] G. Oppy, D. Dowe (2021). The Turing Test. Stanford Encyclopedia of Philosophy. Quote: "it is sometimes suggested that the Turing Test is prefigured in Descartes' Discourse on the Method. (Copeland (2000:527) finds an anticipation of the test in the 1668 writings of the Cartesian de Cordemoy. Abramson (2011a) presents archival evidence that Turing was aware of Descartes' language test at the time that he wrote his 1950 paper. Gunderson (1964) provides an early instance of those who find that Turing's work is foreshadowed in the work of Descartes.)"
[TUR3a] D. Abramson. Descartes' Influence on Turing. Studies in History and Philosophy of Science, 42:544-551, 2011.
[TUR3b] Are computers conscious?—Panpsychism with Noam Chomsky | Theories of Everything. Mentioning the ancient "Turing Test" by Descartes. YouTube video, 2022.
[UN] J. Schmidhuber (AI Blog, 2021). 30-year anniversary. 1991: First very deep learning with unsupervised or self-supervised pre-training. Unsupervised hierarchical predictive coding (with self-supervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0] J. Schmidhuber. Neural sequence chunkers. Technical Report FKI-148-91, Institut für Informatik, Technische Universität München, April 1991. PDF. Unsupervised/self-supervised learning and predictive coding is used in a deep hierarchy of recurrent neural networks (RNNs) to find compact internal representations of long sequences of data, across multiple time scales and levels of abstraction. Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above. The resulting compressed sequence representations greatly facilitate downstream supervised deep learning such as sequence classification. By 1993, the approach solved problems of depth 1000 [UN2] (requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning). A variant collapses the hierarchy into a single deep net. It uses a so-called conscious chunker RNN which attends to unexpected events that surprise a lower-level so-called subconscious automatiser RNN. The chunker learns to understand the surprising events by predicting them. The automatiser uses a neural knowledge distillation procedure to compress and absorb the formerly conscious insights and behaviours of the chunker, thus making them subconscious. The systems of 1991 allowed for much deeper learning than previous methods. More.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.^[UN0] PDF. First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales), overcoming the vanishing gradient problem through unsupervised pre-training and predictive coding (with self-supervised target generation). Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. See also this tweet. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF. An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / self-supervised pre-training for a stack of recurrent NN can be found here (depth > 1000).
[UN3] J. Schmidhuber, M. C. Mozer, and D. Prelinger. Continuous history compression. In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors, Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95. Augustinus, 1993.
[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF. This work describes unsupervised pre-training of stacks of feedforward NNs (FNNs) called Deep Belief Networks (DBNs). It did not cite the much earlier 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)^[UN0-3] which introduced the first NNs shown to solve very deep problems. The 2006 justification of the authors was essentially the one Schmidhuber used for the 1991 RNN stack: each higher level tries to reduce the description length (or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]} This can greatly facilitate very deep downstream learning.^[UN0-3]
[UN5] Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle. Greedy layer-wise training of deep networks. Proc. NIPS 06, pages 153-160, Dec. 2006. The comment under reference^[UN4] applies here as well.
[URQ10] A. Urquhart. Von Neumann, Gödel and complexity theory. Bulletin of Symbolic Logic 16.4 (2010): 516-530. Link.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF. More on the Fundamental Deep Learning Problem.
[VAN2] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE TNN 5(2), p 157-166, 1994. Results are essentially identical to those of Schmidhuber's diploma student Hochreiter (1991).^[VAN1] Even after a common publication,^[VAN3] the first author of [VAN2] published papers^[VAN4-5] that cited only their own [VAN2] but not the original work.
[VAN3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001. PDF.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
[VAN5] R. Pascanu, T. Mikolov, Y. Bengio. On the difficulty of training Recurrent Neural Networks. ICML 2013.
[VAR13] M. Y. Vardi (2013). Who begat computing? Communications of the ACM, Vol. 56(1):5, Jan 2013. Link.
[VID1] G. Hinton. The Next Generation of Neural Networks. Youtube video [see 28:16]. GoogleTechTalk, 2007. Quote: "Nobody in their right mind would ever suggest" to use plain backpropagation for training deep networks. However, in 2010, Schmidhuber's team in Switzerland showed^[MLP1-2] that unsupervised pre-training is not necessary to train deep NNs.
[VID2] Bloomberg Hello World. The Rise of AI. Youtube video, 2018. The narrator of this 2018 Bloomberg video is thanking Hinton for speech recognition and machine translation, although both were actually done (at production time of the video) on billions of smartphones by deep learning methods developed in Schmidhuber's labs in Germany and Switzerland (LSTM & CTC) long before Hinton's less successful methods.
[W45] G. H. Wannier (1945). The Statistical Problem in Cooperative Phenomena. Rev. Mod. Phys. 17, 50.
[WER87] P. J. Werbos. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 1987.
[WER89] P. J. Werbos. Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209-216, 1989.
[WI48] N. Wiener (1948). Time, communication, and the nervous system. Teleological mechanisms. Annals of the N.Y. Acad. Sci. 50 (4): 197-219. Quote: "... the general idea of a computing machine is nothing but a mechanization of Leibniz's calculus ratiocinator."
[WID62] Widrow, B. and Hoff, M. (1962). Associative storage and retrieval of digital information in networks of adaptive neurons. Biological Prototypes and Synthetic Systems, 1:160, 1962.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times.
[XAV] X. Glorot, Y. Bengio. Understanding the difficulty of training deep feedforward neural networks. Proc. 13th Intl. Conference on Artificial Intelligence and Statistics, PMLR 9:249-256, 2010.
[YB20] Y. Bengio. Notable Past Research. WWW link (retrieved 15 May 2020). Local copy (plain HTML only). The author claims that in 1995 he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue"^[HB96] although Schmidhuber's publications on exactly this topic date back to 1991-93.^[UN0-2][UN] The author also writes that in 1999 he "introduced, for the first time, auto-regressive neural networks for density estimation" although Schmidhuber & Heil used a very similar set-up for text compression already in 1995.^[SNT]
.