Scientific Integrity and the History of Deep Learning:
The 2021 Turing Lecture, and the 2018 Turing Award
This is a pointforpoint critique of ACM's justification of the ACM A. M. Turing Award for deep learning, as well as a critique of the Turing Lecture given by the awardees (published by ACM in July 2021). In brief, three Europeans went to North America, where they republished methods and concepts first published by others whom they did not cite—not even in later surveys. Instead, they credited each other at the expense of the field's pioneers. Apparently, the ACM was not aware of this. The critique supplements my awardwinning
deep learning survey,^{[DL1]} and can also be seen as a short history of the deep learning revolution, at least as far as ACM's erroneous laudation^{[T19]} and the Turing Lecture^{[DL3a]} are concerned.
Disclaimer.
Following a recent paper,^{}[LEC] I would like to start this by acknowledging that I am not without a conflict of interest here. My seeking to correct the record will naturally seem selfinterested. The truth of the matter is that it is. Much of the closely related work pointed to below was done in my lab, and I naturally wish that it be acknowledged, and recognized. Setting my conflict aside, I ask the reader to study the original papers and judge for themselves the scientific content of these remarks, as I seek to set emotions aside and minimize bias so much as I am capable.
Note on version 1 of Sep 2021:
Following the great success of massive open online peer review (MOOR) for my
2015 survey of deep learning^{[DL1]}
(now the most cited article ever published in
the journal Neural Networks), this extended version of a
June 2020 article^{[T20a][R12]}
is currently undergoing MOOR as well.
Please send suggestions for improvements and additional relevant
references to juergen@idsia.ch.
Note on version 2 of Dec 2021:
In the wake of MOOR, public comments—such as those on the connectionists mailing list^{}[CONN21]—and many additional private comments (some by wellknown deep learning pioneers) helped to update and improve upon
version 1 of the present report.
The essential statements of the text remain unchanged as their accuracy remains unchallenged. I'd like to thank everyone from the bottom of my heart for their feedback up until this point and hope everyone will be satisfied with the changes.
Note on version 3 of June 2022:
Additional comments by deep learning pioneers helped to correct a few historic details.
The essential statements of the text remain unchanged as their accuracy remains unchallenged. To cite this article in publications use: J. Schmidhuber. Scientific Integrity and the History of Deep Learning:
The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA7721, IDSIA, Lugano, Switzerland, 2022.
Note on version 4 of Nov 2022:
Perhaps the greatest existential risk to scientific integrity is that the ongoing attack on it isn't discussed more.
Dear reader, let me urge you: don't become part of the problem! For example, don't simply cite (without rectification) a flawed paper that misrepresents or does not mention earlier relevant work, just because it was cited by many others. Have you failed to correctly assign credit in the past? Then you must rectify this in future publications.
Don't participate in unintentional^{}[PLAG1][CONN21] or intentional^{[FAKE2]} plagiarism and systemic academic corruption. It's a shame for our field that such elementary rules of scientific conduct need to be emphasized.
Abstract. ACM's 2018 A. M. Turing Award was about
deep learning in artificial neural networks.
ACM lauds Dr. LeCun, Dr. Bengio, and Dr. Hinton (LBH) for work based on algorithms
and conceptual foundations first published by other researchers
whom the awardees failed to cite
(see Executive Summary
and Sec.
I,
V,
II,
XII,
XIX,
XXI,
XIII,
XIV,
XX,
XVII).
ACM explicitly mentions "astonishing" deep learning breakthroughs in 4 fields:
(A) speech recognition,
(B) natural language processing,
(C) robotics,
(D) computer vision,
as well as "powerful" new deep learning tools in 3 fields:
(VII) medicine, astronomy, materials science.
Most of these breakthroughs and tools, however, were direct consequences of
the breakthroughs of my lab and other labs in the past 3 decades
(see Sec.
A,
B,
C,
D,
VII,
XVII,
VI,
XVI).
I correct ACM's distortions of deep learning history (see Sec.
II,
V,
XX,
XVIII)
and
mention no fewer than 10 of our direct priority disputes
with Dr. Bengio & Dr. Hinton (see Sec. XVII, I), plus several with Dr. LeCun (see Sec. XVII3).
Furthermore,
I respond to LBH's recent ACM article (July 2021).
Outline.
This document (over 25,000 words) greatly
expands material in my Critique of the 2019 Honda Prize^{[HIN]} (~3,000 words).
It has several layers of hierarchical abstraction:
Abstract & Outline (~300 words),
Introduction (~300 words),
Critique of LBH's ACM article (Turing Lecture) of July 2021^{[DL3a]} (~900 words: a compact list of many issues—hurried readers may start here, then follow the links to the details in later sections),
Executive summary of what's wrong with ACM's laudation (~1,100 words),
21 comments on 21 claims by ACM (~8,500 words),
Conclusion (~2,000 words).
All backed up by over 300 references (over 12,000 words).
The text contains numerous hyperlinks to relevant overview sites from the AI Blog.
We must stop crediting the wrong people for inventions made by others.
Instead let's heed the recent call in the journal Nature: let us value "those who ensure that
science is selfcorrecting."^{}[SV20]
Like those who know me can testify, finding and citing original sources of scientific and technological innovations is important to me, whether they are mine or other people's.^{[DL12][DLH][HIN][NASC19]} The present page is offered as a resource for all good computer scientists who share this inclination.
By grounding research in its true intellectual foundations and crediting the original inventors,
I am not diminishing important contributions made by popularizers of those inventions.
My goal is to encourage the entire community to be more scholarly in its efforts, to recognize the foundational work that sometimes gets lost in the frenzy of modern AI and machine learning,
and to fight plagiarism,^{[FAKE2]}
collusion rings,^{[LIT21]} and systemic academic corruption in all of their more and less subtle forms.^{[FAKE]}
I am also inviting others to contribute additional relevant
references (please send any and all directly to me at juergen@idsia.ch).
Sec. 2
will start with a critique of
LBH's 2021 ACM article^{[DL3a]} which necessitated an extension of the
first version
of this post.^{[T20a][R12]}
Subsequent sections will focus on
contributions praised by
ACM's official justification^{[T19]} of the
2018 A.M. Turing Award^{[R1]}
published in 2019.
After the Executive Summary in Sec. 3, Sec. 4 will split
ACM's full text^{[T19]}
into 21 parts
labeled by "ACM:"
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
Each part is marked by a blue bar and followed by a critique.
Most of the critiques are based on references to original papers and material from the AI Blog.^{[AIB][MIR][DEC][HIN]}
They'll point out
that highly cited publications of the awardees ignored fundamental
relevant prior work—this may be the reason for some of ACM's misattributions.
As recently as of July 2021, Dr. LeCun, Dr. Bengio, Dr. Hinton (LBH) and the ACM
have continued to promulgate their revisionist "history" of deep learning by
publishing yet another misleading overview of the field, this time based on LBH's Turing Lecture.^{}[DL3a]
In the new piece, LBH credit again themselves
for fundamental work first done by others, and fail to correct
LBH's wellknown earlier omissions.^{[DLC][HIN][T20a]}
★1.
LBH claim to "briefly describe the origins of deep learning"^{[DL3a]} without even mentioning the world's first working deep learning nets by
Ivakhnenko and Lapa in 1965^{[DEEP12][R8]} (see Sec. II).
★2.
LBH
dedicate an extra section to
their unsupervised pretraining of deep neural networks (NNs) around 2006, without mentioning that
this class of methods was pioneered in 1991^{[UNUN2]} (see Sec. II, III).
★3.
LBH mention the "most popular class of convolutional net architecture for computer vision," the "ResNet family," without clarifying that ResNet is just an (opengated)
Highway Net,
the first really deep feedforward NN.^{[HW13]}
The socalled "ResNet family" is actually the "Highway Net family"
(see Sec. D, VI).
★4.
In this context, LBH devote an extra section to the importance of NN depth,
without mentioning that the relevant breakthroughs emphasized by LBH
were all driven by my lab:^{[MOST]} In 1991, I had the
first very deep NNs based on unsupervised pretraining;^{[UNUN2]}
soon afterwards our
LSTMs
brought essentially unlimited depth to gradientbased supervised recurrent NNs;^{[LSTM017][25y97]}
later our Highway Nets^{[HW13]} brought it to feedforward NNs.
★5.
LBH cite Hinton et al.'s work on speech recognition since 2009 without mentioning our earlier and superior methods
from 2007^{[LSTM4,14]}
based on LSTM^{[LSTM06]} (1990s2005) and CTC (2006).^{[CTC]}
By the time the Turing Award was handed out,
our CTCLSTMbased speech recognition (not that of Hinton) had been on most smartphones for years^{[GSR][GSR1519][DL4]} (see Sec. A, VI, XI, XV). Similarly for machine translation (see Sec. B).
★6.
LBH cite Hinton (2012) for "dropout" without mentioning that dropout is just a variant of Hanson's 1990 stochastic delta rule^{[Drop14]} (see Sec. XIV).
★7.
Several times, LBH mention backpropagation—and LBH's papers on applications of this method—but neither its inventor Linnainmaa (1970)^{[BP15][BPAC]} nor Werbos, who first applied it to NNs in 1982 (see Sec. XII, XIX, XXI).
LBH also fail to cite Amari's 196768 work—which included computer simulations—on learning internal representations of multilayer perceptrons through stochastic gradient descent^{[GD13]} (without reverse mode backpropagation^{[BP1]}).
★8.
LBH devote an extra section to rectified linear units (ReLUs), citing papers of the 2000s by Hinton and his former students, without citing Fukushima who introduced ReLUs in 1969^{[RELU12]} (see Sec. XIV).
★9.
LBH claim ReLUs enabled deep learning to outperform previous methods for object recognition, referring to their GPUbased ImageNet 2012 winner called AlexNet,^{[GPUCNN4]} without mentioning that our earlier groundbreaking deep GPUbased DanNet^{[GPUCNN13,58][DAN]} did not need ReLUs at all to win 4 earlier object recognition competitions and to achieve superhuman results already in 2011^{[GPUCNN18][R56]} (see Sec. XIV).
★10.
LBH refer to LeCun's work on CNNs, citing neither Fukushima—who created the basic CNN architecture in the 1970s^{[CNN14]}—nor Waibel, who in 1987 was the first to combine NNs with convolutions with backpropagation^{[BP16]} and weight sharing (see Sec. D,
XVIII).
★11.
LBH cite Hinton (1981) for multiplicative gating, without mentioning Ivakhnenko and Lapa who had multiplicative gating in deep networks already in 1965^{[DEEP12][R8]} (see Sec. II).
★12.
LBH cite the "fast weights" of Hinton (1987) without mentioning the earlier fast weights of von der Malsburg (1981) and Feldman (1982).^{[FAST,FASTab][FWP]} LBH refer to Hinton's 2014 paper on "a highcapacity, shortterm memory" through fast weights without clarifying that this was first described in the 199193 papers on Fast Weight Programmers and Transformers with linearized selfattention^{[FWP01,6]} (see Sec. XVI, XVII2).
★13.
LBH
dedicate an extra section to attentionbased Transformers,^{[TR16]} citing Bengio's team (2014) for "soft attention"^{[ATT14]} without citing the much earlier original work of 19911993 on soft attention and Transformers with linearized selfattention^{[FWP,FWP02,6][ATT]} (see Sec. XVII1, XVI, and this tweet of 2022).
★14.
LBH claim that Bengio's team^{[NPM]} first showed in 2002 on real sentences that "activity vectors can be used to model the structure inherent in a set of symbol strings by learning appropriate activity vectors for each symbol and learning nonlinear transformations that allow the activity vectors that correspond to missing elements of a symbol string to be filled in." However, this was shown on real sentences already in 1995 in the context of text compression^{[SNT]} (see Sec. XVI, XVII1).
★15.
LBH cite Bengio's 2014 paper on Generative Adversarial Networks (GANs)^{[GAN01]} without mentioning that
GANs are instances
of the Adversarial Curiosity Principle of 1990^{[AC9020][MIR](Sec. 5)} (see Sec. XVII).
In summation, LBH have repeatedly chosen to ignore the previous wellknown critiques^{[DLC][HIN][T20a]} and deep learning surveys,^{[DL12]} and ACM's peer review process failed to catch this. ACM's Code of Ethics and Professional Conduct^{[ACM18]} states: "Computing professionals should therefore credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works." LBH didn't.
The repetitive nature of LBH's and ACM's failures to uphold basic scientific standards represents a serious attack on the integrity of the field of Artificial Intelligence.
If we, in turn, choose to ignore this, then we will be committing a grievous sin against ourselves and our scientific predecessors.
It is clear from the dilligence with which Turing cited his predecessors,
such as Gödel and Church^{[GOD][CHU][TUR]} (see Sec. IV),
that he would have never approved of being associated with something like this.
While Dr. LeCun, Dr. Bengio, and Dr. Hinton (LBH for short)
have made useful improvements of algorithms for
artificial neural networks (NNs)
and deep learning (e.g., Sec. I), ACM lauds
them for more visible
work based on fundamental methods whose inventors they did not cite—not even in later surveys
(this may actually explain some of ACM's misattributions). I correct ACM's distortions of deep learning history.
Numerous references can be found under the relevant section links IXXI
which adhere to the sequential order of ACM's text^{[T19]}
(while this summary groups related sections together).
Sec. II:
In contrast to ACM's claims,
NNs for pattern recognition etc. were introduced long before the 1980s.
Deep learning with multilayer perceptrons started in 1965 through Ivakhnenko & Lapa
long before LBH who have never cited them—not even in recent work.
In the 1980s, "modern" gradientbased learning in Amari's style (1967)
worked only for rather shallow NNs,
but
it became really deep in 1991 in my lab,
first through
unsupervised pretraining of NNs,
then through the
supervised LSTM.
Sec. I contains 4 subsections
A, B, C, D
on the 4 deep learning "breakthroughs" explicitly
mentioned by ACM. ACM does not mention that they were
mostly based on the deep learning techniques developed by my team:
Sec.
A: Speech Recognition (see also Sec. VI & XI & XV): The first superior endtoend neural speech recognition
combines two methods from my lab: LSTM (1990s2005) and CTC (2006), which were
applied to speech in 2007.
Hinton (2012) and Bengio (XV)
still used an old hybrid approach of the 1980s and 90s:
Hinton et al. (2012) did not compare it to
our revolutionary CTCLSTM which was soon on most smartphones.
Sec. B: Natural Language Processing (see also Sec. VI & XI & XVI):
The first superior endtoend neural machine translation
(soon used for several billions of
translations each day by the big platform companies)
was also based on our LSTM.
Sec. C: Robotics.
Our LSTM trained by Reinforcement Learning (RL) was also the core of the
most visible breakthroughs
in robotics and video games.
Sec. D: Computer Vision
(see also Sec.
XVIII & XIV & XI & VI)
was revolutionized by convolutional NNs (CNNs).
The basic CNN architecture is due to Fukushima (1979).
NNs with convolutions were later (1987) combined by Waibel with backpropagation and weight sharing,
and applied to speech. All before LeCun's CNN work (XVIII).
We showed twice (199195 and 200610) that
deep NNs
don't need unsupervised
pretraining (in contrast to Hinton's claims). Our DanNet was the first CNN fast & deep enough for
superior computer vision in 2011,
winning 4 image recognition contests in a row
before Hinton's team won one. ResNet (ImageNet 2015 winner)
is an opengated version of our earlier Highway Nets.
Sec. XIV:
Again ACM recognizes work that failed to cite the pioneers.
Long before Hinton (2012), Hanson (1990) had a variant of dropout,
and Fukushima (1969) had rectified linear neurons; Hinton did not cite them.
Already in 2011,
our
deep & fast CNN
more than "halved the error rate for object recognition" (ACM's wording)
in a computer vision contest
(where LeCun participated),
long before Hinton's similar CNN (2012).
Sec. XI: ACM mentions GPUaccelerated NNs
pioneered by Jung & Oh (2004). LBH
did not cite them.
Our
deep GPUNN of 2010
debunked unsupervised pretraining (introduced by myself in 1991 and later championed by Hinton),
and our GPUCNN of 2011 (DanNet) was the first
to win contests in computer
vision (explicitly mentioned by ACM).
Sec.
XVIII:
ACM credits LeCun for developing CNNs. However, the foundations of CNNs were laid earlier by
Fukushima and Waibel (see Sec. D).
ACM also explicitly mentions autonomous driving and medical image analysis.
The first application of CNNs with backpropagation to biomedical/biometric images is due to Baldi and Chauvin.^{[BA93]}
The first team to win relevant international contests in these fields
through deep CNNs was ours (2011, 2012, 2013).
Sec.
VII: ACM explicitly mentions medicine and
materials science. Our deep NNs were the
first to win medical imaging competitions
in 2012 and 2013, and the first to apply deep NNs to material defect detection in industry (since 2010).
Sec. XII & XIX & XXI: Modern
backpropagation
was first published by Linnainmaa (1970),
not by LeCun or Hinton or their collaborators (1985) who did not cite Linnainmaa,
not even in later surveys.
Sec.
XIII &
II &
V
(&
III &
IX &
X &
XX):
Ivakhnenko's deep feedforward nets (since 1965) and Amari's (since 1967) learned
internal representations long before Hinton's shallower ones (1980s).
Hinton has never cited him.
Sec. XX: ACM credits LeCun for work on
hierarchical feature representation which did not cite Ivakhnenko's much earlier work
on this (since 1965).
Sec. XXI: ACM credits LeCun for work on
automatic differentiation which did not cite its inventor Linnainmaa (1970).
And also for work on
deep learning for graphs that failed to cite
the earlier work by Sperduti & Goller & Küchler & Pollack.
Sec.
XV: ACM credits Bengio for hybrids of NNs and probabilistic models of sequences.
His work
was not the first on this topic, and is
not important for modern deep learning speech recognition systems (mentioned by ACM) based on our
CTCLSTM
(see Sec.
A &
B).
Sec.
XVI: ACM
credits Bengio for neural probabilistic language models.
Our 1995 neural probabilistic text model greatly predates Bengio's.
ACM mentions NNs that learn
sequential attention.
We started this in 199093
long before LBH
who did not cite the relevant prior work.
Sec. XVII:
ACM mentions
Generative Adversarial Networks (GANs, 201014) of Bengio's team, an instance of
my Adversarial
Artificial Curiosity
(1990) which he did not cite.
I list 10 of
our priority disputes with Bengio & Hinton (many more than can be explained by chance),
on
vanishing gradients (1991),
metalearning (1987),
unsupervised pretraining (1991),
compressing or distilling one NN into another (1991),
learning sequential attention with NNs (1990),
transformerlike attentionbased
fast weight programmers using
outer products (1991),
and other topics.^{[R2R6]}
I also mention
several priority disputes with LeCun since 1990.^{[LEC]}
Sec. IV is on Turing (1936) and his predecessors
Gödel (1931) and Church (1935).
The 21 comments also contain details of issues raised in the
Critique of LBH's ACM article (Turing Lecture) of July 2021.
Sec. Conclusion:
In the recent decade of deep learning,
most major AI applications mentioned by ACM
(speech recognition, language translation, etc.) on billions of devices (also healthcare applications)
heavily depended on my lab's deep learning techniques and conceptual foundations,
while LBH's most visible work ignored
essential prior art since the 1960s—see, e.g.,
Sec. II &
III &
V &
XII &
XIII &
XVII &
XIV &
XIX &
XX &
XXI.
Much of LBH's more prominent work has simply been repackaged versions of earlier work that they produced without proper citation.
But in science, by definition, the facts will always win in the end.
As long as the facts have not yet won it's not yet the end.
In what follows, ACM's full text [T19] is split into 21 parts
labeled by "ACM:"
I,
II,
III,
IV,
V,
VI,
VII,
VIII,
IX,
X,
XI,
XII,
XIII,
XIV,
XV,
XVI,
XVII,
XVIII,
XIX,
XX,
XXI.
Each part is marked by a blue bar and followed by a critique.
I. ACM:
ACM named Yoshua Bengio, Geoffrey Hinton, and Yann LeCun recipients of the 2018 ACM A.M. Turing Award for conceptual and engineering breakthroughs that have made deep neural networks a critical component of computing. ...
Working independently and together, Hinton, LeCun and Bengio developed conceptual foundations for the field, identified surprising phenomena through experiments, and contributed engineering advances that demonstrated the practical advantages of deep neural networks. In recent years, deep learning methods have been responsible for astonishing breakthroughs in computer vision, speech recognition, natural language processing, and robotics—among other applications.
Comment:
LBH and their coworkers have contributed certain useful improvements of existing deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]}
However, the essential
"conceptual foundations" of deep learning (mentioned by ACM)
were laid by others, e.g., deep learning multilayer perceptrons
that learn internal representations
(1965),^{[DEEP12][R8]}
stochastic gradient descent for multilayer perceptrons (1967),^{[GD13]}
modern backpropagation
(1970),^{[BP12][R7]}
architectures of recurrent NNs (1920s1956)^{[I24,I25][MC43][K56]}
and convolutional NNs (1979),^{[CNN1]}
principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]}^{[R2]}
unsupervised pretraining for deep NNs (1991),^{[UN12]}
vanishing gradients (1991)^{[VAN1]} &
Long ShortTerm Memory or LSTM (Sec. A),
supervised
GPUaccelerated NNs (2004),^{[GPUNN][DAN][DAN1][GPUCNN5]}
super deep
NNs with over 100 layers (2015),^{[HW13][R5]}
and Transformers with linearized selfattention through
fast weight programmers (1991).^{[FWP02,6][TR16][FWP][ATT]}
Often LBH failed to cite essential prior work, even in their later surveys.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2R5, R7R8]}
This may explain some of ACM's misattributions.^{[T19]}
See also
Sec.
II &
III &
V &
XIII &
X &
XVII &
XII &
XVIII &
XX.
Although ACM does not literally claim
that LBH were somehow responsible for the
"astonishing breakthroughs in computer vision, speech recognition, natural language processing, and robotics,"
ACM's wording seems to suggest this.
In particular,
ACM does not mention that these breakthroughs were
fundamentally derived from
three decades of research that came out of other deep learning groups—including my own (e.g., A & B & C).
The deep NNs
of our team, for example, revolutionised Pattern Recognition and Machine Learning.
By the 2010s,^{[DEC]} they were
heavily used in
academia and industry,^{[DL4]}
in particular, by Microsoft, Google & Facebook, past and present employers of Hinton & LeCun.
I will focus on the 4 fields explicitly
mentioned by ACM (labeled as A, B, C, D) below:
A. Speech recognition. The first superior endtoend neural speech recogniser that outperformed the
state of the art was based on two methods from my lab:
(A1)
Long ShortTerm Memory
or LSTM (1990s2005)^{[LSTM06]}
which overcomes the famous
vanishing gradient problem
first analysed by my
student Sepp Hochreiter in 1991.^{[VAN1]}
This happened long before the similar work of Bengio (see Sec. XVII).^{[MIR]
(Sec. 3,Sec. 4)}
LSTM was refined with my student Felix Gers^{[LSTM2]}
through "forget gates" based on endtoenddifferentiable fast weights.^{[MIR](Sec. 8)[FWP,FWP01]}
(A2) Connectionist Temporal Classification by my student Alex Graves et al. (2006).^{[CTC]} Our team successfully applied CTCtrained LSTM to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}). This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as hidden Markov models (HMMs)^{[BW][BRI][BOU]} (Sec. XV). Hinton et al. (2012) still used the old hybrid approach^{[HYB12]} and did not compare it to CTCLSTM.
In 2009, through the efforts of Alex, CTCtrained LSTM
became the first recurrent NN (RNN) to win international competitions.
He later reused our endtoend neural speech recognizer^{[LSTM4][LSTM14]} as a postdoc in Hinton's lab.^{[LSTM8]}
By 2015, when compute had become cheap enough,
CTCLSTM dramatically improved Google's speech recognition.^{[GSR][GSR15][DL4]}
By the time the Turing Award was handed out,
this had been on most smartphones for years;
Google's 2019
ondevice speech recognition^{[GSR19]}
(not any longer on the server)
is still based on
LSTM^{[MIR](Sec. 4)}
(see Sec. VI & XI & XV).
B. Natural Language Processing (NLP). The first superior endtoend neural machine translation was also based on our LSTM.
In 1995, we already had excellent neural probabilistic models
of text^{[SNT]} (see Sec. XVI).
In 2001, we showed that LSTM can learn languages unlearnable by traditional models such as HMMs,^{[LSTM13]} i.e., a neural "subsymbolic" model suddenly excelled at learning "symbolic" tasks. Compute still had to get 1000 times cheaper, but by 201617, both Google Translate^{[GT16]}—whose whitepaper^{[WU]} mentions LSTM over 50 times—and Facebook Translate^{[FB17]} were based on two connected LSTMs,^{[S2S]} one for incoming texts, and one for outgoing translations—much better than what existed before.^{[DL4]} By 2017, Facebook's users made 30 billion LSTMbased translations per week^{[FB17][DL4]}
(the most popular youtube video needed 2 years to achieve only 6 billion clicks).
See also Sec. VI & XI & XV.
It should be mentioned that further improvements were due to an
attention mechanism
tailored by Bengio's team.^{[ATT14][FWP]}
However, such attention mechanisms also
have their roots in my lab (1991),
in form of what's now called
Transformers with linearized selfattention (1991).^{[FWP02,6][TR16][FWP][ATT]}
See tweet of 2022 and Sec. XVI.
C. Robotics & RL etc. Since 2003, our team has used LSTM for Reinforcement Learning (RL) and robotics.^{[LSTMRL][RPG][LSTMPG]}
In the 2010s,
combinations of RL and LSTM have become standard,
in particular, our
LSTM trained by policy gradients—or PGs (2007).^{[RPG07][RPG][LSTMPG]}
For example, in 2018, a PGtrained LSTM was the core of OpenAI's famous Dactyl which learned to control a dextrous robot hand without a teacher.^{[OAI1][OAI1a]}
Similar for Video Games: In 2019, DeepMind (cofounded by a student from my lab) famously
beat a pro player in the game of Starcraft, which is theoretically harder than Chess or Go^{[DM2]} in many ways, using
Alphastar whose brain has a deep LSTM core trained by PG.^{[DM3]}
An RL LSTM (with 84% of the model's total parameter count) also was the core of the famous
OpenAI Five
which learned to defeat human experts in the
Dota 2 video game (2018).^{[OAI2]}
Bill Gates called this a "huge milestone in advancing artificial intelligence".^{[OAI2a][MIR](Sec. 4)[LSTMPG]}
Apart from A, B, C above,
the 2010s saw many additional LSTM applications, e.g.,
in healthcare,
chemistry, molecular design, lip reading, speech synthesis,^{[AM16]}
stock market prediction, selfdriving cars,
mapping brain signals to speech,
predicting what's going on in nuclear fusion reactors, and so on.^{[DEC][DL4]}
By 2016, more than a quarter of the power of all the
Tensor Processing Units in Google's data centers
was being used for LSTM (only 5% for the CNNs of Sec. D).^{[JOU17]}
Apparently the first LSTM journal paper^{[LSTM1][R5]} is now the 20th century
computer science paper with the most citations per year—though citations are a highly questionable measure of true impact.^{[NAT1]}
D. Computer Vision was revolutionized in the 2010s by
a particular feedforward neural net (NN) called the convolutional NN (CNN).^{[CNN14]} The basic CNN architecture with convolutional and downsampling layers is due to Fukushima (1979),^{[CNN1]} who also introduced the now widely used
rectified linear units (ReLUs) in 1969.^{[RELU1]}
In 1987, NNs with convolutions were combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel did not call this CNNs but TDNNs.
The popular downsampling variant
called maxpooling was introduced by Yamaguchi et al. for TDNNs in 1990^{[CNN3a]} and by Weng et al. for higherdimensional CNNs in 1993.^{[CNN3]} Since 1989,
LeCun's team has contributed improvements of CNNs, especially for images^{[CNN2,4]} (see Sec. XVIII); see also Behnke's work.^{[CNN5ac]}
Finally, my own team showed in 2010^{[MLP1]}
that
unsupervised pretraining is not necessary
to train deep NNs, contrary to claims by Hinton^{[VID1]} who said that "nobody in their right mind would ever suggest" this. Then we
greatly sped up the training of deep
CNNs (Dan Ciresan et al. 2011).
Our fast GPUbased CNN of 2011^{[GPUCNN1]} known as DanNet^{[DAN,DAN1][R6]}
was a practical breakthrough. It was much deeper and faster than earlier GPUaccelerated
CNNs of 2006.^{[GPUCNN]}
In 2011, DanNet was the first pure deep CNN
to win computer vision contests.
For a while, it enjoyed a monopoly.
From 2011 to 2012 it won every contest it entered,
winning four of them
in a row (15 May 2011, 6 Aug 2011, 1 Mar 2012, 10 Sep 2012).^{[GPUCNN5]}
In particular,
at IJCNN 2011 in Silicon Valley, DanNet blew away the competition and achieved the first superhuman visual pattern recognition^{[DAN1]} in an international contest (where LeCun's team took a distant second place, with
three times worse performance).
Even the NY Times mentioned this.
DanNet was also the first deep CNN to win:
a Chinese handwriting contest (ICDAR 2011),
an image segmentation contest (ISBI, May 2012),
a contest on object detection in large images (ICPR, 10 Sept 2012), and—
at the same time—a medical imaging contest on cancer detection.^{[GPUCNN8]}
In July 2012, our
CVPR paper on DanNet^{[GPUCNN3]}
hit the computer vision community.
All of this happened before
the similar GPUaccelerated AlexNet (Dec 2012)
of Hinton's student Krizhevsky won the ImageNet^{[IM09]} 2012 contest^{[GPUCNN45][R6]} (now also without unsupervised pretraining, citing DanNet).
Our CNN image scanners were 1000 times faster than previous methods.^{[SCAN]} This attracted tremendous interest from the healthcare industry. Today IBM, Siemens, Google and many startups are pursuing this approach.
The VGG network (ImageNet 2014 winner)^{[GPUCNN9]}
and other highly cited CNNs^{[RCNN13]}
further extended the work of 2011.^{[MIR](Sec. 19)}
ResNet, the ImageNet 2015 winner^{[HW2]} (Dec 2015) and currently the
most cited neural network,^{[MOST]} is a version (with open gates) of our earlier
Highway Net (May 2015).^{[HW13][R5]} The Highway Net is actually the feedforward net version of vanilla LSTM.^{[LSTM2]} It was the first working, really deep feedforward NN with hundreds of layers (previous NNs had at most a few tens of layers).
See also Sec. XVIII & XIV & XI & VI.
II. ACM:
While the use of artificial neural networks as a tool to help computers recognize patterns and simulate human intelligence had been introduced in the 1980s, ...
Comment:
Perhaps ACM's lack of knowledge about NN history^{[DLH]} is the reason why
they praise works by LBH that failed to cite the original work.
In fact, NNs of the kind mentioned by ACM
appeared long before the 1980s.
The first nonlearning recurrent NN (RNN) architecture (the LenzIsing model) was analyzed in the 1920s.^{[L20][I24,I25][K41][W45]}
Nonlearning RNNs
were also discussed in 1943 by McCulloch and Pitts^{[MC43]} and formally analyzed in 1956 by Kleene.^{[K56]}
In 1972, Amari made the LenzIsing recurrent architecture adaptive.^{[AMH12]} See also Grossberg's work on biological networks,^{[GRO69]} Marr's^{[MAR71]} and Kohonen's^{[KOH72]} work,
and Nakano's learning RNN.^{[NAK72]}
Already in 1948, Turing wrote up
ideas related to
artificial evolution^{[TUR1]} and
learning NNs. He failed to formally publish his ideas though, which explains the obscurity of his thoughts here.
Minsky's simple neural SNARC computer dates back to 1951. Rosenblatt's perceptron with a
single adaptive layer learned in 1958^{[R58]} (Joseph^{[R61]}
mentions an even earlier perceptronlike device by Farley & Clark);
Widrow & Hoff's similar Adaline learned in 1962.^{[WID62]}
Such singlelayer
"shallow learning" actually
started around 1800 when Gauss & Legendre introduced linear
regression and the method of least squares^{[DL12][DLH]} (formally equivalent to linear NNs)—a famous early example of pattern recognition and generalization from training data through a parameterized predictor is Gauss' rediscovery of the asteroid Ceres based on previous astronomical observations.
Deeper
multilayer perceptrons (MLPs) were discussed by Steinbuch^{[ST6195]} (1961), Joseph^{[R61]} (1961), and Rosenblatt^{[R62]} (1962),
who wrote about "backpropagating errors" in an MLP with a hidden layer,^{[R62]} but did not yet have
a general deep learning algorithm for deep MLPs (what's now called backpropagation is quite different and was first published by Linnainmaa in 1970^{[BP1BP5][BPAC]}).
Compare also Selfridge's multilayer Pandemonium^{[SE59]} (1959).
Successful learning in deep architectures started in 1965 in the Ukraine when
Ivakhnenko & Lapa introduced the first general, working learning algorithms for deep MLPs with arbitrarily many hidden layers (already containing the now popular multiplicative gates).^{[DEEP12][DL12][DLH]} A paper of 1971^{[DEEP2]} already described a deep learning net with 8 layers, trained by their highly cited method which was still popular in the new millennium,^{[DL2]} especially in Eastern Europe, where much of Machine Learning was born.^{[MIR](Sec. 1)[R8]} LBH failed to cite this, just like they failed to cite Amari,^{[GD1]} who in 1967 proposed stochastic gradient descent^{[STO5152]} (SGD) for MLPs and whose implementation^{[GD2,GD2a]} (with Saito) learned internal representations at a time when compute was billions of times more expensive than today (see also Tsypkin's work^{[GDab]}).
Fukushima's now widely used
deep convolutional NN architecture was first introduced in the 1970s;^{[CNN1]} his very popular ReLU already in 1969.^{[RELU12]}
See also Sec.
XIII,
III,
V,
VIII,
IX, and
X.
ACM seems to be influenced by a misleading "history of deep learning" propagated by
LBH & coauthors, e.g., Sejnowski^{[S20]} (see Sec. XIII). It goes more or less like this: "In 1969, Minsky & Papert^{[M69]} showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^{[S20]} However, as mentioned above, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL12][DLH]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method^{[DEEP12][DL2]}
(and then also by Amari's SGD for MLPs^{[GD12]}).
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[DLH]}
In the 1980s, "modern" gradientbased learning worked only for rather shallow NNs
(but see a 1989 paper^{[MOZ]}).
However, it became really deep in 1991 in my lab,^{[UNUN3]} which has
always focused on the depth in deep learning.
See Sec. 1 of the overview:^{[MIR]}
First Very Deep NNs, Based on Unsupervised PreTraining (1991).
By 1993, my unsupervised pretraining helped to solve previously unsolvable
"Very Deep Learning" tasks of depth > 1000.^{[UN2][DL1][UN]}
Then, however, we replaced it by the even better, purely supervised LSTM—see Sec. A.^{[MIR](Sec. 4)}
(By 2003, LSTM variants successfully dealt with language problems of depth up to 30,000^{[LSTM17]}
and
more.)
In fact,
twice my lab
drove the shift
from unsupervised pretraining to purely supervised learning (199195; 200610).^{[HIN](Sec. II)[MIR]
(Sec. 19)}
Also see Sec.
III. Note that
LSTMs
brought essentially unlimited depth to gradientbased supervised recurrent NNs; Highway Nets^{[HW13]} brought it to feedforward NNs.^{[MOST][DLH]}
III. ACM:
... by the early 2000s, LeCun, Hinton and Bengio were among a small group who remained committed to this approach. Though their efforts to rekindle the AI community's interest in neural networks were initially met with skepticism, their ideas recently resulted in major technological advances, and their methodology is now the dominant paradigm in the field.
Comment: However, it isn't "their" methodology because it
was introduced much earlier
by others (Sec.
III).^{[DLC][DEEP12][BP1][DL12][DLH][R7R8][R2R4]}
As mentioned above, others introduced
deep learning multilayer perceptrons (1965),^{[DEEP12][R8]}
stochastic gradient descent for multilayer perceptrons (1967),^{[GD13]}
modern backpropagation
(1970),^{[BP1,2][R7]}
architectures of recurrent NNs (1920s1956)^{[I24,I25][MC43][K56]}
and convolutional NNs (1979),^{[CNN1]}
principles of generative adversarial NNs and artificial curiosity (1990),^{[AC90,90b][AC20]}
unsupervised pretraining for deep NNs,^{[UN12]}
the vanishing gradient problem (1991)^{[VAN1]} &
solutions to it (Sec. A),
supervised
GPUaccelerated NNs (2004),^{[GPUNN][GPUCNN5]}
and other foundations.^{[DL12][R2R8]}
Often LBH failed to cite essential prior work.^{[DLC][HIN][MIR](Sec. 21)}
Also see
Sec.
II &
V &
XIII &
IX &
X &
XVII &
XII &
XVIII &
XX &
I.
ACM may have been misled by LBH's web site
deeplearning.net which until 2019 advertised
deep learning as "moving beyond shallow machine learning since 2006",^{[DL7]}
referring to Hinton's^{[UN4]} and Bengio's^{[UN5]}
unsupervised layerwise pretraining for deep NNs
(2006) although
we had this type of deep learning already in 1991;^{[UN][UN12]} see Sec.
II & XVII (5).
Not to mention Ivakhnenko's even earlier supervised layerwise training of deep NNs^{[DEEP12]}
which Hinton,^{[UN4]} Bengio,^{[UN5]} and
LBH^{[DL3,DL3a]} did not cite either.
See Sec. X.
IV. ACM:
The ACM A.M. Turing Award, often referred to as the "Nobel Prize of Computing," carries a $1 million prize, with financial support provided by Google, Inc. It is named for Alan M. Turing, the British mathematician who articulated the mathematical foundation and limits of computing.
Comment:
Skip this comment if you are not interested in deviating from the topic of
LBH—this comment appears here only because
my comments systematically track the sequential order of ACM's claims.^{[T19]}
ACM's statement on Turing is greatly misleading, like some of its other statements.^{[T19]}
It is correct that Turing "articulated the mathematical foundation and limits of computing." However, many
have done this over the decades, and when it comes to credit assignment in science,
the important question is: Who did it first? It wasn't Turing.
Turing published five years after the groundbreaking work of
the Austrian mathematician Kurt Gödel (1931)^{[GOD][GOD21,21a]} and one year after the American Alonzo Church (1935),^{[CHU]} Turing's PhD advisor. Of course, he cited both of them in his 1936 paper.^{[TUR]}
With that in mind, let us look more closely at the birth of modern computer science.
In the early 1930s, Gödel founded modern theoretical computer science.^{[GOD][GOD34][LEI21,21a]} He introduced a universal coding language (193134).^{[GOD][GOD3421a]} It was
based on the integers,
and allows for formalizing the operations of any digital computer in axiomatic form.
Gödel used it to represent both data (such as axioms and theorems) and programs^{[VAR13]} (such as proofgenerating sequences of operations on the data).
He famously constructed formal statements that talk about the computation of other formal statements—especially selfreferential statements which imply that they are not decidable, given a computational theorem prover that systematically enumerates all possible theorems from an enumerable set of axioms. Thus he identified fundamental limits of algorithmic theorem proving, computing, and
any type of computationbased AI.^{[GOD][BIB3][MIR](Sec. 18)[GOD21,21a]}
Much of early AI in the 1940s70s was actually about theorem proving^{[ZU48][NS56]}
and deduction in Gödel style through expert systems and logic programming.
In 1935, Church derived a corollary / extension of Gödel's result by demonstrating that Hilbert & Ackermann's Entscheidungsproblem (decision problem) does not have a general solution.^{[CHU]} To do this, he used his alternative universal coding language called Untyped Lambda Calculus, which forms the basis of the
highly influential programming language LISP.
In 1936, Turing
introduced yet another universal model: the
Turing Machine.^{[TUR]} He rederived the abovementioned result,^{[CHU][TUR][HIN][GOD21,21a][TUR21][LEI21,21a]}
citing both Gödel and Church.^{[TUR]}
In the same year of 1936, Emil Post published yet another independent universal model of computing,^{[POS]}
also citing Gödel and Church.
Today we know many such models.
Nevertheless, although he was standing on the shoulders of others, Turing
was certainly an important computer science pioneer.
(See also
my reply to Hinton
who criticized my website on Turing
without suggesting any factbased corrections.^{[HIN]})
The Gödel Prize for theoretical computer science is named after Gödel.
The currently more lucrative ACM A. M. Turing Award was created in 1966 for
contributions "of lasting and major technical importance to the computer field."
It is funny—and at the same time embarrassing—that Gödel (19061978) never got one, although he not only laid the foundations of the "modern" version of the field, but also identified its most famous open problem "P=NP?" in his famous letter to John von Neumann (1956).^{[GOD56][URQ10]}
Neither did Church (19031995). There would have been plenty of time though—these pioneers died years after the award was introduced.
Likewise, Konrad Zuse (19101995)
never got a Turing award despite having
created the world's first working programmable generalpurpose computer 193541.
His patent application of 1936^{[ZU3638][Z36][RO98][ZUS21]}
described the digital circuits required by programmable physical hardware,
predating Claude Shannon's 1937 thesis on digital circuit design.^{[SHA37]}
Zuse also created the first highlevel programming language in the early 1940s.^{[BAU][KNU]}
Zuse's Z3 computer of 1941 was a working practical device, not just a
theoretical and impractical pen & paper construct like
those of Gödel (193134), Church (1935), Turing (1936), and Post (1936).
Ignoring the inevitable storage limitations of any physical computer,
the physical hardware of Z3 was indeed
universal in the modern sense of the
theory papers above—simple arithmetic tricks
can compensate for its lack of an explicit
conditional jump instruction.^{[RO98]}
(BTW, programming a Turing machine or Post machine is much more awkward than that.)
In sum, the two founders of the theory and practice of modern computing never got Turing awards.
V. ACM:
"Artificial intelligence is now one of the fastestgrowing areas in all of science and one of the most talkedabout topics in society," said ACM President Cherri M. Pancake. "The growth of and interest in AI is due, in no small part, to the recent advances in deep learning for which Bengio, Hinton and LeCun laid the foundation."
Comment:
The foundations of deep learning
were actually laid by others much earlier, e.g.,
deep learning multilayer perceptrons
that learn internal representations (1965),^{[DEEP12][R8]}
stochastic gradient descent for multilayer perceptrons (1967),^{[GD13]}
modern backpropagation
(1970),^{[BP1,2][R7]}
architectures of recurrent NNs (1920s1956)^{[I24,I25][MC43][K56]}
and convolutional NNs (1979),^{[CNN1]}
principles of generative adversarial NNs and artificial curiosity (1990),^{[AC][AC90,90b][AC10][AC20]}
unsupervised pretraining for deep NNs (1991),^{[UN12][UN]}
vanishing gradients (1991)^{[VAN1]} &
solutions to it (Sec. A),^{[LSTM017][CTC]}
supervised GPUaccelerated NNs
(2004),^{[GPUNN][GPUCNN5]}
recordbreaking deep supervised NNs
(2010)^{[MLP12]}
and contestwinning deep CNNs (2011),^{[DAN][DAN1][GPUCNN5]}
super deep
NNs with over 100 layers (2015),^{[HW13][R5]}
Transformers with linearized selfattention through
fast weight programmers (1991),^{[FWP02,6][TR16][FWP][ATT]}
and more.^{[DL12][DLH][R2R8]}
Often LBH failed to cite essential prior work.^{[DL3,DL3a][DLC][HIN][MIR](Sec. 21)[R2R5,R7,R8,R11]}
Also see
Sec.
II &
I &
III &
XIII &
X &
XVII &
XII &
XVIII &
XX.
VI. ACM:
These technologies are used by billions of people. Anyone who has a smartphone in their pocket can tangibly experience advances in natural language processing and computer vision that were not possible just 10 years ago.
Comment:
However, those
"advances in natural language processing" and in speech
in the past 10 years
came mainly through
the LSTM and CTC developed outside of LBH's groups. They were developed instead by our group^{[LSTM16][CTC]} (19912007)—see Sec. B & Sec. A. And even the "advances in computer vision" were possible only through the speedups of
supervised NNs and
CNNs
achieved by our group 20102011^{[MLP12][DAN][DAN1][GPUCNN5][R6]}
and through Highway Netlike NNs (2015),^{[HW13][R5]} although the principles of CNNs were invented and developed by others since the 1970s.^{[CNN14]} See Sec. D & XVIII & XIV
as well as Sec. 4 & Sec. 19 of the overview.^{[MIR]}
VII. ACM:
In addition to the products we use every day, new advances in deep learning have given scientists powerful new tools—in areas ranging from medicine, to astronomy, to materials science."
Comment: But who really started this?
ACM explicitly mentions medicine.
Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]}
Our
DanNet^{[DAN][DAN1][GPUCNN5]}
was
the first NN to win a medical imaging contest through deep learning
(Sept 2012, on cancer detection).^{[GPUCNN5,8]}
ACM also explicitly mentions materials science. In 2010, we introduced our
deep and fast GPUbased NNs to Arcelor Mittal, the world's largest steel producer,
and were able to greatly improve steel defect detection.^{[ST]}
To the best of my knowledge, this was the first deep learning breakthrough in heavy industry.
All of this happened before the similar GPUaccelerated AlexNet of Hinton's student Krizhevsky^{[GPUCNN45][R6]} and the VGG network^{[GPUCNN9]}
won ImageNet contests in 2012 and 2014.
One year later, our team also won the MICCAI Grand Challenge on
mitosis detection.^{[MGC][GPUCNN5,8]}
Our
approach of
20122013
has transformed medical imaging, and
many major companies are using it now (see Sec.
D &
XI).
And of course,
our LSTM (see Sec. A & B & C) is also massively used in healthcare and medical diagnosis—a simple Google Scholar search turns up thousands of such articles.
VIII. ACM:
"Deep neural networks are responsible for some of the greatest advances in modern computer science, helping make substantial progress on longstanding problems in computer vision, speech recognition, and natural language understanding," said Jeff Dean, Google Senior Fellow and SVP, Google AI.
"At the heart of this progress are fundamental techniques developed starting more than 30 years ago by this year's Turing Award winners, Yoshua Bengio, Geoffrey Hinton, and Yann LeCun."
Comment:
As pointed out above,
LBH actually used the "fundamental techniques" invented by others, including our team, often
without citing them.^{[DL1][DLC][HIN][R2R4][R7R8]}
See Sec.
V &
XII &
XIX &
II &
III &
XIII &
XVII &
X &
I.
IX. ACM:
By dramatically improving the ability of computers to make sense of the world, deep neural networks are changing not just the field of computing, but nearly every field of science and human endeavor."
Machine Learning, Neural Networks and Deep Learning
In traditional computing, a computer program directs the computer with explicit stepbystep instructions. In deep learning, a subfield of AI research, the computer is not explicitly told how to solve a particular task such as object classification. Instead, it uses a learning algorithm to extract patterns in the data that relate the input data, such as the pixels of an image, to the desired output such as the label "cat." The challenge for researchers has been to develop effective learning algorithms that can modify the weights on the connections in an artificial neural network so that these weights capture the relevant patterns in the data.
Geoffrey Hinton, who has been advocating for a machine learning approach to artificial intelligence since the early 1980s, looked to how the human brain functions to suggest ways in which machine learning systems might be developed. Inspired by the brain, he and others proposed "artificial neural networks" as a cornerstone of their machine learning investigations.
Comment:
However, as mentioned above, those "others" mentioned by ACM
proposed such systems decades before Hinton
who failed to cite them, even in later
work.^{[HIN][DLC][DL12][DEEP12][RELU12][R7R8][DLH]} See Sec.
II &
III &
XIII &
V &
X &
XIV &
I.
X. ACM:
In computer science, the term "neural networks" refers to systems composed of layers of relatively simple computing elements called "neurons" that are simulated in a computer. These "neurons," which only loosely resemble the neurons in the human brain, influence one another via weighted connections. By changing the weights on the connections, it is possible to change the computation performed by the neural network. Hinton, LeCun and Bengio recognized the importance of building deep networks using many layers—hence the term "deep learning."
Comment:
The ancient term "deep learning" (explicitly mentioned by ACM) was actually
first introduced to Machine Learning by Dechter (1986), and to NNs by Aizenberg et al (2000).^{[DL2]} To my knowledge, LBH have never cited them.
(Margin note: our 2005 paper on deep RL^{[DL6,6a]} was
the first machine learning
publication with the word combination "learn deep" in the title.)
Later
LBH started talking about "deep learning ... moving beyond shallow machine learning since 2006",^{[DL7]} referring to their unsupervised pretraining methods of 2006.
See Sec. III.
It is true though that LBH "recognized the importance of building deep networks using many layers." However,
others built careers on this notion long before LBH recognized this.^{[DEEP12][CNN1][HIN][R8][DL1][DLC]} Even deep learning through unsupervised pretraining was introduced by others.^{[UN13][R4][HIN](Sec. II)}
See also Sec.
II &
III &
XIII &
V &
I.
XI. ACM:
The conceptual foundations and engineering advances laid by LeCun, Bengio and Hinton over a 30year period were significantly advanced by the prevalence of powerful graphics processing unit (GPU) computers, as well as access to massive datasets. In recent years, these and other factors led to leapfrog advances in technologies such as computer vision, speech recognition and machine translation.
Comment:
Again ACM lauds work that failed to cite the pioneers.
As mentioned above,
the essential "conceptual foundations" of deep learning were laid by others
ignored by LBH's papers^{[HIN][R7R8][R2R5]} (see Sec.
V &
II &
III &
I &
XIII &
XII & XIX &
X & XVII).
ACM correctly mentions advancements through GPUs. The first to use GPUs for NNs were Jung & Oh (2004),^{[GPUNN][GPUCNN5]}
apparently never cited by LBH.
In 2010,
our team (Dan Ciresan et al.)
was the one that
made GPUbased NNs fast and deep enough
to break
an important benchmark record,^{[MLP12]}
demonstrating that
unsupervised pretraining (pioneered by myself in 1991)
is not necessary
to train deep NNs, contrary to Hinton's claims.^{[VID1]}
By 2011,
our CNNs were deep and fast enough^{[DAN][DAN1][GPUCNN5]}
to win competitions in computer
vision (explicitly mentioned by ACM) for the first time^{[R6]} (see Sec. D).
Furthermore, by the mid 2010s, speech recognition and machine translation
(explicitly mentioned by ACM) were actually dominated by LSTM and CTC of our team.^{[LSTM14][CTC]}
In particular, as mentioned in Sec. A,
the CTCLSTM combination (20062007) was the first superior endtoend neural speech recogniser, while previous methods since the late 1980s (including Bengio's and Hinton's) combined NNs with traditional models such as HMMs.^{[BW][BOU][BRI][HYB12]}
As mentioned in Sec. B and XVI, the first superior endtoend neural machine translation was also based on LSTM.
XII. ACM:
... Select Technical Accomplishments ...
Geoffrey Hinton
Backpropagation: In a 1986 paper, "Learning Internal Representations by Error Propagation," coauthored with David Rumelhart and Ronald Williams, Hinton demonstrated that the backpropagation algorithm allowed neural nets to discover their own internal representations of data, making it possible to use neural nets to solve problems that had previously been thought to be beyond their reach. The backpropagation algorithm is standard in most neural networks today.
Comment:
ACM credits Hinton for work that failed to cite the origins of the backpropagation algorithm.
ACM's statement is "less wrong" than Honda's^{[HIN](Sec. I)} but still
very misleading since nonexperts
(and apparently even other award committees^{[HIN](Sec. I)})
are left with the impression that
Hinton and colleagues created this method. They didn't. In fact,
Hinton was coauthor of an article on
backpropagation by Rumelhart et al. (198586)^{[RUM]}
which did not state that 3 years earlier, Werbos proposed to train NNs in this way
(1982).^{[BP2]}
And the article^{[RUM]} even failed to mention Linnainmaa, the inventor of this famous algorithm for credit assignment in networks (1970),^{[BP1]} also known as "reverse mode of automatic differentiation." In 1960, Kelley already had a precursor thereof in the field of control theory;^{[BPA]} see also later work of the early 1960s.^{[BPB][BPC]}^{[R7]}
By 1985, compute had become about 1,000 times cheaper than in 1970, and
the first desktop computers
had just become accessible in wealthier academic labs. Computational experiments then demonstrated that backpropagation can yield useful internal representations in hidden layers of NNs.^{[RUM]} But this was essentially just an experimental analysis of a known method.^{[BP12]} And
the authors did not cite the prior art—not even in later surveys.^{[DL3,DL3a][DLC]}
More on the
history of backpropagation
can be found at Scholarpedia^{[DL2]} and in my awardwinning survey.^{[DL1]}
Also see Sec. XIX, II.
Some claim that "backpropagation is just the chain rule of Leibniz (1676)^{[DLH]} & L'Hopital (1696)." No, it is the efficient way of applying the chain rule to big networks with differentiable nodes (there are also many inefficient ways of doing this). It was not published until 1970.^{[BP1][DLH]}
See the
recent debate:^{[HIN]} It is true that in 2018,
Hinton^{[AOI]}
did not credit himself but his coauthor
Rumelhart^{[RUM]} with the "invention" of backpropagation.
Nevertheless, he accepted the Honda Prize
for "creating" the method and for other things he didn't do.^{[HIN]}
Neither in a popular book^{[AOI]}
nor in other recent work^{[DL3,DL3a]} did he
cite Linnainmaa (1970),^{[BP1]} the true creator.^{[BP45]}
It should be mentioned
that his 2015 survey^{[DL3]} does cite Werbos (1974) who however described the method correctly only
later in 1982^{[BP2]} and
also failed to cite Linnainmaa.^{[BP1]}
Compare the 196768 work of Amari:^{[GD13]} to my knowledge the first to propose and implement stochastic gradient descent^{[STO5152]} for training multilayer perceptrons (without specifying the specific reverse mode gradient descent^{[GD', GD'']} method now known as backpropagation^{[BP1]});
see also Tsypkin's work of 1966.^{[GDab]}
Linnainmaa's backpropagation method was wellknown.^{[BP5][DL12][DLC]}
It wasn't created by "lots of different people" as Hinton suggested^{[AOI][HIN][R11]}
but by exactly
one person who published first^{[BP1][DLH]} and therefore should get the credit.
XIII. ACM:
Boltzmann Machines: In 1983, with Terrence Sejnowski, Hinton invented Boltzmann Machines, one of the first neural networks capable of learning internal representations in neurons that were not part of the input or output.
Comment:
Again ACM credits work that failed to cite the pioneers.
I have once called the
Boltzmann Machine (BM)^{[BM]} a
significant contribution to deep
learning.^{[HIN]}
Recently, however, I learnt through a reader that even the BM paper^{[BM]} did not cite prior relevant work
by Sherrington & Kirkpatrick^{[SK75]} and Glauber.^{[G63]}
(Compare related work.^{[H86][H88][S93]})
The BM paper should also have mentioned
that already two decades earlier, in 1965, Ivakhnenko & Lapa published the first general, working learning algorithms for deep multilayer perceptrons with arbitrarily many layers.^{[DEEP12][HIN]} These
networks were fully "capable of learning internal representations in neurons that were not part of the input or output." The BM paper^{[BM]} did not cite this. LBH have never cited this—not even in recent work. See also
Sec. II
&
V &
X.^{[MIR](Sec. 1)[R8]}
As mentioned in Sec. II, Sejnowski's rather selfserving "history of deep learning" [S20] claims: In 1969, Minsky & Papert^{[M69]} showed that shallow NNs are very limited "and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "deep learning problem"
(a limitation of Gauss & Legendre's shallow learning around 1800^{[DL12][DLH]}) that had already been solved four years prior (see Sec. II),
and deep learning research was alive and kicking
also in the 1970s, especially outside of the Anglosphere.^{[DEEP2][GD13][CNN1][DL12][DLH]}
XIV. ACM:
Improvements to convolutional neural networks: In 2012, with his students, Alex Krizhevsky and Ilya Sutskever, Hinton improved convolutional neural networks using rectified linear neurons and dropout regularization. In the prominent ImageNet competition, Hinton and his students almost halved the error rate for object recognition and reshaped the computer vision field.
Comment: Again ACM recognizes work that failed to cite the pioneers.
Rectified linear neurons (ReLUs) were actually known much earlier— see Fukushima (1969)^{[RELU1]} and v. d. Malsburg (1973).^{[RELU2]} Hinton's 2012 paper^{[GPUCNN4]}
did not cite their origins. Instead, it cited another paper
by Hinton which also did not cite the original work.
Dropout is actually a variant of Hanson's much earlier stochastic delta rule (1990).^{[Drop14]} Hinton's 2012 paper and his later patent did not cite this either.
Apart from this,
as we showed already in 2011 in a contest where LeCun's team participated as well,^{[DAN1]}
neither dropout nor ReLUs are necessary
to win computer vision competitions and achieve
superhuman results—see
Sec. D above. Back then, the only really
important CNNrelated task was to greatly accelerate the training
of deep CNNs through GPUs.^{[GPUCNN1,3,5][R6]}
Already before ImageNet 2012,^{[R6]}
our earlier
fast deep CNN called DanNet
(using
neither ReLUs nor dropout / Hanson's rule) had
a monopoly on winning computer vision competitions.^{[GPUCNN5]} It more than "halved the error rate for object recognition" (ACM's wording) in a contest already in 2011^{[GPUCNN2][DAN,DAN1][R6]} long before the similar system of Hinton's student.
See Sec. D
as well as Sec. 19 of the overview.^{[MIR]}
XV. ACM:
Yoshua Bengio
Probabilistic models of sequences: In the 1990s, Bengio combined neural networks with probabilistic models of sequences, such as hidden Markov models. These ideas were incorporated into a system used by AT&T/NCR for reading handwritten checks, were considered a pinnacle of neural network research in the 1990s, and modern deep learning speech recognition systems are extending these concepts.
Comment:
However, such hybrids of NNs and hidden Markov models (HMMs) etc.
have existed
since the late 1980s.^{[BW][BRI][BOU]}
It is not true that
"modern deep learning speech recognition systems are extending these concepts"
(ACM's wording)
because they basically abandon HMMs and are based on
two methods from my lab:
LSTM (1990s2005)^{[LSTM06]}
and CTC^{[CTC]} (2006), which were applied to speech
in 2007.^{[LSTM4][LSTM14]}
CTCLSTM is endtoendneural and thus very different from (and superior to) the hybrid methods since the late 1980s.^{[BW][BRI][BOU][HYB12]}
By the time the 2018 Turing Award was handed out, our
CTCLSTMbased speech recognition was on most smartphones.
See also Sec. A.
XVI. ACM:
Highdimensional word embeddings and attention: In 2000, Bengio authored the landmark paper, "A Neural Probabilistic Language Model," that introduced highdimension word embeddings as a representation of word meaning. Bengio's insights had a huge and lasting impact on natural language processing tasks including language translation, question answering, and visual question answering. His group also introduced a form of attention mechanism which led to breakthroughs in machine translation and form a key component of sequential processing with deep learning.
Comment:
5 years earlier, in 1995, we already had a similar, excellent neural probabilistic text model.^{[SNT]} Bengio^{[NPM]} characterizes it only briefly as "related"
(see also Pollack's earlier work on embeddings of words and other structures^{[PO87][PO90]}).
In the 2010s,
the central method in
the mentioned fields of "language translation, question answering, and visual question answering"
was actually the LSTM of our team,^{[LSTM06]} which Bloomberg called the "arguably the most commercial AI achievement."^{[AV1][MIR](Sec. 4)} See Sec. B.
A particular attention mechanism tailored to NLP by
Bengio's team^{[ATT14]} has indeed become important.
For example, it helped to further improve Facebook's LSTMbased translation (see Sec. B).
However,
already in 199093, we had both of the now common types of
adaptive neural sequential attention: endtoenddifferentiable
"soft" attention in the latent space of Fast Weight Programmers (FWPs),^{[FWP2][FWP]} and "hard" attention (in observation space) in the context of RL^{[ATT][ATT01]} (1990).
In fact, the now widely used
attentionbased Transformers^{[TR16]} are
closely related to my
FWPs of 1991^{[FWP01]}
which have become a popular alternative to RNNs.
A traditional slow neural net (NN) learns by gradient descent to program the changes of
the fast weights of
another NN.
Like RNNs, FWPs can learn to memorize past data.
My FWP of 1991^{[FWP01]}
computed its fast weight changes through
additive outer products of selfinvented activation patterns
(now often called keys and values for selfattention).^{[TR16][FWP]}
Transformers combine this with projections
and softmax. Towards the end of
the 2010s,^{[DEC]}
despite their limited time windows,
Transformers^{[TR12]}
started to excel at Natural Language Processing,
a traditional LSTM domain (see Sec. B).
Nevertheless, there are still many language tasks that LSTM can
rapidly learn to solve quickly^{[LSTM13,17]}
(in time proportional to sentence length)
while plain Transformers can't—see^{[TR34]}
for additional limitations of Transformers.
For long input sequences, the efficiency of Transformers was improved through
Transformers with linearized selfattention^{[TR56]}
which are formally equivalent to the outer product version (Eq. 5) of my 1991 FWPs (apart from normalization; see tweet of 2022).^{[FWP6][FWP]}
In 1993, I introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
extended the approach to
RNNs that program themselves.
See^{[MIR](Sec. 9)[R4]} for my related priority dispute on attention with Hinton.
He was the reviewer of my 1990 paper^{[ATT2]} which summarised in its Section 5 our early work on attention, to my knowledge the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component
with an attentional component (the fixation controller).
Two decades later Hinton wrote about
his own work:^{[ATT3]}
"To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)."
XVII. ACM:
Generative adversarial networks: Since 2010, Bengio's papers on generative deep learning, in particular the Generative Adversarial Networks (GANs) developed with Ian Goodfellow, have spawned a revolution in computer vision and computer graphics. In one fascinating application of this work, computers can actually create original images, reminiscent of the creativity that is considered a hallmark of human intelligence.
Comment:
Again ACM lauds Bengio for work that failed to cite the original work.
GANs^{[GAN01]} (20102014) are actually
a simple application^{[AC]}
of the adversarial curiosity (AC) principle
from 1990^{[AC90,90b][AC20]} (see also surveys^{[AC0910]}). This principle
is now widely used for exploration in RL (e.g., Sec. C) and
for image synthesis^{[GAN1]} (also mentioned by ACM in Sec. XVIII). It
works as follows. One NN—the controller—probabilistically generates outputs.
Another NN—the world model—sees the outputs of the controller and predicts environmental reactions to them. Using gradient descent, the predictor NN minimizes its error, while the generator NN tries to make outputs that maximize this error: one net's loss is the other net's gain. 4 years before the GAN paper,^{[GAN1]} a wellknown 2010 survey^{[AC10]} summarised the generative adversarial NNs of 1990 as follows: a
"neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" (which are minimized).
GANs are a version of this where the trials are very short (like in bandit problems) and the environment simply returns 1 or 0 depending on whether the controller's (or generator's) output is in a given set.^{[AC20][AC]}
(Other
early adversarial machine learning settings^{[S59][H90]}
were very different—they
neither involved unsupervised NNs nor were about modeling data nor used gradient descent.^{[AC20]}) Bengio et al. neither cited the original work^{[AC90,90b][AC20]} nor corrected
their erroneous claims^{[GAN1]} about
the other
adversarial NNs using "predictability minimization" (PM) for creating disentangled representations
(1991).^{[PM12][AC20][R2][MIR](Sec. 5)}
The priority dispute above was picked up by the popular press, e.g.,
Bloomberg,^{[AV1]}
after a particularly notable encounter between me and Bengio's student Dr. Goodfellow at a N(eur)IPS conference.
He gave a talk on GANs, encouraging people to ask questions.
I did, addressing problems in
their NIPS 2014 paper^{[GAN1]}
and some of the erroneous claims it made about my prior work.^{[AC20]}
Subsequent efforts to correct these issues in a common paper didn't work out.
Goodfellow eventually admitted that PM is adversarial (his paper^{[GAN1]} still claims the opposite), but emphasized that it's not generative. However, the even earlier AC^{[AC90,90b][AC10][AC20]} is both adversarial and generative (its generator contains probabilistic units^{[AC90]} like in StyleGANs^{[GAN2]}).
When the authors^{[GAN1]}
did not produce an erratum,
I published one myself in the hopes of correcting the annals of history.^{[AC20]}
Remarkably, Bengio was backed by LeCun who called GANs
"the coolest idea in machine learning in the last twenty years" without mentioning
that they are instances of my earlier work.^{[R2][AC20]}
XVII1. Additional priority disputes with Bengio's group, mostly going back 3 decades and more
(B2) 3 years after my student Sepp Hochreiter had published his analysis of
the famous
vanishing gradient problem,^{[MIR](Sec. 3)[VAN1]} Bengio published his own,^{[VAN2]} without citing Sepp.
At the 1996 N(eur)IPS conference, this dispute was settled in favor of Sepp.^{[VAN1]}
However, even after a common publication,^{[VAN3]} Bengio published papers^{[VAN4][XAV]}
that cited only his own 1994 paper but not Sepp's original work (1991).
Disturbingly, this has apparently helped him to get more citations for vanishing gradients
than Sepp—another sign that citation counts
are poor indicators of truly pioneering work.^{[NAT1]}
(Margin note: Bengio states^{[YB20]} that in 2018 he
"ranked as the most cited computer scientist worldwide"—the above illustrates what such citation counts are really worth.)
The deontology of science requires:
If one "reinvents" something that was already known,
and only becomes aware of it later,
one must at least clarify it later,^{[DLC]}
and correctly give credit
in all followup papers and presentations.
(B3)
Bengio also claims^{[YB20]} that in 1995
he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue"^{[HB96]}
although
my publications on exactly this topic
date back to 199193.^{[UN02][UN]}
(B4) Another dispute was on
metalearning (learning to learn—now a hot topic)
which I started in 1987^{[META1][META]} long before Bengio
who suggested in public at N(eur)IPS 2019
that he did it before me.^{[R3]}
(B5)
Bengio also writes^{[YB20]} that in
1999 he "introduced, for the first time, autoregressive neural networks for density estimation"
although we used a very similar setup for text compression
in 1995^{[SNT]}—see Sec. XVI.
(B6)
Regarding attentionbased Transformers,^{[TR16]} Bengio^{[DL3a]} cites his own team (2014) for "soft attention" without citing my much earlier original work of 19911993 on soft attention and Transformers with linearized selfattention.^{[FWP02,6][TR16][FWP][ATT]} See also this tweet of 2022.
There is more. For example,
Bengio has also heavily used our LSTM (see Sec. AC),
but for some reason he introduced in 2014 what he called
"gated recurrent units (GRU)"^{[LSTMGRU]}
for a variant of our vanilla LSTM architecture^{[LSTM2]} (2000) which he did not cite
although our work^{[LSTM2]} was the one that introduced gated recurrent units.
In addition, our team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method.
(Margin note: GRU cells lack an important gate and can neither
learn to count^{[LSTMGRU2]} nor learn simple nonregular
languages;^{[LSTMGRU2]} they
also do not work as well for challenging translation tasks,
according to Google Brain.^{[LSTMGRU3]})
XVII2. Additional priority disputes with Hinton's group, going back 3 decades and more
(H1) The dispute on
unsupervised pretraining
for deep NNs.^{[UN04][HIN](Sec. II)[MIR](Sec. 1)}
Hinton's paper^{[UN4]} (2006) appeared long after my earlier
work on this^{[UN02]}
which introduced
the first NNs shown to solve very deep problems
(see Sec. II above).^{[UN]}
It was published in 199192^{[UN1]} when compute was about 1000 times more expensive than in 2006.
Hinton
did not mention it—not even in LBH's later
survey (2015),^{[DL3][DLC]}
although he and Bengio knew it well (also from discussions by email).
See also Sec. II & III.
(H2) The dispute on
compressing or distilling
one NN into another.^{[UN02][DIST2][MIR](Sec. 2)}
Hinton^{[DIST2]} (2006) did not cite my much earlier original
work on this (1991),^{[UN1][UN]} not even in his later patent application
US20150356461A1.
(H3) The dispute on
fast weight programmers^{[FWP][FWP04a]}
through tensorlike outer products (19912016) and their motivation^{[FWP2][FWP4a][MIR](Sec. 8)} (see also Sec. XVI above).
(H4) The dispute on
learning sequential attention
with NNs.^{[MIR](Sec. 9)}
Hinton^{[ATT3]} (2010)
did not mention
our much earlier work on this^{[ATT1][ATT]} although
he was both reviewer and editor of my summary^{[ATT2]} (1990; see Sec. XVI above).
The ten priority disputes mentioned in the present Sec. XVII are not on the only ones.^{[R4]} Remarkably, three of them
are related to the 1991 paper^{[UN1][UN]} which in many ways started what people now call deep learning, going beyond
Ivakhnenko's "early" deep learning^{[DEEP12]} (which LBH did not cite either^{[DLC]}—see Sec. II & III).
Most of them go back to work of 199091.^{[MIR]}
See Sec. I for additional related issues of credit assignment.
For decades, it seems like much of the more prominent work of Dr. Hinton and Dr. Bengio has simply been repackaged versions of earlier work that they produced without proper citation.
XVII3. Additional priority disputes with LeCun's group, going back 3 decades and more
Some of the disputes with LeCun are covered here.^{[LEC]} For example, years ago, my team published most of what LeCun called his "main original contributions" in 2022: neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc.^{[LEC22ab]} This was also discussed on Hacker News, reddit, and in the media.
XVIII. ACM:
Yann LeCun
Convolutional neural networks: In the 1980s, LeCun developed convolutional neural networks, a foundational principle in the field, which, among other advantages, have been essential in making deep learning more efficient.
In the late 1980s, while working at the University of Toronto and Bell Labs, LeCun was the first to train a convolutional neural network system on images of handwritten digits. Today, convolutional neural networks are an industry standard in computer vision, as well as in speech recognition, speech synthesis, image synthesis, and natural language processing. They are used in a wide variety of applications, including autonomous driving, medical image analysis, voiceactivated assistants, and information filtering.
Comment:
LeCun's team has made important contributions to CNNs since 1989.^{[CNN2,4]}
However, the basic CNN architecture with convolutional and downsampling layers is actually due to Fukushima (1979).^{[CNN1]} NNs with convolutions were later (1987) combined by Waibel with weight sharing and backpropagation.^{[CNN1a]} Waibel called this TDNN and
also was the first to apply this to speech (explicitly mentioned by ACM).
All of this happened before LeCun's work on CNNs. See Sec. D above and Sec. 21 of the overview of our Annus Mirabilis 19901991.^{[MIR]}
ACM explicitly mentions autonomous driving.
The first team to win a relevant international contest through deep CNNs was ours:
at IJCNN 2011 in Silicon Valley, our DanNet^{[DAN][GPUCNN13]} won the
traffic sign recognition competition with
superhuman performance
while LeCun's team took a distant second place (with
three times worse performance).^{[DAN1]} Again see Sec. D.
ACM explicitly mentions medical image analysis.
Baldi and Chauvin (1993) had the first application of CNNs with backpropagation to biomedical/biometric images.^{[BA93]}
The first team to win a medical image analysis competition through deep CNNs was again ours:
at ICPR 2012, our DanNet^{[GPUCNN13]} won the
medical imaging contest
(Sept 2012, on detection of mitosis/cancer)^{[GPUCNN5,7,8]}
(before the similar AlexNet won ImageNet 2012^{[GPUCNN5][R6]} and the similar VGG network^{[GPUCNN9]} won ImageNet 2014).
One year later, our team also won the MICCAI Grand Challenge on
mitosis detection.^{[MGC][GPUCNN5,7,8]}
This approach has transformed medical imaging.
Many major companies are using it now. See Sec. D & VII.
ACM also addresses image synthesis—see Sec. XVII.
ACM also explicitly mentions speech recognition, speech synthesis,^{[AM16][DL1]}
natural language processing,
voiceactivated assistants, and information filtering.
All of these fields were heavily shaped in the 2010s by our nonCNN methods.^{[DL1][DL4][AM16][GSR][GSR15][GT16][WU][FB17]} See
Sec. A, B, VI, XI.
XIX. ACM:
Improving backpropagation algorithms: LeCun proposed an early version of the backpropagation algorithm (backprop), and gave a clean derivation of it based on variational principles. His work to speed up backpropagation algorithms included describing two simple methods to accelerate learning time.
Comment: ACM recognizes
LeCun for work that did not cite the pioneers of this method.
As mentioned in Sec. XII, backpropagation was actually proposed earlier as a learning method for NNs by Werbos (1982)^{[BP24][DLH]} (see also Amari's work on SGD for MLPs of 196768^{[GD12a]}). And already in 1970, the modern backpropagation algorithm itself—the real centerpiece of all this later applied work, also known as the reverse mode of automatic differentiation—was published by Linnainmaa^{[BP1,4][R7]} (with a "clean derivation," of course).
LeCun has never cited this—not even in
recent work.^{[DL3,DL3a][DLC]}
In 1960, Kelley already had a precursor of the algorithm.^{[BPA]} Furthermore, many
besides LeCun have worked "to speed up backpropagation algorithms"^{[DL1]} (ACM's wording). More on the history of backpropagation can be found at Scholarpedia^{[DL2]}^{[BP4]} and in my annotated history of modern AI and deep learning (2022).^{[DLH]}
XX. ACM:
Broadening the vision of neural networks: LeCun is also credited with developing a broader vision for neural networks as a computational model for a wide range of tasks, introducing in early work a number of concepts now fundamental in AI. For example, in the context of recognizing images, he studied how hierarchical feature representation can be learned in neural networks—a concept that is now routinely used in many recognition tasks.
Comment:
However, "hierarchical feature representation" in deep learning networks is what Ivakhnenko & Lapa (1965)^{[DEEP12]}
and Amari^{[GD12]}
(and also Fukushima^{[CNN1][DL2]}) had long before LeCun.
ACM may have been misled by the fact that LeCun has never cited Ivakhnenko—not even in his later survey.^{[DL3][DLC]}
See Sec. D &
II &
XIII &
V.
XXI. ACM:
Together with Leon Bottou, he proposed the idea, used in every modern deep learning software, that learning systems can be built as complex networks of modules where backpropagation is performed through automatic differentiation. They also proposed deep learning architectures that can manipulate structured data, such as graphs.
Comment:
What does ACM mean by "modules"? Neuronlike elements? Bigger modules? Anyway,
LeCun et al. neither cited the origins^{[BP1]} (1970) of this
widely used type of automatic differentiation for differentiable networks of modules^{[DL2][BP45][DLC]}
nor a computer program (1980) for automatically deriving and implementing backpropagation
for such systems.^{[S80]} See also
Sec. XIX & XII.
And "deep learning architectures that can manipulate structured data, such as graphs" were
proposed by Sperduti, Goller, and Küchler in the 1990s^{[SP9397][GOL][KU]}
before LeCun who did not cite them. See also Pollack's even earlier relevant work;^{[PO8790]} compare the important work of Baldi and colleagues.^{[BA9603]}
(Furthermore, "complex networks of modules where backpropagation is performed" were the central theme of my much earlier habilitation thesis (1993).^{[UN2]} For example, our
adaptive subgoal generators
(1991)^{[HRL02]} were trained through endtoenddifferentiable chains of such modules.^{[MIR](Sec. 10)}
Same for my
planning and reinforcement learning with recurrent neural world models
(1990).^{[PLAN][MIR](Sec. 11)} Same for my
fast weight programmers^{[FWP02][FWP][ATT][MIR](Sec. 8)} since 1991 (see Sec. XVI)
consisting of chains of several modules—the outer productbased version is now known as a
Transformer with linearized selfattention.^{[FWP02,6][TR56]})
In the hard sciences, the only things that count are the facts. Science is not democratic. If 100 persons claim one thing, and only one person claims the opposite, but he/she can back it up through facts, then he/she wins. If you haven't already read it, see "100 Authors against Einstein."^{}[AH1]
The deontology of science enforces proper scientific standards and behavior when it comes to identifying prior art and assigning credit.
Unlike politics, science is immune to
ad hominem attacks^{[AH23][HIN]}
true to the motto:
"If you cannot dispute a factbased message, attack the messenger himself."^{[HIN]}
Science has a wellestablished way of dealing with plagiarism (which may be unintentional^{[PLAG1][CONN21]} or not^{[FAKE2]}) and priority disputes, based on facts such as time stamps of publications and patents. Sometimes it may take a while to settle disputes, but in the end, the facts must always win.
As long as the facts have not yet won it's not yet the end. No fancy award can ever change that.^{[HIN]}
Dr. Hinton, Dr. LeCun, Dr. Bengio,
and their coworkers have contributed useful improvements of deep learning methods.^{[CNN2,4][CDI][LAN][RMSP][XAV][ATT14][CAPS]}
But their most visible work (praised by ACM) mainly helped to
popularize methods created by other researchers
whom they did not cite, in contrast to
ACM's Code of Ethics and Professional Conduct^{[ACM18]}
(see, e.g., Sec.
II,
V,
XII,
XIX,
XXI,
XIII,
XIV,
XI, and
XX, and 2).
My lab is especially affected by ACM's misleading statements (see, e.g.,
Sec. I, A, B, C, D, XVII, VI, and XVI).
As emphasized earlier:^{[DLC][HIN]}
"The inventor of an important method should get credit for inventing it. They may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it—but not for inventing it."
If one "reinvents" something that was already known,
and only becomes aware of it later,
one must at least clarify it later,
and correctly give credit
in followup papers and presentations.
It is a sign of our field's immaturity that popularizers
are sometimes still credited for the creations of other researchers whom they
ignored.
Of course,
ACM (or anyone for that matter) is free to hand out awards to anybody,
but one should not decorate anybody for work based on unmentioned contributions of others.
To fulfill its mandate,
ACM should revise its statements so that it can preserve the reputation of the Turing award
and its significance to computer science—others will. Similar for scientific journals, which "need to make clearer and firmer commitments to selfcorrection,"^{[SV20]} as is already the standard in other scientific fields.
Could it be that seemingly unbiased award committees are actually affected by PR efforts
in popular science venues without peer review? For example, the narrator of a popular 2018 Bloomberg video^{[VID2]} is thanking Hinton for speech recognition and machine translation, although both were actually done (at production time of the video) on billions of smartphones by deep learning methods developed in my labs in Germany and Switzerland (LSTM & CTC; see Sec. A) long before Hinton's methods. Similarly, in 2016, the NY Times published an article^{[NYT3]} about the new, greatly improved, LSTMbased Google Translate without even mentioning our LSTM (instead featuring Hinton who had little to do with it), although
Google's original 2016 paper
on Google Translate^{[WU]} mentions LSTM over 50 times (see Sec. B).
In ad hominem style,^{[AH23]}
LeCun stated in the NY Times that "Jürgen ... keeps
claiming credit he doesn't deserve for many, many things",^{[NYT1]} without
providing a single example.
LeCun also called the GANs of Bengio's team^{[GAN1]}
"the coolest idea in machine learning in the last twenty years" without mentioning
that
GANs are variations
of my work in 1990.^{[AC90,90b][AC20][R2]} According to Bloomberg,^{[AV2]} Bengio has simply "denied my claims" without
backing up his denial by any facts; see Sec. XVII.
It has been requested that
"scientists must be willing to speak out when they see false information being presented in social media, traditional print or broadcast press" and "must speak out against false information and fake science in circulation
and forcefully contradict public figures who promote it."^{[FAKE]}
LBH, who called themselves the deep learning conspiracy,^{[DLC][DLC12]}
have cited
and otherwise supported each other through interviews and other PR
at the expense of the true pioneers.
Apparently this has earned them many citations, which
is just another sign that citation counts
are poor indicators of truly pioneering work—see Sec. XVII. As I pointed out in Nature (2011):^{[NAT1]}
like the lessthanworthless collateralized debt obligations that drove the 2008 financial bubble, citations are easy to print and inflate, providing an incentive for professors to maximize citation counts instead of scientific progress—witness how relatively unknown scientists can now collect more citations than the most influential founders of their fields.
In fact, many of my critical comments above do address highly cited work.
Our LSTM paper^{[LSTM1]} has got more citations
than any paper by Bengio or LeCun,^{[R5]}
and more per year than any other computer science paper of the 20th century.
Hinton's most cited paper (2012) is the one on GPUbased CNNs.^{[GPUCNN4][R5]} It follows our earlier work on supervised
deep NNs (2010)^{[MLP1]}
(which abandoned the
unsupervised pretraining for deep NNs
introduced
by myself
^{[UN][UN03]} and later
championed by Hinton;^{[UN4][VID1]} see Sec. D).
Hinton (2012)^{[GPUCNN4]} characterizes
our deep and fast DanNet (2011)^{[GPUCNN13]} as
"somewhat similar"—DanNet won 4 computer vision contests before Hinton's
AlexNet won one;^{[R6]}
see Sec. D, XIV.
The highly cited VGG network (2014)^{[GPUCNN9]}
further extended our work.
Hinton's 2nd most cited paper^{[RUM][R5]}
is the one on experiments with backpropagation (note that in 2019
his Google Scholar page greatly exaggerated the citation count
of Hinton's paper,^{[RUM]} adding citations
for a book by Rumelhart & McClelland^{[R5]}).
Backpropagation is a previously invented
method^{[BP1]} whose origins
Hinton did not cite—not even in later surveys;^{[R7]} see Sec. XII.
His nets learned internal representations two decades after the nets
of Ivakhnenko whom he has never cited;^{[DEEP12][R7R8]} see Sec. II, XIII.
Bengio's 2nd most cited research paper is the one on GANs (2014),^{[GAN1]} which are instances of my
artificial curiosity
(1990)^{[AC90,90b][AC20][R2]} which he did not cite;
see Sec. XVII.
As of 2021, the most cited machine learning paper
is the one on ResNet (2015),^{[HW2][R5]} a version of our earlier Highway Net,^{[HW13]} which was the first working feedforward NN with over 100 layers—see Sec. D (in fact, ResNets are just Highway Nets whose gates are initialized such that they remain always open).
Hinton's highly cited papers on unsupervised pretraining for deep NNs (2006)^{[UN4]}
were preceded
by ours^{[UN02][UN]}
(1991) by 15 years, but he did not cite them^{[UN02][R4][HIN](Sec. II)}—see Sec. II & III.
His papers/patents on dropout and rectified neurons
were preceded by Hanson's^{[Drop14]}
and Fukushima's^{[RELU12]} by decades, but he did not cite them—see Sec. Sec. XIV.
As recently as of 2021, ACM published yet another misleading deep learning "survey" by LBH,^{[DL3a]} again heavily citing LBH without
correcting the previous omissions.
Consult the Executive Summary and Sec. IXXI of this critique for more.
So virtually all the algorithms that have attracted
many citations in the recent deep learning revolution
have their conceptual and technical roots in my labs in Munich and Lugano,^{[MOST]}
apart from the old basic principles
of deep learning MLPs since 1965^{[DEEP12][GD12a]} (see Sec. II, XX)
and backpropagation (196070)^{[BPA][BP1]} (see Sec. XIX, XII)
and convolutional NNs since 1979^{[CNN14]} (see Sec. XVIII, D).
Here is an overview of our relevant work compressed into a few lines that link to
subsections of the present article, where "A→B" indicates that A conceptually led to B:
Our LSTM
(1990s, see Sec. A, B; also for RL, 2003, see Sec. C)
→ our Highway Net (May 2015) → ResNet (Dec 2015, see Sec. D).
Our adversarial Artificial Curiosity (1990) → GANs (2010s, see Sec. XVII).
We abandoned
our own unsupervised pretraining of deep NNs
(1991, see Sec. II & III)
for recurrent NNs in the 1990s → our LSTM (see Sec. AC) and
for feedforward NNs in 2010 → our DanNet (2011) → AlexNet (2012); VGG Net (2014) (see Sec. D).
As mentioned earlier,
our LSTM
brought essentially unlimited depth to gradientbased supervised recurrent NNs in the 1990s; our Highway Nets^{[HW13]} brought it to feedforward NNs in May 2015.^{[MOST]}
Even earlier, our DanNet brought
superior computer vision (2011, see Sec. D, XVIII),
medical diagnosis (2012, see Sec. VII, XVIII), and many other applications.^{[DEC]}
Our LSTM brought superior
speech recognition (with our CTC, 200715, see Sec. A),
machine translation (2016, see Sec. B),
robotics & video game players (201819, see Sec. C),
and many other applications.^{[DEC]}
Our outer productbased
Fast Weight Programmers (1991, see Sec. XVI) are formally equivalent to Transformers (now popular in NLP) with linearized selfattention.^{[FWP,FWP02,6][ATT]}
In fact, our methods and conceptual foundations shaped most
of the application areas mentioned by ACM—see, e.g.,
Sec.
I, A, B, C, D, VII, XVIII.
As mentioned earlier,^{[MIR](Sec. 21)}
when only consulting surveys from the Anglosphere,
it is not always clear^{[DLC]}
that Deep Learning was first conceived outside of it. It started in 1965 in the Ukraine (back then the USSR) with the first nets of arbitrary depth that really learned.^{[DEEP12][R8]}
Soon afterwards, multilayer perceptrons learned internal representations through stochastic gradient descent in Japan.^{[GD12a]} A few years later, modern
backpropagation
was published in Finland (1970).^{[BP1]} The basic deep convolutional NN architecture (now widely used) was invented in the 1970s in Japan^{[CNN1]} where NNs with convolutions were later (1987) also combined with "weight sharing" and backpropagation.^{[CNN1a]} We are standing on the shoulders of these authors and many others—see 888 references in the survey.^{[DL1]}
Our own work since the 1980s mostly took place in Germany and Switzerland.
Unfortunately, LBH's frequent failures to credit essential prior work by others
cannot serve as a role model for PhD students who are told by their advisors
to perform meticulous research on prior art, and to avoid at all costs
the slightest hint of plagiarism, be it
unintentional^{[PLAG1][CONN21]} or intentional.^{[FAKE2]}
It is worrisome that the 2018 Turing award seems to reward LBH for this behavior.
Yes, this critique is also an implicit critique of certain other awards to LBH.^{[HIN]}
It is also related to some of the most popular posts and comments of 2019 at
reddit.com/r/MachineLearning^{[R1R12]} (the largest machine learning forum with back then over 800k subscribers),
many of them influenced by my overview.^{[MIR]}
Dr. LeCun himself is well aware of the challenges to scientific integrity in our field and lists potential rewards of academic corruption:^{[LECP]} "... citing an obscure paper, rather than an accepted paper by a prominent author is dangerous, and has zero benefits.
Sure, author A might be upset, but who cares about upsetting some guy from the university of Oriental Syldavia that you will never have to confront at a conference and who will never be asked to write a letter for your tenure case? On the other hand, author B might be asked to write a review for your next paper, your next grant application, or your tenure case. So, voicing the fact that he doesn't deserve all the credit for the idea is very dangerous. Hence, you don't cite what's right. You cite what everybody else cites."^{[LECP]}
This sounds like the ancient advice: "eat dung—a billion flies can't be wrong!" See Note 4 above.
The blatant misattribution in the field of deep learning may already have inspired others. For example, around 1960, Rosenblatt not only had linear NNs plus threshold functions, he also had much more interesting MLPs with a nonlearning first layer with randomized weights and an adaptive output layer.^{[R62]} So Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs)^{[ELM1]}
without proper attribution. The
revisionist narrative of ELMs^{[ELM2][CONN21]}
is a bit like the revisionist narrative of deep learning criticized by the present report. The "ELM conspiracy" apparently feels they can get away with outrageous improper credit assignment, just like the selfproclaimed "deep learning conspiracy"^{[DLC12]} seems to get away with it on an even grander scale. What an embarrassing lack of maturity of our field. ACM's Turing award for LBH may already have encouraged other machine learning researchers to follow in their footsteps and conduct what can only be described as bad science.
Note that I am insisting on proper credit assignment not only in my own research field but also in quite disconnected areas,^{[HIN]} as demonstrated by my numerous letters in this regard published in Science and Nature, e.g., on the history of aviation,^{[NASC12]} the telephone,^{[NASC3]} the computer,^{[NASC47]} resilient robots,^{[NASC8]} and scientists of the 19th century.^{[NASC9]}
As Elvis Presley put it, "Truth is like the sun. You can shut it out for a time, but it ain't goin' away." It is fun to speculate how future supersmart
AI scientists and AI historians
equipped with artificial curiosity^{[SA17][AC90AC20][PPPP2][R1]}
will be fascinated by their own roots, and how
they
will rummage through all available data (old papers, email messages, videos, etc) to fully understand every little detail of their origins in human civilization. However, today's scientists won't have to wait for AI historians to establish proper credit assignment. It is easy enough to do the right thing right now.
6. Acknowledgments
Thanks to many expert reviewers (including several famous neural net pioneers) for useful comments. Since science is about selfcorrection, let me know under juergen@idsia.ch if you can spot any remaining error. Many additional relevant publications can be found in my
publication page and my
arXiv page. This work is licensed under a Creative Commons AttributionNonCommercialShareAlike 4.0 International License.
333+ References
[25y97]
In 2022, we are celebrating the following works from a quartercentury ago.
1. Journal paper on Long ShortTerm Memory, the
most cited neural network (NN) of the 20th century
(and basis of the most cited NN of the 21st).
2. First paper on physical, philosophical and theological consequences of the simplest and fastest way of computing
all possible metaverses
(= computable universes).
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
4. Journal paper on
metareinforcement learning.
5. Journal paper on hierarchical Qlearning.
6. First paper on reinforcement learning to play soccer: start of a series.
7. Journal papers on flat minima & lowcomplexity NNs that generalize well.
8. Journal paper on LowComplexity Art, the Minimal Art of the Information Age.
9. Journal paper on probabilistic incremental program evolution.
[AC]
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Schmidhuber's artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
selfsupervised neural networks for dynamic reinforcement learning and
planning in nonstationary environments.
Technical Report FKI12690, TUM, Feb 1990, revised Nov 1990.
PDF.
The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
where a generator NN is fighting a predictor NN in a minimax game
(more).
[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
modelbuilding neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222227. MIT Press/Bradford Books, 1991.
PDF.
More.
[AC91]
J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI14991, Inst. f. Informatik, Tech. Univ. Munich, April 1991.
PDF.
[AC91b]
J. Schmidhuber.
Curious modelbuilding control systems.
Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 14581463. IEEE, 1991.
PDF.
[AC97]
J. Schmidhuber.
What's interesting?
Technical Report IDSIA3597, IDSIA, July 1997.
Focus
on automatic creation of predictable internal
abstractions of complex spatiotemporal events:
two competing, intrinsically motivated agents agree on essentially
arbitrary algorithmic experiments and bet
on their possibly surprising (not yet predictable)
outcomes in zerosum games,
each agent potentially profiting from outwitting / surprising
the other by inventing experimental protocols where both
modules disagree on the predicted outcome. The focus is on exploring
the space of general algorithms (as opposed to
traditional simple mappings from inputs to
outputs); the
general system
focuses on the interesting
things by losing interest in both predictable and
unpredictable aspects of the world. Unlike Schmidhuber et al.'s previous
systems with intrinsic motivation,^{[AC90AC95]} the system also
takes into account
the computational cost of learning new skills, learning when to learn and what to learn.
See later publications.^{[AC99][AC02]}
[AC99]
J. Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z.
Zalzala, eds., Congress on Evolutionary Computation, p. 16121618,
IEEE Press, Piscataway, NJ, 1999.
[AC02]
J. Schmidhuber.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579612, Springer, 2002.
PDF.
[AC06]
J. Schmidhuber.
Developmental Robotics,
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173187, 2006.
PDF.
[AC09]
J. Schmidhuber. Art & science as byproducts of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98112.
PDF. (More on
artificial scientists and artists.)
[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2(3):230247, 2010.
IEEE link.
PDF.
With a brief summary of the generative adversarial neural networks of 1990^{[AC90,90b][AC20]}
where a generator NN is fighting a predictor NN in a minimax game (more).
[AC20]
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 5866, 2020.
Preprint arXiv/1906.04493.
[ACM18]
ACM Code of Ethics and Professional Conduct. Association for Computing Machinery (ACM), 2018. Quote: "Computing professionals should therefore credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works."
[AH1]
Hentschel K. (1996) A. v. Brunn: Review of "100 Authors against Einstein" [March 13, 1931]. In: Hentschel K. (eds) Physics and National Socialism. Science Networks—Historical Studies, vol 18. Birkhaeuser Basel.
Link.
[AH2]
F. H. van Eemeren, B. Garssen & B. Meuffels.
The disguised abusive ad hominem empirically investigated: Strategic manoeuvring with direct personal attacks.
Journal Thinking & Reasoning, Vol. 18, 2012, Issue 3, p. 344364.
Link.
[AH3]
D. Walton (PhD Univ. Toronto, 1972), 1998. Ad hominem arguments. University of Alabama Press.
[AIB] J. Schmidhuber. AI Blog.
Includes variants of chapters of the AI Book.
[AM16]
Blog of Werner Vogels, CTO of Amazon (Nov 2016):
Amazon's Alexa
"takes advantage of bidirectional long shortterm memory (LSTM) networks using a massive amount of data to train models that convert letters to sounds and predict the intonation contour. This technology enables high naturalness, consistent intonation, and accurate processing of texts."
[AMH0]
S. I. Amari (1972).
Characteristics of random nets of analog neuronlike elements. IEEE Trans. Syst. Man Cybernetics, 2, 643657. First published 1969 in Japanese, long before Wilson & Cowan's very similar work (197273).
[AMH1]
S. I. Amari (1972).
Learning patterns and pattern sequences by selforganizing nets of threshold elements. IEEE Transactions, C 21, 11971206, 1972.
PDF.
First publication of what was later sometimes called the Hopfield network^{[AMH2]} or AmariHopfield Network,^{[AMH3]} based on the (uncited) LenzIsing recurrent architecture.^{[L20][I25][T22]}
[AMH1b]
W. A. Little. The existence of persistent states in the brain. Mathematical Biosciences, 19.12, p. 101120, 1974.
Mentions the recurrent Ising model^{[L20][I25]}on which the (uncited) Amari network^{[AMH1,2]} is based.
[AMH2]
J. J. Hopfield (1982). Neural networks and physical systems with emergent
collective computational abilities. Proc. of the National Academy of Sciences,
vol. 79, pages 25542558, 1982.
The Hopfield network or AmariHopfield Network was first published in 1972 by Amari.^{[AMH1]} [AMH2] did not cite [AMH1].
[AMH3]
A. P. Millan, J. J. Torres, J. Marro.
How Memory Conforms to Brain Development.
Front. Comput. Neuroscience, 2019
[AOI] M. Ford. Architects of Intelligence: The truth about AI from the people building it. Packt Publishing, 2018.
Preface to German edition by J. Schmidhuber.
[ATT] J. Schmidhuber (AI Blog, 2020). 30year anniversary of endtoend differentiable sequential neural attention. Plus goalconditional reinforcement learning. Schmidhuber had both hard attention (1990) and soft attention (199193).^{[FWP]} Today, both types are very popular.
[ATT0] J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI12890, Institut für Informatik, Technische
Universität München, 1990.
PDF.
[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135141, 1991. Based on TR FKI12890, TUM, 1990.
PDF.
More.
[ATT2]
J. Schmidhuber.
Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton,
editors, Proc. of the 1990 Connectionist Models Summer School, pages
5261. San Mateo, CA: Morgan Kaufmann, 1990.
PS. (PDF.)
[ATT3]
H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a thirdorder Boltzmann machine. NIPS 2010. This work is very similar to [ATT02] which the authors did not cite.
In fact, Hinton was the reviewer of a 1990 paper^{[ATT2]} which summarised in its Section 5 Schmidhuber's early work on attention: the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component
with an attentional component (the fixation controller).
Two decades later, Hinton wrote about
his own work:^{[ATT3]}
"To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)."
See [MIR](Sec. 9)[R4].
[ATT14]
D. Bahdanau, K. Cho, Y. Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. 201416.
Preprint
arXiv/1409.0473, 201416.
This work on soft "attention" did not cite Schmidhuber's much earlier original work of 19911993 on soft attention and Transformers with linearized selfattention.^{[FWP,FWP02,6][ATT]}
[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week,
Bloomberg, May 15, 2018.
[AV2] A. Vance. Apple and Its Rivals Bet Their Futures on These Men's Dreams. Business Week,
Bloomberg, May 17, 2018.
[BA93]
P. Baldi and Y. Chauvin. Neural Networks for Fingerprint Recognition, Neural Computation, Vol. 5, 3, 402418, (1993).
First application of CNNs with backpropagation to biomedical/biometric images.
[BA96]
P. Baldi and Y. Chauvin. Hybrid Modeling, HMM/NN Architectures, and Protein Applications, Neural Computation, Vol. 8, 7, 15411565, (1996).
One of the first papers on graph neural networks.
[BA99]
P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri. Exploiting the Past and the Future in Protein Secondary Structure Prediction, Bioinformatics, Vol. 15, 11, 937946, (1999).
[BA03]
P. Baldi and G. Pollastri. The Principled Design of LargeScale Recursive Neural Network ArchitecturesDAGRNNs and the Protein Structure Prediction Problem. Journal of Machine Learning Research, 4, 575602, (2003).
[BAU]
F. L. Bauer, H. Woessner (1972). The "Plankalkül" of Konrad Zuse: A Forerunner of Today's Programming Languages.
[BB2]
J. Schmidhuber.
A local learning algorithm for dynamic feedforward and
recurrent networks.
Connection Science, 1(4):403412, 1989.
(The Neural Bucket Brigade—figures omitted!).
PDF.
HTML.
Compare TR FKI12490, TUM, 1990.
PDF.
[BIB3]
W. Bibel (2003).
Mosaiksteine einer Wissenschaft vom Geiste. Invited talk at
the conference on AI and Gödel, Arnoldsheim, 46 April 2003.
Manuscript, 2003.
[BM]
D. Ackley, G. Hinton, T. Sejnowski (1985). A Learning Algorithm for Boltzmann Machines. Cognitive Science, 9(1):147169.
This paper neither cited relevant prior work
by Sherrington & Kirkpatrick^{[SK75]} & Glauber^{[G63]} nor the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP12][HIN]} nor
Amari's work (196768)^{[GD12]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[S20][DLC]} failed to cite the prior art.^{[T22]}
[BOU] H Bourlard, N Morgan (1993). Connectionist speech recognition. Kluwer, 1993.
[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947954, 1960.
Precursor of modern backpropagation.^{[BP14]}
[BPB]
A. E. Bryson. A gradient method for optimizing multistage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 3045, 1962.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 67 and FORTRAN code on pages 5860.
PDF.
See also BIT 16, 146160, 1976.
Link.
The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389400.
[BPTT1]
P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78.10, 15501560, 1990.
[BPTT2]
R. J. Williams and D. Zipser. Gradientbased learning algorithms for recurrent networks. In: Backpropagation: Theory, architectures, and applications, p 433, 1995.
[BRI] Bridle, J.S. (1990). AlphaNets: A Recurrent "Neural" Network Architecture with a Hidden Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 8392.
[BW] H. Bourlard, C. J. Wellekens (1989).
Links between Markov models and multilayer perceptrons. NIPS 1989, p. 502510.
[CAPS]
S. Sabour, N. Frosst, G. E. Hinton (2017).
Dynamic routing between capsules. Proc. NIPS 2017, pp. 38563866.
[CDI]
G. E. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation 14.8 (2002): 17711800.
[CHU]
A. Church (1935). An unsolvable problem of elementary number theory. Bulletin of the American Mathematical Society, 41: 332333. Abstract of a talk given on 19 April 1935, to the American Mathematical Society.
Also in American Journal of Mathematics, 58(2), 345363 (1 Apr 1936).
First explicit proof that the Entscheidungsproblem (decision problem) does not have a general solution.
[CNN1] K. Fukushima: Neural network model for a mechanism of pattern
recognition unaffected by shift in position—Neocognitron.
Trans. IECE, vol. J62A, no. 10, pp. 658665, 1979.
The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia.
[CNN1+]
K. Fukushima: Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics, vol. 36, no. 4, pp. 193202 (April 1980).
Link.
[CNN1a] A. Waibel. Phoneme Recognition Using TimeDelay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weightsharing
to a convolutional architecture.
[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328339, March 1989. Based on [CNN1a].
[CNN1c] Bower Award Ceremony 2021:
Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021.
[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541551, 1989.
PDF.
[CNN3a]
K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for SpeakerIndependent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990.
An NN with convolutions using MaxPooling instead of Fukushima's
Spatial Averaging.^{[CNN1]}
[CNN3] Weng, J.,
Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3D objects from 2D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121128. A CNN whose downsampling layers use MaxPooling
(which has become very popular) instead of Fukushima's
Spatial Averaging.^{[CNN1]}
[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
[CNN5a]
S. Behnke. Learning iterative image reconstruction in the neural abstraction pyramid. International Journal of Computational Intelligence and Applications, 1(4):427438, 1999.
[CNN5b]
S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
[CNN5c]
D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92101, 2010.
[CO1]
J. Koutnik, F. Gomez, J. Schmidhuber (2010). Evolving Neural Networks in Compressed Weight Space. Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO2010), Portland, 2010.
PDF.
[CO2]
J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez.
Evolving LargeScale Neural Networks for VisionBased Reinforcement Learning.
Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), Amsterdam, July 2013.
PDF.
The first deep learning model to successfully learn control policies directly from highdimensional sensory input using reinforcement learning, without any unsupervised pretraining.
[CO3]
R. K. Srivastava, J. Schmidhuber, F. Gomez.
Generalized Compressed Network Search.
Proc. GECCO 2012.
PDF.
[CONN21]
Since November 2021: Comments on version 1 of the present report^{[T21v1]}
in the Connectionists Mailing List, perhaps the oldest mailing list on artificial neural networks. Link to the archive.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
PDF.
[CUB0]
R. J. Williams.
Complexity of exact gradient computation algorithms for recurrent
neural networks. Technical Report NUCCS8927, Northeastern University,
College of Computer Science, 1989.
[CW]
J. Koutnik, K. Greff, F. Gomez, J. Schmidhuber. A Clockwork RNN. Proc. 31st International Conference on Machine Learning (ICML), p. 18451853, Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].
[DAN]
J. Schmidhuber (AI Blog, 2021).
10year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after my outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.^{[DAN1]}
Now everybody is using this approach.
[DAN1]
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
At the IJCNN 2011 computer vision competition in Silicon Valley,
our artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (by LeCun's team), and six times better than the best nonneural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.
[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 4355.
[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364378.
[DIST2]
O. Vinyals, J. A. Dean, G. E. Hinton.
Distilling the Knowledge in a Neural Network.
Preprint arXiv:1503.02531 [stat.ML], 2015.
The authors did not cite Schmidhuber's original
1991 NN distillation procedure,^{[UN02][MIR](Sec. 2)}
not even in the later patent application US20150356461A1.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85117.
More.
Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DL3] Y. LeCun, Y. Bengio, G. Hinton (2015). Deep Learning. Nature 521, 436444.
HTML.
A "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL3a] Y. Bengio, Y. LeCun, G. Hinton (2021). Turing Lecture: Deep Learning for AI. Communications of the ACM, July 2021. HTML.
Local copy (HTML only).
Another "survey" of deep learning that does not mention the pioneering works of deep learning [T22].
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 201517, neural nets developed in my labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTCbased) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTMbased translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019
ondevice speech recognition
(on the phone, not the server)
is still based on
LSTM.
[DL6]
F. Gomez and J. Schmidhuber.
Coevolving recurrent neurons learn deep memory POMDPs.
In Proc. GECCO'05, Washington, D. C.,
pp. 17951802, ACM Press, New York, NY, USA, 2005.
PDF.
[DL6a]
J. Schmidhuber (AI Blog, Nov 2020). 15year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.^{[DL6]} Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
[DL7]
"Deep Learning ... moving beyond shallow machine learning since 2006!"
Web site deeplearning.net of Y. Bengio's MILA (2015, retrieved May 2020; compare the version in the
Internet Archive),
referring to Hinton's^{[UN4]} and Bengio's^{[UN5]}
unsupervised pretraining for deep NNs^{[UN4]} (2006) although
this type of deep learning dates back to Schmidhuber's work of 1991.^{[UN12][UN]}
Compare
Sec.
II &
XVII &
III.
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by selfproclaimed^{[DLC12]} "Deep Learning Conspiracy" (Nature 521 p 436).
The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it).
[DLC1]
Y. LeCun. IEEE Spectrum Interview by L. Gomes, Feb 2015.
Quote: "A lot of us involved in the resurgence of Deep Learning in the mid2000s, including Geoff Hinton, Yoshua Bengio, and myself—the socalled 'Deep Learning conspiracy' ..."
[DLC2]
M. Bergen, K. Wagner (2015).
Welcome to the AI Conspiracy: The 'Canadian Mafia' Behind Tech's Latest Craze. Vox recode, 15 July 2015.
Quote: "... referred to themselves as the 'deep learning conspiracy.' Others called them the 'Canadian Mafia.'"
[DLH]
J. Schmidhuber (AI Blog, 2022).
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA2222, IDSIA, Lugano, Switzerland, 2022.
Preprint arXiv:2212.11279.
Tweet of 2022.
[DM1]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Tech Report, 19 Dec. 2013,
arxiv:1312.5602.
[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Humanlevel control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015.
Link.
DeepMind's first famous paper. Its abstract claims: "While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, lowdimensional state spaces." It also claims to bridge "the divide between highdimensional sensory inputs and actions." Similarly, the first sentence of the abstract of the earlier tech report version^{[DM1]} of [DM2] claims to "present the first deep learning model to successfully learn control policies directly from highdimensional sensory input using reinforcement learning."
However, the first such system (requiring no unsupervised pretraining) was created earlier by Jan Koutnik et al. in Schmidhuber's lab.^{[CO2]}
DeepMind was cofounded by Shane Legs, a PhD student from this lab; he and Daan Wierstra (another PhD student of Schmidhuber and DeepMind's 1st employee) were the first persons at DeepMind who had AI publications and PhDs in computer science. More.
[DM3]
S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019.
Alphastar has a "deep LSTM core."
[DM4]
J. Jumper, R. Evans, A. Pritzel, T. Green, M. Figurnov, O. Ronneberger, K. Tunyasuvunakool, R. Bates, A. Zidek, A. Potapenko, A. Bridgland, C. Meyer, S. A. A. Kohl, A. J. Ballard, A. Cowie, B. RomeraParedes, S. Nikolov, R. Jain, J. Adler, T. Back, S. Petersen, D. Reiman, E. Clancy, M. Zielinski, M. Steinegger, M. Pacholska, T. Berghammer, S. Bodenstein, D. Silver, O. Vinyals, A. W. Senior, K. Kavukcuoglu, P. Kohli & D. Hassabis. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583589, 2021.
DeepMind's breakthrough application of deep learning did not cite
Hochreiter et al.'s first successful application [HO07] of deep learning to protein folding (2007).
[DNC]
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. GrabskaBarwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis.
Hybrid computing using a neural network with dynamic external memory.
Nature, 538:7626, p 471, 2016.
This work of DeepMind did not cite the original work of the early 1990s on
neural networks learning to control dynamic external memories.^{[PDA12][FWP01]}
[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265272.
What's now called "dropout" is a variation of the stochastic delta rule—compare preprint
arXiv:1808.03578, 2018.
[Drop2]
N. FrazierLogue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):10181032.
[Drop3]
J. Hertz, A. Krogh, R. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, California: AddisonWesley Pub. Co., pp. 4546.
[Drop4]
N. FrazierLogue, S. J. Hanson (2018). Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning.
Preprint arXiv:1808.03578, 2018.
[ELM1]
G.B. Huang, Q.Y. Zhu, and C.K. Siew. Extreme learning machine: A new learning scheme of feedforward neural networks. Proc. IEEE Int. Joint Conf. on Neural Networks, Vol. 2, 2004, pp. 985990. This paper does not mention that the "ELM" concept goes back to Rosenblatt's work around 1960.^{[R62][T22]}
[ELM2]
ELMORIGIN, 2004.
The Official Homepage on Origins of Extreme Learning Machines (ELM).
"Extreme Learning Machine Duplicates Others' Papers from 19882007."
Local copy.
This overview does not mention that the "ELM" concept goes back to Rosenblatt's work around 1960.^{[R62][T22]}
[FAKE]
H. Hopf, A. Krief, G. Mehta, S. A. Matlin.
Fake science and the knowledge crisis: ignorance can be fatal.
Royal Society Open Science, May 2019.
Quote: "Scientists must be willing to speak out when they see false information being presented in social media, traditional print or broadcast press" and "must speak out against false information and fake science in circulation
and forcefully contradict public figures who promote it."
[FAKE2]
L. Stenflo.
Intelligent plagiarists are the most dangerous. Nature, vol. 427, p. 777 (Feb 2004).
Quote: "What is worse, in my opinion, ..., are cases where scientists rewrite previous findings in different words, purposely hiding the sources of their ideas, and then during subsequent years forcefully claim that they have discovered new phenomena.
[FAST] C. v.d. Malsburg. Tech Report 812, Abteilung f. Neurobiologie,
MaxPlanck Institut f. Biophysik und Chemie, Goettingen, 1981.
First paper on fast weights or dynamic links.
[FASTa]
J. A. Feldman. Dynamic connections in neural networks.
Biological Cybernetics, 46(1):2739, 1982.
2nd paper on fast weights.
[FASTb]
G. E. Hinton, D. C. Plaut. Using fast weights to deblur old memories. Proc. 9th annual conference of the Cognitive Science Society (pp. 177186), 1987.
3rd paper on fast weights (two types of weights with different learning rates).
[FB17]
By 2017, Facebook
used LSTM
to handle
over 4 billion automatic translations per day (The Verge, August 4, 2017);
see also
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[FM]
S. Hochreiter and J. Schmidhuber.
Flat minimum search finds simple nets.
Technical Report FKI20094, Fakultät für Informatik,
Technische Universität München, December 1994.
PDF.
[FWP]
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
30year anniversary of a now popular
alternative^{[FWP01]} to recurrent NNs.
A slow feedforward NN learns by gradient descent to program the changes of
the fast weights^{[FAST,FASTa]} of
another NN, separating memory and control like in traditional computers.
Such Fast Weight Programmers^{[FWP06,FWPMETA18]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of selfinvented activation patterns^{[FWP01]}
(now often called keys and values for selfattention^{[TR16]}).
The similar Transformers^{[TR12]} combine this with projections
and softmax and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
Transformers with linearized selfattention^{[TR56]}
which are formally equivalent to Schmidhuber's 1991 outer productbased Fast Weight Programmers (apart from normalization).
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
extended the approach to
RNNs that program themselves.
See tweet of 2022.
[FWP0]
J. Schmidhuber.
Learning to control fastweight memories: An alternative to recurrent nets.
Technical Report FKI14791, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
First paper on fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer productbased version (Eq. 5) is now known as a "Transformer with linearized selfattention."^{[FWP]}
[FWP1] J. Schmidhuber. Learning to control fastweight memories: An alternative to recurrent nets. Neural Computation, 4(1):131139, 1992. Based on [FWP0].
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30year anniversary.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460463. Springer, 1993.
PDF.
First recurrent NNbased fast weight programmer using outer products, introducing the terminology of learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for OnTheFly Neural Program Generation. Workshop on MetaLearning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.
[FWP4a] J. Ba, G. Hinton, V. Mnih, J. Z. Leibo, C. Ionescu. Using Fast Weights to Attend to the Recent Past. NIPS 2016. PDF. Very similar to [FWP02], in both motivation [FWP2] and execution.
[FWP4b]
D. Bahdanau, K. Cho, Y. Bengio (2014).
Neural Machine Translation by Jointly Learning to Align and Translate. Preprint arXiv:1409.0473 [cs.CL].
This work on "attention" did not cite Schmidhuber's much earlier original work of 19911993 on soft attention and Transformers with linearized selfattention.^{[FWP,FWP02,6][ATT]}
[FWP4d]
Y. Tang, D. Nguyen, D. Ha (2020).
Neuroevolution of SelfInterpretable Agents.
Preprint: arXiv:2003.08165.
[FWP5]
F. J. Gomez and J. Schmidhuber.
Evolving modular fastweight networks for control.
In W. Duch et al. (Eds.):
Proc. ICANN'05,
LNCS 3697, pp. 383389, SpringerVerlag Berlin Heidelberg, 2005.
PDF.
HTML overview.
Reinforcementlearning fast weight programmer.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
Preprint: arXiv:2106.06295 (June 2021).
[FWPMETA1] J. Schmidhuber. Steps towards `selfreferential' learning. Technical Report CUCS62792, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
First recurrent fast weight programmer that can learn
to run a learning algorithm or weight change algorithm on itself.
[FWPMETA2] J. Schmidhuber. A selfreferential weight matrix.
In Proceedings of the International Conference on Artificial
Neural Networks, Amsterdam, pages 446451. Springer, 1993.
PDF.
[FWPMETA3] J. Schmidhuber.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
Brighton, pages 191195. IEE, 1993.
[FWPMETA4]
J. Schmidhuber.
A neural network that embeds its own metalevels.
In Proc. of the International Conference on Neural Networks '93,
San Francisco. IEEE, 1993.
[FWPMETA5]
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
A recurrent neural net with a selfreferential, selfreading, selfmodifying weight matrix
can be found here.
[FWPMETA6]
L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It. Advances in Neural Information Processing Systems (NeurIPS), 2021. Preprint arXiv:2012.14905 [cs.LG], 2020.
[FWPMETA7]
I. Schlag, T. Munkhdalai, J. Schmidhuber.
Learning Associative Inference Using Fast Weight Memory.
To appear at ICLR 2021.
Report arXiv:2011.07831 [cs.AI], 2020.
[FWPMETA8]
K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
A Modern SelfReferential Weight Matrix That Learns to Modify Itself.
International Conference on Machine Learning (ICML), 2022.
Preprint: arXiv:2202.05780.
[FWPMETA9]
L. Kirsch and J. Schmidhuber.
SelfReferential Meta Learning.
First Conference on Automated Machine Learning (LateBreaking Workshop), 2022.
[G63] R. J Glauber (1963). Timedependent statistics of the Ising model.
Journal of Mathematical Physics, 4(2):294307, 1963.
[GD']
C. Lemarechal. Cauchy and the Gradient Method. Doc Math Extra, pp. 251254, 2012.
[GD'']
J. Hadamard. Memoire sur le probleme d'analyse relatif a Vequilibre des plaques elastiques encastrees. Memoires presentes par divers savants estrangers Ã l'Academie des Sciences de l'Institut de France, 33, 1908.
[GDa]
Y. Z. Tsypkin (1966). Adaptation, training and selforganization automatic control systems,
Avtomatika I Telemekhanika, 27, 2361.
On gradient descentbased online learning for nonlinear systems.
[GDb]
Y. Z. Tsypkin (1971). Adaptation and Learning in Automatic Systems, Academic Press, 1971.
On gradient descentbased online learning for nonlinear systems.
[GD1]
S. I. Amari (1967).
A theory of adaptive pattern classifier, IEEE Trans, EC16, 279307 (Japanese version published in 1965).
PDF.
Probably the first paper on using stochastic gradient descent^{[STO5152]} for learning in multilayer neural networks
(without specifying the specific gradient descent method now known as reverse mode of automatic differentiation or backpropagation^{[BP1]}).
[GD2]
S. I. Amari (1968).
Information Theory—Geometric Theory of Information, Kyoritsu Publ., 1968 (in Japanese).
OCRbased PDF scan of pages 94135 (see pages 119120).
Contains computer simulation results for a five layer network (with 2 modifiable layers) which learns internal representations to classify
nonlinearily separable pattern classes.
[GD2a]
H. Saito (1967). Master's thesis, Graduate School of Engineering, Kyushu University, Japan.
Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.^{[GD1]} (S. Amari, personal communication, 2021.)
[GD3]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175185, 1977.
See Section 3.1 on using gradient descent for learning in multilayer networks.
[GSR]
H. Sak, A. Senior, K. Rao, F. Beaufays, J. Schalkwyk—Google Speech Team.
Google voice search: faster and more accurate.
Google Research Blog, Sep 2015, see also
Aug 2015 Google's speech recognition based on CTC and LSTM.
[GSR15] Dramatic
improvement of Google's speech recognition through LSTM:
Alphr Technology, Jul 2015, or 9to5google, Jul 2015
[GSR19]
Y. He, T. N. Sainath, R. Prabhavalkar, I. McGraw, R. Alvarez, D. Zhao, D. Rybach, A. Kannan, Y. Wu, R. Pang, Q. Liang, D. Bhatia, Y. Shangguan, B. Li, G. Pundak, K. Chai Sim, T. Bagby, S. Chang, K. Rao, A. Gruenstein.
Streaming endtoend speech recognition for mobile devices. ICASSP 20192019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019.
[GT16] Google's
dramatically improved Google Translate of 2016 is based on LSTM, e.g.,
WIRED, Sep 2016,
or
siliconANGLE, Sep 2016
[GAN0]
O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context.
Blog post, Internet Archive, 2010.
A blog post describing the basic ideas^{[AC][AC90, AC90b][AC20]} of GANs.
[GAN1]
I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair,
A. Courville, Y. Bengio.
Generative adversarial nets. NIPS 2014, 26722680, Dec 2014.
A description of GANs that does not cite Schmidhuber's original GAN principle of 1990^{[AC][AC90,AC90b][AC20][R2][T22]} (also containing wrong claims about Schmidhuber's adversarial NNs for
Predictability Minimization^{[PM02][AC20][T22]}).
[GAN2]
T. Karras, S. Laine, T. Aila. A stylebased generator architecture for generative adversarial
networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages
44014410, 2019.
[GOD]
K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173198, 1931.
In the early 1930s,
Gödel founded theoretical computer science. He identified fundamental limits of mathematics and theorem proving and computing and Artificial Intelligence.
[GOD34]
K. Gödel (1934).
On undecidable propositions of formal mathematical
systems. Notes by S. C. Kleene and J. B. Rosser on lectures
at the Institute for Advanced Study, Princeton, New Jersey, 1934, 30
pp. (Reprinted in M. Davis, (ed.), The Undecidable. Basic Papers on Undecidable
Propositions, Unsolvable Problems, and Computable Functions,
Raven Press, Hewlett, New York, 1965.)
Gödel introduced a universal coding language.
[GOD56]
R. J. Lipton and K. W. Regan.
Gödel's lost letter and P=NP.
Link.
[GOD86]
K. Gödel.
Collected works Volume I: Publications 192936,
S. Feferman et. al., editors, Oxford Univ. Press, Oxford, 1986.
[GOD21] J. Schmidhuber (2021). 90th anniversary celebrations: 1931: Kurt Gödel, founder of theoretical computer science,
shows limits of math, logic, computing, and artificial intelligence.
This was number 1 on Hacker News.
[GOD21a]
J. Schmidhuber (2021). Als Kurt Gödel die Grenzen des Berechenbaren entdeckte.
(When Kurt Gödel discovered the limits of computability.)
Frankfurter Allgemeine Zeitung, 16/6/2021.
[GOL]
C. Goller & A. Küchler (1996). Learning taskdependent distributed representations by backpropagation through structure. Proceedings of International Conference on Neural Networks (ICNN'96). Vol. 1, p. 347352 IEEE, 1996.
Based on TR AR9502, TU Munich, 1995.
[GPT3]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. HerbertVoss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei.
Language Models are FewShot Learners (2020).
Preprint arXiv/2005.14165.
[GPUNN]
Oh, K.S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):13111314. Speeding up traditional NNs on GPU by a factor of 20.
[GPUCNN]
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. Speeding up shallow CNNs on GPU by a factor of 4.
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI2011, Barcelona), 2011. PDF. ArXiv preprint.
Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 20112012 before others won any
with similar approaches.
[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
International Joint Conference on Neural Networks (IJCNN2011, San Francisco), 2011.
PDF.
HTML overview.
First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.^{[DAN1]} This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multicolumn Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 36423649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012.
PDF.
This paper describes AlexNet, which is similar to the earlier
DanNet,^{[DAN,DAN1][R6]}
the first pure deep CNN
to win computer vision contests in 2011^{[GPUCNN23,5]} (AlexNet and VGG Net^{[GPUCNN9]} followed in 20122014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)^{[RELU1]} and dropout (a variant of Hanson 1990 stochastic delta rule)^{[Drop14]} but neither cites the original work^{[RELU1][Drop1]} nor the basic CNN architecture (Fukushima, 1979).^{[CNN1]}
[GPUCNN5]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI11), Google, Mountain View, California, 2011.
PDF.
[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
First deep learner to win a contest on object detection in large images—
first deep learner to win a medical imaging contest (2012). Link.
How the Swiss AI Lab IDSIA used GPUbased CNNs to win the
ICPR 2012 Contest on Mitosis Detection
and the MICCAI 2013 Grand Challenge.
[GPUCNN9]
K. Simonyan, A. Zisserman. Very deep convolutional networks for largescale image recognition. Preprint arXiv:1409.1556 (2014).
[GRO69]
S. Grossberg. Some networks that can learn, remember, and reproduce any number of complicated
spacetime patterns, Indiana University Journal of Mathematics and Mechanics, 19:5391, 1969.
[H86] J. L. van Hemmen (1986). Spinglass models of a neural network.
Phys. Rev. A 34, 3435, 1 Oct 1986.
[H88]
H. Sompolinsky (1988). Statistical Mechanics of Neural Networks.
Physics Today 41, 12, 70, 1988.
[H90]
W. D. Hillis.
Coevolving parasites improve simulated evolution as an optimization
procedure.
Physica D: Nonlinear Phenomena, 42(13):228234, 1990.
[HB96]
S. El Hihi, Y. Bengio. Hierarchical recurrent neural networks for longterm dependencies. NIPS, 1996.
Bengio claimed^{[YB20]}
that in 1995
he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue"
although
Schmidhuber's publications on exactly this topic
date back to 199193.^{[UN02][UN]}
[HEL]
P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel.
The Helmholtz machine.
Neural Computation, 7:889904, 1995.
Related to Schmidhuber's Neural Heat Exchanger.^{[NHE]}
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.
[HO07]
S. Hochreiter, M. Heusel, K. Obermayer. Fast modelbased protein homology detection without alignment.
Bioinformatics 23(14):172836, 2007.
Successful application of deep learning to protein folding problems,
through an LSTM that was orders of magnitude faster than competing methods.
[HRL0]
J. Schmidhuber.
Towards compositional learning with dynamic neural networks.
Technical Report FKI12990, Institut für Informatik, Technische
Universität München, 1990.
PDF.
An RL machine gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NNbased subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of costminimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals.
The system is learning action plans
at multiple levels of abstraction and multiple time scales and solves what Y. LeCun called an "open problem" in 2022.^{[LEC]}
[HRL1]
J. Schmidhuber. Learning to generate subgoals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967972. Elsevier Science Publishers B.V., NorthHolland, 1991. PDF. Extending TR FKI12990, TUM, 1990.
HTML & images in German.
[HRL2]
J. Schmidhuber and R. Wahnsiedler.
Planning simple trajectories using neural subgoal generators.
In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc.
of the 2nd International Conference on Simulation of Adaptive Behavior,
pages 196202. MIT Press, 1992.
PDF.
HTML & images in German.
[HRL3]
P. Dayan and G. E. Hinton.
Feudal Reinforcement Learning.
Advances in Neural Information Processing Systems 5, NIPS, 1992.
This work did not cite Schmidhuber's gradientbased subgoal generators for hierarchical reinforcement learning (1990).^{[HRL02]}
[HRL4]
M. Wiering and J. Schmidhuber. HQLearning. Adaptive Behavior 6(2):219246, 1997.
PDF.
[HRLW]
C. Watkins (1989). Learning from delayed rewards.
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]}
More.
[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 1011, 2015.
Link.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
where the gates are always open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[HYB12]
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling
in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag.,
29(6):8297.
This work did not cite the earlier LSTM^{[LSTM06]} trained by Connectionist Temporal Classification (CTC, 2006).^{[CTC]} CTCLSTM was successfully applied to speech in 2007^{[LSTM4]} (also with hierarchical LSTM stacks^{[LSTM14]}) and became the first superior endtoend neural speech recogniser that outperformed the
state of the art, dramatically improving Google's speech recognition.^{[GSR][GSR15][DL4]}
This was very different from previous hybrid methods since the late 1980s which combined NNs and traditional approaches such as hidden Markov models (HMMs).^{[BW][BRI][BOU]} [HYB12] still used the old hybrid approach and did not compare it to CTCLSTM. Later, however, Hinton switched to LSTM, too.^{[LSTM8]}
[I24]
E. Ising (1925). Beitrag zur Theorie des Ferro und Paramagnetismus. Dissertation, 1924.
[I25]
E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253258, 1925.
The first nonlearning recurrent NN architecture (the Ising model or LenzIsing model) was introduced and analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.^{[L20][I25][K41][W45][T22]} It settles into an equilibrium state in response to input conditions, and is the foundation of learning RNNs.^{[AMH12]}
[IM09]
J. Deng, R. Socher, L.J. Li, K. Li, L. FeiFei (2009). Imagenet: A largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248255). IEEE, 2009.
[JOU17] Jouppi et al. (2017). InDatacenter Performance Analysis of a Tensor Processing Unit.
Preprint arXiv:1704.04760
[K41]
H. A. Kramers and G. H. Wannier (1941). Statistics of the TwoDimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.
[K56]
S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 342, Princeton, N.J., 1956.
[KOH72]
T. Kohonen. Correlation Matrix Memories. IEEE Transactions on Computers, C21, p. 353359, 1972.
[KNU]
D. E. Knuth, L. T. Pardo (1976). The Early Development of Programming Languages. Stanford University, Computer Science Department.
PDF.
[KO2]
J. Schmidhuber.
Discovering neural nets with low Kolmogorov complexity
and high generalization capability.
Neural Networks, 10(5):857873, 1997.
PDF.
[KU] A. Küchler & C. Goller (1996). Inductive learning in symbolic domains using structuredriven recurrent neural networks. Lecture Notes in Artificial Intelligence, vol 1137. Springer, Berlin, Heidelberg.
[L20]
W. Lenz (1920). Beitrag zum Verständnis der magnetischen
Erscheinungen in festen Körpern. Physikalische Zeitschrift, 21:613615. See also [I25].
[LAN]
J. L. Ba, J. R.Kiros, G. E. Hinton. Layer Normalization.
arXiv:1607.06450, 2016.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 19902015. Years ago, Schmidhuber's team published most of what Y. LeCun calls his "main original contributions:" neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc. This was also discussed on Hacker News, reddit, and in the media.
See tweet1.
LeCun also listed the "5 best ideas 20122022" without mentioning that
most of them are from Schmidhuber's lab, and older.
See tweet2.
[LEC22a]
Y. LeCun (27 June 2022).
A Path Towards Autonomous Machine Intelligence.
OpenReview Archive.
Link. See critique [LEC].
[LEC22b]
M. Heikkilä, W. D. Heaven.
Yann LeCun has a bold new vision for the future of AI.
MIT Technology Review, 24 June 2022.
Link. See critique [LEC].
[LECP]
Y. LeCun.
A New Publishing Model in Computer Science.
Pamphlet, 20002004.
Local copy (HTML only).
[LEI21] J. Schmidhuber (AI Blog, 2021). 375th birthday of Leibniz, founder of computer science.
[LEI21a]
J. Schmidhuber (2021). Der erste Informatiker. Wie Gottfried Wilhelm Leibniz den Computer erdachte.
(The first computer scientist. How Gottfried Wilhelm Leibniz conceived the computer.)
Frankfurter Allgemeine Zeitung (FAZ), 17/5/2021. FAZ online:
19/5/2021.
[LIT21]
M. L. Littman (2021).
Collusion Rings Threaten the Integrity of Computer Science Research.
Communications of the ACM, Vol. 64 No. 6, p. 4344, June 2021.
[LSTM0]
S. Hochreiter and J. Schmidhuber.
Long ShortTerm Memory.
TR FKI20795, TUM, August 1995.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. PDF.
Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):24512471, 2000.
PDF.
The "vanilla LSTM architecture" with forget gates
that everybody is using today, e.g., in Google's Tensorflow.
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:56, pp. 602610, 2005.
PDF.
[LSTM4]
S. Fernandez, A. Graves, J. Schmidhuber. An application of
recurrent neural networks to discriminative keyword
spotting.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545552, Vancouver, MIT Press, 2009.
PDF.
[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber.
Evolving memory cell structures for sequence learning.
Proc. ICANN09, Cyprus, 2009.
PDF.
[LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013.
PDF.
Based on [LSTM12,4,14][CTC].
[LSTM9]
O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton.
Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL].
[LSTM10]
A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, BioADIT 2004, Lausanne, Switzerland, p. 175184, 2004.
PDF.
[LSTM11]
N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575581, SpringerVerlag Berlin Heidelberg, 2005.
[LSTM12]
D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 17951802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.
[LSTM13]
F. A. Gers and J. Schmidhuber.
LSTM Recurrent Networks Learn Simple Context Free and
Context Sensitive Languages.
IEEE Transactions on Neural Networks 12(6):13331340, 2001.
PDF.
[LSTM14]
S. Fernandez, A. Graves, J. Schmidhuber.
Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proc.
IJCAI 07, p. 774779, Hyderabad, India, 2007 (talk).
PDF.
[LSTM15]
A. Graves, J. Schmidhuber.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks.
Advances in Neural Information Processing Systems 22, NIPS'22, p 545552,
Vancouver, MIT Press, 2009.
PDF.
[LSTM16]
M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel MultiDimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015.
Preprint: arxiv:1506.07452.
[LSTM17]
J. A. PerezOrtiz, F. A. Gers, D. Eck, J. Schmidhuber.
Kalman filters improve LSTM network performance in
problems unsolvable by traditional recurrent nets.
Neural Networks 16(2):241250, 2003.
PDF.
[LSTMPG]
J. Schmidhuber (AI Blog, Dec 2020). 10year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (20072010). Recent famous applications: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota player (2018)—Bill Gates called this a huge milestone in advancing AI.
[LSTMRL]
B. Bakker, F. Linaker, J. Schmidhuber.
Reinforcement Learning in Partially Observable Mobile Robot
Domains Using Unsupervised Event Extraction.
In Proceedings of the 2002
IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2002), Lausanne, 2002.
PDF.
[LSTMGRU] J. Chung, C. Gulcehre, K. Cho, Y. Bengio (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Preprint arXiv:1412.3555 [cs.NE].
The socalled gated recurrent units (GRU)
are actually a variant of the vanilla LSTM architecture^{[LSTM2]} (2000) which the authors did not cite
although this work^{[LSTM2]} was the one that introduced gated recurrent units.
Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009^{[LSTM7]} without changing the name of the basic method.
(Margin note: GRU cells lack an important gate and can neither
learn to count^{[LSTMGRU2]} nor learn simple nonregular
languages;^{[LSTMGRU2]} they
also do not work as well for challenging translation tasks,
according to Google Brain.^{[LSTMGRU3]})
[LSTMGRU2] G. Weiss, Y. Goldberg, E. Yahav. On the Practical Computational Power of Finite Precision RNNs for Language Recognition.
Preprint arXiv:1805.04908.
[LSTMGRU3] D. Britz et al. (2017). Massive Exploration of Neural Machine Translation
Architectures. Preprint arXiv:1703.03906
[M69] M. Minsky, S. Papert. Perceptrons (MIT Press, Cambridge, MA, 1969).
A misleading "history of deep learning" goes more or less like this: "In 1969, Minsky & Papert^{[M69]} showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL12]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP12][DL2]}
and then also by Amari's SGD for MLPs.^{[GD12]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
[MAR71]
D. Marr. Simple memory: a theory for archicortex. Philos Trans R Soc Lond B Biol Sci, 262:841, p 2381, 1971.
[MC43]
W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, Vol. 5, p. 115133, 1943.
[META]
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
first publication on metalearning machines that learn to learn (1987).
For its cover I drew a robot that bootstraps itself.
1992: gradient descentbased neural metalearning. 1994: MetaReinforcement Learning with selfmodifying policies. 1997: MetaRL plus artificial curiosity and intrinsic motivation.
2002: asymptotically optimal metalearning for curriculum learning. 2003: mathematically optimal Gödel Machine. 2020: new stuff!
[META1]
J. Schmidhuber.
Evolutionary principles in selfreferential learning, or on learning
how to learn: The metameta... hook. Diploma thesis,
Institut für Informatik, Technische Universität München, 1987.
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
For example,
Genetic Programming
(GP) is applied to itself, to recursively evolve
better GP methods through MetaEvolution. More.
[MGC] MICCAI 2013 Grand Challenge on Mitosis Detection, organised by M. Veta, M.A. Viergever, J.P.W. Pluim, N. Stathonikos, P. J. van Diest of University Medical Center Utrecht.
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 19901991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of Schmidhuber's team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 202021, we celebrated that many of the basic ideas behind this revolution were published within fewer than 12 months in the "Annus Mirabilis" 19901991 at TU Munich.
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 32073220, 2010. ArXiv Preprint.
Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pretraining.
[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10year anniversary of supervised deep learning breakthrough (2010). No unsupervised pretraining.
By 2010, when compute was 100 times more expensive than today, both the feedforward NNs^{[MLP1]} and the earlier recurrent NNs of Schmidhuber's team were able to beat all competing algorithms on important problems of that time. This deep learning revolution quickly spread from Europe to North America and Asia. The rest is history.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long ShortTerm Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of the much earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized selfattention are formally equivalent to the much earlier Fast Weight Programmers).
Most of this started with the
Annus Mirabilis of 19901991.^{[MIR]}
[MOZ]
M. Mozer. A Focused Backpropagation Algorithm for Temporal Pattern Recognition.
Complex Systems, 1989.
[NAK72]
K. Nakano. Associatron—A Model of Associative Memory.
IEEE Transactions on Systems, Man, and Cybernetics, SMC2:3 p. 380388, 1972.
[NAS] B. Zoph, Q. V. Le. Neural Architecture Search with Reinforcement Learning.
Preprint arXiv:1611.01578 (PDF), 2017.
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NASC2]
J. Schmidhuber. Zooming in on aviation history.
Correspondence, Nature, vol 566, p 39, 7 Feb 2019.
[NASC3] J. Schmidhuber. The last inventor of the telephone. Letter, Science, 319, no. 5871, p. 1759, March 2008.
[NASC4] J. Schmidhuber. Turing: Keep his work in perspective.
Correspondence, Nature, vol 483, p 541, March 2012, doi:10.1038/483541b.
[NASC5] J. Schmidhuber. Turing in Context.
Letter, Science, vol 336, p 1639, June 2012.
(On Gödel, Zuse, Turing.)
See also comment on response by A. Hodges (DOI:10.1126/science.336.6089.1639a)
[NASC6] J. Schmidhuber. Colossus was the first electronic digital computer. Correspondence, Nature, 441 p 25, May 2006.
[NASC7] J. Schmidhuber. Turing's impact. Correspondence, Nature, 429 p 501, June 2004
[NASC8] J. Schmidhuber. Prototype resilient, selfmodeling robots. Correspondence, Science, 316, no. 5825 p 688, May 2007.
[NASC9] J. Schmidhuber. Comparing the legacies of Gauss, Pasteur, Darwin. Correspondence, Nature, vol 452, p 530, April 2008.
[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011.
HTML.
[NDR]
R. Csordas, K. Irie, J. Schmidhuber.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
[NHE] J. Schmidhuber. The Neural Heat Exchanger.
Oral presentations since 1990 at various universities including TUM and the
University of Colorado at Boulder. Also in In S. Amari, L. Xu, L. Chan, I. King, K. Leung, eds., Proceedings of the Intl. Conference on Neural Information Processing (1996), pages 194197, Springer, Hongkong.
Link.
[NPMa]
M. Nakamura, K. Shikano.
A study of English word category prediction based on neural networks.
Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP),
p. 731734, 1989.
[NPM]
Y. Bengio, R. Ducharme, P. Vincent, C. Jauvin (2003).
A Neural Probabilistic Language Model.
Journal of Machine Learning Research 3, p 11371155, 2003.
Based on Schmidhuber & Heil's
excellent 1995 neural probabilistic text model.^{[SNT]} See also Nakamura and Shikano's 1989 word category prediction model.^{[NPMa]}
[NS56]
A. Newell and H. Simon.
The logic theory machine—A complex information processing system.
IRE Transactions on Information Theory 2.3 (1956):6179.
[NYT1]
NY Times article
by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'
[NYT3]
NY Times article
by G. LewisKraus, Dec. 14, 2016: The Great A.I. Awakening
[OAI1]
G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker.
Learning Dexterity. OpenAI Blog, 2018.
[OAI1a]
OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba.
Learning Dexterous InHand Manipulation. arxiv:1312.5602 (PDF).
[OAI2]
OpenAI:
C. Berner, G. Brockman, B. Chan, V. Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fischer, S. Hashme, C. Hesse, R. Jozefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, S. Zhang (Dec 2019).
Dota 2 with Large Scale Deep Reinforcement Learning.
Preprint
arxiv:1912.06680.
An LSTM composes 84% of the model's total parameter count.
[OAI2a]
J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. An LSTM with 84% of the model's total parameter count was the core of OpenAI Five.
[PDA1]
G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649653, Princeton University, Princeton, NJ, 1990.
[PDA2]
M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of contextfree languages. Proc. NIPS 1993.
[PG]
R. J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning 8.34: 229256, 1992.
[PHD]
J. Schmidhuber.
Dynamische neuronale Netze und das fundamentale raumzeitliche
Lernproblem
(Dynamic neural nets and the fundamental spatiotemporal
credit assignment problem).
Dissertation,
Institut für Informatik, Technische
Universität München, 1990.
PDF.
HTML.
[PLAG1]
Oxford's guidance to types of plagiarism (2021).
Quote: "Plagiarism may be intentional or reckless, or unintentional."
Link.
[PLAN]
J. Schmidhuber (AI Blog, 2020). 30year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced highdimensional reward signals, deterministic policy gradients for RNNs,
the GAN principle (widely used today). Agents with adaptive recurrent world models even suggest a simple explanation of consciousness & selfawareness.
[PLAN2]
J. Schmidhuber.
An online algorithm for dynamic reinforcement learning and planning
in reactive environments.
Proc. IEEE/INNS International Joint Conference on Neural
Networks, San Diego, volume 2, pages 253258, 1990.
Based on TR FKI12690 (1990).^{[AC90]}
More.
[PLAN3]
J. Schmidhuber.
Reinforcement learning in Markovian and nonMarkovian environments.
In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors,
Advances in Neural Information Processing Systems 3, NIPS'3, pages 500506. San
Mateo, CA: Morgan Kaufmann, 1991.
PDF.
Partially based on TR FKI12690 (1990).^{[AC90]}
[PLAN4]
J. Schmidhuber.
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models.
Report arXiv:1210.0118 [cs.AI], 2015.
[PLAN5]
One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
[PLAN6]
D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.)
Preprint: arXiv:1809.01999.
Github: World Models.
[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CUCS56591, Univ. Colorado at Boulder, 1991. PDF.
More.
[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863879, 1992. Based on [PM0], 1991. PDF.
More.
[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces wellknown feature detectors. Neural Computation, 8(4):773786, 1996.
PDF. More.
[PO87] J. B. Pollack. On Connectionist Models of Natural Language Processing.
PhD thesis, Computer Science Department, University of Illinois, Urbana, 1987.
[PO90] J. B. Pollack. Recursive Distributed Representations. Artificial Intelligence,
46(12):77105, 1990.
[PP] J. Schmidhuber.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
ArXiv preprint (2011):
arXiv:1112.5309 [cs.AI]
[PP1] R. K. Srivastava, B. Steunebrink, J. Schmidhuber.
First Experiments with PowerPlay.
Neural Networks, 2013.
ArXiv preprint (2012):
arXiv:1210.8385 [cs.AI].
[PP2] V. Kompella, M. Stollenga, M. Luciw, J. Schmidhuber. Continual curiositydriven skill acquisition from highdimensional video inputs for humanoid robots. Artificial Intelligence, 2015.
Relevant threads with many comments at reddit.com/r/MachineLearning, the largest machine learning forum with over 800k subscribers in 2019 (note that my name is often misspelled):
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber MetaLearning Fiasco.
Schmidhuber started
metalearning (learning to learn—now a hot topic)
in 1987^{[META1][META]} long before Bengio
who suggested in public at N(eur)IPS 2019
that he did it before Schmidhuber.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R9] Reddit/ML, 2019. We find it extremely unfair that Schmidhuber did not get the Turing award. That is why we dedicate this song to Juergen to cheer him up.
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun
[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers
[R58]
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological review, 65(6):386.
This paper not only described single layer perceptrons, but also deeper multilayer perceptrons (MLPs).
Although these MLPs did not yet have deep learning, because only the last layer learned,^{[DL1]}
Rosenblatt basically had what much later was rebranded as Extreme Learning Machines (ELMs) without proper attribution.^{[ELM12][CONN21][T22]}
[R61]
Joseph, R. D. (1961). Contributions to perceptron theory. PhD thesis, Cornell Univ.
[R62]
Rosenblatt, F. (1962). Principles of Neurodynamics. Spartan, New York.
[RCNN]
R. Girshick, J. Donahue, T. Darrell, J. Malik.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Preprint arXiv/1311.2524, Nov 2013.
[RCNN2]
R. Girshick.
Fast RCNN. Proc. of the IEEE international conference on computer vision, p. 14401448, 2015.
[RCNN3]
K. He, G. Gkioxari, P. Dollar, R. Girshick.
Mask RCNN.
Preprint arXiv/1703.06870, 2017.
[RELU1]
K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322333. doi:10.1109/TSSC.1969.300225.
This work introduced rectified linear units or ReLUs.
[RELU2]
C. v. d. Malsburg (1973).
SelfOrganization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85100, 1973. See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.
[RO98]
R. Rojas (1998). How to make Zuse's Z3 a universal computer. IEEE Annals of Computing, vol. 19:3, 1998.
[RMSP]
T. Tieleman, G. E. Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning 4.2 (2012): 2631.
[ROB]
A. J. Robinson and F. Fallside.
The utility driven dynamic error propagation network.
Technical Report CUED/FINFENG/TR.1, Cambridge University Engineering Department, 1987.
[RPG]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620634.
[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation. TR No. ICS8506, California Univ San Diego La Jolla Inst for Cognitive Science. Later version published as:
Learning representations by backpropagating errors. Nature, 323, p. 533536 (1986).
This experimental analysis of backpropagation did not cite the origin of the method,^{[BP15]} also known as the reverse mode of automatic differentiation.
The paper also failed to cite
the first working algorithms for deep learning of internal representations (Ivakhnenko & Lapa, 1965)^{[DEEP12][HIN]} as well as
Amari's work (196768)^{[GD12]} on learning internal representations in deep nets through stochastic gradient descent.
Even later surveys by the authors^{[DL3,3a]} failed to cite the prior art.^{[T22]}
[S93]
D. Sherrington (1993).
Neural networks: the spin glass approach.
NorthHolland Mathematical Library,
vol 51, 1993, p. 261291.
[S20]
T. Sejnowski. The unreasonable effectiveness of deep learning in artificial intelligence. PNAS, January 28, 2020.
Link.
A misleading "history of deep learning" which goes more or less like this: "In 1969, Minsky & Papert^{[M69]} showed that shallow NNs without hidden layers are very limited and the field was abandoned until a new generation of neural network researchers took a fresh look at the problem in the 1980s."^{[S20]} However, the 1969 book^{[M69]} addressed a "problem" of Gauss & Legendre's shallow learning (~1800)^{[DL12]} that had already been solved 4 years prior by Ivakhnenko & Lapa's popular deep learning method,^{[DEEP12][DL2]}
and then also by Amari's SGD for MLPs.^{[GD12]}
Minsky was apparently unaware of this and failed to correct it later.^{[HIN](Sec. I)[T22](Sec. XIII)}
Deep learning research was alive and kicking
in the 1960s70s, especially outside of the Anglosphere.^{[DEEP12][GD13][CNN1][DL12][T22]}
[S80]
B. Speelpenning (1980). Compiling Fast Partial Derivatives of Functions Given by Algorithms. PhD
thesis, Department of Computer Science, University of Illinois, UrbanaChampaign.
[S2S]
I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 31043112.
[S59]
A. L. Samuel.
Some studies in machine learning using the game of checkers.
IBM Journal on Research and Development, 3:210229, 1959.
[STO51]
H. Robbins, S. Monro (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics. 22(3):400, 1951.
[STO52]
J. Kiefer, J. Wolfowitz (1952). Stochastic Estimation of the Maximum of a Regression Function.
The Annals of Mathematical Statistics. 23(3):462, 1952.
[SA17] J. Schmidhuber.
Falling Walls:
The Past, Present and Future of Artificial Intelligence.
Scientific American, Observations, Nov 2017.
[SCAN] J. Masci,
A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with MaxPooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.
[SE59]
O. G. Selfridge (1959). Pandemonium: a paradigm for learning. In D. V. Blake and A. M. Uttley, editors, Proc. Symposium on Mechanisation of Thought Processes, p 511529, London, 1959.
[SHA37]
C. E. Shannon (1938). A Symbolic Analysis of Relay and Switching Circuits. Trans. AIEE. 57 (12): 713723. Based on his thesis, MIT, 1937.
[SNT]
J. Schmidhuber, S. Heil (1996).
Sequential neural text compression.
IEEE Trans. Neural Networks, 1996.
PDF.
An earlier version appeared at NIPS 1995.
Much later this was called a probabilistic language model.^{[T22]}
[SK75]
D. Sherrington, S. Kirkpatrick (1975).
Solvable Model of a SpinGlass.
Phys. Rev. Lett. 35, 1792, 1975.
[ST]
J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber
Steel Defect Classification with MaxPooling Convolutional Neural Networks.
Proc. IJCNN 2012.
PDF.
Apparently, this was the first deep learning breakthrough in heavy industry.
[ST61]
K. Steinbuch. Die Lernmatrix. (The learning matrix.) Kybernetik, 1(1):3645, 1961.
[ST95]
W. Hilberg (1995). Karl Steinbuch, ein zu Unrecht vergessener Pionier
der künstlichen neuronalen Systeme. (Karl Steinbuch, an unjustly forgotten pioneer of artificial neural systems.) Frequenz, 49(1995)12.
[SP93] A. Sperduti (1993).
Encoding Labeled Graphs by Labeling RAAM. NIPS 1993: 11251132
One of the first papers on graph neural networks.
[SP94] A. Sperduti (1994).
Labelling Recursive Autoassociative Memory. Connect. Sci. 6(4): 429459 (1994)
[SP95] A. Sperduti (1995).
Stability properties of labeling recursive autoassociative memory. IEEE Trans. Neural Networks 6(6): 14521460 (1995)
[SPG95] A. Sperduti, A. Starita, C. Goller (1995).
Learning Distributed Representations for the Classification of Terms. IJCAI 1995: 509517
[SPG96] A. Sperduti, D. Majidi, A. Starita (1996).
Extended CascadeCorrelation for Syntactic and Structural Pattern Recognition. SSPR 1996: 9099
[SPG97] A. Sperduti, A. Starita (1997).
Supervised neural networks for the classification of structures.
IEEE Trans. Neural Networks 8(3): 714735, 1997.
[SV20] S. Vazire (2020). A toast to the error detectors. Let 2020 be the year in which we value those who ensure that science is selfcorrecting. Nature, vol 577, p 9, 2/2/2020.
[T19]
ACM's justification of the 2018 A.M. Turing Award (announced in 2019). WWW link.
Local copy 1 (HTML only).
Local copy 2 (HTML only).
[T22] debunks this justification.
[T20a] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Drs. Bengio & Hinton & LeCun. A precursor of [T22].
[T21v1]
J. Schmidhuber.
Scientific Integrity, the 2021 Turing Lecture, and the 2018 Turing Award for Deep Learning.
Technical Report IDSIA7721 (v1), IDSIA, 24 Sep 2021.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA7721, IDSIA, Lugano, Switzerland, 2022. Debunking [T19] and [DL3a].
[THE17] S. Baker (2017). Which countries and universities are leading on AI research? Times Higher Education World University Rankings, 2017.
Link.
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 59986008.
This paper introduced the name "Transformers" for a now widely used NN type. It did not cite
the 1991 publication on what's now called "Transformers with linearized selfattention."^{[FWP06][TR56]}
Schmidhuber also introduced the now popular
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30year anniversary.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 47314736. ArXiv preprint 1803.03585.
[TR4]
M. Hahn. Theoretical Limitations of SelfAttention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156171, 2020.
[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.
[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.
[TUR]
A. M. Turing. On computable numbers, with an application to the Entscheidungsproblem. Proceedings of the London Mathematical Society, Series 2, 41:230267. Received 28 May 1936. Errata appeared in Series 2, 43, pp 544546 (1937).
2nd explicit proof that the Entscheidungsproblem (decision problem) does not have a general solution.
[TUR1]
A. M. Turing. Intelligent Machinery. Unpublished Technical Report, 1948.
Link.
In: Ince DC, editor. Collected works of AM Turing  Mechanical Intelligence. Elsevier Science Publishers, 1992.
[TUR21] J. Schmidhuber (AI Blog, Sep 2021). Turing Oversold. It's not Turing's fault, though.
[UN]
J. Schmidhuber (AI Blog, 2021). 30year anniversary. 1991: First very deep learning with unsupervised or selfsupervised pretraining. Unsupervised hierarchical predictive coding (with selfsupervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI14891, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
Unsupervised/selfsupervised learning and predictive coding is used
in a deep hierarchy of recurrent neural networks (RNNs)
to find compact internal
representations of long sequences of data,
across multiple time scales and levels of abstraction.
Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above.
The resulting compressed sequence representations
greatly facilitate downstream supervised deep learning such as sequence classification.
By 1993, the approach solved problems of depth 1000 [UN2]
(requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning).
A variant collapses the hierarchy into a single deep net.
It uses a socalled conscious chunker RNN
which attends to unexpected events that surprise
a lowerlevel socalled subconscious automatiser RNN.
The chunker learns to understand the surprising events by predicting them.
The automatiser uses a
neural knowledge distillation procedure
to compress and absorb the formerly conscious insights and
behaviours of the chunker, thus making them subconscious.
The systems of 1991 allowed for much deeper learning than previous methods. More.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. Based on TR FKI14891, TUM, 1991.^{[UN0]} PDF.
First working Deep Learner based on a deep RNN hierarchy (with different selforganising time scales),
overcoming the vanishing gradient problem through unsupervised pretraining and predictive coding (with selfsupervised target generation).
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / selfsupervised pretraining for a stack of recurrent NN
can be found here (depth > 1000).
[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 8795.
Augustinus, 1993.
[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF.
This work describes unsupervised pretraining of stacks of feedforward NNs (FNNs)
called Deep Belief Networks (DBNs).
It did not cite the much earlier 1991 unsupervised pretraining of stacks of more general recurrent NNs (RNNs)^{[UN03]}
which introduced
the first NNs shown to solve very deep problems.
The 2006 justification of the authors was essentially the one Schmidhuber used for the 1991 RNN stack:
each higher level tries to reduce the description length
(or negative log probability) of the data representation in the level below.^{[HIN][T22][MIR]}
This can greatly facilitate very deep downstream learning.^{[UN03]}
[UN5]
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle.
Greedy layerwise training of deep networks.
Proc. NIPS 06, pages 153160, Dec. 2006.
The comment under reference^{[UN4]} applies here as well.
[URQ10]
A. Urquhart. Von Neumann, Gödel and complexity theory. Bulletin of Symbolic Logic 16.4 (2010): 516530.
Link.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
More on the Fundamental Deep Learning Problem.
[VAN2] Y. Bengio, P. Simard, P. Frasconi. Learning longterm dependencies with gradient descent is difficult. IEEE TNN 5(2), p 157166, 1994.
Results are essentially identical to those of Schmidhuber's diploma student Hochreiter (1991).^{[VAN1]} Even after a common publication,^{[VAN3]} the first author of [VAN2] published papers^{[VAN4]} that cited only their own [VAN2] but not the original work.
[VAN3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001.
PDF.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
[VAR13]
M. Y. Vardi (2013). Who begat computing? Communications of the ACM, Vol. 56(1):5, Jan 2013.
Link.
[VID1] G. Hinton.
The Next Generation of Neural Networks.
Youtube video [see 28:16].
GoogleTechTalk, 2007.
Quote: "Nobody in their right mind would ever suggest"
to use plain backpropagation for training deep networks.
However, in 2010, Schmidhuber's team in Switzerland showed^{[MLP12]}
that
unsupervised pretraining is not necessary
to train deep NNs.
[VID2] Bloomberg Hello World.
The Rise of AI.
Youtube video, 2018.
The narrator of this 2018 Bloomberg video is thanking Hinton for speech recognition and machine translation, although both were actually done (at production time of the video) on billions of smartphones by deep learning methods developed in Schmidhuber's labs in Germany and Switzerland (LSTM & CTC) long before Hinton's less successful methods.
[W45]
G. H. Wannier (1945).
The Statistical Problem in Cooperative Phenomena.
Rev. Mod. Phys. 17, 50.
[WID62]
Widrow, B. and Hoff, M. (1962). Associative storage and retrieval of digital information in networks
of adaptive neurons. Biological Prototypes and Synthetic Systems, 1:160, 1962.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Preprint arXiv:1609.08144 (PDF), 2016. Based on LSTM which it mentions at least 50 times.
[XAV]
X. Glorot, Y. Bengio.
Understanding the difficulty of training deep feedforward neural networks.
Proc. 13th Intl. Conference on Artificial Intelligence and Statistics,
PMLR 9:249256, 2010.
[YB20]
Y. Bengio. Notable Past Research.
WWW link (retrieved 15 May 2020).
Local copy (plain HTML only).
The author claims that in 1995
he "introduced the use of a hierarchy of time scales to combat the vanishing gradients issue"^{[HB96]}
although
Schmidhuber's publications on exactly this topic
date back to 199193.^{[UN02][UN]}
The author also writes that in
1999 he "introduced, for the first time, autoregressive neural networks for density estimation"
although Schmidhuber & Heil used a very similar setup for text compression
already in 1995.^{[SNT]}
[ZU36]
K. Zuse (1936).
Verfahren zur selbsttätigen Durchführung von Rechnungen mit Hilfe von Rechenmaschinen. Patent application Z 23 139 / GMD Nr. 005/021, 1936.
First patent application describing
a general, practical, programcontrolled computer.
[ZU48]
K. Zuse (1948). Über den Plankalkül als Mittel zur Formulierung schematisch kombinativer Aufgaben.
Archiv der Mathematik 1(6), 441449 (1948).
PDF.
Apparently the first practical design of an automatic theorem prover (based on Zuse's highlevel programming language Plankalkül).
[ZUS21]
J. Schmidhuber (AI Blog, 2021). 80th anniversary celebrations: 1941: Konrad Zuse completes the first working general computer, based on his 1936 patent application.
.
