On 14 June 2022, a science tabloid that published
this article [LEC22b] (24 June)
on LeCun's report "A Path Towards Autonomous Machine Intelligence" [LEC22a] (27 June) sent me a draft of [LEC22a] (back then still under embargo) and asked for comments. I wrote a review (see below), telling them that this is essentially a rehash of our previous work that LeCun did not mention. My comments, however, fell on deaf ears. Now I am posting my not so enthusiastic remarks here such that the history of our field does not become further corrupted. The images below link to relevant blog posts from the
AI Blog.
I would like to start this by acknowledging that I am not without a conflict of interest here; my seeking to correct the record will naturally seem selfinterested. The truth of the matter is that it is. Much of the closely related work pointed to below was done in my lab, and I naturally wish that it be acknowledged, and recognized. Setting my conflict aside, I ask the reader to study the original papers and judge for themselves the scientific content of these remarks, as I seek to set emotions aside and minimize bias so much as I am capable.
★ LeCun writes: "Many ideas described in this paper (almost all of them) have been formulated by many authors in various contexts in various form."
In fact, unfortunately, much of the paper reads like a déjà vu of our papers since 1990, without citation. Years ago we have already published most of what LeCun calls his "main original contributions" [LEC22a]. More on this below.
★ LeCun writes: "There are three main challenges that AI research must address today: (1) How can machines learn to represent the world, learn to predict, and learn to act largely by observation? ... (2) How can machine reason and plan in ways that are compatible with gradientbased learning? ... (3) How can machines learn to represent percepts (3a) and action plans (3b) in a hierarchical manner, at multiple levels of abstraction, and multiple time scales?"
These questions were addressed in detail in a series of papers published in 1990, 1991, 1997, and 2015. Since then we have elaborated upon these papers. Let me first focus on (1) and (2), then on (3a) and (3b).
In 1990, I published the first works on gradientbased artificial neural networks (NNs) for longterm planning & reinforcement learning (RL) & exploration through artificial curiosity [AC90][PLAN2]. The wellknown report "Making the world differentiable ..."
[AC90] (which spawned several conference publications, e.g., [PLAN23]) introduced several concepts mentioned by LeCun that are now widely used.
It describes a combination of two
recurrent neural networks (RNNs, the most powerful NNs) called the controller and the world model. The controller tries to emit sequences of actions that maximize cumulative expected
vectorvalued (not necessarily scalar) pain and pleasure signals (special inputs to the controller) in an initially unknown environment. The world model learns to predict the consequences of the controller's actions. The controller can use the world model to plan ahead for several time steps through what's now called a rollout, selecting action sequences that maximise predicted reward [AC90][PLAN2].
This integrated architecture for learning, planning, and reacting was apparently published
[AC90][PLAN2]
before the important, related RL DYNA planner [DYNA9091] cited by LeCun. [AC90] also cites relevant work on less general control and
system identification with feedforward
NNs (FNNs) that predates the FNN work cited by LeCun, who claims that this goes back to the early 1990s although the first papers appeared in the 1980s, e.g., [MUN87][WER8789][NGU89] (compare Sec. 6 of [DL1]).
See also Sec. 11 of [MIR]
and our 1990 application of world models to the
learning of sequential attention and active foveation
[ATT][ATT02] (emphasized by LeCun [LEC22b]).
The approach led to lots of followup publications, not only
in 199091 [PLAN23][PHD],
but also in more recent years, e.g., [PLAN46].
In the beginning, the world model knows nothing. Which experiments should the controller invent to obtain data that will improve the world model? To solve this problem,
the 1990 paper [AC90] introduced
artificial curiosity [AC90b] or intrinsic motivation (emphasized by LeCun's abstract [LEC22b]) through
NNs that are both generative and adversarial—the 2014 generative adversarial NN [GAN1] cited by LeCun is actually a simple version of my 1990 approach [AC20][R2].
My wellknown 2010 survey [AC10] summarised the GANs of 1990 as follows: a
"neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" (which are minimized).
See my
priority dispute on GANs [T22] with LeCun's coauthor (who unsurprisingly had rather positive comments [LEC22b] on LeCun's article [LEC22a]).
In the 2010s [DEC], these concepts
became popular as
compute became cheaper. Our work of 19972002 [AC97][AC99][AC02] and more recent work since 2015 [PLAN45][OBJ24] go beyond the "millisecond by millisecond
planning" [PLAN] of 1990 [AC90][PLAN2], addressing planning and reasoning in abstract
concept spaces (emphasized by LeCun) and
learning to think [PLAN4],
including LeCun's "learning to act largely by observation"— see item (1) above.
The cartoon on top of the present page is based on Figure 1 in [PLAN4]: C denotes the recurrent control network, M the recurrent predictive world model which may be good at predicting some things but uncertain about others. C maximizes its objective function by learning to query (a copy of) M through sequences of selfinvented questions (activation patterns) and to interpret the answers (more activation patterns). That is, in principle, C can profit from being able to extract any type of algorithmic information [KO02][CO13][PLAN45] from M, e.g., for hierarchical planning and reasoning, analogy building, etc.
Here is an illustrative quote from [PLAN4] (2015) on how C can learn from passive observations (frequently mentioned by LeCun [LEC22a]) encoded in M: "For example, suppose that M has learned to represent (e.g., through predictive coding)
videos of people placing toys in boxes,
or to summarize such videos through textual outputs.
Now suppose C's task is to learn to control a robot that places toys in boxes.
Although the robot's actuators may be quite different from human arms and hands,
and although videos and videodescribing texts are quite different from desirable trajectories of
robot movements, M is expected to convey algorithmic information about C's task, perhaps in form of connected
highlevel spatiotemporal feature detectors representing typical movements of hands and elbows independent of arm size.
Learning a [weight matrix of C] that addresses and extracts this information from M and partially reuses it to solve the robot's task may
be much faster than learning to solve the task from scratch without access to M" [PLAN45].
(See also the related "fingerprinting" and its recent applications [PEVN][GPE][GGP].)
[PLAN4] also explains concepts such as mirror neurons.
My agents with adaptive recurrent
world models even suggest a simple explanation of selfawareness and consciousness (mentioned by LeCun), dating back three decades [CON16]. Here is my 2020 overview page [PLAN] on this:
30year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990).
(3a) Answer regarding NNbased hierarchical percepts: this was already at least partially solved by my
first very deep learning machine
of 1991,
the neural sequence chunker aka neural history compressor
[UN][UN0UN2] (see also [UN3]). It uses
unsupervised learning and predictive coding
in a deep hierarchy of recurrent neural networks (RNNs)
to find compact internal
representations of long sequences of data, at multiple levels of abstraction and multiple time scales (exactly what LeCun is writing about).
This greatly facilitates downstream supervised deep learning such as sequence classification.
By 1993, the approach solved problems of depth 1000
(requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning).
A variant collapses the hierarchy into a single deep net.
It uses a socalled conscious chunker RNN
which attends to unexpected events that surprise
a lowerlevel socalled subconscious automatiser RNN.
The chunker learns to understand the surprising events by predicting them.
The automatiser uses my
neural knowledge distillation procedure
of 1991
[UN0UN2]
to compress and absorb the formerly conscious insights and
behaviours of the chunker, thus making them subconscious.
The systems of 1991 allowed for much deeper learning than previous methods. Here is my 2021 overview page [UN] on this:
30year anniversary. 1991: First very deep learning with unsupervised pretraining.
(See also the 1993 work on continuous history compression [UN3] and our 1995 neural probabilistic language model based on predictive coding [SNT].)
Furthermore, our more recent 2021 hierarchical world model [OBJ4] also explicitly distinguishes multiple levels of abstraction (to capture objects, parts, and their relations) to improve at modeling the visual world. Regarding LeCun's section 8.3.3 "Do We Need Symbols for Reasoning?," we have previously argued [BIND] for the importance of incorporating inductive biases in NNs that enable them to efficiently learn about symbols (e.g., [SYM13]) and the processes for manipulating them. Currently, many NNs suffer from a binding problem, which affects their ability to dynamically and flexibly combine (bind) information that is distributed throughout the NN, as is required to effectively form, represent, and relate symbollike entities. Our 2020 position/survey paper [BIND] offers a conceptual framework for addressing this problem and provides an indepth analysis of the challenges, requirements, and corresponding inductive biases required for symbol manipulation to emerge naturally in NNs.
(3b) Answer regarding NNbased hierarchical action plans: already in 1990, this problem was at least partially solved through my Hierarchical Reinforcement Learning (HRL) with
endtoend differentiable NNbased subgoal generators [HRL0], also with
recurrent NNs that learn to generate sequences of subgoals [HRL12][PHD].
An RL machine gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NNbased subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of costminimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals.
LeCun writes: "A general formulation can be done with the framework of EnergyBased Models (EBM).
The system is a scalarvalued function F(x,y) that produces low energy values when x and
y are compatible and higher values when they are not." That's exactly what the evaluator of our 1990 subgoal generator implements,
where x and y are start and goal, respectively.
The system is learning action plans
at multiple levels of abstraction and multiple time scales (exactly what LeCun is writing about).
Here is my 2021 overview [MIR] (Sec. 10):
Deep Learning: Our Miraculous Year 19901991.
(There are many more recent papers on "work on command," e.g.,
[ATT1][PP][PPa][PP1][SWA22][UDRL12][GGP].)
★ LeCun writes: "Our best approaches to learning rely on estimating and using the gradient of a loss."
This is true for some tasks, but not for many others. For example, simple problems such as the general parity problem [GUESS][LSTM1] or Towers of Hanoi [OOPS2] cannot easily be learned by gradient descent from large training examples. See, e.g., our work since 2002 on
asymptotically optimal curriculum learning through incremental universal search for problemsolving programs [OOPS13].
★ LeCun writes: "Because both submodules of the cost module are differentiable, the gradient of the energy can be backpropagated through the other modules, particularly the world model, the actor and the perception, for planning, reasoning, and learning."
That's exactly what I published in 1990 (see above), citing less general 1980s work on
system identification with feedforward
NNs [MUN87][WER8789][NGU89] (see also Sec. 6 of [DL1]).
And in the early 2000s, my former postdoc Marcus Hutter even published theoretically optimal, universal, nondifferentiable methods for learning both world model and controller [UNI].
(See also the mathematically optimal
selfreferential AGI called the Gödel Machine [GM39].)
★ LeCun writes: "The shortterm memory module ... architecture may be similar to that of KeyValue Memory Networks."
He does not mention, however, that I published the first such "KeyValue Memory Networks" in 1991 [FWP01,6], when I described sequenceprocessing "Fast Weight Controllers" or Fast Weight Programmers (FWPs). Such an FWP has a slow NN that learns by backpropagation [BP16][BPAC] to rapidly modify the fast weights of another NN [FWP01]. The slow net does that by programming the fast net through outer products of selfinvented KeyValue pairs (back then called FromTo pairs). Today this is known as a linear Transformer [TR56] (without the softmax operation of modern Transformers [TR14]). In 1993, I also introduced the attention terminology [FWP2] now used in this context [ATT][FWP2]. Basically, I separated storage and control like in traditional computers,
but in a fully neural way (rather than in a hybrid fashion [PDA1][PDA2][DNC]). Here is my 2021 overview page on this: 26 Mar 1991: Neural nets learn to program neural nets with fast weights—like today's Transformer variants. 2021: New stuff!
In fact, our very recent work on goalconditioned generators of deep policies [GGP] (submitted in May 2022) has a Fast Weight Programmer that learns to obey commands of the form "generate a policy (an NN weight matrix) that achieves a desired expected return," generalizing UpsideDown RL [UDRL12], building on
ParameterBased Value Functions [PBVF] and Policy Evaluation Networks [PEVN].
[GGP] exhibits competitive performance on a set of continuous control tasks, basically playing the role of LeCun's unspecified "configurator" which is supposed to configure other modules "for the task at hand by modulating their parameters and their attention circuits" [LEC22a].
See also our other very recent paper [GPE] (and references therein) on evaluating/improving weight matrices that are control policies.
On the other hand, LeCun also writes: "Perhaps the most important function of the configurator is to set subgoals for the agent and to configure the cost module for this subgoal." We implemented such an adaptive subgoal generator 3 decades ago [HRL02][PHD]—see item (3b) above.
★ LeCun writes: "The centerpiece of this paper is the Joint Embedding Predictive Architecture (JEPA). ... The main advantage of JEPA is that it performs predictions in representation space, eschewing the need to predict every detail of y."
This is what I published in the context of control in 19972002 [AC97][AC99][AC02]. Before 1997, the world models of our RL systems tried to predict all the details of future inputs, e.g., pixels [AC9095]. But in 1997, a quartercentury ago [25y97], I built more general adversarial RL machines that could ignore many or all of these details and ask arbitrary abstract questions with computable answers in "representation space" [AC97][AC99][AC02]. For example, an agent may ask itself: if we run this policy (or program) for a while until it executes a special interrupt action, will the internal storage cell number 15 (a latent variable in representation space) contain the value 5, or not? The agent actually consists of two learning, rewardmaximizing adversaries (called "left brain" and "right brain") playing a zero sum game, occasionally betting on different yes/no outcomes of such computational experiments. The winner of such a bet gets a reward of 1, the loser 1. So each brain is motivated to come up with questions whose answers surprise the other, until the answers become predictable and boring. Experiments showed that this type of abstractionbased curiosity may also accelerate the intake of external reward [AC97][AC02]. Here is my 2021 overview blog page [AC] on this (see Sec. 4):
Artificial Curiosity & Creativity Since 199091.
Note also that even our earlier, less general approaches to artificial curiosity since 1991 [AC9195] naturally direct the world model towards representing predictable details in the environment, by rewarding a dataselecting controller for improvements of the world model. See Sections 25 of the overview [AC].
★ LeCun writes: "a JEPA can choose to train its encoders to eliminate irrelevant details of the inputs so as to make the representations more predictable. In other words, a JEPA will learn abstract representations that make the world predictable."
That's what we published in very general form for RL systems in 1997
[AC97][AC99][AC02] (title: "Exploring the Predictable"). See also earlier work on much less general supervised systems, e.g., "Discovering Predictable Classifications" (1992) [PMax], extending [IMAX] (1989). The
science tabloid article [LEC22b] also focuses on this issue, acting as if LeCun had some novel insight here, although it's really an old hat.
★ LeCun writes: "One question that is left unanswered is how the configurator can learn to decompose a complex task into a sequence of subgoals that can individually be accomplished by the agent. I shall leave this question open for future investigation."
Far from a future investigation, we published the first systems doing exactly this 3 decades ago when compute was a million times more expensive than today: learning to decompose by gradient descent "a complex task into a sequence of subgoals that can individually be accomplished by the agent" [HRL02][PHD]—see (3b) above and
Sec. 10 of "Deep Learning: Our Miraculous Year 19901991."
See also [HRL4] on a different approach (1997) to this problem, with my student Marco Wiering. I could point to many additional papers of ours on exactly this topic.
★ LeCun writes: "Perhaps the main original contributions of the paper reside in
(I) an overall cognitive architecture in which all modules are differentiable and many of
them are trainable.
(II) HJEPA: a nongenerative hierarchical architecture for predictive world models that
learn representations at multiple levels of abstraction and multiple time scales.
(III) a family of noncontrastive selfsupervised learning paradigm that produces representations
that are simultaneously informative and predictable.
(IV) a way to use HJEPA as the basis of predictive world models for hierarchical planning
under uncertainty."
Given my comments above, I cannot see any significant novelty here. Of course, I am not claiming that everything is solved. Nevertheless, in the past 32 years, we have already made substantial progress along the lines "proposed" by LeCun. I am referring the interested reader again to (I) our
"cognitive architectures in which all modules are differentiable and many of them are trainable" [HAB][PHD][AC90][AC90b][AC][HRL02][PLAN25], (II) our "hierarchical architecture for predictive world models that
learn representations at multiple levels of abstraction and multiple time scales" [UN,UN03], (III) our "selfsupervised learning paradigm that produces
representations that are simultaneously informative and predictable" [AC97][AC99][AC02]([PMax]), and (IV) our predictive models "for hierarchical planning under uncertainty" [PHD][HRL02][PLAN45]. In particular, the work of 19972002 [AC97][AC99][AC02][AC] and more recent work since 2015 [PLAN45][OBJ24][BIND] focuses on reasoning in abstract
concept spaces and learning to think [PLAN4]. I am also recommending the work on Fast Weight Programmers (FWPs) and "KeyValue Memory Networks" since 1991 [FWP06][FWPMETA110] (recall LeCun's "configurator" [LEC22a]), including our recent work since 2020 [FWP67][FWPMETA69][GGP][GPE]. All of this is closely connected to our
metalearning machines that learn to learn (since 1987) [META].
★ LeCun writes: "Below is an attempt to connect the present proposal with relevant prior work. Given the scope of the proposal, the references cannot possibly be exhaustive." Then he goes on citing a few somewhat related, mostly relatively recent works, while ignoring most of the directly relevant original work mentioned above, possibly encouraged by an
award that he and his colleagues shared for inventions of other researchers whom they did not cite [T22].
Perhaps some of the prior work that I note here was simply unknown to LeCun. The point of this post is not to attack the ideas reflected in the paper under review, or its author. The point is that these ideas are not as new as may be understood by reading LeCun's paper. There is much prior work that is directly along the lines proposed, by my lab, and by others. I have naturally placed some emphasis on my own prior work, which has focused for decades on what LeCun now calls his "main original contributions,"
and hope the readers will judge for themselves the validity of my comments.
Acknowledgments
Thanks to several machine learning experts for useful comments. Since science is about selfcorrection, let me know under juergen@idsia.ch if you can spot any remaining error. The contents of this article may be used for educational and noncommercial purposes, including articles for Wikipedia and similar sites. This work is licensed under a Creative Commons AttributionNonCommercialShareAlike 4.0 International License.
Addendum I (8/24/2022): LeCun's response and my reply
On 8 Jul 2022, I posted
the link to the present critique in OpenReview:
This paper rehashes but does not cite vital work from 19902015
On 14 June 2022, a science tabloid that published a 24 June article on LeCun's 27 June report sent me a draft of the report (back then still under embargo) and asked for comments. I wrote a review, telling them that this is essentially a rehash of our previous work that LeCun did not mention. My comments, however, fell on deaf ears. Now I am posting a link to my notsoenthusiastic remarks here: https://people.idsia.ch/~juergen/lecunrehash19902022.html
Before you read that, though, I want to acknowledge that I am not without a conflict of interest here; my seeking to correct the record will naturally seem selfinterested. The truth of the matter is that it is. Much of the closely related work pointed to below was done in my lab, and I naturally wish that it be acknowledged, and recognized. Setting my conflict aside, I ask the reader to study the original papers and judge for themselves the scientific content of these remarks, as I seek to set emotions aside and minimize bias so much as I am capable.
TL;DR: Years ago we published most of what LeCun calls his "main original contributions:" (I) our "cognitive architectures in which all modules are differentiable and many of them are trainable" (1990), (II) our "hierarchical architecture for predictive world models that learn representations at multiple levels of abstraction and multiple time scales" (1991), (III) our "selfsupervised learning paradigm that produces representations that are simultaneously informative and predictable" (since 1997 for reinforcement learning/world models), and (IV) our predictive models "for hierarchical planning under uncertainty," including gradientbased neural subgoal generators (1990), reasoning in abstract concept spaces (1997), neural nets that "learn to act largely by observation" (2015), and learn to think (2015).
More details and numerous references to the original papers can be found under https://people.idsia.ch/~juergen/lecunrehash19902022.html
On 14 Jul 2022, Yann LeCun responded:
Let's be constructive, please?
I don't want to get into a sterile dispute about who invented by plowing though the 160 references listed in your response piece. It would be more constructive to point to 4 publications that you think may contain the ideas and methods in my list of 4 contributions.
As I say at the beginning of the paper, there are many concepts that have been around for a long time that neither you nor I invented:

the concept of differentiable world model goes back to early work in optimal control.

trainable world models is the whole idea of systems identification

using neural nets to learn world models goes back to the late 1980s with work by Michael Jordan, Bernie Widrow, Robinson & Fallside, Kumpathi Narendra, Paul Werbos, all predating your own work.
This straw man reply seems designed to distract from the problems with what LeCun calls his "main original contributions." I responded on 14 Jul 2022:
199091: neural nets (NNs) learn multiple time scales and levels of abstraction, generate subgoals, use GANlike intrinsic motivation to improve world models, and plan. 1997: controllers learn informative predictable representations. 2015: controller NN queries world model NN to extract arbitrary algorithmic information.
Regarding what "neither you nor I invented:" your paper claims that system identification with neural nets (NNs) goes back to the early 1990s. In your reply above, however, you seem to agree now with what I wrote: the first papers on this appeared in the 1980s.
Regarding your "main original contributions" (IIV below):
(I) your "cognitive architectures in which all modules are differentiable and many of them are trainable," with "behavior driven through intrinsic motivation:"
My
differentiable 1990 architecture for online learning & planning
(through "rollouts") [AC90][PLAN2] was the first with "intrinsic motivation" for the controller C to improve the world model M. It was both generative and adversarial; the 2014 GAN you cite is a version thereof.
(II) your "hierarchical architecture for predictive world models that learn representations at multiple levels of abstraction and multiple time scales:"
This was implemented by my 1991 neural history compressor [UN1]. Using "predictive coding," it learns in selfsupervised fashion hierarchical internal representations of long sequences of data, to greatly facilitate downstream learning. These representations can be collapsed into a single recurrent NN (RNN), using my NN distillation procedure of 1991 [UN1].
(III) your "selfsupervised learning paradigm that produces representations that are simultaneously informative and predictable" in the context of control:
See my 1997 system [AC97][AC99][AC02]. Instead of predicting all details (e.g. pixels) of future inputs [AC9095], it can ask arbitrary abstract questions with computable answers in what you call "representation space." Two learning, rewardmaximizing adversaries called "left brain" and "right brain" play a zero sum game, trying to surprise each other, occasionally betting on different yes/no outcomes of such computational experiments, until the outcomes
become predictable and boring.
(IV) your predictive differentiable models "for hierarchical planning under uncertainty"  you write: "One question that is left unanswered is how the configurator can learn to decompose a complex task into a sequence of subgoals that can individually be accomplished by the agent. I shall leave this question open for future investigation."
Far from a future investigation, I published exactly this over 3 decades ago: a controller NN gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the expected costs of going from start to goal. A differentiable (R)NNbased subgoal generator
also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of costminimizing intermediate subgoals [HRL1].
(V) You also emphasize NNs that "learn to act largely by observation." We addressed this a long time ago, e.g., [PLAN4] (2015). M may be good at predicting some things but uncertain about others. C maximizes its objective function by learning to query (a copy of) M through sequences of selfinvented questions (activation patterns) and to interpret the answers (more activation patterns). C may profit from learning to extract any type of algorithmic information from M, e.g., for hierarchical planning and reasoning, exploiting passive observations encoded in M, etc.
You want only four relevant publications  take five: (I) [AC90], (II) [UN1], (III) [AC02], (IV) [HRL1], (V) [PLAN4]. References and details under https://people.idsia.ch/~juergen/lecunrehash19902022.html
Addendum II (10/4/2022): LeCun on the "5 best ideas 20122022:" most of them are from my lab, and much older
On 13 Sep 2022, Prof. David Chalmers tweeted: "what are the most important intellectual breakthroughs (new ideas) in Al in the last ten years?" See LeCun's answer in the tweet to the right.
Remarkably, most of this stems from my lab! Most of it is much older than 10 years, and was already mentioned above:
1. "SelfSupervised Learning" with automatic label generation through neural nets (NNs) dates back at least to our work of 199091:
(I) selfsupervised target generation through predictive coding in a recurrent NN (RNN) hierarchy that learns to compress data sequences across multiple time scales and levels of abstraction [UN][UN0UN2]. Here an "automatizer" RNN learns pretext tasks of the type "predict the next input," and sends unexpected observations in the incoming data stream as targets to a "chunker" RNN, which learns higher level regularities and later distills its acquired predictive knowledge back into the automatizer through appropriate training targets [UN1]. This greatly facilitates the previously unsolvable downstream deep learning task of sequence classification. (II) Selfsupervised label generation through intrinsic motivation of the GAN type,
where a world model NN learns to predict the consequences of the actions of an adversarial, labelgenerating, experimentinventing controller NN [AC90,AC90b][AC20][AC][LEC]. The 1990 paper [AC90] even has "selfsupervised" in the title. (However, much older papers do so, too, e.g., [SS78].)
2. "ResNets (not intellectually deep, but useful):"
ResNets are actually just our earlier Highway Nets whose gates are initialized such that they remain always open [HW13].
Before Highway Nets entered the scene, feedforward NNs had at most a few tens of layers, e.g., 2030 layers.
Highway Nets were the first working really deep feedforward neural networks with hundreds of layers.
It breaks my heart that LeCun does not find them intellectually deep :(
On the other hand, they represent the essence of
deep learning,
which is all about the depth of NNs [DL1].
In the 1990s, our
LSTMs
brought essentially unlimited depth to supervised recurrent NNs; in the 2000s, our LSTMinspired Highway Nets brought it to feedforward NNs. LSTM has become the most cited NN of the 20th century; the Highway Net version called ResNet the most cited NN of the 21st [MOST].
3. "Gating > Attention > Dynamic connection graphs:"
dating back at least to my
Fast Weight Programmers (FWPs) and "KeyValue Memory Networks" of 199193 (where "KeyValue" was called "FROMTO") [FWP02,6][FWPMETA15][LEC].
In 1993, I introduced
the attention terminology [FWP2] now used
in this context [ATT]. It should be mentioned, however, that the first multiplicative gates in NNs date back to Ivakhnenko & Lapa's deep learning machines of 1965 [DEEP12][DL2].
4. "Differentiable memory:" the neural version thereof dates back at least to our
Fast Weight Programmers (FWPs) or "KeyValue Memory Networks" of 1991 ("KeyValue" was called "FROMTO") [FWP02,6][LEC],
separating storage and control like in traditional computers,
but in an endtoenddifferentiable, adaptive, fully neural way (rather than in a hybrid fashion [PDA12][DNC]).
5. "Permutationequivariant modules, e.g. multihead selfattention" > Transformers:
We published Transformers with linearized selfattention in 1991 [FWP01][MOST]; the corresponding attention terminology (learning "internal spotlights of attention") dates back to 1993 [FWP2][ATT]. See this 2022 tweet for the 30year anniversary of the 1992 journal publication.
6. Roger Borràs replied to LeCun: "In your keynote from NIPS 2016 in BCN (I was there) you said that GANs were the best machine learning idea of the last 10 years. How do you see it now im (sic) perspective?"
The GAN principle was actually introduced in
1990 [AC90] under the moniker
artificial curiosity [AC90b][AC20]. The 2014 generative adversarial NN [GAN1] referred to by LeCun is actually a simple version of my 1990 approach [AC20][R2].
See the
wellknown priority dispute on GANs [T22].
In the context of the above disputes, LeCun also posted the tweet to the right.
I am the "French aviation buff" who touted French aviation pioneers about 2 decades ago in Nature and Newsweek—see my 2003 letter
"First Pow(d)ered Flight / Plane Truth" [NASC1].
Mocking Ader's plane is like mocking LeCun's convolutional NN variants (CNNs [CNN14]) just
because our
awardwinning
superhuman CNN called DanNet [GPUCNN13,58] [DAN] was three times better than his NN in the famous competition of 2011 [DAN1][GPUCNN18][R6].
In ad hominem style,
LeCun stated in the NY Times that "Jürgen ... keeps
claiming credit he doesn't deserve for many, many things" [NYT1], without any justification, without providing a single example [T22]. In conjunction with reference [T22], the present piece makes clear that it is actually LeCun himself who "keeps
claiming credit he doesn't deserve for many, many things," providing numerous examples, plus the references to back them up.
Addendum III (2/9/2023): LeCun's subsequent claims in popular science venues, and my reply
For a while, LeCun hasn't followed the standard scientific procedure, namely, either defend his work on OpenReview (where he posted his report) through facts against my critique (see Addendum I above), or acquiesce to my arguments, and correct his papers.
Instead he gave an interview to the popular science venue ZDNet [LEC22c] where my critique was mentioned
and said: "I'm not claiming any priority on most of what I wrote in that paper, essentially."
He said this although the target of that critique was what he called "main originial contributions" of his paper [LEC22a] on OpenReview; the critique showed that his "main originial contributions" were anything but [LEC].
LeCun also claimed about me:
"... the main paper that he says I should cite doesn't have any of the main ideas that I talk about in the paper. He's done this also with with GANs and other things, which didn't turn out to be true."
This claim has no justification, no references, and is both false and misleading.
First of all, I listed not just one but several relevant papers [LEC] (including [AC90][UN1][AC02][HRL1][PLAN4])
that include most of what LeCun explicitly calls his "main original contributions" [LEC22a].
The socalled "main paper" (presumably [UN1]) has but one of these: a neural "hierarchical architecture for predictive world models that learn representations at multiple levels of abstraction and multiple time scales." See the main text and Addendum I of the present report.
Any expert in the field can easily validate this in a few minutes.
On the topic of GANs, it is wholly unclear how LeCun can believe my claim "didn't turn out to be true." This claim in question is that my gradientbased generative and adversarial NNs of 1990 [AC90AC90b] (whose principles have been frequently cited and implemented and used) were an earlier version of the 2014 GAN—whose paper [GAN1]
failed to correctly assign credit [T22]. One of my previous peerreviewed publications [AC20] was able to quite clearly show the correctness of my claim and this work remains unchallenged.
LeCun also writes: "I think the arguments that he made on social networks that he basically invented all of this in 1991, as he's done in other cases, is just not the case. I mean, it's very easy to do flagplanting, and to, kindof, write an idea without any experiments, without any theory, just suggest that you could do it this way. But, you know, there's a big difference between just having the idea, and then getting it to work on a toy problem, and then getting it to work on a real problem, and then doing a theory that shows why it works, and then deploying it. There's a whole chain, and his idea of scientific credit is that it's the very first person who just, sortof, you know, had the idea of that, that should get all the credit. And that's ridiculous."
This is a straw man reply designed to distract. First of all, the "main paper" mentioned above [UN1] (1991) on the "hierarchical architecture for predictive world models that learn representations at multiple levels of abstraction and multiple time scales" does include experiments (although back then compute was a million times more expensive than today). Same for LeCun's "selfsupervised learning paradigm that produces representations that are simultaneously informative and predictable" published long after my work on this in the 1990s [AC97][AC99][AC02]([PMax]). Same for LeCun's predictive differentiable models "for hierarchical planning under uncertainty" which are present in my neural subgoal learners of the early 1990s [HRL02][PHD]. See again the main text and Addendum I of the present report.
Also, LeCun seems to conflate together Science and Engineering. Ideas and understanding are the realms of science. Operationalising and rendering ideas practical is the realm of engineering. Einstein did not build the GPS based on his ideas.
Perhaps the most preposterous quote from LeCun—one that no doubt has made all of the great scientists from Archimedes to Einstein roll in their graves—is his claim that my
"idea of scientific credit is that it's the very first person who just, sortof, you know, had the idea of that, that should get all the credit." In no universe is this true. As I wrote in a previous critique (one which he knows well) [DLC]:
"the inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it)." Nothing more or less than the standard elementary principles of scientific credit assignment [T22]. LeCun, however, apparently isn't satisfied with credit for popularising the inventions of others; he also wants the inventor's credit.
In another popular science journal LeCun also said in the context of our disputes [LEC22c]: "Sometimes, what's hard is actually to instantiate those ideas into things that work. That's where the difficult part starts very often. I can write 'f' of 'x' equals zero. Absolutely every theoretical statement in all of science can be reduced to this data. If you know what 'f' and 'x' mean, you might have a general idea, but then you need to contribute something concrete and operationalise this idea."
Nobody is debating this straw man argument which is certainly irrelevant to things such as the extremely close relationship between my 1990 artificial curiosity (AC) [AC90AC90b] and the 2014 version thereof called GANs [GAN1]. By LeCun's "f(x)=0" reasoning, identical human twins would be no more similar than two randomly chosen persons. Identical twins such as AC (1990) and GAN (2014), however, are extremely similar in very specific ways, as discussed in the unchallenged peerreviewed publication [AC20]: both have a generative neural net (NN) whose outputs are fed into a predictor NN which minimises by gradient descent its error which in turn is maximised by the generative NN. Many later wellknown papers compactly summarised this 1990 GAN principle, e.g., 1 year later [AC91]: "Spend reinforcement for a [generative, reinforcementmaximizing] modelbuilding controller whenever there is a mismatch between the expectations of the adaptive world model and reality." Or 20 years later [AC10]: a
"neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" (which are minimized).
Much more specific than some general f(x)=0! Before 1990, there was nothing like AC; other
early adversarial machine learning settings since 1959 [S59][H90] were very different—they
neither involved unsupervised/selfsupervised NNs nor were about modeling data nor used gradient descent [AC20].
Furthermore, AC was subsequently frequently cited and implemented and used before 2014. Similar for the other disputes mentioned above.
LeCun's claims about "flagplanting" are once more designed to evade the real issue: that some of his work has failed to credit those who invented what he described [PLAG1][FAKE][FAKE2]. The Code of Ethics and Professional Conduct by ACM (the organisation handing out Turing Awards) [ACM18] states that computing professionals should "credit the creators of ideas, inventions, work, and artifacts, and respect copyrights, patents, trade secrets, license agreements, and other methods of protecting authors' works." Much of LeCun's work does not do this [T22]. And with this interview [LEC22c] he is doubling down on an untenable position that's incompatible with the basic universallyaccepted rules of scientific integrity [T22].
References
[25y97]
In 2022, we are celebrating the following works from a quartercentury ago.
1. Journal paper on Long ShortTerm Memory, the
most cited neural network (NN) of the 20th century
(and basis of the most cited NN of the 21st).
2. First paper on physical, philosophical and theological consequences of the simplest and fastest way of computing
all possible metaverses
(= computable universes).
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
4. Journal paper on
metareinforcement learning.
5. Journal paper on hierarchical Qlearning.
6. First paper on reinforcement learning to play soccer: start of a series.
7. Journal papers on flat minima & lowcomplexity NNs that generalize well.
8. Journal paper on LowComplexity Art, the Minimal Art of the Information Age.
9. Journal paper on probabilistic incremental program evolution.
[AC]
J. Schmidhuber (AI Blog, 2021). 3 decades of artificial curiosity & creativity. Our artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
selfsupervised neural networks for dynamic reinforcement learning and
planning in nonstationary environments.
Technical Report FKI12690, TUM, Feb 1990, revised Nov 1990.
PDF.
The first paper on longterm planning with selfsupervised reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
where a generator NN is fighting a predictor NN in a minimax game
(more).
[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
modelbuilding neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222227. MIT Press/Bradford Books, 1991.
PDF.
More.
[AC91]
J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI14991, Inst. f. Informatik, Tech. Univ. Munich, April 1991.
PDF.
[AC91b]
J. Schmidhuber.
Curious modelbuilding control systems.
In Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 14581463. IEEE, 1991.
PDF.
[AC95]
J. Storck, S. Hochreiter, and J. Schmidhuber. Reinforcementdriven information acquisition in nondeterministic environments. In Proc. ICANN'95, vol. 2, pages 159164. EC2 & CIE, Paris, 1995. PDF.
[AC97]
J. Schmidhuber.
What's interesting?
Technical Report IDSIA3597, IDSIA, July 1997.
Focus
on automatic creation of predictable internal
abstractions of complex spatiotemporal events:
two competing, intrinsically motivated agents agree on essentially
arbitrary algorithmic experiments and bet
on their possibly surprising (not yet predictable)
outcomes in zerosum games,
each agent potentially profiting from outwitting / surprising
the other by inventing experimental protocols where both
modules disagree on the predicted outcome. The focus is on exploring
the space of general algorithms (as opposed to
traditional simple mappings from inputs to
outputs); the
general system
focuses on the interesting
things by losing interest in both predictable and
unpredictable aspects of the world. Unlike our previous
systems with intrinsic motivation,^{[AC90AC95]} the system also
takes into account
the computational cost of learning new skills, learning when to learn and what to learn.
See later publications.^{[AC99][AC02]}
[AC98]
M. Wiering and J. Schmidhuber.
Efficient modelbased exploration.
In R. Pfeiffer, B. Blumberg, J. Meyer, S. W. Wilson, eds.,
From Animals to Animats 5: Proceedings
of the Fifth International Conference on Simulation of Adaptive
Behavior, p. 223228, MIT Press, 1998.
[AC98b]
M. Wiering and J. Schmidhuber.
Learning exploration policies with models.
In Proc. CONALD, 1998.
[AC99]
J. Schmidhuber.
Artificial Curiosity Based on Discovering Novel Algorithmic
Predictability Through Coevolution.
In P. Angeline, Z. Michalewicz, M. Schoenauer, X. Yao, Z.
Zalzala, eds., Congress on Evolutionary Computation, p. 16121618,
IEEE Press, Piscataway, NJ, 1999.
[AC02]
J. Schmidhuber.
Exploring the Predictable.
In Ghosh, S. Tsutsui, eds., Advances in Evolutionary Computing,
p. 579612, Springer, 2002.
PDF.
[AC05]
J. Schmidhuber.
SelfMotivated Development Through
Rewards for Predictor Errors / Improvements.
Developmental Robotics 2005 AAAI Spring Symposium,
March 2123, 2005, Stanford University, CA.
PDF.
[AC06]
J. Schmidhuber.
Developmental Robotics,
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173187, 2006.
PDF.
[AC07]
J. Schmidhuber.
Simple Algorithmic Principles of Discovery, Subjective Beauty,
Selective Attention, Curiosity & Creativity.
In V. Corruble, M. Takeda, E. Suzuki, eds.,
Proc. 10th Intl. Conf. on Discovery Science (DS 2007)
p. 2638, LNAI 4755, Springer, 2007.
Also in M. Hutter, R. A. Servedio, E. Takimoto, eds.,
Proc. 18th Intl. Conf. on Algorithmic Learning Theory (ALT 2007)
p. 32, LNAI 4754, Springer, 2007.
(Joint invited lecture for DS 2007 and ALT 2007, Sendai, Japan, 2007.)
Preprint: arxiv:0709.0674.
PDF.
Curiosity as the drive to improve the compression
of the lifelong sensory input stream: interestingness as
the first derivative of subjective "beauty" or compressibility.
[AC08]
Driven by Compression Progress. In Proc.
KnowledgeBased Intelligent Information and
Engineering Systems KES2008,
Lecture Notes in Computer Science LNCS 5177, p 11, Springer, 2008.
(Abstract of invited keynote talk.)
PDF.
[AC09]
J. Schmidhuber. Art & science as byproducts of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98112.
PDF. (More on
artificial scientists and artists.)
[AC09a]
J. Schmidhuber.
Driven by Compression Progress: A Simple Principle Explains Essential Aspects of Subjective Beauty, Novelty, Surprise, Interestingness, Attention, Curiosity, Creativity, Art, Science, Music, Jokes.
Based on keynote talk for KES 2008 (below) and joint invited
lecture for ALT 2007 / DS 2007 (below). Short version: ref 17 below. Long version in G. Pezzulo, M. V. Butz, O. Sigaud, G. Baldassarre, eds.: Anticipatory Behavior in Adaptive Learning Systems, from Sensorimotor to Higherlevel Cognitive Capabilities, Springer, LNAI, 2009.
Preprint (2008, revised 2009): arXiv:0812.4360.
PDF (Dec 2008).
PDF (April 2009).
[AC09b]
J. Schmidhuber.
Simple Algorithmic Theory of Subjective Beauty, Novelty, Surprise,
Interestingness, Attention, Curiosity, Creativity, Art,
Science, Music, Jokes. Journal of SICE, 48(1):2132, 2009.
PDF.
[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (19902010). IEEE Transactions on Autonomous Mental Development, 2(3):230247, 2010.
IEEE link.
PDF.
[AC10a]
J. Schmidhuber. Artificial Scientists & Artists Based on the Formal Theory of Creativity.
In
Proceedings of the Third Conference on Artificial General Intelligence (AGI2010), Lugano, Switzerland.
PDF.
[AC11]
Sun Yi, F. Gomez, J. Schmidhuber.
Planning to Be Surprised: Optimal Bayesian Exploration in Dynamic Environments.
In Proc. Fourth Conference on Artificial General Intelligence (AGI11),
Google, Mountain View, California, 2011.
PDF.
[AC11a]
V. Graziano, T. Glasmachers, T. Schaul, L. Pape, G. Cuccu, J. Leitner, J. Schmidhuber. Artificial Curiosity for Autonomous Space Exploration. Acta Futura 4:4151, 2011 (DOI: 10.2420/AF04.2011.41). PDF.
[AC11b]
G. Cuccu, M. Luciw, J. Schmidhuber, F. Gomez.
Intrinsically Motivated Evolutionary Search for VisionBased Reinforcement Learning.
In Proc. Joint IEEE International Conference on Development and Learning (ICDL) and on Epigenetic Robotics (ICDLEpiRob 2011), Frankfurt, 2011.
PDF.
[AC11c]
M. Luciw, V. Graziano, M. Ring, J. Schmidhuber.
Artificial Curiosity with Planning for Autonomous Visual and Perceptual Development.
In Proc. Joint IEEE International Conference on Development and Learning (ICDL) and on Epigenetic Robotics (ICDLEpiRob 2011), Frankfurt, 2011.
PDF.
[AC11d]
T. Schaul, L. Pape, T. Glasmachers, V. Graziano J. Schmidhuber.
Coherence Progress: A Measure of Interestingness Based on Fixed Compressors.
In Proc. Fourth Conference on Artificial General Intelligence (AGI11),
Google, Mountain View, California, 2011.
PDF.
[AC11e]
T. Schaul, Yi Sun, D. Wierstra, F. Gomez, J. Schmidhuber. CuriosityDriven Optimization. IEEE Congress on Evolutionary Computation (CEC2011), 2011.
PDF.
[AC11f]
H. Ngo, M. Ring, J. Schmidhuber.
Curiosity Drive based on Compression Progress for Learning Environment Regularities.
In Proc. Joint IEEE International Conference on Development and Learning (ICDL) and on Epigenetic Robotics (ICDLEpiRob 2011), Frankfurt, 2011.
[AC12]
L. Pape, C. M. Oddo, M. Controzzi, C. Cipriani, A. Foerster, M. C. Carrozza, J. Schmidhuber.
Learning tactile skills through curious exploration.
Frontiers in Neurorobotics 6:6, 2012, doi: 10.3389/fnbot.2012.00006
[AC12a]
H. Ngo, M. Luciw, A. Foerster, J. Schmidhuber.
Learning Skills from Play: Artificial Curiosity on a Katana Robot Arm.
Proc. IJCNN 2012.
PDF.
Video.
[AC12b]
V. R. Kompella, M. Luciw, M. Stollenga, L. Pape, J. Schmidhuber.
Autonomous Learning of Abstractions using CuriosityDriven Modular Incremental Slow Feature Analysis.
Proc. IEEE Conference on Development and Learning / EpiRob 2012
(ICDLEpiRob'12), San Diego, 2012.
[AC12c]
J. Schmidhuber. Maximizing Fun By Creating Data With Easily Reducible Subjective Complexity.
In G. Baldassarre and M. Mirolli (eds.), Roadmap for Intrinsically Motivated Learning.
Springer, 2012.
[AC20]
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 5866, 2020.
Preprint arXiv/1906.04493.
[ATT] J. Schmidhuber (AI Blog, 2020). 30year anniversary of endtoend differentiable sequential neural attention. Plus goalconditional reinforcement learning. Schmidhuber had both hard attention for foveas (1990) and soft attention in form of Transformers with linearized selfattention (199193).^{[FWP]} Today, both types are very popular.
[ATT0] J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI12890, Institut für Informatik, Technische
Universität München, 1990.
PDF.
[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135141, 1991. Based on TR FKI12890, TUM, 1990.
PDF.
More.
[ATT2]
J. Schmidhuber.
Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton,
editors, Proc. of the 1990 Connectionist Models Summer School, pages
5261. San Mateo, CA: Morgan Kaufmann, 1990.
PS. (PDF.)
[BIND]
K. Greff, S. Van Steenkiste, J. Schmidhuber. On the binding problem in artificial neural networks.
Preprint: arXiv:2012.05208 (2020).
[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947954, 1960.
Precursor of modern backpropagation.^{[BP14]}
[BPB]
A. E. Bryson. A gradient method for optimizing multistage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 3045, 1962.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 67 and FORTRAN code on pages 5860.
PDF.
See also BIT 16, 146160, 1976.
Link.
The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
First application of backpropagation^{[BP1]} to NNs (concretizing thoughts in his 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2020).
Who invented backpropagation?
More.^{[DL2]}
[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389400.
[BP6]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175185, 1977.
See Section 3.1 on using gradient descent for learning in multilayer networks.
[CNN1] K. Fukushima: Neural network model for a mechanism of pattern
recognition unaffected by shift in position—Neocognitron.
Trans. IECE, vol. J62A, no. 10, pp. 658665, 1979.
The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CNN1+]. More in Scholarpedia.
[CNN1+]
K. Fukushima: Neocognitron: a selforganizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics, vol. 36, no. 4, pp. 193202 (April 1980).
Link.
[CNN1a] A. Waibel. Phoneme Recognition Using TimeDelay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. First application of backpropagation^{[BP1][BP2]} and weightsharing
to a convolutional architecture.
[CNN1b] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using timedelay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328339, March 1989. Based on [CNN1a].
[CNN1c] Bower Award Ceremony 2021:
Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021.
[CNN2] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel: Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541551, 1989.
PDF.
[CNN3a]
K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for SpeakerIndependent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990.
An NN with convolutions using MaxPooling instead of Fukushima's
Spatial Averaging.^{[CNN1]}
[CNN3] Weng, J.,
Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3D objects from 2D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121128. A CNN whose downsampling layers use MaxPooling
(which has become very popular) instead of Fukushima's
Spatial Averaging.^{[CNN1]}
[CNN4] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
[CO1]
J. Koutnik, F. Gomez, J. Schmidhuber (2010). Evolving Neural Networks in Compressed Weight Space. Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO2010), Portland, 2010.
PDF.
[CO2]
J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez.
Evolving LargeScale Neural Networks for VisionBased Reinforcement Learning.
Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), Amsterdam, July 2013.
PDF.
The first deep learning model to successfully learn control policies directly from highdimensional sensory input using reinforcement learning, without any unsupervised pretraining.
[CO3]
R. K. Srivastava, J. Schmidhuber, F. Gomez.
Generalized Compressed Network Search.
Proc. GECCO 2012.
PDF.
[CON16]
J. Carmichael (2016).
Artificial Intelligence Gained Consciousness in 1991.
Why A.I. pioneer Jürgen Schmidhuber is convinced the ultimate breakthrough already happened.
Inverse, 2016. Link.
[DAN]
J. Schmidhuber (AI Blog, 2021).
10year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after Schmidhuber's outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.^{[DAN1]} Now everybody is using this approach.
[DAN1]
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
At the IJCNN 2011 computer vision competition in Silicon Valley, the artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (from LeCun's team), and six times better than the best nonneural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020; revised 2021). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.
[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 4355.
[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364378.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85117.
More.
Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by selfproclaimed^{[DLC12]} "Deep Learning Conspiracy" (Nature 521 p 436).
The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it).
[DLC1]
Y. LeCun. IEEE Spectrum Interview by L. Gomes, Feb 2015.
Quote: "A lot of us involved in the resurgence of Deep Learning in the mid2000s, including Geoff Hinton, Yoshua Bengio, and myself—the socalled 'Deep Learning conspiracy' ..."
[DLC2]
M. Bergen, K. Wagner (2015).
Welcome to the AI Conspiracy: The 'Canadian Mafia' Behind Tech's Latest Craze. Vox recode, 15 July 2015.
Quote: "... referred to themselves as the 'deep learning conspiracy.' Others called them the 'Canadian Mafia.'"
[DNC]
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. GrabskaBarwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis.
Hybrid computing using a neural network with dynamic external memory.
Nature, 538:7626, p 471, 2016.
This work of DeepMind did not cite the original work of the early 1990s on
neural networks learning to control dynamic external memories.^{[PDA12][FWP01]}
[DYNA90]
R. S. Sutton (1990). Integrated Architectures for Learning, Planning, and Reacting Based on Approximating Dynamic Programming. Machine Learning Proceedings 1990, of the Seventh International Conference, Austin, Texas, June 2123,
1990, p 216224.
[DYNA91]
R. S. Sutton (1991). Dyna, an integrated architecture for learning, planning, and reacting. ACM Sigart Bulletin 2.4 (1991):160163.
[FAKE]
H. Hopf, A. Krief, G. Mehta, S. A. Matlin.
Fake science and the knowledge crisis: ignorance can be fatal.
Royal Society Open Science, May 2019.
Quote: "Scientists must be willing to speak out when they see false information being presented in social media, traditional print or broadcast press" and "must speak out against false information and fake science in circulation
and forcefully contradict public figures who promote it."
[FAKE2]
L. Stenflo.
Intelligent plagiarists are the most dangerous. Nature, vol. 427, p. 777 (Feb 2004).
Quote: "What is worse, in my opinion, ..., are cases where scientists rewrite previous findings in different words, purposely hiding the sources of their ideas, and then during subsequent years forcefully claim that they have discovered new phenomena.
[FAST] C. v.d. Malsburg. Tech Report 812, Abteilung f. Neurobiologie,
MaxPlanck Institut f. Biophysik und Chemie, Goettingen, 1981.
First paper on fast weights or dynamic links.
[FASTa]
J. A. Feldman. Dynamic connections in neural networks.
Biological Cybernetics, 46(1):2739, 1982.
2nd paper on fast weights.
[FWP]
J. Schmidhuber (AI Blog, 26 March 2021, updated 2022).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
30year anniversary of a now popular
alternative^{[FWP01]} to recurrent NNs.
A slow feedforward NN learns by gradient descent to program the changes of
the fast weights^{[FAST,FASTa]} of
another NN, separating memory and control like in traditional computers.
Such Fast Weight Programmers^{[FWP06,FWPMETA18]} can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of selfinvented activation patterns^{[FWP01]}
(now often called keys and values for selfattention^{[TR16]}).
The similar Transformers^{[TR12]} combine this with projections
and softmax and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
Transformers with linearized selfattention^{[TR56]}
which are formally equivalent to Schmidhuber's 1991 outer productbased Fast Weight Programmers (apart from normalization).
In 1993, he introduced
the attention terminology^{[FWP2]} now used
in this context,^{[ATT]} and
extended the approach to
RNNs that program themselves.
See tweet of 2022.
[FWP0]
J. Schmidhuber.
Learning to control fastweight memories: An alternative to recurrent nets.
Technical Report FKI14791, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
First paper on fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer productbased version (Eq. 5) is now known as a "Transformer with linearized selfattention."^{[FWP]}
[FWP1] J. Schmidhuber. Learning to control fastweight memories: An alternative to recurrent nets. Neural Computation, 4(1):131139, 1992. Based on [FWP0].
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30year anniversary.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of timevarying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460463. Springer, 1993.
PDF.
First recurrent NNbased fast weight programmer using outer products, introducing the terminology of learning "internal spotlights of attention."
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for OnTheFly Neural Program Generation. Workshop on MetaLearning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.
[FWP4d]
Y. Tang, D. Nguyen, D. Ha (2020).
Neuroevolution of SelfInterpretable Agents.
Preprint: arXiv:2003.08165.
[FWP5]
F. J. Gomez and J. Schmidhuber.
Evolving modular fastweight networks for control.
In W. Duch et al. (Eds.):
Proc. ICANN'05,
LNCS 3697, pp. 383389, SpringerVerlag Berlin Heidelberg, 2005.
PDF.
HTML overview.
Reinforcementlearning fast weight programmer.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
Advances in Neural Information Processing Systems (NeurIPS), 2021.
Preprint: arXiv:2106.06295 . See also the
Blog Post.
[FWPMETA1] J. Schmidhuber. Steps towards `selfreferential' learning. Technical Report CUCS62792, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
First recurrent fast weight programmer that can learn
to run a learning algorithm or weight change algorithm on itself.
[FWPMETA2] J. Schmidhuber. A selfreferential weight matrix.
In Proceedings of the International Conference on Artificial
Neural Networks, Amsterdam, pages 446451. Springer, 1993.
PDF.
[FWPMETA3] J. Schmidhuber.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
Brighton, pages 191195. IEE, 1993.
[FWPMETA4]
J. Schmidhuber.
A neural network that embeds its own metalevels.
In Proc. of the International Conference on Neural Networks '93,
San Francisco. IEEE, 1993.
[FWPMETA5]
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
A recurrent neural net with a selfreferential, selfreading, selfmodifying weight matrix
can be found here.
[FWPMETA6]
L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It.
Advances in Neural Information Processing Systems (NeurIPS), 2021.
Preprint: arXiv:2012.14905.
[FWPMETA7]
I. Schlag, T. Munkhdalai, J. Schmidhuber.
Learning Associative Inference Using Fast Weight Memory.
ICLR 2021.
Report arXiv:2011.07831 [cs.AI], 2020.
[FWPMETA8]
K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
A Modern SelfReferential Weight Matrix That Learns to Modify Itself.
International Conference on Machine Learning (ICML), 2022.
Preprint: arXiv:2202.05780.
[FWPMETA9]
L. Kirsch and J. Schmidhuber.
SelfReferential Meta Learning.
First Conference on Automated Machine Learning (LateBreaking Workshop), 2022.
[FWPMETA10]
L. Kirsch, S. Flennerhag, H. van Hasselt, A. Friesen, J. Oh, Y. Chen.
Introducing symmetries to black box meta reinforcement learning.
AAAI 2022, vol. 36(7), p 72077210, 2022.
[GAN1]
I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair,
A. Courville, Y. Bengio.
Generative adversarial nets. NIPS 2014, 26722680, Dec 2014.
A description of GANs that does not cite Schmidhuber's original GAN principle of 1990^{[AC][AC90,AC90b][AC20][R2][T22]} (also containing wrong claims about Schmidhuber's adversarial NNs for Predictability Minimization^{[PM02][AC20][T22]}).
[GGP]
F. Faccio, V. Herrmann, A. Ramesh, L. Kirsch, J. Schmidhuber.
GoalConditioned Generators of Deep Policies.
Preprint arXiv/2207.01570, 4 July 2022 (submitted in May 2022).
[GM3]
J. Schmidhuber (2003).
Gödel Machines: SelfReferential Universal Problem Solvers Making Provably Optimal SelfImprovements.
Preprint
arXiv:cs/0309048 (2003).
More.
[GM6]
J. Schmidhuber (2006).
Gödel machines:
Fully SelfReferential Optimal Universal SelfImprovers.
In B. Goertzel and C. Pennachin, eds.: Artificial
General Intelligence, p. 199226, 2006.
PDF.
[GM9]
J. Schmidhuber (2009).
Ultimate Cognition à la Gödel.
Cognitive Computation 1(2):177193, 2009. PDF.
More.
[GPE]
F. Faccio, A. Ramesh, V. Herrmann, J. Harb, J. Schmidhuber.
General Policy Evaluation and Improvement by Learning to Identify Few But Crucial States.
Preprint arXiv/2207.01566, 4 July 2022 (submitted in May 2022).
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI2011, Barcelona), 2011. PDF. ArXiv preprint.
Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 20112012 before others won any
with similar approaches.
[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
International Joint Conference on Neural Networks (IJCNN2011, San Francisco), 2011.
PDF.
HTML overview.
First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.^{[DAN1]} This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multicolumn Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 36423649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012.
PDF.
The paper describes AlexNet, which is similar to the earlier
DanNet,^{[DAN,DAN1][R6]}
which was the first pure deep CNN
to win computer vision contests in 2011.^{[GPUCNN23,5]} AlexNet and VGG Net^{[GPUCNN9]} followed in 20122014 (using stochastic delta rule/dropout^{[Drop13]} and ReLUs^{[RELU1]} without citation).
[GPUCNN5]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet won 4 of them in a row before the similar AlexNet/VGG Net and the Resnet (a Highway Net with open gates) joined the party. Today, deep CNNs are standard in computer vision.
[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI11), Google, Mountain View, California, 2011.
PDF.
[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
First deep learner to win a contest on object detection in large images—
first deep learner to win a medical imaging contest (2012). Link.
How the Swiss AI Lab IDSIA used GPUbased CNNs to win the
ICPR 2012 Contest on Mitosis Detection
and the MICCAI 2013 Grand Challenge.
[GUESS]
J. Schmidhuber and S. Hochreiter.
Guessing can outperform many long time lag algorithms.
Technical Note IDSIA1996, IDSIA, May 1996.
[H90]
W. D. Hillis.
Coevolving parasites improve simulated evolution as an optimization
procedure.
Physica D: Nonlinear Phenomena, 42(13):228234, 1990.
[HAB]
J. Schmidhuber.
Netzwerkarchitekturen, Zielfunktionen und Kettenregel
(Network architectures, objective functions, and chain rule).
Habilitation (postdoctoral thesis  qualification for a
tenure professorship),
Institut für Informatik, Technische Universität
München, 1993.
PDF.
HTML.
[HRLW]
C. Watkins (1989). Learning from delayed rewards.
[HRL0]
J. Schmidhuber.
Towards compositional learning with dynamic neural networks.
Technical Report FKI12990, Institut für Informatik, Technische
Universität München, 1990.
PDF.
An RL machine gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NNbased subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of costminimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals.
The system is learning action plans
at multiple levels of abstraction and multiple time scales and solves what LeCun called an "open problem" in 2022.^{[LEC]}
[HRL1]
J. Schmidhuber. Learning to generate subgoals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967972. Elsevier Science Publishers B.V., NorthHolland, 1991. PDF. Extending TR FKI12990, TUM, 1990.
[HRL2]
J. Schmidhuber and R. Wahnsiedler.
Planning simple trajectories using neural subgoal generators.
In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc.
of the 2nd International Conference on Simulation of Adaptive Behavior,
pages 196202. MIT Press, 1992.
PDF.
[HRL4]
M. Wiering and J. Schmidhuber.
HQLearning.
Adaptive Behavior 6(2):219246, 1997 (122 K).
PDF.
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (July 2015). Also at NIPS 2015. The first working very deep feedforward nets with over 100 layers (previous NNs had at most a few tens of layers). Let g, t, h, denote nonlinear differentiable functions. Each noninput layer of a highway net computes g(x)x + t(x)h(x), where x is the data from the previous layer. (Like LSTM with forget gates^{[LSTM2]} for RNNs.) Resnets^{[HW2]} are a version of this where the gates are always open: g(x)=t(x)=const=1.
Highway Nets perform roughly as well as ResNets^{[HW2]} on ImageNet.^{[HW3]} Variants of highway gates are used for certain algorithmic tasks, where the simpler residual layers do not work as well.^{[NDR]}
More.
[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 1011, 2015.
Link.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015). Residual nets are a version of Highway Nets^{[HW1]}
where the gates are always open:
g(x)=1 (a typical highway net initialization) and t(x)=1.
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[IMAX]
S. Becker, G. E. Hinton. Spatial coherence as an internal teacher for a neural network. TR CRGTR897, Dept. of CS, U. Toronto, 1989.
[KO0]
J. Schmidhuber.
Discovering problem solutions with low Kolmogorov complexity and
high generalization capability.
Technical Report FKI19494, Fakultät für Informatik,
Technische Universität München, 1994.
PDF.
[KO1] J. Schmidhuber.
Discovering solutions with low Kolmogorov complexity
and high generalization capability.
In A. Prieditis and S. Russell, editors, Machine Learning:
Proceedings of the Twelfth International Conference (ICML 1995),
pages 488496. Morgan
Kaufmann Publishers, San Francisco, CA, 1995.
PDF.
[KO2]
J. Schmidhuber.
Discovering neural nets with low Kolmogorov complexity
and high generalization capability.
Neural Networks, 10(5):857873, 1997.
PDF.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long ShortTerm Memory. Neural Computation, 9(8):17351780, 1997. PDF.
More.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 19902015. Years ago, Schmidhuber's team published most of what LeCun calls his "main original contributions:" neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc. This was also discussed on Hacker News, reddit, and in the media.
[LEC22a]
Y. LeCun (27 June 2022).
A Path Towards Autonomous Machine Intelligence.
OpenReview Archive.
Link. See critique [LEC].
[LEC22b]
M. Heikkilä, W. D. Heaven.
Yann LeCun has a bold new vision for the future of AI.
MIT Technology Review, 24 June 2022.
Link. See critique [LEC].
[LEC22c]
ZDNet, 2022. Meta's AI guru LeCun: Most of today's AI approaches will never lead to true intelligence.
Here LeCun makes wrong and misleading claims about Schmidhuber's work, as discussed in Addendum III of [LEC].
[LEC22d]
Analytics India, Dec 2022. Angels & Demons of AI.
More of LeCun's misleading statements about the dispute with Schmidhuber, as discussed in Addendum III of [LEC].
[META]
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
first publication on metalearning machines that learn to learn (1987).
For its cover Schmidhuber drew a robot that bootstraps itself.
1992: gradient descentbased neural metalearning. 1994: MetaReinforcement Learning with selfmodifying policies. 1997: MetaRL plus artificial curiosity and intrinsic motivation. 2002: asymptotically optimal metalearning for curriculum learning. 2003: mathematically optimal Gödel Machine. 2020: new stuff!
[MIR] J. Schmidhuber (AI Blog, Oct 2019, updated 2021, 2022). Deep Learning: Our Miraculous Year 19901991. Preprint
arXiv:2005.05744, 2020. The deep learning neural networks of Schmidhuber's team have revolutionised pattern recognition and machine learning, and are now heavily used in academia and industry. In 202021, we celebrate that many of the basic ideas behind this revolution were published within fewer than 12 months in the "Annus Mirabilis" 19901991 at TU Munich.
[MOST]
J. Schmidhuber (AI Blog, 2021). The most cited neural networks all build on work done in my labs. Foundations of the most popular NNs originated in Schmidhuber's labs at TU Munich and IDSIA. (1) Long ShortTerm Memory (LSTM), (2) ResNet (which is the earlier Highway Net with open gates), (3) AlexNet and VGG Net (both building on the similar earlier DanNet: the first deep convolutional NN to win
image recognition competitions),
(4) Generative Adversarial Networks (an instance of the much earlier
Adversarial Artificial Curiosity), and (5) variants of Transformers (Transformers with linearized selfattention are formally equivalent to the much earlier Fast Weight Programmers).
Most of this started with the
Annus Mirabilis of 19901991.^{[MIR]}
[MUN87]
P. W. Munro. A dual backpropagation scheme for scalar reinforcement learning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165176, 1987.
[NASC1] J. Schmidhuber. First Pow(d)ered flight / plane truth. Correspondence, Nature, 421 p 689, Feb 2003.
[NDR]
R. Csordas, K. Irie, J. Schmidhuber.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
[NGU89]
D. Nguyen and B. Widrow; The truck backerupper: An example of self learning in neural networks. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 357364, 1989.
[NYT1]
NY Times article
by J. Markoff, Nov. 27, 2016: When A.I. Matures, It May Call Jürgen Schmidhuber 'Dad'
[OBJ1] K. Greff, A. Rasmus, M. Berglund, T. Hao, H. Valpola, J. Schmidhuber (2016). Tagger: Deep unsupervised perceptual grouping. NIPS 2016, pp. 44844492.
Preprint arXiv/1606.06724.
[OBJ2] K. Greff, S. van Steenkiste, J. Schmidhuber (2017). Neural expectation maximization. NIPS 2017, pp. 66916701.
Preprint arXiv/1708.03498.
[OBJ3] S. van Steenkiste, M. Chang, K. Greff, J. Schmidhuber (2018). Relational neural expectation maximization: Unsupervised discovery of objects and their interactions. ICLR 2018.
Preprint arXiv/1802.10353.
[OBJ4]
A. Stanic, S. van Steenkiste, J. Schmidhuber (2021). Hierarchical Relational Inference. AAAI 2021.
Preprint arXiv/2010.03635.
[OBJ5]
A. Gopalakrishnan, S. van Steenkiste, J. Schmidhuber (2020). Unsupervised Object Keypoint Learning using Local Spatial Predictability.
Preprint arXiv/2011.12930.
[OOPS1]
J. Schmidhuber. BiasOptimal Incremental Problem Solving.
In S. Becker, S. Thrun, K. Obermayer, eds.,
Advances in Neural Information Processing Systems 15, N(eur)IPS'15, MIT Press, Cambridge MA, p. 15711578, 2003.
PDF
[OOPS2]
J. Schmidhuber.
Optimal Ordered Problem Solver.
Machine Learning, 54, 211254, 2004.
PDF.
HTML.
HTML overview.
Download
OOPS source code in crystalline format.
[OOPS3]
Schmidhuber, J., Zhumatiy, V. and Gagliolo, M. BiasOptimal
Incremental Learning of Control Sequences for Virtual Robots. In Groen,
F., Amato, N., Bonarini, A., Yoshida, E., and Kroese, B., editors:
Proceedings of the 8th conference
on Intelligent Autonomous Systems, IAS8, Amsterdam,
The Netherlands, pp. 658665, 2004.
PDF.
[PDA1]
G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context  Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649653, Princeton University, Princeton, NJ, 1990.
[PDA2]
M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of contextfree languages. Proc. N(eur)IPS 1993.
[PBVF]
F. Faccio, L. Kirsch, J. Schmidhuber.
Parameterbased value functions.
Preprint arXiv/2006.09226, 2020.
[PEVN]
Policy Evaluation Networks.
J. Harb, T. Schaul, D. Precup, P. Bacon.
Preprint arXiv/2002.11833, 2020.
[PLAG1]
Oxford's guide to types of plagiarism (2021).
Quote: "Plagiarism may be intentional or reckless, or unintentional."
Link.
Local copy.
[PLAN]
J. Schmidhuber (AI Blog, 2020). 30year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced highdimensional reward signals, deterministic policy gradients for RNNs,
the GAN principle (widely used today). Agents with adaptive recurrent world models even suggest a simple explanation of consciousness & selfawareness.
[PLAN2]
J. Schmidhuber.
An online algorithm for dynamic reinforcement learning and planning
in reactive environments.
In Proc. IEEE/INNS International Joint Conference on Neural
Networks, San Diego, volume 2, pages 253258, June 1721, 1990.
Based on [AC90].
[PLAN3]
J. Schmidhuber.
Reinforcement learning in Markovian and nonMarkovian environments.
In R. P. Lippman, J. E. Moody, and D. S. Touretzky, editors,
Advances in Neural Information Processing Systems 3, NIPS'3, pages 500506. San
Mateo, CA: Morgan Kaufmann, 1991.
PDF.
Partially based on [AC90].
[PLAN4]
J. Schmidhuber.
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models.
Report arXiv:1210.0118 [cs.AI], 2015.
[PLAN5]
One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
[PLAN6]
D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.)
Preprint: arXiv:1809.01999.
Github: World Models.
[PHD]
J. Schmidhuber.
Dynamische neuronale Netze und das fundamentale raumzeitliche
Lernproblem
(Dynamic neural nets and the fundamental spatiotemporal
credit assignment problem).
Dissertation,
Institut für Informatik, Technische
Universität München, 1990.
PDF.
HTML.
[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CUCS56591, Univ. Colorado at Boulder, 1991. PDF.
More.
[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863879, 1992. Based on [PM0], 1991. PDF.
More.
[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces wellknown feature detectors. Neural Computation, 8(4):773786, 1996.
PDF. More.
[PMax0]
J. Schmidhuber and D. Prelinger. Discovering predictable classifications. Technical Report CUCS62692, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
[PMax]
J. Schmidhuber and D. Prelinger.
Discovering
predictable classifications.
Neural Computation, 5(4):625635, 1993.
PDF.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[RPG]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620634.
[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[PP] J. Schmidhuber.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
ArXiv preprint (2011):
arXiv:1112.5309 [cs.AI].
[PPa]
R. K. Srivastava, B. R. Steunebrink, M. Stollenga, J. Schmidhuber.
Continually Adding SelfInvented
Problems to the Repertoire: First
Experiments with POWERPLAY.
Proc. IEEE Conference on Development and Learning / EpiRob 2012
(ICDLEpiRob'12), San Diego, 2012.
PDF.
[PP1] R. K. Srivastava, B. Steunebrink, J. Schmidhuber.
First Experiments with PowerPlay.
Neural Networks, 2013.
ArXiv preprint (2012):
arXiv:1210.8385 [cs.AI].
[PP2] V. Kompella, M. Stollenga, M. Luciw, J. Schmidhuber. Continual curiositydriven skill acquisition from highdimensional video inputs for humanoid robots. Artificial Intelligence, 2015.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[S59]
A. L. Samuel.
Some studies in machine learning using the game of checkers.
IBM Journal on Research and Development, 3:210229, 1959.
[SNT]
J. Schmidhuber, S. Heil (1996).
Sequential neural text compression.
IEEE Trans. Neural Networks, 1996.
PDF.
An earlier version appeared at NIPS 1995.
Much later this was called a probabilistic language model.^{[T22]}
[SS78]
S. K. Pal, A. K. Datta, D. D. Majumder.
Computer recognition of vowel sounds using a selfsupervised learning algorithm.
J. Anatomical Soc. India 6:117123, 1978.
[SYM1]
P. Smolensky (1988). On the proper treatment of connectionism. Behavioral and Brain Sciences, 11(1), 123. doi:10.1017/S0140525X00052432
[SYM2]
Mozer, M. C. (1990). The perception of multiple objects: A connectionist approach. Cambridge, MA: MIT Press.
[SYM3]
C. McMillan, M. C. Mozer, P. Smolensky. Rule induction through integrated symbolic and subsymbolic processing. Advances in Neural Information Processing Systems 4 (1991).
[SWA22]
J. Swan, E. Nivel, N. Kant, J. Hedges, T. Atkinson, B. Steunebrink (2022).
Work on Command: The Case for Generality. In: The Road to General Intelligence. Studies in Computational Intelligence, vol 1049. Springer, Cham. https://doi.org/10.1007/9783031080203_6.
[T22] J. Schmidhuber (AI Blog, 2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA7721, IDSIA, Lugano, Switzerland, 20212022.
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 59986008.
This paper introduced the name "Transformers" for a now widely used NN type. It did not cite
the 1991 publication on what's now called "Transformers with linearized selfattention."^{[FWP06][TR56]}
Schmidhuber also introduced the now popular
attention terminology in 1993.^{[ATT][FWP2][R4]}
See tweet of 2022 for 30year anniversary.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pretraining of deep bidirectional transformers for language understanding. Preprint arXiv:1810.04805.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 47314736. ArXiv preprint 1803.03585.
[TR4]
M. Hahn. Theoretical Limitations of SelfAttention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156171, 2020.
[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.
[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.
[UDRL1]
J. Schmidhuber.
Reinforcement Learning Upside Down: Don't Predict Rewards—Just Map Them to Actions.
Preprint arXiv/1912.02875, 5 Dec 2019.
[UDRL2]
R. K. Srivastava, P. Shyam, F. Mutz, W. Jaskowski, J. Schmidhuber.
Training Agents using UpsideDown Reinforcement Learning.
Preprint arXiv/1912.02877, 5 Dec 2019.
[UN]
J. Schmidhuber (AI Blog, 2021). 30year anniversary. 1991: First very deep learning with unsupervised or selfsupervised pretraining. Unsupervised hierarchical predictive coding (with selfsupervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI14891, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
Unsupervised/selfsupervised learning and predictive coding is used
in a deep hierarchy of recurrent neural networks (RNNs)
to find compact internal
representations of long sequences of data,
across multiple time scales and levels of abstraction.
Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above.
The resulting compressed sequence representations
greatly facilitate downstream supervised deep learning such as sequence classification.
By 1993, the approach solved problems of depth 1000 [UN2]
(requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning).
A variant collapses the hierarchy into a single deep net.
It uses a socalled conscious chunker RNN
which attends to unexpected events that surprise
a lowerlevel socalled subconscious automatiser RNN.
The chunker learns to understand the surprising events by predicting them.
The automatiser uses a
neural knowledge distillation procedure
to compress and absorb the formerly conscious insights and
behaviours of the chunker, thus making them subconscious.
The systems of 1991 allowed for much deeper learning than previous methods. More.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234242, 1992. Based on TR FKI14891, TUM, 1991.^{[UN0]} PDF.
First working Deep Learner based on a deep RNN hierarchy (with different selforganising time scales),
overcoming the vanishing gradient problem through unsupervised pretraining and predictive coding (with selfsupervised target generation).
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used. See also this tweet. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / selfsupervised pretraining for a stack of recurrent NN
can be found here (depth > 1000).
[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 8795.
Augustinus, 1993.
[UNI]
Theory of Universal Learning Machines & Universal AI.
Work of Marcus Hutter (in the early 2000s) on J.
Schmidhuber's SNF project 2061847:
Unification of universal induction and sequential decision theory.
[WER87]
P. J. Werbos. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 1987.
[WER89]
P. J. Werbos. Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209216, 1989.
.