Abstract.
The Deep Learning Artificial Neural Networks (NNs)
of our team have
revolutionised
Machine Learning & AI.
Many of the basic ideas behind this revolution were published within the 12 months of our "Annus Mirabilis" 1990-1991 at our lab in TU Munich.
Back then, few people were interested. But a quarter century later, NNs based on our "Miraculous Year"
were on over 3 billion devices,
and used many billions of times per day,
consuming a significant fraction of the world's compute.
In particular, in 1990-91, we laid foundations of Generative AI, publishing principles of (1)
Generative Adversarial Networks for Artificial Curiosity and Creativity (now used for deepfakes), (2) Transformers (the T in ChatGPT—see the 1991 Unnormalized Linear Transformer), (3) Pre-training for deep NNs (see the P in ChatGPT), (4) NN distillation (key for DeepSeek), and (5) recurrent World Models for
Reinforcement Learning and Planning in partially observable environments. The year 1991 also marks the emergence of the defining features of (6)
LSTM, the most cited AI paper of the 20th century (based on deep residual learning and constant error flow through residual NN connections), and (7) the most cited paper of the 21st century, based on our LSTM-inspired Highway Net that was
10 times deeper than previous feedforward NNs.
As of 2025, the two most frequently cited scientific articles of all time (with the most Google Scholar citations within 3 years—manuals excluded) are both directly based on our 1991 work.[MOST26]
The following summary of what happened in 1990-91 both contains
some high-level context which is accessible by a general audience,
but also references to original sources for NN experts. I also mention selected later work
which further developed the ideas of 1990-91 (at TU Munich, the Swiss AI Lab IDSIA, and other places), as well as
related work by others.
Table of contents
Sec. 0: Background on Deep Learning in Artificial Neural Nets (NNs)
Sec. 1: First Very Deep Learner, Based on Pre-Training (1991; see the P in ChatGPT)
Sec. 2: Distilling one Neural Net into Another (1991; key for DeepSeek 2025)
Sec. 3: The Fundamental Deep Learning Problem (Vanishing / Exploding Gradients, 1991)
Sec. 4: Long Short-Term Memory: Supervised Very Deep Learning (basic insights since 1991)
Sec. 5: Artificial Curiosity Through Generative Adversarial NNs (GANs, 1990)
Sec. 6: Artificial Curiosity Through NNs that Maximize Learning Progress (1991)
Sec. 7: Adversarial Networks Create Disentangled Data Representations (1991)
Sec. 8: Unnormalized Linear Transformers (1991). NNs Learn to Program NNs
Sec. 9: Learning Sequential Attention with NNs (1990)
Sec. 10: Gradient Descent Finds Subgoals for Hierarchical Reinforcement Learning (1990)
Sec. 11: Planning and Reinforcement Learning with Recurrent Neural World Models (1990)
Sec. 12: Goal-Defining Commands as Extra NN Inputs (1990)
Sec. 13: High-Dimensional Reward Signals as NN Inputs / General Value Functions (1990)
Sec. 14: Deterministic Policy Gradients (1990)
Sec. 15: Networks Adjust Networks / Synthetic Gradients (1990)
Sec. 16: O(n3) Gradient Computation for Online Recurrent NNs (1991)
Sec. 17: The Neural Heat Exchanger (1990): Deep Learning Through Local Computations
Sec. 18: My PhD Thesis (1991)
Sec. 19: From Unsupervised Pre-Training to Pure Supervised Learning (1991-95 and 2006-11)
Sec. 20: The Amazing FKI Tech Report Series on Artificial Intelligence in the 1990s
Sec. 21: Concluding Remarks
Background on Deep Learning in Neural Nets (NNs)
The human brain has on the order of 100 billion neurons, each connected to 10,000 other neurons on average. Some are input neurons that feed the rest with data (sound, vision, tactile, pain, hunger). Others are output neurons that control muscles. Most neurons are hidden in between, where thinking takes place. Your brain apparently learns by changing the strengths or weights of the connections, which determine how strongly neurons influence each other, and which seem to encode all your lifelong experience. Similar for our artifical neural networks (NNs), which learn better than previous methods to recognize speech or handwriting or video, minimize pain, maximize pleasure, drive cars, etc.[DL1-4][DLH]
Most current commercial applications focus on
supervised learning to make NNs imitate human teachers.[DL1-4]
In the course of many trials, Seppo Linnainmaa's gradient-computing
algorithm of 1970,[BP1][DLH] today often called backpropagation or the reverse mode of automatic differentiation,[BP4] is used to
incrementally weaken certain NN connections and strengthen others,
such that the NN behaves more and more like the teacher.[BPA-C][BP2][HIN][T22][DLP][NOB]
Today's most powerful NNs tend to be very deep, that is, they have many layers of neurons or many subsequent computational stages.
In the 1980s, however, gradient-based training did not work well for deep NNs, only for shallow
ones.[DL1-2]
This Deep Learning Problem
was most obvious for recurrent NNs (RNNs). RNN architectures have been studied since the 1920s.[L20][I24-I25][K41][MC43][W45][K56][AMH1-2][NOB]
Like the human brain,
but unlike the more limited feedforward NNs (FNNs),
RNNs have feedback connections.
This makes RNNs powerful,
general purpose, parallel-sequential computers
that can process input sequences of arbitrary length (think of speech data or videos).
RNNs can in principle implement any program that can run on your laptop or any other computer in existence.
If we want to build an Artificial General Intelligence (AGI),
then its underlying computational substrate must be something more like an RNN than an FNN as FNNs are fundamentally insufficient; RNNs are to FNNs as general computers are to pocket calculators.
In particular, unlike FNNs, RNNs can in principle deal with problems
of arbitrary depth.[DL1]
Early RNNs of the 1980s, however, failed to learn deep problems in practice.
From early in my career,
I wanted to overcome this drawback, to achieve
RNN-based "general purpose Deep Learning" or "general Deep Learning."
1. First Very Deep NNs, Based on Unsupervised or Self-Supervised Pre-Training (1991, see the P in ChatGPT)
My first idea to overcome the Deep Learning Problem mentioned above was to
facilitate supervised learning in deep RNNs
by unsupervised pre-training of a hierarchical stack of RNNs (1991, see the P in ChatGPT),
to obtain a first "Very Deep Learner" called the
Neural Sequence Chunker[UN0] or
Neural History Compressor.[UN1]
Each higher level minimizes the description length (or negative log probability)
of the data representation in the level below,
using the Predictive Coding trick: try to predict the next
input in the incoming data stream, given the previous inputs, and update neural activations only in case of unpredictable data,
thus storing only what's not yet known.
In other words, the chunker learns to compress the data stream such that the
Deep Learning Problem
becomes less severe, and can be solved by standard backpropagation.
Although computers back then were about a million times slower per dollar than today,
by 1993, my method was able to solve previously unsolvable
"Very Deep Learning" tasks of depth > 1000[UN2] (requiring
more than 1,000 subsequent computational stages—the more such stages, the deeper the learning).
In 1993, we also published a continuous version of the
Neural History Compressor.[UN3]
To my knowledge, the Sequence Chunker[UN0]
also was the first system made of RNNs operating on different
(self-organizing) time scales.[UN][DLP] (But I also had a way of
distilling
all those RNNs down into a single deep RNN operating on a single time scale—see
Sec. 2.) A few years later, others also started publishing
on multi-time scale RNNs[HB96][DLP]
(see also the Clockwork RNN[CW]).
More than a decade after this work,[UN1]
a similar method for more limited feedforward NNs (FNNs) was published, facilitating
supervised learning by unsupervised pre-training of stacks of FNNs
called Deep Belief Networks (DBNs).[UN4][DLP][NOB]
The 2006 justification was essentially the one I used in the early 1990s for my RNN stack:
each higher level tries to reduce the description length
(or negative log probability) of the data representation in the level below.[HIN]
Soon after the unsupervised pre-training-based Very Deep Learner above,
the Deep Learning Problem (see Sec. 3) was also overcome
through our purely supervised LSTM (Sec. 4).
Much later, between 2006 and 2011, my lab also drove
a very similar shift from unsupervised pre-training to pure supervised learning,
two decades after our Miraculous Year,
this time for the less general feedforward NNs (FNNs) rather
than recurrent NNs (RNNs), with revolutionary applications
to
cancer detection and many other problems. See Sec. 19 for more on this.
Of course, Deep Learning in feedforward NNs started much earlier, with Ivakhnenko & Lapa, who published the first general, working learning algorithms for deep multilayer perceptrons with arbitrarily many layers back in 1965.[DEEP1][NOB][DLH] For example, Ivakhnenko's paper from 1971[DEEP2] already described a Deep Learning net with 8 layers, trained by a highly cited method still popular in the new millennium.[DL2] But unlike the deep FNNs of Ivakhnenko and his successors of the 1970s and 80s, our deep RNNs had general purpose parallel-sequential computational architectures.[UN-UN3] By the early 1990s, most NN research was still limited to rather shallow nets with fewer than 10 subsequent computational stages, while our methods already enabled over 1,000 such stages. In this sense we were the ones who made NNs really deep, especially RNNs, the deepest and most powerful nets of them all.
See also: who invented deep learning?[WHO5]
2. Compressing/Collapsing/Distilling an NN into Another NN (1991; key for DeepSeek 2025)
My above-mentioned paper on the
Neural History Compressor (see Sec. 1) also
introduced a way of compressing the network hierarchy
(whose higher levels are typically running on much slower self-organising time scales than lower levels)
into a single
deep RNN[UN1] which thus learned to solve very
deep problems despite the obstacles mentioned in Sec. 0.
This is described in
Section 4 of the paper on the
"conscious" chunker and a "subconscious" automatiser,[UN1] which
introduced a general principle for
transferring the knowledge of one NN to another.
Suppose a teacher NN has learned to predict (conditional expectations of) data,
given other data. Its knowledge can be compressed into a student NN,
by training the student NN to imitate the behavior of the teacher NN
(while also re-training the student NN on previously learned skills such that it does not forget them).
I called this "collapsing" or "compressing" the behavior of one net into another. Today, this is widely used,
and also called "distilling"[DIST2][HIN][DLP] or "cloning" the behavior of a teacher net into a student net. See also: who invented knowledge distillation with artificial neural networks?[WHO9]
In January 2025, the DeepSeek "Sputnik"[DS1] wiped out a trillion USD from the stock market. DeepSeek-R1[DS1] used elements of my 2015 reinforcement learning (RL) prompt engineer[PLAN4] and its 2018 refinement[PLAN5] which collapses the 2015 RL machine and its world model[PLAN4] into a single net through the neural net distillation procedure of 1991[UN0-3][UN][DLP]: a distilled chain of thought system. See the popular tweet of 31 Jan 2025.
3. The Fundamental Deep Learning Problem: Vanishing / Exploding Gradients (1991)
The background section Sec. 0
pointed out that Deep Learning
is hard. But why is it hard? A main reason is what I like to call the Fundamental Deep Learning Problem identified and analyzed in 1991 by my first student Sepp Hochreiter in his diploma thesis.[VAN1]
As a part of his thesis,
Sepp implemented the Neural History Compressor above (see Sec. 1)
and other RNN-based systems (see Sec. 11).
However, he did much more:
His work formally showed that deep NNs suffer from the now famous problem of vanishing or exploding gradients:
in typical deep or recurrent networks, back-propagated error signals either shrink rapidly, or grow out of bounds.
In both cases, learning fails. This analysis led to basic principles of what's now called LSTM (see Sec. 4).
Note that Sepp's thesis identified those problems of backpropagation in deep NNs
two decades after another student with a similar first name (Seppo Linnainmaa) published modern backpropagation or the reverse mode of automatic differentiation
in his own thesis of 1970.[BP1]
In 1994, others published results[VAN2] essentially identical to the 1991 vanishing gradient results of Sepp.[VAN1] Even after a common publication[VAN3] the first author of reference[VAN2] published papers[VAN4] that cited only their
own 1994 paper but not Sepp's original work.[DLP]
4. Long Short-Term Memory Recurrent Networks: Supervised Very Deep Learning / Residual Connections
The
Long Short-Term Memory (LSTM) recurrent neural network[LSTM1-6] overcomes
the Fundamental Deep Learning Problem identified by Sepp in his above-mentioned 1991
diploma thesis[VAN1] (see Sec. 3), which I consider one of the most important documents in the history of machine learning. It also
provided essential insights for overcoming the problem, through basic principles (such as constant error flow through what's now called "residual connections") of what we called LSTM in a tech report of 1995.[LSTM0] This led to lots of follow-up work described below.
See also: who invented deep residual learning?[WHO11]
In 2020 we celebrated the quarter-century anniversary of
LSTM's first failure to pass peer review.
After the main peer-reviewed publication in 1997[LSTM1] (now the most cited article in the history of Neural Computation[25y97]), LSTM and its training procedures were further improved on my Swiss LSTM grants at IDSIA through the work of my later students Felix Gers, Alex Graves, and others. A milestone was the "vanilla LSTM architecture" with forget gate[LSTM2a][LSTM2]—the LSTM variant of 1999-2000 that everybody is using today, e.g., in Google's Tensorflow.
Alex was lead author of our first successful application of LSTM to speech (2004).[LSTM10]
2005 saw the first publication of LSTM with full backpropagation through time and of bi-directional LSTM[LSTM3] (now widely used). Another milestone of 2006 was the training method "Connectionist Temporal Classification" or CTC[CTC] for simultaneous alignment and recognition of sequences. Our team successfully applied CTC-trained LSTM to speech in 2007[LSTM4] (also with hierarchical LSTM stacks[LSTM14]). This was the first superior end-to-end neural speech recognition. It was
very different from hybrid methods since the late 1980s which combined NNs and traditional approaches such as Hidden Markov Models (HMMs).[BW][BRI][BOU][HYB12][DLP]
In 2015, the CTC-LSTM combination dramatically improved
Google's speech recognition on the Android smartphones.[GSR15][DL4][DLH]
The first superior end-to-end neural machine translation
was also based on our LSTM.
In 1995, we already had an excellent neural probabilistic text model.[SNT]
In the early 2000s, we showed how LSTM can learn languages unlearnable by traditional models such as Hidden Markov Models.[LSTM13] This took a while to sink in, and compute still had to get 1000 times cheaper, but by 2016-17, both Google Translate[WU][GT16] and Facebook Translate[FB17] were based on two
connected LSTMs, one for the incoming text, one for the outgoing translation,[S2Sa,b,c,d] much better than what they had before.[DL4]
In 2009, my PhD student Justin Bayer was lead author of a system that automatically designed LSTM-like architectures
outperforming vanilla LSTM in certain applications.[LSTM7] In 2017, Google started using
similar "neural architecture search."[NAS]
Since 2006, we have worked with the software industry (e.g., LifeWare) to greatly improve handwriting recognition.
In 2009, through the efforts of Alex, LSTM trained by CTC became the first RNN to win international competitions, namely,
three ICDAR 2009 Connected Handwriting Competitions (French, Farsi, Arabic).
This attracted enormous interest from industry.
LSTM was soon used for everything that involves sequential data such as language and speech[LSTM10-11][LSTM4][DL1] and videos.
By 2017,
LSTM powered Facebook's machine translation (over 30 billion translations per week),[FB17][DL4] Apple's Quicktype on roughly 1 billion iPhones,[DL4] the voice of Amazon's Alexa,[DL4]
Google's speech recognition (on Android smartphones since 2015)[GSR15][DL4] &
image caption generation[DL4] &
machine translation[GT16][DL4] &
automatic email answering[DL4] etc. Business Week called LSTM "arguably the most commercial AI achievement."[AV1]
The first Large Language Models (LLMs) were based on LSTM as well.
By 2016, more than a quarter of the awesome computational
power for inference in Google's datacenters
was used for LSTM (and 5% for another popular Deep Learning technique called CNNs—see Sec. 19).[JOU17]
Google's
on-device speech recognition of 2019
(now on the phone, not on the server) was still based on
LSTM.
LSTM for Healthcare.[DEC]
A simple Google Scholar search turns up innumerable medical articles that have "LSTM" in their title, e.g.,
for learning to diagnose, ECG time signal analysis and classification, patient subtyping,
clinical concept extraction, diagnosis of arrhythmia,
drug-drug interaction extraction, hospitalization prediction, monitoring on personal wearable devices,
automatic pain level classification, cardiovascular disease risk factors prediction, 4D medical image segmentation, detection of radiological abnormalities, automated sleep stage classification, blood glucose prediction,
diabetes detection, lung cancer detection,
respiration prediction, real-time tumor tracking, breast cancer detection from histopathological images,
air pollution forecasting, protein model quality assessment, protein secondary structure prediction, modeling genome data, generation of drug-like chemical matter, pandemic forecasting,
Covid-19
detection, Covid-19 classification, Covid-19 prediction, and many more.
Through the work of my students
Rupesh Kumar Srivastava and Klaus Greff,
the LSTM principle also led to
our Highway Networks[HW1] of May 2015, the first working very deep FNNs with hundreds of layers (previous NNs had at most a few tens of layers). 7 months later, Microsoft won the
ImageNet 2015 contest with an open-gated Highway Net variant called ResNet[HW2]
(ResNets are like Highway Nets whose gates are always open).
The earlier Highway Nets perform roughly as well as ResNets on ImageNet.[HW3]
Variants of Highway gates are also used for certain algorithmic tasks, where the simpler residual layers do not work as well.[NDR] See also: who invented deep residual learning?[WHO11]
Deep learning
is all about NN depth.[DL1][DLH]
LSTMs
brought essentially unlimited depth to supervised recurrent NNs; Highway Nets brought it to feedforward NNs. Interestingly, LSTM has become the most cited NN of the 20th century; an open-gated Highway Net variant the most cited NN of the 21st.[MOST][MOST26]
We also trained LSTM through Reinforcement Learning (RL) for robotics
without a teacher, e.g., with my postdoc Bram Bakker[LSTM-RL] (2002).
And also through Neuroevolution and
policy gradients, e.g., with my PhD student Daan Wierstra,[LSTM12][RPG07][RPG][LSTMPG] who later became employee number 1 of DeepMind, the company co-founded by his friend Shane Legg, another PhD student from my lab (Shane and Daan were the first persons at DeepMind with AI publications and PhDs in computer science).
RL with LSTM has become important.
For example, in 2019, DeepMind beat a pro player in the game of Starcraft, which is harder than Chess or Go[DM2] in many ways, using
Alphastar whose brain has a deep LSTM core trained by RL.[DM3]
An RL LSTM (with 84% of the model's total parameter count) also was the core of the famous
OpenAI Five
which learned to defeat human experts in the
Dota 2 video game (2018).[OAI2]
Bill Gates called this a "huge milestone in advancing artificial intelligence."[OAI2a]
Essential foundations for all of this were laid in 1991.
My team subsequently developed LSTM & CTC etc. with the help of
basic funding from TU Munich and the (back then private) Swiss Dalle Molle Institute for AI (IDSIA), as well as
public funding which I acquired from Switzerland, Germany, and EU during the "Neural Network Winter" of the 1990s and early 2000s, trying
to keep the field alive when few were interested in NNs. I am especially thankful to Professors Kurt Bauknecht, Leslie Kaelbling, Ron Williams, and Ray Solomonoff whose positive reviews of my grant proposals have greatly helped to obtain financial support from SNF since the 1990s.
5. Artificial Curiosity / Generative Adversarial NNs (1990)
As humans interact with the world, they learn to predict the consequences of their actions. They are also curious, designing experiments that lead to novel data from which they can learn more.
To build
curious artificial agents,[AC] I introduced
a new type of active unsupervised or self-supervised learning in 1990.[AC90,AC90b]
It is based on a
minimax game where one NN minimizes the objective function maximized by another NN.
Today, I refer to
this duel between two unsupervised adversarial NNs as Adversarial Artificial Curiosity,[AC20]
to distinguish it from our later types of Artificial Curiosity since 1991 (see Sec. 6).
How does Adversarial Curiosity work?
The first NN is called the controller C. C (probabilistically) generates outputs that may influence an environment. The second NN is called the world model M. It predicts the environmental reactions to C's outputs.
Using gradient descent, M minimizes its error, thus becoming a better predictor. But in a zero sum game, C tries to find outputs that maximize the error of M. M's loss is the gain of C.
That is, C is motivated to invent novel outputs or experiments that yield data that M still finds surprising, until the data becomes familiar and eventually boring. Compare more recent summaries and extensions of this principle.[AC09] A 2010 survey[AC10] summarised the work of 1990 as follows: a
"neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors."
So in 1990 we already had
unsupervised or self-supervised neural nets that were both
generative and adversarial (using much later terminology from 2014[GAN14]),
generating experimental outputs yielding novel data,
not only for stationary
patterns but also for pattern sequences, and even for the general case of
Reinforcement Learning (RL).
The popular
Generative Adversarial Networks (GANs)[GAN90-25]
(2010-2014)
are an
Instance of Adversarial Curiosity[GAN90] where the
environment simply returns whether C's current output is in a given set.[GAN20-25][DLP]
BTW, note that the closely related Adversarial Curiosity[GAN90,90b] & GANs[GAN14]
& Adversarial Predictability
Minimization (see Sec. 7) are
very different from other
early adversarial machine learning settings[S59][H90]
which
neither involved unsupervised NNs nor were about modeling data nor used gradient descent.[GAN20]
See also: who invented generative adversarial networks?[WHO8]
6. Artificial Curiosity Through NNs That Maximize Learning Progress (1991)
Numerous improvements of the original Adversarial Curiosity of 1990 (AC1990, Sec. 5)
are summarized in more recent surveys.[AC06][AC09][AC10][AC] Here I focus on the first important improvement of 1991.[AC91][AC91b]
The errors of AC1990's world model M (to be minimized, Sec. 5) are the rewards of the controller C (to be maximized).
This makes for a fine exploration strategy in many deterministic environments.
In stochastic environments, however, this might fail.
C might learn to focus on situations where M can always
get high prediction errors due to randomness,
or due to its computational limitations.
For example, an agent controlled by C might get stuck in front of
a TV screen showing highly unpredictable white
noise.[AC10]
Therefore, as pointed out in 1991,
in stochastic environments,
C's reward should not be the errors of M,
but (an approximation of) the first derivative of M's errors across subsequent training iterations,
that is, M's improvements.[AC91][AC91b]
As a consequence, despite its high errors in front of
the noisy TV screen above,
C won't get rewarded for getting stuck there.
Both the totally predictable and the fundamentally unpredictable will get boring.
This insight led to lots of follow-up work[AC10] on
artificial scientists and artists.[AC09][AC]
7. Adversarial Networks Create Disentangled Data Representations (1991)
Soon after my first work on adversarial generative networks in 1990 (see Sec. 5),
I introduced a variation of the unsupervised adversarial minimax principle while I was a postdoc
at the University of Colorado at Boulder.
One of the most important NN tasks is to learn the statistics
of given data such as images.
To achieve this,
I used again the principles of gradient descent/ascent in a
minimax game where one NN minimizes the objective function maximized by another.
This duel between two unsupervised adversarial NNs was called
Predictability Minimization
(PM, 1990s).[PM0-2]
Contrary to later claims,[GAN14]
PM is indeed a pure minimax game, e.g., Equation 2 of [PM2].[GAN20][T20][DLP]
The first toy experiments with PM[PM1] were
conducted nearly three decades ago when compute was about a million times more expensive than today.
When it had become about 10 times cheaper 5 years later,
we could show that semi-linear PM variants applied to images automatically generate
feature detectors well-known from neuroscience, such as
on-center-off-surround detectors,
off-center-on-surround detectors, and
orientation-sensitive bar detectors.[PM2]
8. Unnormalized Linear Transformers (1991). Fast Weight Programmers: NNs Learn to Program NNs
The first Large Language Models (LLMs) were based on LSTM (see Sec. 4). However, in the late 2010s,
despite their limited time windows, recurrent-connection-free
Transformers[TR1,TR25]
(see the T in ChatGPT[GPT3])
started to excel at Natural Language Processing (NLP), a traditional LSTM domain. Remarkably, Transformers also have their roots in our Miraculous Year of 1991!
In March 1991, when compute was a million times more expensive than in 2022, even before the LSTM, I published the first Transformer variant, which is now called the unnormalized linear Transformer (ULTRA).[ULTRA][FWP0]
It had to be more efficient than Google's 2017 quadratic Transformer:[TR1][TR25] ULTRA's computational costs scale linearly in input size, rather than quadratically (in 1991, no journal would have accepted an NN that scales quadratically).
My 1993 paper on recurrent ULTRA extensions[FWP2] talked about learning "internal spotlights of attention”—compare the recent attention terminology, e.g., "attention is all you need,"[TR1] and tweets of
2022 &
2023.
The ULTRA was a by-product of more general research on NNs learning to program other NNs.
A typical NN has many more connections than neurons.
In traditional NNs, neuron activations change quickly,
while connection weights change slowly.
That is, the numerous weights cannot implement short-term memories
or temporal variables, only the few neuron activations can.
Non-traditional NNs with quickly changing "fast weights" overcome this limitation. Dynamic links or fast weights for NNs were introduced by Christoph v. d. Malsburg in 1981[FAST] and further studied by others.[FASTa,b]
However, before 1991, no network learned by gradient descent to quickly compute the changes of the fast weight storage of another network or of itself.
Enter the end-to-end-differentiable
Fast Weight Programmers (FWPs)[FWP] published in 1991-93.[FWP0-2]
There a slow NN learns to program the weights of a separate fast NN.
That is, I separated storage and control like in traditional computers,
but in a fully neural way (rather than in a hybrid fashion[PDA1-2][DNC]).
FWPs embody the principles found in what is now called
attention[ATT]
and Transformers.[TR1-6][FWP]
In fact, one of my FWPs is the 1991 unnormalized linear Transformer above (with what's now called "linearized self-attention").[TR5-6][FWP][ATT] See this tweet celebrating the 30-year anniversary of the 1992 journal publication.[FWP0-1]
Some of my FWPs used gradient descent-based, active control of fast weights through 2D tensors or outer product updates[FWP1-2] (compare our more recent work on this[FWP3-3a][FWP6-7]).
One of the motivations[FWP2]
was to get many more temporal variables under end-to-end differentiable control than what's possible in standard RNNs of the same size: O(H^2) instead of O(H), where H is the number of hidden units. A quarter century later, others followed this approach.[FWP4a][DLP]
The 1993 paper[FWP2] also explicitly introduced the "modern" terminology of learning internal spotlights of attention in end-to-end-differentiable networks. Compare Sec. 9 on
learning attention.
Work on extensions of ULTRAs and other FWPs has become mainstream research, aiming to develop sequence models that are both efficient and powerful.[TR6,TR6a][LT23-25][FWP23-25b]
See also: who invented transformer neural networks?[WHO10]
I also showed how FWPs can be used for meta-learning or
learning to learn, one of my main research topics
since 1987.[META1][META]
In follow-up work[FWPMETA1-10] since 1992, the slow RNN and the fast RNN are identical: the initial weight of each connection in the net is trained by gradient descent, but during an episode, each connection can be addressed and read and modified by the net itself (through O(log n) special output units where n is the number of connections), and the connection's weight may rapidly change—the network becomes self-referential in the sense that it can in principle learn to run arbitrary computable weight change algorithms or learning algorithms (for all of its weights) on itself. This led to
many
follow-up papers in the 1990s and 2000s.
Deep Reinforcement Learning (RL) without a teacher can also profit from fast weights even when the system's dynamics are not differentiable, as shown in 2005
by my former postdoc Faustino Gomez[FWP5]
(now CEO of NNAISENSE)
when affordable computers were about 1000 times faster than in the early 1990s.
Interestingly, our related work on Deep RL in the same year (but without fast weights) to my knowledge was the
first machine learning
publication with the word combination "learn deep" in the title[DL6-6a] (2005; soon afterwards
many started talking about "Deep Learning").
Over the decades we have published quite a few additional ways of learning to generate quickly numerous weights of large NNs through very compact codes.[KO0-2][CO1-3] Here we exploited that the
Kolmogorov complexity or algorithmic information content of successful huge NNs may actually be rather small.
In particular, in July 2013,
Compressed Network Search[CO2]
was the
first Deep Learning model to successfully learn control policies directly from high-dimensional sensory input (video) using reinforcement learning,
without any unsupervised pre-training (unlike in Sec. 1).
Soon afterwards,
DeepMind also had a Deep RL system for high-dimensional sensory input.[DM1-2]
9. Learning Sequential Attention with NNs (1990)
Unlike traditional NNs, humans use sequential gaze shifts and selective attention to detect and recognize patterns.
This can be much more efficient than the highly parallel approach of traditional FNNs.
That's why we introduced
sequential attention-learning NNs three decades ago (1990 and onwards).[ATT0-1]
Shortly afterwards, I also explicitly addressed the learning of
"internal spotlights of attention" in RNNs[FWP2] (see Sec. 8).
So back then we already had both of the now common types of neural sequential attention:
end-to-end-differentiable "soft" attention (in latent space)
through multiplicative units within NNs,[FWP][FWP2]
and
"hard" attention (in observation space) in
the context of Reinforcement Learning (RL).[ATT0-1]
In particular, in 1991, I had
what's now called the
unnormalised linear Transformer[ULTRA]
with linearized self-attention.[TR1-6][FWP0-1][FWP6][FWP] (see Sec. 8).
This led to lots of follow-up work.
Today, many are using sequential attention-learning NNs.
My
overview paper for CMSS 1990[ATT2] summarised in Section 5 our early work on attention, to my knowledge the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component
with an attentional component (the fixation controller).
As with the work of Sepp (see Sec. 3), this has often been misattributed. In 2010, the reviewer of my 1990 paper wrote about his own work:[ATT3] "To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)"[DLP]
(see also Sec. 10).
10. Gradient Descent Finds Subgoals for Hierarchical Reinforcement Learning (1990)
Traditional Reinforcement Learning (RL) without a teacher does not hierarchically decompose problems into
easier subproblems.[HRLW] That's why in 1990 I introduced Hierarchical RL (HRL) with
end-to-end differentiable NN-based subgoal generators,[HRL0] also with
recurrent NNs that learn to generate sequences of subgoals.[HRL1-2][LEC]
An RL machine gets extra inputs of the form (start, goal). An evaluator NN learns to predict the rewards/costs of going from start to goal. An (R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of cost-minimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals.
The system is learning action plans
at multiple levels of abstraction and multiple time scales. At least in principle, it solves what was called an "open problem" in 2022.[LEC]
Our 1990-91 papers[HRL0-1] started a series of papers on HRL.(e.g., [HRL4])
The reviewer of my 1990 paper[ATT2] (which summarised in Section 6 our early work on
HRL) was last author of a 1992 paper on HRL.[HRL3] Compare Sec. 9.
11. Planning with Recurrent World Models (1990)
The concept of a mental model of the world—a world model—dates back millennia. Plato suggested[PLA1] that we recognize objects by recollecting internal blueprints or templates, today often called internal representations. Aristotle wrote that phantasia or mental images allow humans to imagine the future and to plan action sequences by mentally manipulating images in the absence of the actual objects.[ARI1]
Only 2,370 years later—a mere blink of an eye by cosmical standards—we are witnessing a boom in world models based on NNs for AI in the physical world.[WM26][WM26b] New startups on this are emerging.
Much of this goes back to 1990, when I introduced RNNs as general purpose world models.[AC90][PLAN2][PLAN3]
I studied adaptive agents living in partially observable environments where non-trivial kinds of memory are required to act successfully. I used the term world model for an RNN that learns to predict the agent's sensory inputs (including pain and reward signals) reflecting the consequences of the actions of a separate controller RNN steering the agent. The controller C used the world model M to plan its action sequences through mental experiments.[PLAN]
Since RNNs are general purpose computers, this approach went beyond previous, less powerful, FNN-based systems (since 1987) for fully observable environments.[WER87-89][MUN87][NGU89]
In the beginning, my 1990 world model M knew nothing. That's why my 1990 controller C (a generative model with stochastic neurons) was intrinsically motivated through adversarial artificial curiosity[AC90][AC90b][AC] to invent action sequences or experiments that yield data from which M can learn something: C simply tried to maximize the prediction error minimized by M. Today, they call this a generative adversarial network (GAN).[GAN90-25] See also Sec. 5 and Sec. 6.
The 1990 system didn't learn like today's foundation models and large language models (LLMs) by downloading and imitating the web. No, it generated its own self-invented experiments to collect limited but relevant data from the environment, like a physicist, or a baby.[AC90,AC90b][AC] It was a simple kind of artificial scientist.
This led to lots of follow-up publications, also in recent years.[PLAN-PLAN6][LEC][WM26] In 2014, we even founded an AGI company for Physical AI in the real world, based on neural world models.[NAI] It achieved lots of remarkable milestones in collaboration with world-famous companies. Alas, like some of our projects, the company may have been a bit ahead of time, because real world robots and hardware are so challenging. Nevertheless, it's great that in the 2020s, new world model startups have been created!
The 1990 FKI report[AC90] also
introduced several other concepts that have become popular. See Sec. 12,
Sec. 13,
Sec. 14,
Sec. 5,
Sec. 20.
12. Goal-Defining Commands as Extra NN Inputs (1990)
One concept that is widely used in today's RL NNs are
extra goal-defining input patterns that encode various tasks,
such that the NN knows which task to execute next. We introduced this in 1990 in various contexts.[ATT0-1][HRL0-1]
In [ATT0-1] a reinforcement learning neural controller learned to control a fovea through sequences of saccades to find particular objects in visual scenes, thus learning sequential attention (see Sec. 9). User-defined goals were provided to the system by special "goal input vectors" that remained constant (Sec. 3.2 of [ATT1]) while the system shaped its stream of visual inputs through fovea-shifting actions.
Hierarchical RL (HRL, Sec. 10) with
end-to-end differentiable subgoal generators[HRL0-1][LEC] also uses an NN with task-defining inputs
of the form (start, goal), learning to predict the costs of going from start to goal.
(Compare my former student Tom Schaul's "universal value function approximator" at DeepMind a quarter century later.[UVF15])
This led to lots of follow-up work. For example, our POWERPLAY RL system (2011)[PP][PP1] also uses task-defining inputs to distinguish between tasks,
continually inventing on its own new goals and tasks, incrementally learning to become a more and more general problem solver in an active, partially unsupervised or self-supervised fashion.
RL robots with high-dimensional video inputs and intrinsic motivation (like in PowerPlay) learned to explore in 2015.[PP2]
13. High-Dimensional Reward Signals As NN Inputs (1990)
Traditional RL is based on one-dimensional reward signals.
Humans, however, have millions of informative sensors for different types of pain and pleasure etc.
To my knowledge, the 1990 tech report[AC90] was the first paper on RL with
multi-dimensional, vector-valued pain and reward signals coming in through
many different sensors,
where cumulative values are predicted for all those sensors,
not just for a single scalar overall reward.
This later emerged again in RL in what we now call
general value functions.[GVF]
Unlike previous adaptive critics, the one of 1990[AC90]
was multi-dimensional and recurrent.
Unlike in traditional RL,
those reward signals were also used as informative inputs to the controller NN
learning to execute actions that maximise cumulative reward.[PLAN][LEC][DLH]
Note that such a setup is essential for meta learning or
learning to learn, one of my main research topics
since 1987.[META1][META][FWPMETA1-10] A meta learner must see its own reward/pain/error signals as inputs, otherwise it cannot learn better learning algorithms that map such signals to useful weight changes or policy changes!
14. Deterministic Policy Gradients (1990)
The section "Augmenting the Algorithm by Temporal Difference Methods" of the 1990 paper[AC90]
also combined the Dynamic Programming-based
Temporal Difference method[TD] for predicting cumulative (possibly multi-dimensional, Sec. 13) rewards
with a gradient-based predictive
model of the world (see Sec. 11),
to compute weight changes for the separate control network.
See also Sec. 2.4 of the 1991 follow-up paper[PLAN3]
(and compare [NAN1]).
A quarter century later, a
variant of this emerged as Deterministic Policy Gradient algorithm (DPG) at DeepMind.[DPG][DDPG]
15. Networks Adjust Networks / Synthetic Gradients (1990)
In 1990, I proposed various NNs that learn to adjust other NNs,[NAN1] a prelude to the
1991 end-to-end-differentiable Fast Weight Programmers of Sec. 8.
Here I focus
on the section "An Approach to Local Supervised Learning in Recurrent Networks".[NAN1]
The global error measure to be minimized is the sum of all errors received at an RNN's output units over time. In conventional backpropagation through time (see surveys[BPTT1-2]),
each unit needs a stack for remembering past activations which are used to compute contributions to weight changes during the error propagation phase. Instead of allowing unlimited storage capacities in the form of stacks, I introduced a second adaptive NN that learns to associate states of the RNN with corresponding error vectors. These locally estimated error gradients (rather than the true gradients) are used to adjust the RNN.[NAN1-4]
Unlike standard backpropagation, the method is local in space and time.[BB1-2][NAN1]
A quarter century later, DeepMind called this "Synthetic Gradients."[NAN5]
16. O(n3) Gradients for Online Recurrent NNs (1991)
The original 1987 fixed-size storage learning algorithm for
fully recurrent continually running networks[ROB]
requires O(n4) computations per time step, where n is the number of non-input units.
I published a method which computes exactly the same gradient and requires fixed-size
storage of the same order as the previous algorithm.
But, the average time complexity per
time step is only O(n3).[CUB1-2] However, this work does not really count,
since the great RNN pioneer Ron Williams had derived this method first![CUB0]
BTW, I committed a similar error in 1987 when I published
what I thought was the first paper on
Genetic Programming (GP), that is, on automatically
evolving computer programs[GP1][GP] (authors in alphabetic order).
Only later I found out that Nichael Cramer had published GP already in 1985[GP0]
(and that Stephen F. Smith had proposed a related approach as part of a larger system[GPA] in 1980).
Since then I have been trying to do the right thing and correctly attribute credit.
At least our 1987 paper[GP1] seems to be
the first on GP for codes with loops and codes of variable size,
and the first on GP implemented in a Logic Programming language.
17. The Deep Neural Heat Exchanger (1990)
The Neural Heat Exchanger is supervised learning method for deep multi-layer NNs. It is inspired by the physical heat exchanger. Inputs "heat up" while being transformed through many successive layers, targets enter from the other end of the deep pipeline and "cool down." Unlike backpropagation, the method is entirely local in space and time.[BB1-2][NAN1][NOB25a] This makes its parallel implementation trivial. It was first presented during occasional talks at various universities since 1990,[NHE] and is closely related to the later Helmholtz Machine.[HEL][DLP] Again, experiments were conducted by my brilliant student
Sepp Hochreiter (see Sec. 3, Sec. 4).
18. PhD Thesis (1990)
My doctoral dissertation at TUM[PHD] also came out in 1991,
summarising some of my earlier work since 1989,
including the first Reinforcement Learning (RL) Neural Economy (the Neural Bucket Brigade),[BB1-2][DLP]
learning algorithms for RNNs that are local in space and time,[BB1]
hierarchical RL (HRL) with
end-to-end differentiable subgoal generators (see Sec. 10),
RL and planning through a combination of two RNNs called the controller C and the world model M (see Sec. 11),
sequential attention-learning NNs (see Sec. 9),
NNs that learn to adjust other NNs (including "synthetic gradients;" see Sec. 15),
and unsupervised or self-supervised, generative, adversarial networks (see Sec. 5) for implementing curiosity.
Back then, much of the NN research by others was inspired by statistical mechanics.[L20][I25][K41][W45][AMH1-2][NOB]
The works of 1990-91 (and my even earlier diploma thesis of 1987[META1])
embodied an alternative program-oriented view of Machine Learning.
When Kurt Gödel
laid the foundations of theoretical computer science in 1931,[GOD][GOD34][GOD21-21b]
he represented both data (such as axioms and theorems) and programs
(such as proof-generating sequences of operations on the data)
in a universal coding language based on the integers.
He famously used this language to construct formal statements that talk about the computation of other formal statements—especially self-referential statements which imply that they are not decidable, given a computational theorem prover that systematically enumerates all possible theorems from an enumerable set of axioms. Thus he identified fundamental limits of algorithmic theorem proving, computing, and
any type of computation-based AI.
As I have frequently pointed out since 1990,[AC90]
the weights of an NN should be viewed as its program.
Some argue that the goal of a deep NN is to learn useful internal representations of observed data
(there even is an international conference on learning representations called ICLR),
but
the NN's goal is actually to learn a program (the parameters)
that computes such representations.
Inspired by Gödel,
I built NNs whose outputs are programs or weight matrices of other NNs (see Sec. 8),
and even self-referential RNNs
that can run and inspect their own weight change algorithms or learning algorithms (see Sec. 8).
A difference to Gödel's work is that the universal programming language is not based on the integers,
but on real values, such that
the outputs of typical NNs are differentiable with respect to their programs.
That is, a simple program generator (the efficient
gradient descent procedure[BP1])
can compute a direction in program space where one may find a better program,[AC90]
in particular, a
better program-generating program (see Sec. 8).
Much of my work since 1989 has exploited this fact.
19. From Unsupervised Pre-Training to Pure Supervised Learning (1991-95; 2006-11)
As mentioned in Sec. 1, my first Very Deep Learner was the RNN stack of 1991
which used unsupervised pre-training
to learn problems of depth greater than 1000.
Soon afterwards, however,
we published more powerful ways
of overcoming the Deep Learning Problem (see Sec. 3) without any unsupervised pre-training,
replacing the unsupervised RNN stack[UN1-3]
by the purely supervised Long Short-Term Memory (LSTM) (Sec. 4).
That is,
already in the previous millennium, unsupervised pre-training lost
significance as LSTM did not require it.
In fact, this shift from unsupervised pre-training to pure supervised learning started
already in 1991.
A very similar shift
took place much later between 2006 and 2010, this time for the less general feedforward NNs (FNNs) rather
than recurrent NNs (RNNs). Again, my little lab played a central role in this transition.
In 2006,
supervised learning in FNNs was facilitated by unsupervised pre-training of stacks of FNNs[UN4] (see Sec. 1).
But in 2010, our team with my outstanding Romanian
postdoc Dan Ciresan[MLP1-3]
showed that deep FNNs
can be trained by plain backpropagation and do not at all require unsupervised
pre-training for important applications.[MLP2-3]
Our system set a new performance record[MLP1] on
the back then famous and widely used image recognition benchmark called MNIST.
This was achieved by greatly accelerating traditional FNNs on highly parallel
graphics processing units called GPUs. A reviewer called this a
"wake-up call to the machine learning community."
My team at the Swiss AI Lab IDSIA further improved the above-mentioned work (2010)
on purely supervised Deep Learning in FNNs[MLP1-3] by replacing the traditional FNNs
through another old NN type called convolutional NNs or CNNs, invented and developed by others 1979-1988 in Japan.[CN69-88]
See also: Who invented convolutional neural networks?[WHO7]
Our supervised fast deep CNN called DanNet (Ciresan et al., 2011)[GPUCNN1]
was a practical breakthrough (much faster than early work on accelerating CNNs[GPUCNN]). DanNet was the first pure CNN to win
international computer vision competitions, and
won 4 of them in a row
between May 15, 2011, and September 10, 2012.[GPUCNN5]
(All of this happened before a similar GPU-CNN by others won ImageNet 2012.[GPUCNN5])
In particular,
DanNet was the first deep CNN to win a
Chinese handwriting contest (ICDAR 2011),
the first to achieve
superhuman visual pattern recognition
in any international contest (IJCNN 2011),
the first to win an image segmentation contest (ISBI, May 2012),
and the first to win a
contest on object detection in large images (ICPR, 10 Sept 2012),
at the same time the first to win a medical imaging contest (on cancer detection).[GPUCNN5]
One year later, our team also won the MICCAI Grand Challenge on
mitosis detection.[MGC][GPUCNN5-8]
Our fast CNN image scanners were over 1000 times faster than previous methods.[SCAN]
This Deep Learning approach has transformed medical imaging.
DanNet more than halved the error rate for object recognition in a contest
already in 2011, 20 years after our Annus Mirabilis.[GPUCNN2]
Soon afterwards,
others applied similar approaches in image recognition contests.[GPUCNN5][MOST]
Like our LSTM results of 2009 (see Sec. 4),
the above-mentioned results with feedforward NNs of 2010-11 attracted enormous interest from industry.
For example, in 2010, we introduced our
deep and fast GPU-based NNs to Arcelor Mittal, the world's largest steel maker,
and were able to greatly improve steel defect detection.[ST]
This may have been the first Deep Learning breakthrough in heavy industry.
Today, most AI startups and major IT firms as well as many other famous companies
are using such supervised fast GPU-NNs.
Let me emphasize, however, that the above-mentioned
supervised deep learning revolutions of
the early 1990s (for recurrent NNs)
and of
2010 (for feedforward NNs)[MLP1-3] did
not at all kill unsupervised learning.
For example, pre-trained language models are now heavily
used by Transformers (see Sec. 8) which
excel at the traditional LSTM domain of
Natural Language Processing[TR1-6]
(although there are still many language tasks that LSTM can
rapidly learn to solve quickly[LSTM13]
while plain Transformers can't).
Remarkably,
unnormalized linear Transformers[ULTRA] (see Sec. 8) were also first published[FWP0-2] in
our
Annus Mirabilis of 1990-1991,[MOST]
together with
unsupervised pre-training for deep learning[UN-UN3] (see the P and the T in ChatGPT).
And
our unsupervised generative adversarial NNs since
1990[AC90-AC20][PLAN][AC]
are still used to endow agents with
artificial curiosity (see Sec. 5 & Sec. 6)—see also a version of our adversarial NNs[AC90b] called GANs.[AC20][R2][PLAN][MOST][DLP] Unsupervised learning still has a bright future!
20. The Amazing FKI Tech Report Series on Artificial Intelligence in the 1990s
In hindsight,
many of the later widely used basic ideas of "modern" Deep Learning were published in
our Miraculous Year 1990-1991 at TU Munich, soon after the fall of the Berlin Wall:
unsupervised or self-supervised, data-generating, adversarial networks (for artificial curiosity and related concepts; see Sec. 5; see also follow-up work at CU in Sec. 7),
the Fundamental Deep Learning Problem (vanishing / exploding gradients; see Sec. 3) and its solutions through (a)
unsupervised pre-training for very deep (recurrent) networks (see Sec. 1) and (b)
basic insights leading to LSTM (see Sec. 3, Sec. 4). We also introduced sequential attention-learning NNs back then—another concept that has become popular (see Sec. 9 on both hard and soft attention, in observation space and in latent space), as well as NNs that learn to program the
fast weights of another NN,
and even their own weights. In particular, we already had what's now called unnormalized linear Transformers with "linearized self-attention" (see Sec. 8).
Plus all the other
things mentioned above, from Hierarchical Reinforcement Learning (see Sec. 10) to
planning with recurrent neural world models (see Sec. 11).
Of course, one had to wait for faster computers to commercialize such algorithms. By the mid 2010s, however, our stuff was
massively used by Apple, Google, Facebook, Amazon, Samsung, Baidu, Microsoft, etc,
many billions of times per day on billions of computers.[DL4]
Most of the results above were actually first published in TU Munich's FKI Tech Report series,
for which I drew many illustrations by hand,
some of them shown in the present page
(see Sec. 10, Sec. 12, Sec. 8, Sec. 18).
The FKI series
now plays an important role in the history of Artificial Intelligence, as it introduced several important concepts:
unsupervised pre-training for very Deep Learning (FKI-148-91;[UN0] see Sec. 1 and the P in ChatGPT),
compressing / distilling
one NN into another (FKI-148-91;[UN0] see Sec. 2), the vanishing gradient problem and
deep residual learning[WHO11] and
Long Short-Term Memory (FKI-207-95;[LSTM0] see Sec. 3, Sec. 4),
Artificial Curiosity through NNs that maximize learning progress (FKI-149-91;[AC91] see Sec. 6),
end-to-end-differentiable Fast Weight Programmers that
learn to program other NNs (separating storage and control for NNs like in
traditional computers; see FKI-147-91[FWP0] and Sec. 8—the outer product version of 1991 is an unnormalized linear Transformer with linearized self-attention),
learning of sequential attention with NNs (FKI-128-90;[ATT0] see Sec. 9),
goal-defining commands as extra NN inputs (FKI-128-90,[ATT0] FKI-129-90;[HRL0] see Sec. 12),
end-to-end-differentiable Hierarchical Reinforcement Learning (FKI-129-90;[HRL0] see Sec. 10),
NNs adjusting NNs / synthetic gradients (FKI-125-90;[NAN2] see Sec. 15).
(Cubic gradient computation for online recurrent NNs also was published as FKI-151-91,[CUB1]
but this one does not really count; see Sec. 16.) In particular, the report
FKI-126-90[AC90] introduced a whole bunch of concepts that are now widely used:
planning in partially observable environments with recurrent world models (see Sec. 11),
high-dimensional reward signals as extra NN inputs / general value functions (see Sec. 13),
deterministic policy gradients (see Sec. 14),
NNs that are both generative and adversarial (GANs; see Sec. 5; see also Sec. 7), for Artificial Curiosity and related concepts.
Later remarkable FKI Tech Reports from the 1990s describe
ways of greatly compressing NNs[KO0][FM] to improve their
generalisation capability.
Peer-reviewed versions came out soon after the tech reports. For example, in 1992, I had a fun contest with the great David MacKay as to who'd have more publications within a single year in Neural Computation,
back then the leading journal of our field. By the end of 1992, both of us had four. But David won,
because his publications (mostly on Bayesian approaches for NNs)
were much longer than mine :-) Disclaimer: Of course, silly measures like number of publications and h-index etc should not matter in science.[NAT1]
21. Concluding Remarks
When only consulting surveys from the Anglosphere,
it is not always clear[DLC]
that Deep Learning was first conceived outside of it.[DLH][NOB][WHO5-11]
It started in 1965 in the Ukraine (back then the USSR) with the first nets of arbitrary depth that really learned[DEEP1-2] (see Sec. 1). A few years later, stochastic gradient descent was successfully applied in Japan to learn internal representations in deep multi-layer perceptrons.[GD1-2a][DLH][DLP][NOB]
Soon afterwards, modern backpropagation was published in Finland (1970)[BP1] (see Sec. 0). The basic
deep convolutional NN architecture
(now widely used) was invented in the 1970s in Japan[CN79][CN86] where NNs with convolutions were later (1987-88) also combined with backpropagation.[CN87][CN88][CN89]
Unsupervised or self-supervised generative adversarial networks that duel each other in a minimax game
to implement artificial curiosity etc (now widely used) originated in Munich (1990, Sec. 5). So did the first self-driving cars in traffic (1994), the
unnormalized linear Transformer[ULTRA][FWP,FWP0] (1991, see the T in ChatGPT & Sec. 8), and neural network distillation (1991, see DeepSeek tweet & Sec. 2). The
fundamental problem of backpropagation-based Deep Learning[VAN1] (1991, see Sec. 3) was also discovered in Munich. So were the first "modern" Deep Learners to overcome this problem, through (1) unsupervised pre-training[UN1-2] (1991, see the P in ChatGPT & Sec. 1), and (2) deep residual learning[WHO11] for very deep gradient-based NNs such as Long Short-Term Memory,[LSTM0-7] "arguably the most commercial AI achievement"[AV1] (see Sec. 4). In fact, as of 2025, the two most frequently cited scientific articles of all time (with the most Google Scholar citations within 3 years—manuals excluded) are both directly based on the 1991 work at TUM.[MOST26]
LSTM was further developed in Switzerland (see Sec. 4), which is also home of
the first image recognition contest-winning
deep GPU-based CNNs (2011, Sec. 19—everybody in computer vision is using this approach now),
the first
superhuman visual pattern recognition (2011),
and the first very deep, working feedforward NNs with hundreds of layers[HW1] based on deep residual learning[WHO11] (see Sec. 4).
Around 1990,
Switzerland also became origin of the World Wide Web,
which allowed for quickly spreading AI around the globe. As of 2025, Switzerland is still
leading the world in AI research in terms of citation impact, though since 2017, China has been the nation that produces the most papers on AI.[THE17]
Of course,
Deep Learning is just a small
part of AI, mostly limited to passive
pattern recognition.
We view it as a by-product of our research on more general AI
through
meta-learning or "learning to learn learning algorithms" (publications since 1987), systems with
artificial curiosity and creativity that
invent their own problems and set their own goals (since 1990),
evolutionary computation (since 1987) and
RNN evolution and
compressed network search,
reinforcement learning (RL) for
agents in realistic partially observable environments
where traditional RL (for board games etc) does not work (since 1989),
artificial super intelligence through
optimal universal learning machines such as the Gödel machine
(2003-),
optimal search for programs
running on general purpose computers such as RNNs, etc.
And of course, AI itself is just part of a grander scheme driving the universe from simple initial conditions
to more and more unfathomable complexity.[SA17]
Finally, even this awesome process may be just a tiny part of the even grander,
optimally efficient computation of all logically possible universes.[ALL1-3]
Acknowledgments
Thanks to several expert reviewers for useful comments. (Let me know under juergen@idsia.ch if you can spot any remaining error.)
The present article[MIR] had an impact on later reports and posts which contain additional relevant references. It also influenced
some of the most popular posts and comments of 2019 at reddit/ML: at the time the largest machine learning forum with over 800k subscribers.[R2-R8]
The contents of this article may be used for educational and non-commercial purposes, including articles for Wikipedia and similar sites.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
References
[25y97]
In 2022, we are celebrating the following works from a quarter-century ago.
1. Journal paper on Long Short-Term Memory, the
most cited neural network (NN) of the 20th century
(and basis of the most cited NN of the 21st).
2. First paper on physical, philosophical and theological consequences of the simplest and fastest way of computing
all possible metaverses
(= computable universes).
3. Implementing artificial curiosity and creativity through generative adversarial agents that learn to design abstract, interesting computational experiments.
4. Journal paper on
meta-reinforcement learning.
5. Journal paper on hierarchical Q-learning.
6. First paper on reinforcement learning to play soccer: start of a series.
7. Journal papers on flat minima & low-complexity NNs that generalize well.
8. Journal paper on Low-Complexity Art, the Minimal Art of the Information Age.
9. Journal paper on probabilistic incremental program evolution.
[AC]
J. Schmidhuber (AI Blog, 2021, updated 2025). 3 decades of artificial curiosity & creativity. Our artificial scientists not only answer given questions but also invent new questions. They achieve curiosity through: (1990) the principle of generative adversarial networks, (1991) neural nets that maximise learning progress, (1995) neural nets that maximise information gain (optimally since 2011), (1997) adversarial design of surprising computational experiments, (2006) maximizing compression progress like scientists/artists/comedians do, (2011) PowerPlay... Since 2012: applications to real robots.
[AC90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF.
The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
where a generator NN is fighting a predictor NN in a minimax game
(more).
[AC90b]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
PDF.
HTML.
[AC91]
J. Schmidhuber. Adaptive confidence and adaptive curiosity. Technical Report FKI-149-91, Inst. f. Informatik, Tech. Univ. Munich, April 1991.
PDF.
[AC91b]
J. Schmidhuber.
Curious model-building control systems.
In Proc. International Joint Conference on Neural Networks,
Singapore, volume 2, pages 1458-1463. IEEE, 1991.
PDF.
[AC06]
J. Schmidhuber.
Developmental Robotics,
Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts.
Connection Science, 18(2): 173-187, 2006.
PDF.
[AC09]
J. Schmidhuber. Art & science as by-products of the search for novel patterns, or data compressible in unknown yet learnable ways. In M. Botta (ed.), Et al. Edizioni, 2009, pp. 98-112.
PDF. (More on
artificial scientists and artists.)
[AC10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010.
IEEE link.
PDF.
[AC20]
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 58-66, 2020.
Preprint arXiv/1906.04493.
[AIB]
J. Schmidhuber's AI Blog.
With lessons on the history of AI & computing, e.g.:
Who invented deep learning?
Who invented backpropagation?
Who invented convolutional neural networks?
Who invented artificial neural networks?
Who invented generative adversarial networks?
Who invented Transformer neural networks?
Who invented deep residual learning?
Who invented neural knowledge distillation?
Who invented the computer?
Who invented the transistor?
Who invented the integrated circuit?
...
[ALL1]
A Computer Scientist's View of Life, the Universe, and Everything.
LNCS 201-288, Springer, 1997 (submitted 1996).
PDF.
More.
[ALL2]
Algorithmic theories of everything
(2000).
ArXiv:
quant-ph/ 0011122.
See also:
International Journal of Foundations of Computer Science 13(4):587-612, 2002:
PDF.
See also: Proc. COLT 2002:
PDF.
More.
[ALL3]
J. Schmidhuber. The Fastest Way of Computing All Universes. In H. Zenil, ed.,
A Computable Universe.
World Scientific, 2012. PDF of preprint.
More.
[AMH1]
S. I. Amari (1972).
Learning patterns and pattern sequences by self-organizing nets of threshold elements. IEEE Transactions, C 21, 1197-1206, 1972.
PDF.
First publication of the basic equations of the so-called Hopfield network[AMH2] or Amari-Hopfield Network,[AMH3] based on the (uncited) Lenz-Ising recurrent architecture.[L20][I25][DLP][NOB]
See also Little's work (1974-1980)[AMH1b-d] and this tweet.
[AMH1b]
W. A. Little. The existence of persistent states in the brain. Mathematical Biosciences, 19.1-2, p. 101-120, 1974.
Little uses Wannier's ideas of the 1940s[K41][W45] to express neural networks, and mentions
the recurrent Ising model[L20][I25]on which the (uncited) Amari network[AMH1,2] is based.
[AMH1c]
W. A. Little and G. L. Shaw (1978). Analytic Study of the Memory Capacity of a Neural Network. Math Biosci. 39, 281–290 (1978).
This paper shows explicitly how to store-recall patterns with the Ising-Lenz model.
[AMH1d]
W. A. Little (1980). An Ising Model of a Neural Network. In:
W. Jaeger, H. Rost, P. Tautu (eds), Biological Growth and Spread.
Lecture Notes in Biomathematics, vol 38. Springer, Berlin, Heidelberg.
https://doi.org/10.1007/978-3-642-61850-5_18
[AMH2]
J. J. Hopfield (1982). Neural networks and physical systems with emergent
collective computational abilities. Proc. of the National Academy of Sciences,
vol. 79, pages 2554-2558, 1982.
The basic equations of the Hopfield network or Amari-Hopfield Network were first published in 1972 by Amari.[AMH1] [AMH2] did not cite [AMH1]. See [NOB].
[AMH3]
A. P. Millan, J. J. Torres, J. Marro.
How Memory Conforms to Brain Development.
Front. Comput. Neuroscience, 2019
[ARI1]
Aristotle (circa 350 BC). De Anima (On the Soul).
[ATT] J. Schmidhuber (AI Blog, 2020, updated 2025). 30-year anniversary of end-to-end differentiable sequential neural attention. Plus goal-conditional reinforcement learning. Schmidhuber had both hard attention for foveas (1990) and soft attention in form of Transformers with linearized self-attention (1991-93).[FWP] Today, both types are very popular.
[ATT0] J. Schmidhuber and R. Huber.
Learning to generate focus trajectories for attentive vision.
Technical Report FKI-128-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.
[ATT1] J. Schmidhuber and R. Huber. Learning to generate artificial fovea trajectories for target detection. International Journal of Neural Systems, 2(1 & 2):135-141, 1991. Based on TR FKI-128-90, TUM, 1990.
PDF.
More.
[ATT2]
J. Schmidhuber.
Learning algorithms for networks with internal and external feedback.
In D. S. Touretzky, J. L. Elman, T. J. Sejnowski, and G. E. Hinton,
editors, Proc. of the 1990 Connectionist Models Summer School, pages
52-61. San Mateo, CA: Morgan Kaufmann, 1990.
PS. (PDF.)
[ATT3]
H. Larochelle, G. E. Hinton. Learning to combine foveal glimpses with a third-order Boltzmann machine. NIPS 2010. This work is very similar to [ATT0-2] which the authors did not cite.
In fact, the 2nd author was the reviewer of a 1990 paper[ATT2] which summarised in its Section 5 Schmidhuber's early work on attention: the first implemented neural system for combining glimpses that jointly trains a recognition & prediction component
with an attentional component (the fixation controller).
Two decades later, he wrote about
his own work:[ATT3]
"To our knowledge, this is the first implemented system for combining glimpses that jointly trains a recognition component ... with an attentional component (the fixation controller)." See [MIR](Sec. 9)[R4].
[AV1] A. Vance. Google Amazon and Facebook Owe Jürgen Schmidhuber a Fortune—This Man Is the Godfather the AI Community Wants to Forget. Business Week,
Bloomberg, May 15, 2018.
[BB1]
J. Schmidhuber.
A local learning algorithm for dynamic feedforward and recurrent
networks.
Technical Report FKI-124-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.
[BB2]
J. Schmidhuber.
A local learning algorithm for dynamic feedforward and
recurrent networks.
Connection Science, 1(4):403-412, 1989.
(The Neural Bucket Brigade—figures omitted!).
PDF.
HTML.
Compare TR FKI-124-90, TUM, 1990.
PDF.
Proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Based on a "neural economy" for reinforcement learning.
[BP1] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970.
See chapters 6-7 and FORTRAN code on pages 58-60.
PDF.
See also BIT 16, 146-160, 1976.
Link.
The first publication on "modern" backpropagation, also known as the reverse mode of automatic differentiation.
[BP2] P. J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP,
Springer, 1982.
PDF.
First application of backpropagation[BP1] to NNs (concretizing thoughts in Werbos' 1974 thesis).
[BP4] J. Schmidhuber (AI Blog, 2014; updated 2025).
Who invented backpropagation?
See also LinkedIn post (2025).
[BP5]
A. Griewank (2012). Who invented the reverse mode of differentiation?
Documenta Mathematica, Extra Volume ISMP (2012): 389-400.
[BP6]
S. I. Amari (1977).
Neural Theory of Association and Concept Formation.
Biological Cybernetics, vol. 26, p. 175-185, 1977.
See Section 3.1 on using gradient descent for learning in multilayer networks.
[BPA]
H. J. Kelley. Gradient Theory of Optimal Flight Paths. ARS Journal, Vol. 30, No. 10, pp. 947-954, 1960.
Precursor of modern backpropagation.[BP1-4]
[BPB]
A. E. Bryson. A gradient method for optimizing multi-stage allocation processes. Proc. Harvard Univ. Symposium on digital computers and their applications, 1961.
[BPC]
S. E. Dreyfus. The numerical solution of variational problems. Journal of Mathematical Analysis and Applications, 5(1): 30-45, 1962.
[BPTT1]
P. J. Werbos. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 78.10, 1550-1560, 1990.
[BPTT2]
R. J. Williams and D. Zipser. Gradient-based learning algorithms for recurrent networks. In: Backpropagation: Theory, architectures, and applications, p 433, 1995.
[BOU] H Bourlard, N Morgan (1993). Connectionist speech recognition. Kluwer, 1993.
[BRI] Bridle, J.S. (1990). Alpha-Nets: A Recurrent "Neural" Network Architecture with a Hidden Markov Model Interpretation, Speech Communication, vol. 9, no. 1, pp. 83-92.
[BW] H. Bourlard, C. J. Wellekens (1989).
Links between Markov models and multilayer perceptrons. NIPS 1989, p. 502-510.
[CN69]
K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. doi:10.1109/TSSC.1969.300225. This work described rectified linear units or ReLUs, now widely used in CNNs and other neural nets.
[CN79] K. Fukushima (1979). Neural network model for a mechanism of pattern
recognition unaffected by shift in position—Neocognitron.
Trans. IECE, vol. J62-A, no. 10, pp. 658-665, 1979.
The first deep convolutional neural network architecture, with alternating convolutional layers and downsampling layers. In Japanese. English version: [CN80]. More in Scholarpedia.
[CN80]
K. Fukushima: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.
Biological Cybernetics, vol. 36, no. 4, pp. 193-202 (April 1980).
Link.
[CN86]
Movie produced by K. Fukushima, S. Miyake and T. Ito (NHK Science and Technical Research Laboratories), in 1986.
YouTube Link. See a popular video tweet by J. Schmidhuber (2025).
[CN87] A. Waibel. Phoneme Recognition Using Time-Delay Neural Networks. Meeting of IEICE, Tokyo, Japan, 1987. Application of backpropagation[BP1][BP2] and weight sharing
to a 1-dimensional convolutional architecture.
[CN87b]
T. Homma, L. Atlas; R. Marks II (1987). An Artificial Neural Network for Spatio-Temporal Bipolar Patterns: Application to Phoneme Classification. Advances in Neural Information Processing Systems (N(eur)IPS), 1:31-40.
[CN88]
W. Zhang, J. Tanida, K. Itoh, Y. Ichioka. Shift-invariant pattern recognition neural network and its optical architecture. Proc. Annual Conference of the Japan Society of Applied Physics, 1988.
PDF.
First "modern" backpropagation-trained 2-dimensional CNN, applied to character recognition.
[CN89]
W. Zhang, J. Tanida, K. Itoh, Y. Ichioka (received 13 April 1989). Parallel distributed processing model with local space-invariant interconnections and its optical architecture. Applied Optics / Vol. 29, No. 32, 1990. PDF.
First journal submission on a "modern" backpropagation-trained 2-dimensional CNN (applied to character recognition).
[CN89b] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel (received July 1989). Backpropagation Applied to Handwritten Zip Code Recognition, Neural Computation, 1(4):541-551, 1989.
Second journal submission on a "modern" backpropagation-trained 2-dimensional CNN (applied to character recognition). Compare [CN88][CN89].
[CN89c] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. J. Lang. Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 3, pp. 328-339, March 1989. Based on [CN87] (1-dimensional convolutions).
[CN89d]
J. Hampshire, A. Waibel (1989). Connectionist architectures for multi-speaker phoneme recognition. In Advances in Neural Information Processing Systems, N(eur)IPS'2. Conference publication on 2D-TDNNs or 2D-CNNs for speech recognition.
[CN90]
K. Yamaguchi, K. Sakamoto, A. Kenji, T. Akabane, Y. Fujimoto. A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90), Kobe, Japan, Nov 1990.
A 1-dimensional NN with convolutions using Max-Pooling instead of Fukushima's
Spatial Averaging.[CN79]
[CN93] Weng, J.,
Ahuja, N., and Huang, T. S. (1993). Learning recognition and segmentation of 3-D objects from 2-D images. Proc. 4th Intl. Conf. Computer Vision, Berlin, Germany, pp. 121-128. A 2-dimensional CNN whose downsampling layers use Max-Pooling
(which has become very popular) instead of Fukushima's
Spatial Averaging.[CN79]
[CN91]
W. Zhang, A. Hasegawa, K. Itoh, Y. Ichioka. Image processing of human corneal
endothelium based on a learning network. Applied Optics, Vol. 30, No. 29, 1991.
First published CNN-based image segmentation.
[CN91b]
W. Zhang, A. Hasegawa, K. Itoh, Y. Ichioka.
Error Back Propagation With Minimum-Entropy Weights: A Technique for
Better Generalization of 2-D Shift-Invariant NNs. IJCNN, 1991.
Got an IJCNN student paper award.
[CN92]
W. Zhang, A. Hasegawa, O. Matoba, K. Itoh, Y. Ichioka, K. Doi. Shift-invariant Neural
Network for Image Processing: Learning and Generalization. SPIE Vol. 1709,
Application of Artificial Neural Networks III, Orlando, 1992.
[CN94]
W. Zhang, K. Doi, M. Giger, Y. Wu, R. Nishikawa, R. Schmidt. Computerized
detection of cluster microcalcifications in digital mammogram using a shift-invariant
neural network. Medical Physics, 21(4), 1994.
First CNN for object detection, commercialised by R2 Technology, which processed
over 30
million mammography exams annually to aid radiologists in breast cancer detection.
[CN98]
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86 (11): 2278-2324.
This work about backpropagation-trained 2-dimensional CNNs for character recognition failed to cite the original work on this by Zhang et al. (1988).[CN88][CN89]
[CN99]
S. Behnke. Learning iterative image reconstruction in the neural abstraction pyramid. International Journal of Computational Intelligence and Applications, 1(4):427-438, 1999.
[CN03]
S. Behnke. Hierarchical Neural Networks for Image Interpretation, volume LNCS 2766 of Lecture Notes in Computer Science. Springer, 2003.
[CN07] M. A. Ranzato, Y. LeCun: A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images. Proc. ICDAR, 2007
[CN08]
D. Ciresan.
Recognition of handwritten numerical strings (in Romanian).
UPT Doctoral Theses, Series 10, No. 10, Politehnica Publishing House,
2008. ISBN: 978-973-625-777-3. GPU-based
convolutional neural networks recognise
partially overlapping digits, without prior segmentation.
20 times faster than the CPU version.
[CN10]
D. Scherer, A. Mueller, S. Behnke. Evaluation of pooling operations in convolutional architectures for object recognition. In Proc. International Conference on Artificial Neural Networks (ICANN), pages 92-101, 2010.
[CN21] Bower Award Ceremony 2021:
Jürgen Schmidhuber lauds Kunihiko Fukushima. YouTube video, 2021.
[CN25]
J. Schmidhuber.
Who invented convolutional neural networks? Technical Note IDSIA-17-25, IDSIA, 2025. See popular tweet.
[CN25b]
F. Chollet, ex-Google, creator of Keras (tweets, 3 Aug 2025): "The big breakthrough for convnets was the first GPU-accelerated CUDA implementation, which immediately started winning first place in image classification competitions. Remember when that happened? I do. That was Dan Ciresan in 2011 [...] it was an actual system that won 2 competitions and even got significant media coverage at the time. Most computer vision researchers knew about it in late 2011 / early 2012."
[CO1]
J. Koutnik, F. Gomez, J. Schmidhuber (2010). Evolving Neural Networks in Compressed Weight Space. Proceedings of the Genetic and Evolutionary Computation Conference
(GECCO-2010), Portland, 2010.
PDF.
[CO2]
J. Koutnik, G. Cuccu, J. Schmidhuber, F. Gomez.
Evolving Large-Scale Neural Networks for Vision-Based Reinforcement Learning.
In Proceedings of the Genetic and Evolutionary
Computation Conference (GECCO), Amsterdam, July 2013.
PDF.
[CO3]
R. K. Srivastava, J. Schmidhuber, F. Gomez.
Generalized Compressed Network Search.
Proc. GECCO 2012.
PDF.
[CTC] A. Graves, S. Fernandez, F. Gomez, J. Schmidhuber. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. ICML 06, Pittsburgh, 2006.
PDF.
[CUB0]
R. J. Williams.
Complexity of exact gradient computation algorithms for recurrent
neural networks. Technical Report NU-CCS-89-27, Northeastern University,
College of Computer Science, 1989.
[CUB1]
An O(n3) learning algorithm for fully recurrent neural networks.
Technical Report FKI-151-91, Institut für Informatik, Technische
Universität München, 1991.
PDF.
[CUB2]
J. Schmidhuber.
A fixed size
storage O(n3) time complexity learning algorithm for fully recurrent
continually running networks.
Neural Computation, 4(2):243-248, 1992.
PDF.
[CW]
J. Koutnik, K. Greff, F. Gomez, J. Schmidhuber. A Clockwork RNN. Proc. 31st International Conference on Machine Learning (ICML), p. 1845-1853, Beijing, 2014. Preprint arXiv:1402.3511 [cs.NE].
[DAN]
J. Schmidhuber (AI Blog, 2021).
10-year anniversary. In 2011, DanNet triggered the deep convolutional neural network (CNN) revolution. Named after Schmidhuber's outstanding postdoc Dan Ciresan, it was the first deep and fast CNN to win international computer vision contests, and had a temporary monopoly on winning them, driven by a very fast implementation based on graphics processing units (GPUs).
1st superhuman result in 2011.[DAN1] Now everybody is using this approach.
[DAN1]
J. Schmidhuber (AI Blog, 2011; updated 2021 for 10th birthday of DanNet): First superhuman visual pattern recognition.
At the IJCNN 2011 computer vision competition in Silicon Valley,
the artificial neural network called DanNet performed twice better than humans, three times better than the closest artificial competitor (from LeCun's team), and six times better than the best non-neural method.
[DEC] J. Schmidhuber (AI Blog, 02/20/2020, updated 2025). The 2010s: Our Decade of Deep Learning / Outlook on the 2020s. The recent decade's most important developments and industrial applications based on our AI, with an outlook on the 2020s, also addressing privacy and data markets.
[DEEP1]
Ivakhnenko, A. G. and Lapa, V. G. (1965). Cybernetic Predicting Devices. CCM Information Corporation. First working Deep Learners with many layers, learning internal representations.
[DEEP1a]
Ivakhnenko, Alexey Grigorevich. The group method of data of handling; a rival of the method of stochastic approximation. Soviet Automatic Control 13 (1968): 43-55.
[DEEP2]
Ivakhnenko, A. G. (1971). Polynomial theory of complex systems. IEEE Transactions on Systems, Man and Cybernetics, (4):364-378.
[DIST2]
O. Vinyals, J. A. Dean, G. E. Hinton.
Distilling the Knowledge in a Neural Network.
Preprint arXiv:1503.02531 [stat.ML], 2015.
The authors did not cite the original
1991 NN distillation procedure,[UN0-2][MIR](Sec. 2)[DLP]
not even in the later patent application US20150356461A1.
[DIST25]
J. Schmidhuber. Who invented knowledge distillation with artificial neural networks? Technical Note IDSIA-12-25, IDSIA, Nov 2025.
[DL1] J. Schmidhuber, 2015.
Deep learning in neural networks: An overview. Neural Networks, 61, 85-117.
More.
Got the first Best Paper Award ever issued by the journal Neural Networks, founded in 1988.
[DL2] J. Schmidhuber, 2015.
Deep Learning.
Scholarpedia, 10(11):32832.
[DL25]
J. Schmidhuber. Who invented deep learning? Technical Note IDSIA-16-25, IDSIA, November 2025.
[DL4] J. Schmidhuber (AI Blog, 2017).
Our impact on the world's most valuable public companies: Apple, Google, Microsoft, Facebook, Amazon... By 2015-17, neural nets developed in Schmidhuber's labs were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute. Examples: greatly improved (CTC-based) speech recognition on all Android phones, greatly improved machine translation through Google Translate and Facebook (over 4 billion LSTM-based translations per day), Apple's Siri and Quicktype on all iPhones, the answers of Amazon's Alexa, etc. Google's 2019
on-device speech recognition
(on the phone, not the server)
is still based on LSTM.
[DL6]
F. Gomez and J. Schmidhuber.
Co-evolving recurrent neurons learn deep memory POMDPs.
In Proc. GECCO'05, Washington, D. C.,
pp. 1795-1802, ACM Press, New York, NY, USA, 2005.
PDF.
[DL6a]
J. Schmidhuber (AI Blog, Nov 2020, updated 2025). 20-year anniversary: 1st paper with "learn deep" in the title (2005). Our deep reinforcement learning & neuroevolution solved problems of depth 1000 and more.[DL6] Soon after its publication, everybody started talking about "deep learning." Causality or correlation?
[DLC] J. Schmidhuber (AI Blog, June 2015).
Critique of Paper by self-proclaimed[DLC1-2] "Deep Learning Conspiracy" (Nature 521 p 436).
The inventor of an important method should get credit for inventing it. She may not always be the one who popularizes it. Then the popularizer should get credit for popularizing it (but not for inventing it).
[DLC1]
Y. LeCun. IEEE Spectrum Interview by L. Gomes, Feb 2015.
Quote: "A lot of us involved in the resurgence of Deep Learning in the mid-2000s, including Geoff Hinton, Yoshua Bengio, and myself—the so-called 'Deep Learning conspiracy' ..."
[DLC2]
M. Bergen, K. Wagner (2015).
Welcome to the AI Conspiracy: The 'Canadian Mafia' Behind Tech's Latest Craze. Vox recode, 15 July 2015.
Quote: "... referred to themselves as the 'deep learning conspiracy.' Others called them the 'Canadian Mafia.'"
[DLH]
J. Schmidhuber.
Annotated History of Modern AI and Deep Learning. Technical Report IDSIA-22-22, IDSIA, Switzerland, 2022, updated 2025.
Preprint arXiv:2212.11279.
Tweet of 2022.
[DLP]
J. Schmidhuber.
How 3 Turing awardees republished key methods and ideas whose creators they failed to credit. Technical Report IDSIA-23-23, Swiss AI Lab IDSIA, 14 Dec 2023, updated 2025.
Tweet of 2023.
[DM1]
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller. Playing Atari with Deep Reinforcement Learning. Tech Report, 19 Dec. 2013,
arxiv:1312.5602.
[DM2] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis. Human-level control through deep reinforcement learning. Nature, vol. 518, p 1529, 26 Feb. 2015.
Link.
DeepMind's first famous paper. Its abstract claims: "While reinforcement learning agents have achieved some successes in a variety of domains, their applicability has previously been limited to domains in which useful features can be handcrafted, or to domains with fully observed, low-dimensional state spaces." It also claims to bridge "the divide between high-dimensional sensory inputs and actions." Similarly, the first sentence of the abstract of the earlier tech report version[DM1] of [DM2] claims to "present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning."
However, the first such system (requiring no unsupervised pre-training) was created earlier by Jan Koutnik et al. in Schmidhuber's lab.[CO2] Disclaimer:
DeepMind was co-founded by Shane Legg, a PhD student from this lab; he and Daan Wierstra (another PhD student of Schmidhuber and DeepMind's 1st employee) were the first persons at DeepMind who had AI publications and PhDs in computer science. More.
[DM3]
S. Stanford. DeepMind's AI, AlphaStar Showcases Significant Progress Towards AGI. Medium ML Memoirs, 2019.
Alphastar has a "deep LSTM core."
[DNC]
A. Graves, G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwinska, S. G. Colmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G. Ostrovski, A. Cain, H. King, C. Summerfield, P. Blunsom, K. Kavukcuoglu, D. Hassabis.
Hybrid computing using a neural network with dynamic external memory.
Nature, 538:7626, p 471, 2016.
This work of DeepMind did not cite the original work of the early 1990s on neural networks learning to control dynamic external memories.[PDA1-2][FWP0-1][ULTRA]
[DDPG]
T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra.
Continuous control with deep reinforcement learning.
Preprint arXiv:1509.02971, 2015.
[DPG]
D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller. Deterministic policy gradient algorithms.
Proceedings of ICML'31, Beijing, China, 2014. JMLR: W&CP volume 32.
[Drop1] S. J. Hanson (1990). A Stochastic Version of the Delta Rule, PHYSICA D,42, 265-272.
What's now called "dropout" is a variation of the stochastic delta rule—compare preprint
arXiv:1808.03578, 2018.
[Drop2]
N. Frazier-Logue, S. J. Hanson (2020). The Stochastic Delta Rule: Faster and More Accurate Deep Learning Through Adaptive Weight Noise. Neural Computation 32(5):1018-1032.
[Drop3]
J. Hertz, A. Krogh, R. Palmer (1991). Introduction to the Theory of Neural Computation. Redwood City, California: Addison-Wesley Pub. Co., pp. 45-46.
[Drop4]
N. Frazier-Logue, S. J. Hanson (2018). Dropout is a special case of the stochastic delta rule: faster and more accurate deep learning.
Preprint
arXiv:1808.03578, 2018.
[DS1]
DeepSeek-AI (2025).
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Preprint arXiv:2501.12948. See the popular DeepSeek tweet of Jan 2025.
[FAST] C. v.d. Malsburg. Tech Report 81-2, Abteilung f. Neurobiologie,
Max-Planck Institut f. Biophysik und Chemie, Goettingen, 1981.
First paper on fast weights or dynamic links.
[FASTa]
J. A. Feldman. Dynamic connections in neural networks.
Biological Cybernetics, 46(1):27-39, 1982.
2nd paper on fast weights.
[FASTb]
G. E. Hinton, D. C. Plaut. Using fast weights to deblur old memories. Proc. 9th annual conference of the Cognitive Science Society (pp. 177-186), 1987.
3rd paper on fast weights (two types of weights with different learning rates).
[FB17]
By 2017, Facebook
used LSTM
to handle
over 4 billion automatic translations per day (The Verge, August 4, 2017);
see also
Facebook blog by J.M. Pino, A. Sidorov, N.F. Ayan (August 3, 2017)
[FM]
S. Hochreiter and J. Schmidhuber.
Flat minimum search finds simple nets.
Technical Report FKI-200-94, Fakultät für Informatik,
Technische Universität München, December 1994.
PDF.
[FWP]
J. Schmidhuber (AI Blog, 26 March 2021, updated 2025).
26 March 1991: Neural nets learn to program neural nets with fast weights—like Transformer variants. 2021: New stuff!
30-year anniversary of a now popular
alternative[FWP0-1] to recurrent NNs.
A slow feedforward NN learns by gradient descent to program the changes of
the fast weights[FAST,FASTa] of
another NN, separating memory and control like in traditional computers.
Such Fast Weight Programmers[FWP0-6,FWPMETA1-8] can learn to memorize past data, e.g.,
by computing fast weight changes through additive outer products of self-invented activation patterns[FWP0-1]
(now often called keys and values for self-attention[TR1-6]).
The similar Transformers[TR1-2] combine this with projections
and softmax and
are now widely used in natural language processing.
For long input sequences, their efficiency was improved through
Transformers with linearized self-attention[TR5-6]
which are formally equivalent to Schmidhuber's 1991 outer product-based Fast Weight Programmers (apart from normalization), now called unnormalized linear Transformers.[ULTRA]
In 1993, he introduced
the attention terminology[FWP2] now used
in this context,[ATT] and
extended the approach to
RNNs that program themselves.
See tweet of 2022.
[FWP0]
J. Schmidhuber.
Learning to control fast-weight memories: An alternative to recurrent nets.
Technical Report FKI-147-91, Institut für Informatik, Technische
Universität München, 26 March 1991.
PDF.
First paper on neural fast weight programmers that separate storage and control: a slow net learns by gradient descent to compute weight changes of a fast net. The outer product-based version (Eq. 5) is now known as the unnormalized linear Transformer or the "Transformer with linearized self-attention."[ULTRA][FWP]
[FWP1] J. Schmidhuber. Learning to control fast-weight memories: An alternative to recurrent nets. Neural Computation, 4(1):131-139, 1992. Based on [FWP0].
PDF.
HTML.
Pictures (German).
See tweet of 2022 for 30-year anniversary.
[FWP2] J. Schmidhuber. Reducing the ratio between learning complexity and number of time-varying variables in fully recurrent nets. In Proceedings of the International Conference on Artificial Neural Networks, Amsterdam, pages 460-463. Springer, 1993.
PDF.
A recurrent extension of the unnormalized linear Transformer,[ULTRA] introducing the terminology of learning "internal spotlights of attention." First recurrent NN-based fast weight programmer using outer products to program weight matrices.
[FWP3] I. Schlag, J. Schmidhuber. Gated Fast Weights for On-The-Fly Neural Program Generation. Workshop on Meta-Learning, @N(eur)IPS 2017, Long Beach, CA, USA.
[FWP3a] I. Schlag, J. Schmidhuber. Learning to Reason with Third Order Tensor Products. Advances in Neural Information Processing Systems (N(eur)IPS), Montreal, 2018.
Preprint: arXiv:1811.12143. PDF.
[FWP6] I. Schlag, K. Irie, J. Schmidhuber.
Linear Transformers Are Secretly Fast Weight Programmers. ICML 2021. Preprint: arXiv:2102.11174.
[FWP7] K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
Going Beyond Linear Transformers with Recurrent Fast Weight Programmers.
NeurIPS 2021.
Preprint: arXiv:2106.06295 (June 2021).
[FWP8] K. Irie, F. Faccio, J. Schmidhuber.
Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules.
NeurIPS 2022.
[FWP9] K. Irie, J. Schmidhuber.
Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules.
ICLR 2023.
[FWP23]
J.von Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvintsev, A. Zhmoginov, M. Vladymyrov.
Transformers learn in-context by gradient descent.
ICML 2023. The core FWP principle of "NNs that learn to program the fast weight changes of other NNs" [FWP0] and [FWP6] provide an intuitive conception of what's now called "in-context learning."
[FWP24]
A. Behrouz, P. Zhong, V. Mirrokni.
Titans: Learning to Memorize at Test Time.
Arxiv preprint 2501.00663, 2024.
[FWP25]
J. von Oswald, N. Scherrer, S. Kobayashi, L. Versari, S. Yang, M. Schlegel, K. Maile, Y. Schimpf, O. Sieberling, A. Meulemans, R. A. Saurous, G. Lajoie, C. Frenkel, R. Pascanu, B. Aguera y Arcas, J. Sacramento.
MesaNet: Sequence Modeling by Locally Optimal Test-Time Training.
Arxiv preprint 2506.05233, 2025.
[FWP25b]
Y. Sun, X. Li, K. Dalal, J. Xu, A. Vikram, G. Zhang, Y. Dubois, X. Chen, X. Wang, S. Koyejo, T. Hashimoto, C. Guestrin.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States.
ICML 2025.
[FWP25c]
K. Irie, S. J. Gershman.
Fast weight programming and linear Transformers: from machine learning to neurobiology.
Arxiv preprint 2508.08435, 2025.
[FWPMETA1] J. Schmidhuber. Steps towards `self-referential' learning. Technical Report CU-CS-627-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
PDF.
[FWPMETA2] J. Schmidhuber. A self-referential weight matrix.
In Proceedings of the International Conference on Artificial
Neural Networks, Amsterdam, pages 446-451. Springer, 1993.
PDF.
[FWPMETA3] J. Schmidhuber.
An introspective network that can learn to run its own weight change algorithm. In Proc. of the Intl. Conf. on Artificial Neural Networks,
Brighton, pages 191-195. IEE, 1993.
[FWPMETA4]
J. Schmidhuber.
A neural network that embeds its own meta-levels.
In Proc. of the International Conference on Neural Networks '93,
San Francisco. IEEE, 1993.
[FWPMETA5]
J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
A recurrent neural net with a self-referential, self-reading, self-modifying weight matrix
can be found here.
[FWPMETA6]
L. Kirsch and J. Schmidhuber. Meta Learning Backpropagation & Improving It. Advances in Neural Information Processing Systems (NeurIPS), 2021.
Preprint arXiv:2012.14905 [cs.LG], 2020.
[FWPMETA7]
I. Schlag, T. Munkhdalai, J. Schmidhuber.
Learning Associative Inference Using Fast Weight Memory.
Report arXiv:2011.07831 [cs.AI], 2020.
[FWPMETA8]
K. Irie, I. Schlag, R. Csordas, J. Schmidhuber.
A Modern Self-Referential Weight Matrix That Learns to Modify Itself.
International Conference on Machine Learning (ICML), 2022.
Preprint: arXiv:2202.05780.
[FWPMETA9]
L. Kirsch and J. Schmidhuber.
Self-Referential Meta Learning.
First Conference on Automated Machine Learning (Late-Breaking Workshop), 2022.
[FWPMETA10] K. Irie, R. Csordas, J. Schmidhuber.
Metalearning Continual Learning Algorithms.
TMLR 2025.
[G63] R. J Glauber (1963). Time-dependent statistics of the Ising model.
Journal of Mathematical Physics, 4(2):294-307, 1963.
[GAN90]
J. Schmidhuber.
Making the world differentiable: On using fully recurrent
self-supervised neural networks for dynamic reinforcement learning and
planning in non-stationary environments.
Technical Report FKI-126-90, TUM, Feb 1990, revised Nov 1990.
PDF.
The first paper on planning with reinforcement learning recurrent neural networks (NNs) (more) and on generative adversarial networks
where a generator NN is fighting a predictor NN in a minimax game
(more).
See [AC90].
[GAN91]
J. Schmidhuber.
A possibility for implementing curiosity and boredom in
model-building neural controllers.
In J. A. Meyer and S. W. Wilson, editors, Proc. of the
International Conference on Simulation
of Adaptive Behavior: From Animals to
Animats, pages 222-227. MIT Press/Bradford Books, 1991.
PDF.
More.
Based on [GAN90].
See [AC90b].
[GAN10]
J. Schmidhuber. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3):230-247, 2010.
IEEE link.
PDF.
This well-known 2010 survey summarised the generative adversarial NNs of 1990 as follows: a
"neural network as a predictive world model is used to maximize the controller's intrinsic reward, which is proportional to the model's prediction errors" (which are minimized).
See [AC10].
[GAN10b]
O. Niemitalo. A method for training artificial neural networks to generate missing data within a variable context.
Blog post, Internet Archive, 2010.
A blog post describing the basic ideas[GAN90-91][GAN20][AC] of GANs.
[GAN14]
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, Y. Bengio.
Generative adversarial nets. NIPS 2014, 2672-2680, Dec 2014.
A description of GANs that does not cite Schmidhuber's original GAN principle of 1990[GAN90-91][GAN20][AC][R2][DLP] and contains wrong claims about Schmidhuber's adversarial NNs for
Predictability Minimization.[PM0-2][GAN20][DLP]
[GAN19]
T. Karras, S. Laine, T. Aila. A style-based generator architecture for generative adversarial
networks. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages
4401-4410, 2019.
[GAN19b]
D. Fallis. The epistemic threat of deepfakes. Philosophy & Technology 34.4 (2021):623-643.
[GAN20]
J. Schmidhuber. Generative Adversarial Networks are Special Cases of Artificial Curiosity (1990) and also Closely Related to Predictability Minimization (1991).
Neural Networks, Volume 127, p 58-66, 2020.
Preprint arXiv/1906.04493. See [AC20].
[GAN25]
J. Schmidhuber. Who Invented Generative Adversarial Networks? Technical Note IDSIA-14-25, IDSIA, December 2025.
[GDa]
Y. Z. Tsypkin (1966). Adaptation, training and self-organization automatic control systems,
Avtomatika I Telemekhanika, 27, 23-61.
On gradient descent-based on-line learning for non-linear systems.
[GDb]
Y. Z. Tsypkin (1971). Adaptation and Learning in Automatic Systems, Academic Press, 1971.
On gradient descent-based on-line learning for non-linear systems.
[GD1]
S. I. Amari (1967).
A theory of adaptive pattern classifier, IEEE Trans, EC-16, 279-307 (Japanese version published in 1965).
PDF.
Probably the first paper on using stochastic gradient descent[STO51-52] for learning in multilayer neural networks
(without specifying the specific gradient descent method now known as reverse mode of automatic differentiation or backpropagation[BP1]).
[GD2]
S. I. Amari (1968).
Information Theory—Geometric Theory of Information, Kyoritsu Publ., 1968 (in Japanese).
OCR-based PDF scan of pages 94-135 (see pages 119-120).
Contains computer simulation results for a five layer network (with 2 modifiable layers) which learns internal representations to classify
non-linearily separable pattern classes.
[GD2a]
S. Saito (1967). Master's thesis, Graduate School of Engineering, Kyushu University, Japan.
Implementation of Amari's 1967 stochastic gradient descent method for multilayer perceptrons.[GD1] (S. Amari, personal communication, 2021.)
[GOD]
K. Gödel. Über formal unentscheidbare Sätze der Principia Mathematica und verwandter Systeme I. Monatshefte für Mathematik und Physik, 38:173-198, 1931.
In the early 1930s,
Gödel founded theoretical computer science. He identified fundamental limits of mathematics and theorem proving and computing and Artificial Intelligence.
[GOD21] J. Schmidhuber (AI Blog, 2021). 90th anniversary celebrations: 1931: Kurt Gödel, founder of theoretical computer science, shows limits of math, logic, computing, and artificial intelligence. This was number 1 on Hacker News.
[GOD21a]
J. Schmidhuber (2021). Als Kurt Gödel die Grenzen des Berechenbaren entdeckte.
(When Kurt Gödel discovered the limits of computability.)
Frankfurter Allgemeine Zeitung, 16/6/2021.
[GOD21b]
J. Schmidhuber (AI Blog, 2021). 80. Jahrestag: 1931: Kurt Gödel, Vater der theoretischen Informatik, entdeckt die Grenzen des Berechenbaren und der künstlichen Intelligenz.
[GOD34]
K. Gödel (1934).
On undecidable propositions of formal mathematical
systems. Notes by S. C. Kleene and J. B. Rosser on lectures
at the Institute for Advanced Study, Princeton, New Jersey, 1934, 30
pp. (Reprinted in M. Davis, (ed.), The Undecidable. Basic Papers on Undecidable
Propositions, Unsolvable Problems, and Computable Functions,
Raven Press, Hewlett, New York, 1965.)
Gödel introduced a universal coding language.
[GP]
J. Schmidhuber (AI Blog, 2020).
Genetic Programming for code of unlimited size (1987).
[GP0]
N. Cramer.
A Representation for the Adaptive Generation of Simple Sequential Programs, Proc. of an Intl. Conf. on Genetic Algorithms and their Applications, Carnegie-Mellon University, July 24-26, 1985.
[GP1]
D. Dickmanns, J. Schmidhuber, and A. Winklhofer.
Der genetische Algorithmus: Eine Implementierung in Prolog.
Fortgeschrittenenpraktikum, Institut für Informatik, Lehrstuhl
Prof. Radig, Technische Universität München, 1987.
Probably the first work on Genetic Programming for evolving programs
of unlimited size written in a universal coding language. Based on Schmidhuber's work since 1985 on a Symbolics Lisp Machine of SIEMENS AG. Authors in alphabetical order. More.
[GPA]
S. F. Smith. A Learning System Based on Genetic Adaptive Algorithms, PhD Thesis, Univ. Pittsburgh, 1980.
[GPT3]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei.
Language Models are Few-Shot Learners (2020).
Preprint arXiv/2005.14165.
[GPUNN]
Oh, K.-S. and Jung, K. (2004). GPU implementation of neural networks. Pattern Recognition, 37(6):1311-1314. Speeding up traditional NNs on GPU by a factor of 20.
[GPUCNN]
K. Chellapilla, S. Puri, P. Simard. High performance convolutional neural networks for document processing. International Workshop on Frontiers in Handwriting Recognition, 2006. Speeding up shallow CNNs on GPU by a factor of 4.
[GPUCNN1] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, J. Schmidhuber. Flexible, High Performance Convolutional Neural Networks for Image Classification. International Joint Conference on Artificial Intelligence (IJCAI-2011, Barcelona), 2011. PDF. ArXiv preprint.
Speeding up deep CNNs on GPU by a factor of 60.
Used to
win four important computer vision competitions 2011-2012 before others won any
with similar approaches.
[GPUCNN2] D. C. Ciresan, U. Meier, J. Masci, J. Schmidhuber.
A Committee of Neural Networks for Traffic Sign Classification.
International Joint Conference on Neural Networks (IJCNN-2011, San Francisco), 2011.
PDF.
HTML overview.
First superhuman performance in a computer vision contest, with half the error rate of humans, and one third the error rate of the closest competitor.[DAN1] This led to massive interest from industry.
[GPUCNN3] D. C. Ciresan, U. Meier, J. Schmidhuber. Multi-column Deep Neural Networks for Image Classification. Proc. IEEE Conf. on Computer Vision and Pattern Recognition CVPR 2012, p 3642-3649, July 2012. PDF. Longer TR of Feb 2012: arXiv:1202.2745v1 [cs.CV]. More.
[GPUCNN4] A. Krizhevsky, I. Sutskever, G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. NIPS 25, MIT Press, Dec 2012.
PDF.
This paper describes AlexNet, which is similar to the earlier
DanNet,[DAN,DAN1][R6]
the first pure deep CNN
to win computer vision contests in 2011[GPUCNN2-3,5] (AlexNet and VGG Net[GPUCNN9] followed in 2012-2014). [GPUCNN4] emphasizes benefits of Fukushima's ReLUs (1969)[CN69] and dropout (a variant of Hanson 1990 stochastic delta rule)[Drop1-4] but neither cites the original work[CN69][Drop1] nor the basic CNN architecture (Fukushima, 1979).[CN79]
[GPUCNN5]
J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet): History of computer vision contests won by deep CNNs since 2011. DanNet was the first CNN to win one, and won 4 of them in a row before the similar AlexNet/VGG Net and a Highway Net with open gates joined the party. Today, deep CNNs are standard in computer vision.
[GPUCNN6] J. Schmidhuber, D. Ciresan, U. Meier, J. Masci, A. Graves. On Fast Deep Nets for AGI Vision. In Proc. Fourth Conference on Artificial General Intelligence (AGI-11), Google, Mountain View, California, 2011.
PDF.
[GPUCNN7] D. C. Ciresan, A. Giusti, L. M. Gambardella, J. Schmidhuber. Mitosis Detection in Breast Cancer Histology Images using Deep Neural Networks. MICCAI 2013.
PDF.
[GPUCNN8] J. Schmidhuber (AI Blog, 2017; updated 2021 for 10th birthday of DanNet).
First deep learner to win a contest on object detection in large images—
first deep learner to win a medical imaging contest (2012). Link.
How the Swiss AI Lab IDSIA used GPU-based CNNs to win the
ICPR 2012 Contest on Mitosis Detection
and the MICCAI 2013 Grand Challenge.
[GPUCNN9]
K. Simonyan, A. Zisserman. Very deep convolutional networks for large-scale image recognition. Preprint arXiv:1409.1556 (2014).
[GSR15] Dramatic
improvement of Google's speech recognition through LSTM:
Alphr Technology, Jul 2015, or 9to5google, Jul 2015
[GT16] Google's
dramatically improved Google Translate of 2016 is based on LSTM, e.g.,
WIRED, Sep 2016,
or
siliconANGLE, Sep 2016
[GVF]
R. Sutton, J. Modayil, M. Delp, T. De-gris, P. M. Pilarski, A. White, AD. Precup. Horde: A scalable real-time architecture for learning knowledge from unsupervised sensorimotor interaction. 10th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp.761-768, 2011.
[H86] J. L. van Hemmen (1986). Spin-glass models of a neural network.
Phys. Rev. A 34, 3435, 1 Oct 1986.
[H88]
H. Sompolinsky (1988). Statistical Mechanics of Neural Networks.
Physics Today 41, 12, 70, 1988.
[H90]
W. D. Hillis.
Co-evolving parasites improve simulated evolution as an optimization
procedure.
Physica D: Nonlinear Phenomena, 42(1-3):228-234, 1990.
[HB96]
S. El Hihi, Y. Bengio. Hierarchical recurrent neural networks for long-term dependencies. NIPS, 1996.
[HEL]
P. Dayan, G. E. Hinton, R. M. Neal, and R. S. Zemel.
The Helmholtz machine.
Neural Computation, 7:889-904, 1995.
An unsupervised learning algorithm related to Schmidhuber's supervised Neural Heat Exchanger.[NHE]
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of 2019 Honda Prize. Science must not allow corporate PR to distort the academic record.
[HRLW]
C. Watkins (1989). Learning from delayed rewards.
[HRL0]
J. Schmidhuber.
Towards compositional learning with dynamic neural networks.
Technical Report FKI-129-90, Institut für Informatik, Technische
Universität München, 1990.
PDF.
An RL machine gets extra command inputs of the form (start, goal). An evaluator NN learns to predict the current rewards/costs of going from start to goal. An (R)NN-based subgoal generator also sees (start, goal), and uses (copies of) the evaluator NN to learn by gradient descent a sequence of cost-minimising intermediate subgoals. The RL machine tries to use such subgoal sequences to achieve final goals.
The system is learning action plans
at multiple levels of abstraction and multiple time scales and solves what Y. LeCun called an "open problem" in 2022.[LEC]
[HRL1]
J. Schmidhuber. Learning to generate sub-goals for action sequences. In T. Kohonen, K. Mäkisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 967-972. Elsevier Science Publishers B.V., North-Holland, 1991. PDF. Extending TR FKI-129-90, TUM, 1990.
HTML & images in German.
[HRL2]
J. Schmidhuber and R. Wahnsiedler.
Planning simple trajectories using neural subgoal generators.
In J. A. Meyer, H. L. Roitblat, and S. W. Wilson, editors, Proc.
of the 2nd International Conference on Simulation of Adaptive Behavior,
pages 196-202. MIT Press, 1992.
PDF.
HTML & images in German.
[HRL3]
P. Dayan and G. E. Hinton.
Feudal Reinforcement Learning.
Advances in Neural Information Processing Systems 5, NIPS, 1992.
[HRL4]
M. Wiering and J. Schmidhuber. HQ-Learning. Adaptive Behavior 6(2):219-246, 1997.
PDF.
[HUT4]
M. Hutter.
Universal Artificial Intelligence: Sequential Decisions based on Algorithmic Probability. Springer, Berlin, 2004. (Based on work done under J. Schmidhuber's SNF grant 20-61847: unification of universal induction and sequential decision theory, 2000).
[HW]
J. Schmidhuber
(AI Blog, 2015, updated 2025 for 10-year anniversary).
Overview of Highway Networks: First working really deep feedforward neural networks with hundreds of layers.
[HW1] R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks.
Preprints arXiv:1505.00387 (May 2015) and arXiv:1507.06228 (Training Very Deep Networks; July 2015). Also at NeurIPS 2015. The first working very deep gradient-based feedforward neural nets (FNNs) with hundreds of layers, ten times deeper than previous gradient-based FNNs. Let g, t, h, denote non-linear differentiable functions. Each non-input layer of a Highway Net computes g(x)x + t(x)h(x), where x is the data from the previous layer. The gates g(x) are typically initialised to 1.0, to obtain plain residual connections (weight 1.0) [VAN1][HW25]. This allows for very deep error propagation, which makes Highway NNs so deep. The later Resnet (Dec 2015) [HW2] adopted this principle. It is like a Highway net variant whose gates are always open: g(x)=t(x)=const=1. That is, Highway Nets are gated ResNets: set the gates to 1.0→ResNet.
The residual parts of a Highway Net are like those of an unfolded 1999 LSTM [LSTM2a], while the residual parts of a ResNet are like those of an unfolded 1997 LSTM [LSTM1][HW25].
Highway Nets perform roughly as well as ResNets on ImageNet [HW3]. Variants of Highway gates are also used for certain algorithmic tasks, where plain residual layers do not work as well [NDR]. See also [HW25]: who invented deep residual learning?
More.
[HW1a]
R. K. Srivastava, K. Greff, J. Schmidhuber. Highway networks. Presentation at the Deep Learning Workshop, ICML'15, July 10-11, 2015.
Link.
[HW2] He, K., Zhang,
X., Ren, S., Sun, J. Deep residual learning for image recognition. Preprint
arXiv:1512.03385
(Dec 2015).
Microsoft's ResNet paper refers to the Highway Net (May 2015) [HW1] as 'concurrent'. However, this is incorrect: ResNet was published seven months later. Although the ResNet paper acknowledges the problem of vanishing/exploding gradients, it fails to recognise that S. Hochreiter first identified the issue in 1991 and developed the residual connection solution (weight 1.0) [VAN1][HW25]. The ResNet paper cites the earlier Highway Net in a way that does not make it clear that ResNets are essentially open-gated Highway Nets and that Highway Nets are gated ResNets. It also fails to mention that the gates of residual connections in Highway Nets are initially open (1.0), meaning that Highway Nets start out with standard residual connections, to achieve deep residual learning (Highway Nets were ten times deeper than previous gradient-based feedforward nets). The residual parts of a Highway Net are like those of an unfolded 1999 LSTM [LSTM2a], while the residual parts of a ResNet are like those of an unfolded 1997 LSTM [LSTM1][HW25].
A follow-up paper by the ResNet authors was flawed in its design, leading to incorrect conclusions about gated residual connections [HW25b]. See also [HW25]: who invented deep residual learning?
More.
[HW3]
K. Greff, R. K. Srivastava, J. Schmidhuber. Highway and Residual Networks learn Unrolled Iterative Estimation. Preprint
arxiv:1612.07771 (2016). Also at ICLR 2017.
[HW25]
J. Schmidhuber. Who Invented Deep Residual Learning? Technical Report IDSIA-09-25, IDSIA, 2025. Preprint arXiv:2509.24732.
[HW25b]
R. K. Srivastava (January 2025). Weighted Skip Connections are Not Harmful for Deep Nets.
Shows that a follow-up paper by the authors of [HW2] suffered from
design flaws leading to incorrect conclusions about gated residual connections.
[HYB12]
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V.,
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). Deep neural networks for acoustic modeling
in speech recognition: The shared views of four research groups. IEEE Signal Process. Mag.,
29(6):82-97.
[I24]
E. Ising (1925). Beitrag zur Theorie des Ferro- und Paramagnetismus. Dissertation, 1924.
[I25]
E. Ising (1925). Beitrag zur Theorie des Ferromagnetismus. Z. Phys., 31 (1): 253-258, 1925.
The first non-learning recurrent NN architecture (the Ising model or Lenz-Ising model) was introduced and analyzed by physicists Ernst Ising and Wilhelm Lenz in the 1920s.[L20][I25][K41][W45][NOB] It settles into an equilibrium state in response to input conditions, and is the foundation of learning RNNs.[AMH1-2]
[JOU17] Jouppi et al. (2017). In-Datacenter Performance Analysis of a Tensor Processing Unit.
Preprint arXiv:1704.04760
[K41]
H. A. Kramers and G. H. Wannier (1941). Statistics of the Two-Dimensional Ferromagnet. Phys. Rev. 60, 252 and 263, 1941.
[K56]
S.C. Kleene. Representation of Events in Nerve Nets and Finite Automata. Automata Studies, Editors: C.E. Shannon and J. McCarthy, Princeton University Press, p. 3-42, Princeton, N.J., 1956.
[KO0]
J. Schmidhuber.
Discovering problem solutions with low Kolmogorov complexity and
high generalization capability.
Technical Report FKI-194-94, Fakultät für Informatik,
Technische Universität München, 1994.
PDF.
[KO1] J. Schmidhuber.
Discovering solutions with low Kolmogorov complexity
and high generalization capability.
In A. Prieditis and S. Russell, editors, Machine Learning:
Proceedings of the Twelfth International Conference (ICML 1995),
pages 488-496. Morgan
Kaufmann Publishers, San Francisco, CA, 1995.
PDF.
[KO2]
J. Schmidhuber.
Discovering neural nets with low Kolmogorov complexity
and high generalization capability.
Neural Networks, 10(5):857-873, 1997.
PDF.
[L20]
W. Lenz (1920). Beitraege zum Verständnis der magnetischen
Eigenschaften in festen Körpern. Physikalische Zeitschrift, 21:
613-615.
[LEC] J. Schmidhuber (AI Blog, 2022). LeCun's 2022 paper on autonomous machine intelligence rehashes but does not cite essential work of 1990-2015. Years ago, Schmidhuber's team published most of what Y. LeCun calls his "main original contributions:" neural nets that learn multiple time scales and levels of abstraction, generate subgoals, use intrinsic motivation to improve world models, and plan (1990); controllers that learn informative predictable representations (1997), etc. This was also discussed on Hacker News, reddit, and in the media.
See tweet1.
LeCun also listed the "5 best ideas 2012-2022" without mentioning that most of them are from Schmidhuber's lab, and older. See tweet2.
[LT23]
K. Irie, R. Csordas, J. Schmidhuber.
Practical Computational Power of Linear Transformers and Their Recurrent and Self-Referential Extensions.
EMNLP 2023.
[LT24]
S. Yang, B. Wang, Y. Zhang, Y. Shen, Y. Kim.
Parallelizing Linear Transformers with the Delta Rule over Sequence Length.
NeurIPS 2024.
[LT25]
S. Yang, J. Kautz, A. Hatamizadeh.
Gated Delta Networks: Improving Mamba2 with Delta Rule.
ICLR 2025. "Mamba2" is the 1991 ULTRA with a scalar time-decay factor on the fast weight matrix.
[LT25b]
R. Grazzi, J. Siems, A. Zela, J. K.H. Franke, F. Hutter, M. Pontil.
Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues.
ICLR 2025.
Shows that the delta-rule extension [FWP6][LT23] is more expressive than the quadratic transformer and other naive linear transformers (e.g., it can do parity and modular arithmetics).
[LT25c]
J. Siems, T. Carstensen, A. Zela, F. Hutter, M. Pontil, R. Grazzi.
DeltaProduct: Improving State-Tracking in Linear RNNs via Householder Products
ICLR 2025 Workshop FM-Wild.
Extending the DeltaNet [FWP6][LT23] through additional "micro-steps."
[LSTM0]
S. Hochreiter and J. Schmidhuber.
Long Short-Term Memory.
TR FKI-207-95, TUM, August 1995.
PDF.
[LSTM1a]
S. Hochreiter and J. Schmidhuber.
LSTM can solve hard long time lag problems. Proceedings of the 9th International Conference on Neural Information Processing Systems (NIPS'96). Cambridge, MA, USA, MIT Press, p. 473–479.
[LSTM1] S. Hochreiter, J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735-1780, 1997. PDF.
Based on [LSTM0]. More.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. In Proc. Int. Conf. on Artificial Neural Networks (ICANN'99), Edinburgh, Scotland, p. 850-855, IEE, London, 1999.
The "vanilla LSTM architecture" with forget gates
that everybody is using today, e.g., in Google's Tensorflow.
[LSTM2] F. A. Gers, J. Schmidhuber, F. Cummins. Learning to Forget: Continual Prediction with LSTM. Neural Computation, 12(10):2451-2471, 2000.
PDF.
The "vanilla LSTM architecture" with forget gates
that everybody is using today, e.g., in Google's Tensorflow.
[LSTM3] A. Graves, J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18:5-6, pp. 602-610, 2005.
PDF.
[LSTM4]
S. Fernandez, A. Graves, J. Schmidhuber. An application of
recurrent neural networks to discriminative keyword
spotting.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[LSTM5] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, J. Schmidhuber. A Novel Connectionist System for Improved Unconstrained Handwriting Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, 2009.
PDF.
[LSTM6] A. Graves, J. Schmidhuber. Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks. NIPS'22, p 545-552, Vancouver, MIT Press, 2009.
PDF.
[LSTM7] J. Bayer, D. Wierstra, J. Togelius, J. Schmidhuber.
Evolving memory cell structures for sequence learning.
Proc. ICANN-09, Cyprus, 2009.
PDF.
[LSTM8] A. Graves, A. Mohamed, G. E. Hinton. Speech Recognition with Deep Recurrent Neural Networks. ICASSP 2013, Vancouver, 2013.
PDF.
[LSTM9]
O. Vinyals, L. Kaiser, T. Koo, S. Petrov, I. Sutskever, G. Hinton.
Grammar as a Foreign Language. Preprint arXiv:1412.7449 [cs.CL].
[LSTM10]
A. Graves, D. Eck and N. Beringer, J. Schmidhuber. Biologically Plausible Speech Recognition with LSTM Neural Nets. In J. Ijspeert (Ed.), First Intl. Workshop on Biologically Inspired Approaches to Advanced Information Technology, Bio-ADIT 2004, Lausanne, Switzerland, p. 175-184, 2004.
PDF.
[LSTM11]
N. Beringer and A. Graves and F. Schiel and J. Schmidhuber. Classifying unprompted speech by retraining LSTM Nets. In W. Duch et al. (Eds.): Proc. Intl. Conf. on Artificial Neural Networks ICANN'05, LNCS 3696, pp. 575-581, Springer-Verlag Berlin Heidelberg, 2005.
[LSTM12]
D. Wierstra, F. Gomez, J. Schmidhuber. Modeling systems with internal state using Evolino. In Proc. of the 2005 conference on genetic and evolutionary computation (GECCO), Washington, D. C., pp. 1795-1802, ACM Press, New York, NY, USA, 2005. Got a GECCO best paper award.
[LSTM13]
F. A. Gers and J. Schmidhuber.
LSTM Recurrent Networks Learn Simple Context Free and
Context Sensitive Languages.
IEEE Transactions on Neural Networks 12(6):1333-1340, 2001.
PDF.
[LSTM14]
S. Fernandez, A. Graves, J. Schmidhuber.
Sequence labelling in structured domains with
hierarchical recurrent neural networks. In Proc.
IJCAI 07, p. 774-779, Hyderabad, India, 2007 (talk).
PDF.
[LSTM15]
A. Graves, J. Schmidhuber.
Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks.
Advances in Neural Information Processing Systems 22, NIPS'22, p 545-552,
Vancouver, MIT Press, 2009.
PDF.
[LSTM16]
M. Stollenga, W. Byeon, M. Liwicki, J. Schmidhuber. Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation. Advances in Neural Information Processing Systems (NIPS), 2015.
Preprint: arxiv:1506.07452.
[LSTM17]
J. A. Perez-Ortiz, F. A. Gers, D. Eck, J. Schmidhuber.
Kalman filters improve LSTM network performance in
problems unsolvable by traditional recurrent nets.
Neural Networks 16(2):241-250, 2003.
PDF.
[LSTMGRU] J. Chung, C. Gulcehre, K. Cho, Y. Bengio (2014). Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. Preprint arXiv:1412.3555 [cs.NE].
The so-called gated recurrent units (GRU)
are actually a variant of the vanilla LSTM architecture[LSTM2] (2000) which the authors did not cite
although this work[LSTM2] was the one that introduced gated recurrent units.[DLP]
Furthermore, Schmidhuber's team automatically evolved lots of additional LSTM variants and topologies already in 2009[LSTM7] without changing the name of the basic method.
(Margin note: GRU cells lack an important gate and can neither
learn to count[LSTMGRU2] nor learn simple non-regular
languages;[LSTMGRU2] they
also do not work as well for challenging translation tasks,
according to Google Brain.[LSTMGRU3])
[LSTMGRU2] G. Weiss, Y. Goldberg, E. Yahav. On the Practical Computational Power of Finite Precision RNNs for Language Recognition.
Preprint arXiv:1805.04908.
[LSTMGRU3] D. Britz et al. (2017). Massive Exploration of Neural Machine Translation
Architectures. Preprint arXiv:1703.03906
[LSTMPG]
J. Schmidhuber (AI Blog, Dec 2020). 10-year anniversary of our journal paper on deep reinforcement learning with policy gradients for LSTM (2007-2010). Recent famous applications of policy gradients to LSTM: DeepMind's Starcraft player (2019) and OpenAI's dextrous robot hand & Dota player (2018)—Bill Gates called this a huge milestone in advancing AI.
[LSTM-RL]
B. Bakker, F. Linaker, J. Schmidhuber.
Reinforcement Learning in Partially Observable Mobile Robot
Domains Using Unsupervised Event Extraction.
In Proceedings of the 2002
IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2002), Lausanne, 2002.
PDF.
[MC43]
W. S. McCulloch, W. Pitts. A Logical Calculus of Ideas Immanent in Nervous Activity.
Bulletin of Mathematical Biophysics, Vol. 5, p. 115-133, 1943.
[META]
J. Schmidhuber (AI Blog, 2020). 1/3 century anniversary of
first publication on metalearning machines that learn to learn (1987).
For its cover I drew a robot that bootstraps itself.
1992-: gradient descent-based neural metalearning. 1994-: Meta-Reinforcement Learning with self-modifying policies. 1997: Meta-RL plus artificial curiosity and intrinsic motivation. 2002-: asymptotically optimal metalearning for curriculum learning. 2003-: mathematically optimal Gödel Machine. 2020: new stuff!
[META1]
J. Schmidhuber.
Evolutionary principles in self-referential learning, or on learning
how to learn: The meta-meta-... hook. Diploma thesis,
Institut für Informatik, Technische Universität München, 1987.
Searchable PDF scan (created by OCRmypdf which uses
LSTM).
HTML.
For example,
Genetic Programming
(GP) is applied to itself, to recursively evolve
better GP methods through Meta-Evolution. More.
[MGC] MICCAI 2013 Grand Challenge on Mitosis Detection, organised by M. Veta, M.A. Viergever, J.P.W. Pluim, N. Stathonikos, P. J. van Diest of University Medical Center Utrecht.
[MIR] J. Schmidhuber (Oct 2019, updated '21, '22, '25, '26). Deep Learning: Our Miraculous Year 1990-1991. Preprint
arXiv:2005.05744. The Deep Learning Artificial Neural Networks (NNs)
of our team have
revolutionised
Machine Learning & AI.
Many of the basic ideas behind this revolution were published within the 12 months of our "Annus Mirabilis" 1990-1991 at our lab in TU Munich.
Back then, few people were interested. But a quarter century later, NNs based on our "Miraculous Year"
were on over 3 billion devices,
and used many billions of times per day,
consuming a significant fraction of the world's compute.
In particular, in 1990-91, we laid foundations of Generative AI, publishing principles of (1)
Generative Adversarial Networks for Artificial Curiosity and Creativity (now used for deepfakes), (2) Transformers (the T in ChatGPT—see the 1991 Unnormalized Linear Transformer), (3) Pre-training for deep NNs (see the P in ChatGPT), (4) NN distillation (key for DeepSeek), and (5) recurrent World Models for
Reinforcement Learning and Planning in partially observable environments. The year 1991 also marks the emergence of the defining features of (6)
LSTM, the most cited AI paper of the 20th century (based on deep residual learning through residual NN connections), and (7) the most cited paper of the 21st century, based on our LSTM-inspired Highway Net that was
10 times deeper than previous feedforward NNs.
As of 2025, the two most frequently cited scientific articles of all time (with the most Google Scholar citations within 3 years—manuals excluded) are both directly based on our 1991 work.
[MLP1] D. C. Ciresan, U. Meier, L. M. Gambardella, J. Schmidhuber. Deep Big Simple Neural Nets For Handwritten Digit Recognition. Neural Computation 22(12): 3207-3220, 2010. ArXiv Preprint.
Showed that plain backprop for deep standard NNs is sufficient to break benchmark records, without any unsupervised pre-training.
[MLP2] J. Schmidhuber
(AI Blog, Sep 2020). 10-year anniversary of supervised deep learning breakthrough (2010). No unsupervised pre-training. By 2010, when compute was 100 times more expensive than today, both the feedforward NNs[MLP1] and the earlier recurrent NNs of Schmidhuber's team were able to beat all competing algorithms on important problems of that time.
[MLP3] J. Schmidhuber
(AI Blog, 2025). 2010: Breakthrough of end-to-end deep learning (no layer-by-layer training, no unsupervised pre-training). The rest is history.
By 2010, when compute was 1000 times more expensive than in 2025, both our feedforward NNs[MLP1] and our earlier recurrent NNs were able to beat all competing algorithms on important problems of that time.
This deep learning revolution quickly spread from Europe to North America and Asia.
[MOST]
J. Schmidhuber (AI Blog, 2021, updated 2025). The most cited neural networks all build on work done in my labs: 1. Long Short-Term Memory (LSTM), the most cited AI of the 20th century. 2. ResNet (open-gated Highway Net), the most cited AI of the 21st century. 3. AlexNet & VGG Net (the similar but earlier DanNet of 2011 won 4 image recognition challenges before them). 4. GAN (an instance of Adversarial Artificial Curiosity of 1990). 5. Transformer variants—see the 1991 unnormalised linear Transformer (ULTRA). Foundations of Generative AI were published in 1991: the principles of GANs (now used for deepfakes), Transformers (the T in ChatGPT), Pre-training for deep NNs (the P in ChatGPT), NN distillation, and the famous DeepSeek—see the tweet. As of 2025, the two most frequently cited scientific articles of all time (with the most Google Scholar citations within 3 years—manuals excluded) are both directly based on our 1991 work.
[MOST25]
H. Pearson, H. Ledford, M. Hutson, R. Van Noorden.
Exclusive: the most-cited papers of the twenty-first century.
Nature, 15 April 2025.
[MOST25b]
R. Van Noorden.
Science’s golden oldies: the decades-old research papers still heavily cited today.
Nature, 15 April 2025.
[MOST26]
J. Schmidhuber. The two most frequently cited papers of all time are based on our 1991 work. Technical Note IDSIA-1-26, January 2026.
[MUN87]
P. W. Munro. A dual back-propagation scheme for scalar reinforcement learning. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pages 165-176, 1987.
[NAI] NNAISENSE, the AGI company for AI in the physical world, founded in 2014, based on neural network world models. J. Schmidhuber was its President and Chief Scientist. See the 2020 NNAISENSE web page in the Internet Archive. (Lately, however, NNAISENSE has become less AGI-focused and more specialised, with a focus on asset management.)
[NAN1]
J. Schmidhuber.
Networks adjusting networks.
In J. Kindermann and A. Linden, editors, Proceedings of
`Distributed Adaptive Neural Information Processing', St.Augustin, 24.-25.5.
1989, pages 197-208. Oldenbourg, 1990.
Extended version: TR FKI-125-90 (revised),
Institut für Informatik, TUM.
PDF.
Includes the proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Based on neural nets learning to estimate gradients for other neural nets.
[NAN2]
J. Schmidhuber.
Networks adjusting networks.
Technical Report FKI-125-90, Institut für Informatik,
Technische Universität München. Revised in November 1990.
PDF.
[NAN3]
Recurrent networks adjusted by adaptive critics.
In Proc. IEEE/INNS International Joint Conference on Neural
Networks, Washington, D. C., volume 1, pages 719-722, 1990.
[NAN4]
J. Schmidhuber.
Additional remarks on G. Lukes' review of Schmidhuber's paper
`Recurrent networks adjusted by adaptive critics'.
Neural Network Reviews, 4(1):43, 1990.
[NAN5]
M. Jaderberg, W. M. Czarnecki, S. Osindero, O. Vinyals, A. Graves, D. Silver, K. Kavukcuoglu.
Decoupled Neural Interfaces using Synthetic Gradients.
Preprint arXiv:1608.05343, 2016.
This work of DeepMind is similar to [NAN1-2].
[NAS] B. Zoph, Q. V. Le. Neural Architecture Search with Reinforcement Learning.
Preprint arXiv:1611.01578 (PDF), 2017.
[NAT1] J. Schmidhuber. Citation bubble about to burst? Nature, vol. 469, p. 34, 6 January 2011.
HTML.
[NDR]
R. Csordas, K. Irie, J. Schmidhuber.
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization. Proc. ICLR 2022. Preprint arXiv/2110.07732.
[NGU89]
D. Nguyen and B. Widrow; The truck backer-upper: An example of self learning in neural networks. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 357-364, 1989.
[NHE] J. Schmidhuber. The Neural Heat Exchanger.
Oral presentations since 1990 at various universities including TUM and the
University of Colorado at Boulder. Also in In S. Amari, L. Xu, L. Chan, I. King, K. Leung, eds., Proceedings of the Intl. Conference on Neural Information Processing (1996), pages 194-197, Springer, Hongkong.
Link. Proposal of a biologically more plausible deep learning algorithm that—unlike backpropagation—is local in space and time. Inspired by the physical heat exchanger: inputs "heat up" while being transformed through many successive layers, targets enter from the other end of the deep pipeline and "cool down."
[NOB] J. Schmidhuber.
A Nobel Prize for Plagiarism.
Technical Report IDSIA-24-24 (7 Dec 2024, updated Oct 2025).
The 2024 Nobel Prize in Physics was awarded to Hopfield & Hinton. They republished foundational methodologies for artificial neural networks developed by Ivakhnenko, Amari and others in Ukraine and Japan during the 1960s and 1970s, as well as other techniques, without citing the original papers. Even in their subsequent surveys and recent 2025 articles, they failed to acknowledge the original inventors. This apparently turned what may have been unintentional plagiarism into a deliberate act. Hopfield and Hinton did not invent any of the key algorithms that underpin modern artificial intelligence.
See also popular
tweet1,
tweet2, and
LinkedIn post.
See also [PLAG1-6][FAKE1-3][DLH].
[NOB25a]
G. Hinton. Nobel Lecture: Boltzmann machines. Rev. Mod. Phys. 97, 030502, 25 August 2025. One of the many problematic statements in this lecture is this: "Boltzmann machines are no longer used, but they were historically important [...] In the 1980s, they demonstrated that it was possible to learn appropriate weights for hidden neurons using only locally available information without requiring a biologically implausible backward pass." Again, Hinton fails to mention Ivakhnenko who had shown this 2 decades earlier in the 1960s [DEEP1-2]. He has plagiarized Ivakhnenko and others in many additional ways [NOB][DLP].
[OAI1]
G. Powell, J. Schneider, J. Tobin, W. Zaremba, A. Petron, M. Chociej, L. Weng, B. McGrew, S. Sidor, A. Ray, P. Welinder, R. Jozefowicz, M. Plappert, J. Pachocki, M. Andrychowicz, B. Baker.
Learning Dexterity. OpenAI Blog, 2018.
[OAI1a]
OpenAI, M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, J. Schneider, S. Sidor, J. Tobin, P. Welinder, L. Weng, W. Zaremba.
Learning Dexterous In-Hand Manipulation. arxiv:1312.5602 (PDF).
[OAI2]
OpenAI et al. (Dec 2019).
Dota 2 with Large Scale Deep Reinforcement Learning.
Preprint
arxiv:1912.06680.
An LSTM composes 84% of the model's total parameter count.
[OAI2a]
J. Rodriguez. The Science Behind OpenAI Five that just Produced One of the Greatest Breakthrough in the History of AI. Towards Data Science, 2018. An LSTM was the core of OpenAI Five.
[PDA1]
G.Z. Sun, H.H. Chen, C.L. Giles, Y.C. Lee, D. Chen. Neural Networks with External Memory Stack that Learn Context—Free Grammars from Examples. Proceedings of the 1990 Conference on Information Science and Systems, Vol.II, pp. 649-653, Princeton University, Princeton, NJ, 1990.
[PDA2]
M. Mozer, S. Das. A connectionist symbol manipulator that discovers the structure of context-free languages. Proc. NIPS 1993.
[PLA1]
M. Uehleke (2022).
Plato's Theory of Forms: How Ancient Philosophy Still Shapes Modern Thinking. The Philosopher's Shirt, 2022.
[PLAN]
J. Schmidhuber (AI Blog, 2020). 30-year anniversary of planning & reinforcement learning with recurrent world models and artificial curiosity (1990). This work also introduced high-dimensional reward signals, deterministic policy gradients for RNNs,
the GAN principle (widely used today). Agents with adaptive recurrent world models even suggest a simple explanation of consciousness & self-awareness.
[PLAN2]
J. Schmidhuber.
An on-line algorithm for dynamic reinforcement learning and planning
in reactive environments.
Proc. IEEE/INNS International Joint Conference on Neural
Networks, San Diego, volume 2, pages 253-258, 1990.
Based on TR FKI-126-90 (1990).[AC90]
More.
[PLAN3]
J. Schmidhuber.
Reinforcement learning in Markovian and non-Markovian environments.
In D. S. Lippman, J. E. Moody, and D. S. Touretzky, editors,
Advances in Neural Information Processing Systems 3, NIPS'3, pages 500-506. San
Mateo, CA: Morgan Kaufmann, 1991.
PDF.
Partially based on [AC90].
[PLAN4]
J. Schmidhuber.
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models.
Report arXiv:1210.0118 [cs.AI], 2015.
[PLAN5]
One Big Net For Everything. Preprint arXiv:1802.08864 [cs.AI], Feb 2018.
[PLAN6]
D. Ha, J. Schmidhuber. Recurrent World Models Facilitate Policy Evolution. Advances in Neural Information Processing Systems (NIPS), Montreal, 2018. (Talk.)
Preprint: arXiv:1809.01999.
Github: World Models.
[PG]
R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8.3-4: 229-256, 1992.
[PHD]
J. Schmidhuber.
Dynamische neuronale Netze und das fundamentale raumzeitliche
Lernproblem
(Dynamic neural nets and the fundamental spatio-temporal
credit assignment problem).
Dissertation,
Institut für Informatik, Technische
Universität München, 1990.
PDF.
HTML.
[PM0] J. Schmidhuber. Learning factorial codes by predictability minimization. TR CU-CS-565-91, Univ. Colorado at Boulder, 1991. PDF.
More.
[PM1] J. Schmidhuber. Learning factorial codes by predictability minimization. Neural Computation, 4(6):863-879, 1992. Based on [PM0], 1991. PDF.
More.
[PM2] J. Schmidhuber, M. Eldracher, B. Foltin. Semilinear predictability minimzation produces well-known feature detectors. Neural Computation, 8(4):773-786, 1996.
PDF. More.
[PMax0]
J. Schmidhuber and D. Prelinger. Discovering predictable classifications. Technical Report CU-CS-626-92, Dept. of Comp. Sci., University of Colorado at Boulder, November 1992.
[PMax]
J. Schmidhuber and D. Prelinger.
Discovering
predictable classifications.
Neural Computation, 5(4):625-635, 1993.
PDF.
[PP] J. Schmidhuber.
POWERPLAY: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem.
Frontiers in Cognitive Science, 2013.
ArXiv preprint (2011):
arXiv:1112.5309 [cs.AI]
[PP1] R. K. Srivastava, B. Steunebrink, J. Schmidhuber.
First Experiments with PowerPlay.
Neural Networks, 2013.
ArXiv preprint (2012):
arXiv:1210.8385 [cs.AI].
[PP2] V. Kompella, M. Stollenga, M. Luciw, J. Schmidhuber. Continual curiosity-driven skill acquisition from high-dimensional video inputs for humanoid robots. Artificial Intelligence, 2015.
[R1] Reddit/ML, 2019. Hinton, LeCun, Bengio receive ACM Turing Award. This announcement contains more comments about Schmidhuber than about any of the awardees.
[R2] Reddit/ML, 2019. J. Schmidhuber really had GANs in 1990.
[R3] Reddit/ML, 2019. NeurIPS 2019 Bengio Schmidhuber Meta-Learning Fiasco.
Schmidhuber started
metalearning (learning to learn—now a hot topic)
in 1987[META1][META] long before Bengio
who suggested in public at N(eur)IPS 2019
that he did it before Schmidhuber.
[R4] Reddit/ML, 2019. Five major deep learning papers by G. Hinton did not cite similar earlier work by J. Schmidhuber.
[R5] Reddit/ML, 2019. The 1997 LSTM paper by Hochreiter & Schmidhuber has become the most cited deep learning research paper of the 20th century.
[R6] Reddit/ML, 2019. DanNet, the CUDA CNN of Dan Ciresan in J. Schmidhuber's team, won 4 image recognition challenges prior to AlexNet.
[R7] Reddit/ML, 2019. J. Schmidhuber on Seppo Linnainmaa, inventor of backpropagation in 1970.
[R8] Reddit/ML, 2019. J. Schmidhuber on Alexey Ivakhnenko, godfather of deep learning 1965.
[R11] Reddit/ML, 2020. Schmidhuber: Critique of Honda Prize for Dr. Hinton
[R12] Reddit/ML, 2020. J. Schmidhuber: Critique of Turing Award for Drs. Bengio & Hinton & LeCun
[R15] Reddit/ML, 2021. J. Schmidhuber's work on fast weights from 1991 is similar to linearized variants of Transformers
[RAW]
J. Schmidhuber (AI Blog, 2001). Raw Computing Power.
[RCNN]
R. Girshick, J. Donahue, T. Darrell, J. Malik.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Preprint arXiv/1311.2524, Nov 2013.
[RCNN2]
R. Girshick.
Fast R-CNN. Proc. of the IEEE international conference on computer vision, p. 1440-1448, 2015.
[RCNN3]
K. He, G. Gkioxari, P. Dollar, R. Girshick.
Mask R-CNN.
Preprint arXiv/1703.06870, 2017.
[RELU1]
K. Fukushima (1969). Visual feature extraction by a multilayered network of analog threshold elements. IEEE Transactions on Systems Science and Cybernetics. 5 (4): 322-333. doi:10.1109/TSSC.1969.300225. This work introduced rectified linear units or ReLUs, now widely used.
[RELU2]
C. v. d. Malsburg (1973).
Self-Organization of Orientation Sensitive Cells in the Striate Cortex. Kybernetik, 14:85-100, 1973. See Table 1 for rectified linear units or ReLUs. Possibly this was also the first work on applying an EM algorithm to neural nets.
[ROB]
A. J. Robinson and F. Fallside.
The utility driven dynamic error propagation network.
Technical Report CUED/F-INFENG/TR.1, Cambridge University Engineering Department, 1987.
[RPG]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber (2010). Recurrent policy gradients. Logic Journal of the IGPL, 18(5), 620-634.
[RPG07]
D. Wierstra, A. Foerster, J. Peters, J. Schmidhuber. Solving Deep Memory POMDPs
with Recurrent Policy Gradients.
Intl. Conf. on Artificial Neural Networks ICANN'07,
2007.
PDF.
[S2Sa] M.L. Forcada and R.P. Ñeco. Recursive hetero-associative memories for translation.
International Work-Conference on Artificial Neural Networks, 1997
[S2Sb] T. Mikolov and G. Zweig, G. December. Context dependent recurrent neural network language model.
IEEE Spoken Language Technology Workshop (SLT), 2012
[S2Sc] A. Graves. Sequence transduction with recurrent neural networks.
Representation Learning Workshop, Int. Conf. on Machine Learning (ICML), 2012
[S2Sd]
I. Sutskever, O. Vinyals, Quoc V. Le. Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems (NIPS), 2014, 3104-3112.
[S59]
A. L. Samuel.
Some studies in machine learning using the game of checkers.
IBM Journal on Research and Development, 3:210-229, 1959.
[S93]
D. Sherrington (1993).
Neural networks: the spin glass approach.
North-Holland Mathematical Library,
vol 51, 1993, p. 261-291.
[SNT]
J. Schmidhuber, S. Heil (1996).
Sequential neural text compression.
IEEE Trans. Neural Networks, 1996.
PDF.
A probabilistic language model based on predictive coding;
an earlier version appeared at NIPS 1995.
[SK75]
D. Sherrington, S. Kirkpatrick (1975).
Solvable Model of a Spin-Glass.
Phys. Rev. Lett. 35, 1792, 1975.
[STO51]
H. Robbins, S. Monro (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics. 22(3):400, 1951.
[STO52]
J. Kiefer, J. Wolfowitz (1952). Stochastic Estimation of the Maximum of a Regression Function.
The Annals of Mathematical Statistics. 23(3):462, 1952.
[SA17] J. Schmidhuber.
Falling Walls:
The Past, Present and Future of Artificial Intelligence.
Scientific American, Observations, Nov 2017.
[SCAN] J. Masci,
A. Giusti, D. Ciresan, G. Fricout, J. Schmidhuber. A Fast Learning Algorithm for Image Segmentation with Max-Pooling Convolutional Networks. ICIP 2013. Preprint arXiv:1302.1690.
[ST]
J. Masci, U. Meier, D. Ciresan, G. Fricout, J. Schmidhuber
Steel Defect Classification with Max-Pooling Convolutional Neural Networks.
Proc. IJCNN 2012.
PDF.
[T20] J. Schmidhuber (AI Blog, 25 June 2020). Critique of 2018 Turing Award for Deep Learning. See [DLP].
[T22] J. Schmidhuber (2022).
Scientific Integrity and the History of Deep Learning: The 2021 Turing Lecture, and the 2018 Turing Award. Technical Report IDSIA-77-21, IDSIA, Lugano, Switzerland, 2022. See [DLP].
[TD]
R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning. 3 (1): 9-44, 1988.
[THE17] S. Baker (2017). Which countries and universities are leading on AI research? Times Higher Education World University Rankings, 2017.
Link.
[TR1]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I. Polosukhin (2017). Attention is all you need. NIPS 2017, pp. 5998-6008.
This paper introduced the name "Transformers" for a now widely used NN type. It did not cite
the 1991 publication on what's now called unnormalized "linear Transformers" with "linearized self-attention."[ULTRA]
Schmidhuber also introduced the now popular
attention terminology in 1993.[ATT][FWP2][R4]
See tweet of 2022 for 30-year anniversary.
[TR2]
J. Devlin, M. W. Chang, K. Lee, K. Toutanova (2018). Bert: Pre-training of deep bidirectional Transformers for language understanding. Preprint arXiv:1810.04805.
[TR3] K. Tran, A. Bisazza, C. Monz. The Importance of Being Recurrent for Modeling Hierarchical Structure. EMNLP 2018, p 4731-4736. ArXiv preprint 1803.03585.
[TR4]
M. Hahn. Theoretical Limitations of Self-Attention in Neural Sequence Models. Transactions of the Association for Computational Linguistics, Volume 8, p.156-171, 2020.
[TR5]
A. Katharopoulos, A. Vyas, N. Pappas, F. Fleuret.
Transformers are RNNs: Fast autoregressive Transformers
with linear attention. In Proc. Int. Conf. on Machine
Learning (ICML), July 2020.
[TR5a] Z. Shen, M. Zhang, H. Zhao, S. Yi, H. Li.
Efficient Attention: Attention with Linear Complexities.
WACV 2021.
[TR6]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song,
A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin,
L. Kaiser, et al. Rethinking attention with Performers.
In Int. Conf. on Learning Representations (ICLR), 2021.
[TR6a] H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, L. Kong.
Random Feature Attention.
ICLR 2021.
[TR7]
S. Bhattamishra, K. Ahuja, N. Goyal.
On the Ability and Limitations of Transformers to Recognize Formal Languages.
EMNLP 2020.
[TR8]
W. Merrill, A. Sabharwal.
The Parallelism Tradeoff: Limitations of Log-Precision Transformers.
TACL 2023.
[TR25]
J. Schmidhuber. Who Invented Transformer Neural Networks? Technical Note IDSIA-11-25, IDSIA, Switzerland, Nov 2025.
[ULTRA]
References on the 1991 unnormalized linear Transformer (ULTRA): original tech report (March 1991) [FWP0]. Journal publication (1992) [FWP1]. Recurrent ULTRA extension (1993) introducing the terminology of learning "internal spotlights of attention” [FWP2]. Modern "quadratic" Transformer (2017: "attention is all you need") scaling quadratically in input size [TR1]. 2020 paper [TR5] using the terminology
"linear Transformer" for a more efficient Transformer variant that scales linearly, leveraging linearized attention [TR5a].
2021 paper [FWP6] pointing out that ULTRA dates back to 1991 [FWP0] when compute was a million times more expensive.
Overview of ULTRA and other Fast Weight Programmers (2021) [FWP].
See the T in ChatGPT! See also surveys [DLH][DLP], 2022 tweet for ULTRA's 30-year anniversary, and 2024 tweet.
[UN]
J. Schmidhuber (AI Blog, 2021, updated 2025). 1991: First very deep learning with unsupervised pre-training (see the P in ChatGPT). First neural network distillation (key for DeepSeek). Unsupervised hierarchical predictive coding (with self-supervised target generation) finds compact internal representations of sequential data to facilitate downstream deep learning. The hierarchy can be distilled into a single deep neural network (suggesting a simple model of conscious and subconscious information processing). 1993: solving problems of depth >1000.
[UN0]
J. Schmidhuber.
Neural sequence chunkers.
Technical Report FKI-148-91, Institut für Informatik, Technische
Universität München, April 1991.
PDF.
Unsupervised/self-supervised pre-training for deep neural networks
(see the P in ChatGPT) and predictive coding is used
in a deep hierarchy of recurrent nets (RNNs)
to find compact internal
representations of long sequences of data,
across multiple time scales and levels of abstraction.
Each RNN tries to solve the pretext task of predicting its next input, sending only unexpected inputs to the next RNN above.
The resulting compressed sequence representations
greatly facilitate downstream supervised deep learning such as sequence classification.
By 1993, the approach solved problems of depth 1000 [UN2]
(requiring 1000 subsequent computational stages/layers—the more such stages, the deeper the learning).
A variant collapses the hierarchy into a single deep net.
It uses a so-called conscious chunker RNN
which attends to unexpected events that surprise
a lower-level so-called subconscious automatiser RNN.
The chunker learns to understand the surprising events by predicting them.
The automatiser uses a
neural knowledge distillation procedure (key for the famous 2025 DeepSeek)
to compress and absorb the formerly conscious insights and
behaviours of the chunker, thus making them subconscious.
The systems of 1991 allowed for much deeper learning than previous methods.
[UN1] J. Schmidhuber. Learning complex, extended sequences using the principle of history compression. Neural Computation, 4(2):234-242, 1992. Based on TR FKI-148-91, TUM, 1991.[UN0] PDF.
First working Deep Learner based on a deep RNN hierarchy (with different self-organising time scales),
overcoming the vanishing gradient problem through unsupervised pre-training of deep NNs (see the P in ChatGPT) and predictive coding (with self-supervised target generation).
Also: compressing or distilling a teacher net (the chunker) into a student net (the automatizer) that does not forget its old skills—such approaches are now widely used, e.g., by DeepSeek. See also this tweet. More.
[UN2] J. Schmidhuber. Habilitation thesis, TUM, 1993. PDF.
An ancient experiment on "Very Deep Learning" with credit assignment across 1200 time steps or virtual layers and unsupervised / self-supervised pre-training for a stack of recurrent NN
can be found here (depth > 1000).
[UN3]
J. Schmidhuber, M. C. Mozer, and D. Prelinger.
Continuous history compression.
In H. Hüning, S. Neuhauser, M. Raus, and W. Ritschel, editors,
Proc. of Intl. Workshop on Neural Networks, RWTH Aachen, pages 87-95.
Augustinus, 1993.
[UN4] G. E. Hinton, R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, Vol. 313. no. 5786, pp. 504—507, 2006. PDF.
This work describes unsupervised layer-wise pre-training of stacks of feedforward NNs (FNNs)
called Deep Belief Networks (DBNs).
It neither cited the original layer-wise training of deep NNs by Ivakhnenko & Lapa (1965)[DEEP1-2][NOB] nor
the 1991 unsupervised pre-training of stacks of more general recurrent NNs (RNNs)[UN0-3]
which introduced
the first NNs shown to solve very deep problems.
The 2006 justification of the authors was essentially the one Schmidhuber used for the 1991 RNN stack:
each higher level tries to reduce the description length
(or negative log probability) of the data representation in the level below.[HIN][T22][MIR]
This can greatly facilitate very deep downstream learning.[UN0-3]
[UN5]
Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle.
Greedy layer-wise training of deep networks.
Proc. NIPS 06, pages 153-160, Dec. 2006.
The comment under reference[UN4] applies here as well.
[UVF15]
T. Schaul, D. Horgan, K. Gregor, D. Silver. Universal value function approximators. Proc. ICML 2015, pp. 1312-1320, 2015.
[VAN1] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TUM, 1991 (advisor J. Schmidhuber). PDF.
More on the Fundamental Deep Learning Problem.
[VAN2] Y. Bengio, P. Simard, P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE TNN 5(2), p 157-166, 1994.
Results are essentially identical to those of Schmidhuber's diploma student Sepp Hochreiter (1991).[VAN1] Even after a common publication,[VAN3] the first author of [VAN2] published papers[VAN4] that cited only their
own [VAN2] but not the original work.
[VAN3] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001.
PDF.
[VAN4] Y. Bengio. Neural net language models. Scholarpedia, 3(1):3881, 2008. Link.
[W45]
G. H. Wannier (1945).
The Statistical Problem in Cooperative Phenomena.
Rev. Mod. Phys. 17, 50.
[WER87]
P. J. Werbos. Building and understanding adaptive systems: A statistical/numerical approach to factory automation and brain research. IEEE Transactions on Systems, Man, and Cybernetics, 17, 1987.
[WER89]
P. J. Werbos. Backpropagation and neurocontrol: A review and prospectus. In IEEE/INNS International Joint Conference on Neural Networks, Washington, D.C., volume 1, pages 209-216, 1989.
[WHO4]
J. Schmidhuber. Who invented artificial neural networks? Technical Note IDSIA-15-25, IDSIA, Switzerland, Nov 2025.
[WHO5]
J. Schmidhuber. Who invented deep learning? Technical Note IDSIA-16-25, IDSIA, Switzerland, Nov 2025.
[WHO6] J. Schmidhuber (AI Blog, 2014; updated 2025).
Who invented backpropagation?
See also LinkedIn post.
[WHO7]
J. Schmidhuber.
Who invented convolutional neural networks? Technical Note IDSIA-17-25, IDSIA, Switzerland, 2025. See popular tweet.
[WHO8]
J. Schmidhuber. Who Invented Generative Adversarial Networks? Technical Note IDSIA-14-25, IDSIA, Switzerland, Dec 2025.
[WHO9]
J. Schmidhuber. Who invented knowledge distillation with artificial neural networks? Technical Note IDSIA-12-25, IDSIA, Nov 2025.
[WHO10]
J. Schmidhuber. Who Invented Transformer Neural Networks? Technical Note IDSIA-11-25, IDSIA, Switzerland, Nov 2025.
[WHO11]
J. Schmidhuber. Who Invented Deep Residual Learning? Technical Report IDSIA-09-25, IDSIA, Switzerland, Sept 2025. Preprint arXiv:2509.24732.
[WM26] J. Schmidhuber. The Neural World Model Boom. Technical Note IDSIA-2-26, 4 Feb 2026.
[WM26b] J. Schmidhuber. Simple but powerful ways of using world models and their latent space. Opening Keynote at the World Modeling Workshop, Agora, Mila - Quebec AI Institute, 4 Feb 2026. On YouTube.
[WU] Y. Wu et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
Preprint arXiv:1609.08144 (PDF), 2016.
.