Sepp Hochreiter's Fundamental Deep Learning Problem (1991)
Jürgen Schmidhuber, 2013
Two decades later everybody is talking about Deep Learning! A first milestone of Deep Learning research was the 1991 diploma thesis of Sepp Hochreiter [1], my very first student, who is now a professor in Linz. His work formally showed that deep neural networks are hard to train, because they suffer from the now famous problem of vanishing or exploding gradients: in typical deep or recurrent networks, backpropagated error signals [3a,3] either shrink rapidly, or grow out of bounds [1]. In fact, they decay exponentially in the number of layers, or they explode. All our subsequent Deep Learning research of the 1990s and 2000s was motivated by this insight.
The thesis is in German, but don't worry, all basic results are documented in the universal language of mathematics [1]. (Google Translate does a reasonable job on it.)
Ten years later, an additional survey came out in English [2].
We have found four ways of partially overcoming the Fundamental Deep Learning Problem:
I. My first Deep Learner of 1991 overcame it through unsupervised pretraining for a hierarchy of (recurrent) neural networks [4]. This greatly facilitated subsequent supervised credit assignment through backpropagation.
II. LSTMlike networks (e.g., [5]) avoid the problem through special architecture unaffected by it.
III. Today, a million times faster GPUbased computers allow for propagating errors a few layers further down within reasonable time, even in traditional NN  that's basically what's winning many of the image competitions now, e.g., [6]. (Although this does not really overcome the problem in a fundamental way.)
IV.
The space of NN weight matrices can also be searched without relying on error gradients,
thus avoiding the Fundamental Deep Learning Problem altogether.
In fact, some simple problems requiring credit assignment
across many successive nonlinear operations
can better be solved by randomly guessing weights until a solution is found [7].
Other problems are better solved by using
universal search for weight matrixcomputing programs written in
a universal programming language [8].
Some are better solved by using linear methods
to obtain optimal weights for connections to output events,
and evolving weights of connections to other events 
this is called Evolino [9].
References
[1] Sepp Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, TU Munich, 1991.
PDF. (Alternative PDFs can be found here  scroll all the way down).
[2] Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Juergen Schmidhuber. Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. In S. C. Kremer and J. F. Kolen, eds., A Field Guide to Dynamical Recurrent Neural Networks. IEEE press, 2001.
PDF.
[3] Paul J. Werbos. Applications of advances in nonlinear sensitivity analysis. In R. Drenick, F. Kozin, (eds): System Modeling and Optimization: Proc. IFIP (1981), Springer, 1982.
[3a] S. Linnainmaa. The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors. Master's Thesis (in Finnish), Univ. Helsinki, 1970. See chapters 67 and FORTRAN code on pages 5860.
[4]
J. Schmidhuber. My first Deep Learner of 1991 + Deep Learning timeline 19622013
[5] 1990s: Deep Learner based on LSTM RNN  first international
pattern recognition competitions won in 2009
[6] 2011: Superhuman Visual Pattern Recognition through GPUbased deep feedforward NN
[7]
J. Schmidhuber and S. Hochreiter.
Guessing can outperform many long time lag algorithms.
Technical Note IDSIA1996, 1996.
[8]
J. Schmidhuber.
Discovering neural nets with low Kolmogorov complexity
and high generalization capability.
Neural Networks, 10(5):857873, 1997.
PDF
.
[9]
J. Schmidhuber, D. Wierstra, M. Gagliolo, F. Gomez.
Training Recurrent Networks by Evolino.
Neural Computation, 19(3): 757779, 2007.
PDF (preprint).
Compare Evolino overview (since IJCAI 2005).
.
