How can we train deep neural networks with very long credit assignment paths from inputs to outputs to solve complex AI problems? Many solutions to this problem have been proposed over the years, but most do not work well for very deep networks with arbitrary non-linear (and possibly recurrent) transformations between layers.

This project presents a different take on the problem. We simply redesign neural networks in a way that makes them easier to optimize even for very large depths.

Our **Highway Networks** are based on Long Short Term Memory (LSTM) recurrent networks and allow training of deep, efficient networks (even with hundreds of layers) with conventional gradient-based methods. Even when large depths are not required, highway layers can be used instead of traditional neural layers to allow the network to adaptively copy or transform representations.

NEW: In recent work we have used highway networks to easily train multi-layer state transitions in recurrent neural networks. The resulting **Recurrent Highway Networks** outperform existing recurrent architectures substantially, and open up a new degree of freedom in recurrent model design.

**Recurrent Highway Networks**

J. G. Zilly, R. K. Srivastava, J. Koutnik and J. Schmidhuber

arXiv Preprint arXiv:1607.03474

Code coming soon.

**Training Very Deep Networks**

R. K. Srivastava, K. Greff and J. Schmidhuber

Neural Information Processing Systems (NIPS 2015 Spotlight) arXiv:1507.06228

Download logs for all 800 optimization runs here, with instructions.

**Highway Networks**

R. K. Srivastava, K. Greff and J. Schmidhuber

Deep Learning Workshop (ICML 2015). arXiv:1505.00387 poster

- You can easily use IDSIA's deep learning library Brainstorm to experiment with Highway Networks. [Python]
- We also have Caffe code for convolutional highways using cuDNN. [C++]
- If you use Theano, there is a simple example of Highway Networks in the open source library Lasagne. There is also example code in the library deepy. [Python]
- If you use Torch, you may want to look at Yoon Kim's code for fully connected highway layers. [Lua/Torch]

**Q**: How do I set the bias for the transform gates when initializing a highway network?

**A**: You can think of the initial bias as a prior over the behavior of your network at initialization. In general, this is a hyper-parameter which will depend on the given problem and network architecture. However, here are some general suggestions which have worked for certain problems:

For convolutional highway networks, initialize to -1 for network depths around 10-20, -2 for depths of 20-30, -3 for depths 30-40.

For feedforward layers, Kim et al. [1] found a bias of -2 to work well even when using a couple of highway layers. Note that this application utilizes tha adaptive processing capabilities of highway networks, and the goal is not to train a very deep network.

For very deep networks, start from -1 and decrement by 0.5 or 1.0 until the network starts training easily. It should be pointed out here that when a proper hyperparameter search was performed, a bias of -2 was sufficient to train networks up to 100 layers. A lower bias may allow a larger range of hyperparameters to work, or learning to be faster.

**Q**: Is the highway gating mechanism related to how information flow is regulated in the brain?

**A**: Information processing in the brain is not understood very well yet. However, the idea that the brain uses similar gating mechanisms has definitely been considered seriously by neuroscientists.

For example, see: Gisiger, T., & Boukadoum, M. (2011). Mechanisms Gating the Flow of Information in the Cortex: What They Might Look Like and What Their Uses may be. Frontiers in Computational Neuroscience, 5, 1. http://doi.org/10.3389/fncom.2011.00001 link

- Kim, Yoon, et al. "Character-Aware Neural Language Models." arXiv preprint arXiv:1508.06615 (2015).
- Zhang et al. "Highway Long Short-Term Memory RNNs for Distant Speech Recognition." arXiv preprint arXiV:1510.08983 (2015).
- Bowman, Samuel R., et al. "Generating sentences from a continuous space." arXiv preprint arXiv:1511.06349 (2015).
- BJozefowicz, Rafal, et al. "Exploring the limits of language modeling." arXiv preprint arXiv:1602.02410 (2016).
- Schmaltz, Allen, et al. "Sentence-level grammatical error identification as sequence-to-sequence correction." arXiv preprint arXiv:1604.04677 (2016)."
- Lu, Liang. "Sequence Training and Adaptation of Highway Deep Neural Networks." arXiv preprint arXiv:1607.01963 (2016).
- Lu, Liang, Michelle Guo, and Steve Renals. "Knowledge Distillation for Small-footprint Highway Networks." arXiv preprint arXiv:1608.00892 (2016).
- Vylomova, Ekaterina, et al. "Word Representation Models for Morphologically Rich Languages in Neural Machine Translation." arXiv preprint arXiv:1606.04217 (2016).