How can we train deep neural networks with very long credit assignment paths from inputs to outputs to solve complex AI problems? Many solutions to this problem have been proposed over the years, but most do not work well for very deep networks with arbitrary non-linear (and possibly recurrent) transformations between layers.
This project presents a different take on the problem. We simply design neural networks in a way that makes them easier to optimize even for very large depths.
Our Highway Networks take inspiration from Long Short Term Memory (LSTM) and allow training of deep, efficient networks (even with hundreds of layers) with conventional gradient-based methods. Even when large depths are not required, highway layers can be used instead of traditional neural layers to allow the network to adaptively copy or transform representations.
Training Very Deep Networks R. K. Srivastava, K. Greff and J. Schmidhuber Neural Information Processing Systems (NIPS 2015 Spotlight) arXiv:1507.06228 Download logs for all 800 optimization runs here, with instructions.
Highway Networks R. K. Srivastava, K. Greff and J. Schmidhuber Deep Learning Workshop (ICML 2015). arXiv:1505.00387 poster
Q: How do I set the bias for the transform gates when initializing a highway network?
A: You can think of the initial bias as a prior over the behavior of your network at initialization. In general, this is a hyper-parameter which will depend on the given problem and network architecture. However, here are some general suggestions which have worked for certain problems:
For convolutional highway networks, initialize to -1 for network depths around 10-20, -2 for depths of 20-30, -3 for depths 30-40.
For feedforward layers, Kim et al.  found a bias of -2 to work well even when using a couple of highway layers. Note that this application utilizes tha adaptive processing capabilities of highway networks, and the goal is not to train a very deep network.
For very deep networks, start from -1 and decrement by 0.5 or 1.0 until the network starts training easily. It should be pointed out here that when a proper hyperparameter search was performed, a bias of -2 was sufficient to train networks up to 100 layers. A lower bias may allow a larger range of hyperparameters to work, or learning to be faster.
Q: Is the highway gating mechanism related to how information flow is regulated in the brain?
A: Information processing in the brain is not understood very well yet. However, the idea that the brain uses similar gating mechanisms has definitely been considered seriously by neuroscientists.
For example, see: Gisiger, T., & Boukadoum, M. (2011). Mechanisms Gating the Flow of Information in the Cortex: What They Might Look Like and What Their Uses may be. Frontiers in Computational Neuroscience, 5, 1. http://doi.org/10.3389/fncom.2011.00001 link