Lecture 21, March 30th, 2015: Recurrent Neural Networks

In this lecture we will discuss Recurrent Neural Networks.

Please study the following material in preparation for class:

Other relevant material:


16 Replies to “Lecture 21, March 30th, 2015: Recurrent Neural Networks”

  1. 1. Not sure that I fully understand this figure in Hinton’s lecture “Why is it difficult to train RNNs”:

    Does it mean that when you want to train attractors with RNN, you will inevitably (provided it learns) end up with vanishing/exploding gradients? The beginning of the training is fine but as the attractors are gradually learnt, vanishing/exploding gradients start appearing. Exploding gradients seem even more dangerous than vanishing gradients because they can completely change the weights and “reset” the optimization, while vanishing gradients only slows down/stop the training. Is that correct?

    2. In RNN, I guess RELU reduce the risk of vanishing gradients (but not completely if the sequences are very long) but it seems that RELU RNNs are more prone to exploding gradients than sigm/tanh RNNs (when they saturate). Is that correct?

    Liked by 1 person

  2. 1. Basically, there are very long (high?!) walls between attractors in error surface of RNNs. That’s because recurrent weight (W) (hidden to hidden) is shared through time and it get multiplied to itself multiple times. (At time-step T we have a term like W^T).
    It can be shown that if the largest singular value of W is less than one, then gradient will vanish and if it’s greater than one, it will explode.
    Here’s the border between two attractors from a very nice paper by Razvan:

    Pascanu, Razvan, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of training recurrent neural networks.”: http://arxiv.org/pdf/1211.5063.pdf

    – Although gradient exploding seems more dangerous, there is a very simple and effective solution for that which is “Gradient clipping”. Just clip the gradients to be below a threshold.

    2. As I said, the reason for gradient vanishing is not just the saturation of activation function, but also having a shared W that get multiplied to itself multiple times.
    But ReLUs can definitely help if we manage to use them for RNNs. As far as I know, no one has used ReLU in RNNs extensively because of exploding hidden values themselves (not just the gradient). That’s because it’s unbounded and it’s so deep in time.

    Liked by 2 people

    1. Also, I would like to add that the vanishing gradient will only affect the long-term dependencies. You can train a network that suffers from this and it will learn something. It’s just that it will have a hard time using important information from the distant past and it will mainly use information from the recent past.

      This is because in the BPTT algorithm, the updates are related to the sum of all the gradients (with respect to x_k). The vanishing gradient problem means that the recent components (bigger k) will be weighted much more than the long-term components (small k).

      Liked by 1 person

        1. In my (small) experience, this is a very problem-dependent and model-dependent issue. I’ve had problems with sequences of length 30; on the other hand I’ve trained successfully models on much larger sequences.

          There are strategies that are very important when training and that can define whether the model will work. Orthonormal initialization of the recurrent weights and gradient clipping are two that I can recall.

          Furthermore, using LSTMs (or GRUs) has become very important since it seems to solve most of the problems by allowing the gradient to flow more easily through the network. However, I don’t think that the vanishing gradient problem is solved yet.


    1. I suppose, it gives one (or two including bias) additional parameters and increases capacity of the model.

      I’ve also heard that it may be reasonable for some application to assign all zeros to the initial state in order to indicate the first symbol of the sequence. For deeper recurrent networks people recommend to initialize the initial state of upper RNN with the last state of the previous recurrent layer, like in encoder-decoder model.

      Liked by 1 person

  3. Sorry this is a little off topic, but I have a nagging question. I was wondering if anyone has tried LeCunn’s preprocessing recipe of YUV conversion plus local contrast normalization? I’m interested in what kind of performance this gives compared to methods like ZCA.


  4. Hinton mentioned Echo State Networks. It seems nobody is using them anymore – but why not?

    Obviously, by not training the hidden-to-hidden weights, they have much lower capacity but they could still be useful if the inputs are of sufficiently small size (e.g. in language modeling with low-dimensional word embeddings).


    1. From what I understand it takes a staggering number of (sparse!) units in a reservoir network (family of echo state nets) to get equivalent performance to a small LSTM. Like most other “random” networks (Extreme Learning Machines (ELM), feedback alignment, random matrix stuff in general) it is really cool but performance bounds seem very loose in practice, and theoretical guarantees while fascinating are fairly weak compared to guarantees of other algorithms.

      There is not much convincing argument for *not* learning, especially since once the weights are learned it shouldn’t be any slower to apply at test time than random weights. It takes some expert knowledge to be able to train these things (deep RNN, LSTM, GRU) effectively, but that’e exactly why we are in this class in the first place 🙂

      One interesting application I saw for ELM (random affine transformation followed by linear or logistic regression) was for real-time control of engine firing for a small engine. These kinds of embedded, online applications that need to change with the environment seem like a neat area to apply these kinds of pseudo-deep learning algorithms.

      Philemon used to be in the Reservoir Lab (http://reslab.elis.ugent.be/), he knows quite a lot about it. You could probably get more information from him.

      As for your comment on word embeddings, the “hashing trick” is a huge part of applying linear models over inputs like words, or website links, or database inputs, or lots of other types of data. It is unreasonably effective in many applications, and much of the theory underlying random projections are also what guarantee unique properties. See http://arxiv.org/pdf/0902.2206.pdf and http://en.wikipedia.org/wiki/Feature_hashing#Feature_vectorization_using_the_hashing_trick for more details, as well as https://gist.github.com/kastnerkyle/ed2827312e1e10f494de for some code in pure Python (not originally mine, but modified and stored for my own uses). It was used heavily in a CTR prediction competition over ~45 GB (uncompressed) data stored on disk, while learning only required 200 *MB* RAM. Half a step from magic, in my experience.

      Obviously learned embeddings, as proposed by Bengio et. al. here http://jmlr.csail.mit.edu/papers/volume3/bengio03a/bengio03a.pdf are better, but if you can get close to the same performance without learning it (or storing it) this still has interesting applications.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s