Lecture 17, March 16th, 2015: RBMs continued and Deep Belief Nets

In this lecture we will continue our discussion of probabilistic undirected graphical models such as the Restricted Boltzmann Machine and moving on to the Deep Belief Network. We will also review the quiz from last Monday.

Please study the following material in preparation for the class:

  • Lecture 7 (esp. 7.7 to 7.9) of Hugo Larochelle’s course on Neural Networks.
  • Chapter 21 (sec. 21.3) of the Deep Learning Textbook on deep generative models

Other relevant material:

  •  Lectures 14 of Geoff Hinton’s cousera course on Neural Networks.

10 thoughts on “Lecture 17, March 16th, 2015: RBMs continued and Deep Belief Nets”

  1. Hugo at one point described “widening” and “narrowing” networks, describing the number of feature detectors per layer, but the performance graphs he showed didn’t really convince me that either of them was better than the other. What guidelines can we use to choose whether our network widens or narrows? This isn’t really related to RBM’s but I was wondering 🙂


  2. From Hinton’s earlier lectures, the wake-sleep algorithm is a very crude approximate learning procedure (which only approximately optimizes a lower bound on the log-likelihood). Yet, after doing greedy layerwise training, applying the crude wake-sleep algortihm still improves the deep belief network. Why does the wake-sleep algorithm improve over greedy layerwise? Why is it better to train layers jointly in a crude fashion (with factorial distribution approximation), than to train each layer greedily but much more exactly?

    Maybe this is explained somewhere in Hinton’s papers, but it would be nice if we could briefly go over it in class.


    1. Aaron, perhaps the ineffiency of wake-sleep (with fixed “recognition weights”) lead people to use (mean field) approximate inference instead? Maybe you already gave this arguement last term?


  3. I have this question in my mind that does adding another layer to DBN always improve the representational power? It seems that even without fine-tuning (back-propagation at the end), the results are pretty good.


    1. Adding additional layer is improving your prior and lower bound of likelihood.
      You can get more sense by reading Yoshua’s paper “Learning deep architectures for AI”.


          1. You can think of log P(v) = log \sum_h P(v, h).
            Using Jensen’s inequality you can make log P(v) lower bounded by
            H(Q(h|v)) + \sum_h Q(h|v)(log P(h) + log P(v|h)).
            The conditional distributions are fixed and the entropy term should be constant I guess.
            Therefore, when you improve the P(h) (which is P(v) of higher layer RBM) you can always improve the overall log P(v). By stacking the layer means you improve the P(h).
            Please correct me if I said something wrong.


  4. In Hugo’s lecture 7.9 he explains how initializing the weights of the second layer DBN as the transpose of the weights of the first layer makes the bound initially tight.
    How important is that? Can’t we achieve good results with another type of initialization?
    Plus, it implies that the dimension of our second hidden layer will (always?) be the same as the dimension of our input data..


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s