In this lecture we will continue our discussion of probabilistic undirected graphical models such as the **Restricted Boltzmann Machine** and moving on to the** Deep Belief Network**. We will also review the quiz from last Monday.

**Please study the following material in preparation for the class:**

- Lecture 7 (esp. 7.7 to 7.9) of Hugo Larochelle’s course on Neural Networks.
- Chapter 21 (sec. 21.3) of the Deep Learning Textbook on deep generative models

**Other relevant material:**

- Lectures 14 of Geoff Hinton’s cousera course on Neural Networks.

### Like this:

Like Loading...

*Related*

Hugo at one point described “widening” and “narrowing” networks, describing the number of feature detectors per layer, but the performance graphs he showed didn’t really convince me that either of them was better than the other. What guidelines can we use to choose whether our network widens or narrows? This isn’t really related to RBM’s but I was wondering 🙂

LikeLike

From Hinton’s earlier lectures, the wake-sleep algorithm is a very crude approximate learning procedure (which only approximately optimizes a lower bound on the log-likelihood). Yet, after doing greedy layerwise training, applying the crude wake-sleep algortihm still improves the deep belief network. Why does the wake-sleep algorithm improve over greedy layerwise? Why is it better to train layers jointly in a crude fashion (with factorial distribution approximation), than to train each layer greedily but much more exactly?

Maybe this is explained somewhere in Hinton’s papers, but it would be nice if we could briefly go over it in class.

LikeLike

Aaron, perhaps the ineffiency of wake-sleep (with fixed “recognition weights”) lead people to use (mean field) approximate inference instead? Maybe you already gave this arguement last term?

LikeLike

I have this question in my mind that does adding another layer to DBN always improve the representational power? It seems that even without fine-tuning (back-propagation at the end), the results are pretty good.

LikeLike

Adding additional layer is improving your prior and lower bound of likelihood.

You can get more sense by reading Yoshua’s paper “Learning deep architectures for AI”.

LikeLike

why is improving prior and lower bound?

LikeLike

There is a typo in textbook Fig 9.3, P164, where should be “colleague’s health h_c”, “your health h_y” instead.

This paper may interest some of you, which is a very nice introduction to RBM.

http://link.springer.com/chapter/10.1007/978-3-642-33275-3_2#page-1

LikeLiked by 2 people

You can think of log P(v) = log \sum_h P(v, h).

Using Jensen’s inequality you can make log P(v) lower bounded by

H(Q(h|v)) + \sum_h Q(h|v)(log P(h) + log P(v|h)).

The conditional distributions are fixed and the entropy term should be constant I guess.

Therefore, when you improve the P(h) (which is P(v) of higher layer RBM) you can always improve the overall log P(v). By stacking the layer means you improve the P(h).

Please correct me if I said something wrong.

LikeLike

Why the conditional distributions are fixed and the entropy term is constant? I didn’t see that. 😦

LikeLike

In Hugo’s lecture 7.9 he explains how initializing the weights of the second layer DBN as the transpose of the weights of the first layer makes the bound initially tight.

How important is that? Can’t we achieve good results with another type of initialization?

Plus, it implies that the dimension of our second hidden layer will (always?) be the same as the dimension of our input data..

LikeLike