In this lecture we will begin our discussion of probabilistic undirected graphical models. In particular, we will study the **Restricted Boltzmann Machine**.

**Please study the following material in preparation for the class:**

- Lecture 5 (5.1 to 5.8) of Hugo Larochelle’s course on Neural Networks.
- Chapter 9 of the Deep Learning Textbook (important background on probabilistic models).
- Chapter 15 (sec. 15.2) of the Deep Learning Textbook (approximate maximum likelihood training)
- Chapter 21 (sec. 21.2) of the Deep Learning Textbook on deep generative models

**Other relevant material:**

- Lectures 11 and 12 ( especially 11d-11e and 12a-12e ) of Geoff Hinton’s cousera course on Neural Networks.

### Like this:

Like Loading...

*Related*

Hi Aaron,

The post is not categorized, it do not appear in the “Lectures” section.

best

LikeLike

Argh! Sorry about that.

Thanks for letting me know.

– Aaron

On Tue, Mar 10, 2015 at 10:57 AM, IFT6266 – H2015 Representation Learning wrote:

>

LikeLike

Two questions about contrastive divergence:

When sampling, we start from a training sample in order to avoid burning out the Markov chain.

1. How many iterations do we really save? (how long is the burn-out in the first place)

2. How many iteration should decorrelate the next sample from the previous one?

LikeLike

So far we have seen

1. RBMs

2. Auto-encoders

3. Sparse coding

Given a known task, how should we choose which one to use?

LikeLiked by 2 people

From “Sparse coding with an overcomplete basis set” [Olshausen and Field, 1997!]:

“Sparse coding only applies if the data actually have sparse structure”.

So my impression is that if you know that the underlying factors are sparse, then it’s reasonable to use sparse coding. Otherwise, sparse coding is not efficient in comparison with auto-encoders.

Besides, it is shown that for natural image data like ImageNet, features extracted using sparse coding and auto-encoders are very similar.

Also, we can show that denoising auto-encoders are equivalent to RBMs.

Finally, RBM (and it’s children like DBN and DBM) are generative models and you can sample from those while auto-encoder and sparse coding are deterministic models.

LikeLiked by 2 people

Thank you Mohammad!

LikeLike

Check out these papers by Adam Coates, we will discuss these in class.

http://machinelearning.wustl.edu/mlpapers/paper_files/AISTATS2011_CoatesNL11.pdf

http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Coates_485.pdf

LikeLiked by 1 person

I don’t really understand the concept of sampling: 1) Why and 2) How do we use it?

LikeLike

In a Markov Chain structure, when we want to compute partition function or marginal probabilities, we need to sum over a large number of variables.

One solution is to use Markov chain Monte Carlo (MCMC), but it takes a long time to converge.

Hinton proposed that we can run MCMC for just a few step started at the last example seen. He showed that it work well! This algorithm is called Contrastive Divergence. So we can estimate that “sum” (or “expectation” or “likelihood gradient”) by moving back and force between seen and unseen variables.

LikeLike

Thanks Mohammad,

When you mention at the end of your explication that you move back and forth you mean that we try reconstruct the x given h right?

Like you mention a Markov chain is used to estimate the probability of a sequence, I don’t understand how it apply in our case … do each state in the sequence is a data sample?

Im confused

LikeLiked by 1 person

Christian,

Sorry. I meant “Random Markov Field”, not “Markov Chain” and I cannot edit my comment!

You’re right. Basically, instead of computing the expectation, we use a point estimate which can be obtained using a sampling chain. (x -> h -> x -> h -> … -> x)

In RBMs, this algorithm is so efficient. Why? Because given a sample x, we can easily sample h. That’s because given x, h’s are independent. The same role applies when we want to sample x from given h (the h that we have sampled in the previous step.)

LikeLike

Normally in Gibbs sampling you want to wait a while in between samples so that the values you get aren’t correlated, because in a Markov chain samples depend on the previous ones. Here we don’t have dependence on previous samples, so we don’t have to wait and K=1 works. But we do Gibbs sampling to estimate a joint probability distribution, and we appear to know the distribution, because we have the data we’re trying to model. So what is the point of doing this at all?

LikeLiked by 2 people

Why ” Here we don’t have dependence on previous samples”? I think that in RBM the current sample is not independent to previous one, that is also the reason that if one wants to get good samples from RBM’s distribution, you should run the RBM for N times and get the last sample, and run another N times and get the last sample again, and so on. N should be big to eliminate the dependence between the samples you get.

LikeLike

I think Saizheng is right.

Form “Justifying and Generalizing Contrastive Divergence (Yoshua)”:

“The surprising empirical result is that even k = 1 (CD-1) often gives good results. An extensive numerical comparison of training with CD-k versus exact log-likelihood gradient has been presented in (Carreira-Perpi & Hinton, 2005). In these experiments, taking k larger than 1 gives more precise results, although very good

approximations of the solution can be obtained even with k = 1.”

LikeLike

“But we do Gibbs sampling to estimate a joint probability distribution, and we appear to know the distribution, because we have the data we’re trying to model.”

We actually know the analytical form of the joint distribution represented by the model. Gibbs sampling is used to sample from it. It appears that K=1 works already surprisingly well.

The Markov chain in the Gibbs sampler iteratively produces dependant samples by definition of the Markov chain (each sample depends on the previous one).

LikeLiked by 1 person

In Hugo video 5.8 he show feature maps obtain from training an RBM on MNIST and although he refer tho them as clear pen strokes I find them less obvious compare to the other features extracted from unsupervised models seen previously. There is a lot of round blobs particularly black blobs… would you say that it is due to the nature of the images (numbers that are in most casses roundish..) could you comment on this?

LikeLike

In the negative phase of contrastive divergence, given that we want to make the energy high everywhere except at the sample location, what motivates the use of the current sample x(t) as initial point in the gibbs sampling? Why not start from a random point of the training distribution or even a completely random point at all?

LikeLiked by 1 person

We want to generate a faithful sample from the model. Assuming the distribution modelled by the RBM is not too far from the actual training data, it is reasonable to start the Markov chain from a datapoint. Why choosing the current point and not a random datapoint? I think it won’t change too much but it seems quite convenient to choose the current datapoint so that we can explore well the space around the data manifold (by choosing randomly, we might be unlucky and never pick a particular datapoint). I don’t think that starting from a completely random point would work well because the RBM may have very uncertain behaviour if we are too far from the data manifold, which may result in stuck or divergent Markov chains (to be confirmed in class).

The persistent contrastive divergent extension does not start from the current datapoint but instead from the latest point sampled from the Markov chain.

LikeLike

Apologizes for the late comment:

I just wanted to point out one thing about contrastive divergence. As Prof Courville pointed out in class, there has been a lot of work on studying the bias of the CD. You can prove, in addition, that contrastive divergence doesn’t follow the gradient of any function.

You can show this with a simple proof by contradiction. You first assume that CD is the valid gradient of a function, and then compute the Hessian. You can show that the Hessian does not commute, which violates the original assumption.

The reference for this is:

http://www.cs.utoronto.ca/~ilya/pubs/2010/cd.pdf

LikeLike