Lecture 20, March 26th, 2015: The Variational Autoencoder

In this lecture we will discuss Variational Autoencoders.

Please study the following material in preparation for class:

Other relevant material:

7 Replies to “Lecture 20, March 26th, 2015: The Variational Autoencoder”

  1. Can we go over the semi-supervised VAE M2 model in some detail? It seems you pay a penalty (section 3.3) compared to the number of classes. Could you get around this in practice if you wanted samples from every class?

    It would also be good to talk about formulations of the sigma for stability: exp(log_sigma) vs. softplus vs. others. It is hard to see what the differences between softplus(sigma) vs parameterizing log_sigma directly are.

    Like

  2. In the VAE paper, they compared the VAE to Monte Carlo EM (a variant from 1987!) in figure 3. There it clearly seems that Monte Carlo EM performs better than VAE for a small number of training examples, and (judging from the increasing blue lines in the right plot) perhaps also better in the case of many training examples. However, it is well known that online EM usually works even better than batch EM for large datasets… So, has the VAE ever been compared to other online variational EM algorithms?

    Aaron, I guess you plan to do this anyway, but it would be nice to dicuss the shortcomings of VAEs as well. The VAE paper doesn’t really touch on that.

    Liked by 1 person

  3. The DeepMind paper (Rezende et al.) seems to obtain worse samples on MNIST than Kingma et al. in “Auto-Encoding Variational Bayes”. I can’t quite tell if it’s because they use the binarized version of MNIST, or if it’s because each layer of the deep latent Gaussian model includes Gaussian noise. Could it be the latter case, since their NORB samples are quite blurry?

    Like

  4. In section 2.1 of Kingma & Welling’s paper is mentioned, that they “do not make the common simplifying assumptions about the marginal or posterior”. Which assumptions do they mean?

    Like

  5. Could you give us an insight on how to choose g(.) when q(z|x) doesn’t follow any of the approaches listed in section 2.4? Also, in practical, how is q(z|x) chosen for a case less obvious than a gaussian?

    Like

Leave a comment