Today we will finish our overview of Machine Learning and dive into a detailed review of Neural Networks.
Please study the following material in preparation for the lecture:
- Hugo Larochelle’s video lectures 1.1 to 1.6.
- Chapter 6 of the Deep Learning textbook (sections 6.1 and 6.2)
Do not forget to leave questions / comments / answers.
Addition reference material:
- Hinton’s coursera lecture 1, videos 1 to 5.
My understanding of the argument for unsupervised pre-training is that the data generating distribution shares properties with the class distribution conditional on the data. I was surprised to hear that unsupervised pre-training is no longer needed for classification.
What made this true? Based on Aaron’s comment yesterday, I understand that convolutional architecture, rectified linear units and more data where the main reasons. Is that it? Any more factors?
Does this only says that supervised training architectures and optimization has become much better or does it also says something about how well unsupervised pre-training does?
LikeLiked by 1 person
Following today’s discussion in class, unsupervised pre-training is no longer needed given the right non-linearities (rectified linear), large ammount of data, sufficient processing speed, the right regularizations (dropout), and possibly other reasons that I forget. Aaron suggests that it is still useful with smaller datasets.
LikeLike
Sorry, in the first paragraph I should have written “… needed to train deep architectures” rather than “… needed for classification”.
LikeLike
About the neural networks for predicting the means and covariances for a GMM (chapter 6, page 109 of the deep learning book). It is suggested that optimization is difficult because of the covariance matrix inversion. Couldn’t we simply learn the inverse covariance matrix straight away? Has this been tried?
LikeLiked by 1 person
I think NICE (http://arxiv.org/pdf/1410.8516v3.pdf), from the LISA lab, actually does this or at least *can* do this. It would be good to confirm with Laurent Dinh. I think learning the inverse covariance would be just as difficult (if not moreso) than learning the covariance, since the inversion process is simply a procedural solution to the covariance, and is in a way a more constrained solution.
A lot of people try to get around this in other machine learning with things by doing the inverse covariance, then using things like low rank approximations, Sherman-Morrison-Woodbury updates, etc. I wonder if you could learn low-rank updates for a particular inverse covariance rather than forcing it to be rank 1 and following the Sherman-Morrison-Woodbury formula.
LikeLiked by 1 person
Kyle, thanks for feedback. I will have a look at the paper you pointed to when I have time.
LikeLike
Following today’s class discussion, learning the inverse of the covariance matrix (i.e. the precision matrix) straight away does not solve the numerical instability. The reason for this is that the gradient of a Gaussian with respect to it’s precision matrix involves the inverse of that precision matrix. This can be easily checked with the help of the matrix cookbook (see formula 49 in http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=3274).
To summarize, the use of the precision matrix instead of the covariance matrix removes computation of inverse of that matrix in the forward pass, but it’s gradient still involves computation of an inverse which is likely to be unstable.
This makes me wonder is there is any way to constrain or regularize the optimization so that the covariance matrix stays positive definite (i.e. invertible) throughout the optimization process?
LikeLike
It was mentioned in class that imperceptibly perturbing the pixel values of an image in an adversarial way can cause a model to misclassify. A natural question to ask is whether injecting a reasonable amount of noise into an adversarially perturbed image before feeding it into a susceptible model can rescue it from misclassification. A recent paper titled Explaining and Harnessing Adversarial Examples by Goodfellow et al suggests the answer is no. They find that by adding to an input an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to a different input they can easily generate adversarial examples. By training on such examples they achieve a more regularized model with superior performance on the original task. Regularizing a model by training on merely noisy examples is however found very inefficient at eschewing susceptibility to adversarial examples.
LikeLiked by 4 people
In class we discussed deconvolutional views of network weights. For those interested in the implementation, there is a reasonable amount of discussion on the pylearn2 mailing list, seen [here](https://groups.google.com/forum/#!topic/pylearn-users/hbQJwM7iS3A). Some is pylearn2 specific, some is Theano specific.
LikeLiked by 2 people