In this lecture we will begin our discuss of unsupervised learning methods. In particular, we will study a particular kind of neural network known as an autoencoder.
Please study the following material in preparation for the class:
- Lecture 6 (6.1 to 6.7) of Hugo Larochelle’s course on Neural Networks.
- Chapter 10 of the Deep Learning Textbook.
Other relevant material:
- Lecture 15 ( 15a-15f ) of Geoff Hinton’s cousera course on Neural Networks.
Hugo Larochelle mentions that adding input noise is equivalent to adding weight decay, but the denoising autoencoder performs better that autoencoder with weight decay. What is the reason for it? Also Jacobian with respect to input parameters is proportional to weight matrix. Why it is different from weight decay?
LikeLike
Dima, the Jacobian used is the derivative of the loss function w.r.t. the input (e.g. the pixels of an image) and not the weights of the model. This is very different, and in fact I’m using a similar derivative for visualizing chess features (which I will talk more about Friday…).
As we’ve already discussed in class, weight decay and adding (Gaussian) input noise is only equivalent in the case of a quadratic bowl (a second order Taylor approximation). Probably something else is happening when we add a hidden layer with non-linear activation functions…
LikeLike
Woops, I was too fast there! The derivative (Jacobian) used for contractive auto-encoders is the derivative of the output value of the hidden units w.r.t. the input. For a sigmoid unit in the first hidden layer this is its weight matrix times a function involving exponentials…
LikeLike
I think what he means is that weight noise/weight decay/jacobian regularization are all equivalent in the linear case, but this connection falls apart when you have the non-linearities. They might be still weakly related in the non-linear case, but I’m not sure what that relationship is.
LikeLike
Could we shortly discuss tied- versus untied-autoencoders, and how should we choose between them when designing a model?
LikeLiked by 7 people
I am also interested in this. Can we relate it to the approximation deconvnets are making?
For example, the assumption that the inverse matrix of the input weights W is approximately its transpose, if it is sparse enough and close to orthonormal (which it of them should be anyway in a completly linear auto-encoder)? Or is it simply another weight-sharing scheme to improve the sample-size statistical properties of the model?
LikeLike
1. Has the contractive penality been applied to other models such as feedforward neural networks? It seems like a reasonable thing to do at a first glance…
2. The book first states that sparse coding uses a non-parametric function to encode the input. That is, it can choose any input representation for each unique training example. Yet in “10.2.4 Sparse Coding as a Generative Model” it is clearly defined as a parameteric model. Are the two forms different (or am I misunderstanding something?), and if yes then what is the non-parametric form useful for / how could you use it to learn a representation for a problem you need to solve?
LikeLiked by 2 people
With respect to 1): The contractive penalty has been applied to supervised problems. In this recent paper they propose it to be more robust to adverserial examples. (The penalty really encourage smooth input-output mappings).
LikeLike
http://arxiv.org/abs/1412.5068
LikeLike
This paper might be interesting for the topic:
Click to access science.pdf
LikeLike
The two following papers by Roland might be interesting.
This one is on the zero-bias autoencoder which also has an interesting insight in suggesting that AE learn a clustering of the data.
Click to access 1402.3337v2.pdf
This one relates autoencoders to RBM by showing it’s potential energies are the same in a specific case.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6918504&tag=1
LikeLike
I thougth section 10.2 of the DL book about factor analysis illustrates well why we want to keep weights W from growing outragously. Indeed, as the model weights increase, the covariance (WW_T + some term) will increase and the model pdf will become narrower and therefore not be able to represent unseen variations of the data.
LikeLike
Just realized a mistake here… This doesn’t make sense. Sorry for the useless comment.
LikeLike
In section 10.2.3 of the DL book, I don’t think I understand the argument suggesting that imposing independance among gaussian factors does not allow one to disentangle them. I can see how this is true if the input has diagonal covariance, but not otherwise.
LikeLike
Are there other ways to train deep unsupervised auto-encoders other than greedy layerwise pre-training?
LikeLike
Yes, see the post by Jose Sotelo above.
LikeLike