Lecture 14, Feb. 23rd, 2015: Autoencoders

In this lecture we will begin our discuss of unsupervised learning methods. In particular, we will study a particular kind of neural network known as an autoencoder.

Please study the following material in preparation for the class:

Lecture 6 (6.1 to 6.7) of Hugo Larochelle’s course on Neural Networks.
Chapter 10 of the Deep Learning Textbook.

Other relevant material:

Lecture 15 ( 15a-15f ) of Geoff Hinton’s cousera course on Neural Networks.

16 Replies to “Lecture 14, Feb. 23rd, 2015: Autoencoders”

serdyukdv says:

February 21, 2015 at 9:37 pm

Hugo Larochelle mentions that adding input noise is equivalent to adding weight decay, but the denoising autoencoder performs better that autoencoder with weight decay. What is the reason for it? Also Jacobian with respect to input parameters is proportional to weight matrix. Why it is different from weight decay?

LikeLike

Reply
1. julianserban says:
  
  February 23, 2015 at 12:16 am
  
  Dima, the Jacobian used is the derivative of the loss function w.r.t. the input (e.g. the pixels of an image) and not the weights of the model. This is very different, and in fact I’m using a similar derivative for visualizing chess features (which I will talk more about Friday…).
  
  As we’ve already discussed in class, weight decay and adding (Gaussian) input noise is only equivalent in the case of a quadratic bowl (a second order Taylor approximation). Probably something else is happening when we add a hidden layer with non-linear activation functions…
  
  LikeLike
  
  Reply
  1. julianserban says:
    
    February 23, 2015 at 1:02 am
    
    Woops, I was too fast there! The derivative (Jacobian) used for contractive auto-encoders is the derivative of the output value of the hidden units w.r.t. the input. For a sigmoid unit in the first hidden layer this is its weight matrix times a function involving exponentials…
    
    LikeLike
    
    Reply
    1. iamkelvinxu says:
      
      February 23, 2015 at 6:03 am
      
      I think what he means is that weight noise/weight decay/jacobian regularization are all equivalent in the linear case, but this connection falls apart when you have the non-linearities. They might be still weakly related in the non-linear case, but I’m not sure what that relationship is.
      
      LikeLike
      
      Reply
Maor says:

February 22, 2015 at 7:28 pm

Could we shortly discuss tied- versus untied-autoencoders, and how should we choose between them when designing a model?

LikeLiked by 7 people

Reply
1. julianserban says:
  
  February 23, 2015 at 12:37 am
  
  I am also interested in this. Can we relate it to the approximation deconvnets are making?
  
  For example, the assumption that the inverse matrix of the input weights W is approximately its transpose, if it is sparse enough and close to orthonormal (which it of them should be anyway in a completly linear auto-encoder)? Or is it simply another weight-sharing scheme to improve the sample-size statistical properties of the model?
  
  LikeLike
  
  Reply
julianserban says:

February 23, 2015 at 3:32 am

1. Has the contractive penality been applied to other models such as feedforward neural networks? It seems like a reasonable thing to do at a first glance…

2. The book first states that sparse coding uses a non-parametric function to encode the input. That is, it can choose any input representation for each unique training example. Yet in “10.2.4 Sparse Coding as a Generative Model” it is clearly defined as a parameteric model. Are the two forms different (or am I misunderstanding something?), and if yes then what is the non-parametric form useful for / how could you use it to learn a representation for a problem you need to solve?

LikeLiked by 2 people

Reply
1. harmdevries89 says:
  
  February 26, 2015 at 3:04 pm
  
  With respect to 1): The contractive penalty has been applied to supervised problems. In this recent paper they propose it to be more robust to adverserial examples. (The penalty really encourage smooth input-output mappings).
  
  LikeLike
  
  Reply
  1. harmdevries89 says:
    
    February 26, 2015 at 3:04 pm
    
    http://arxiv.org/abs/1412.5068
    
    LikeLike
    
    Reply
Jose Sotelo says:

February 23, 2015 at 3:48 am

This paper might be interesting for the topic:

Click to access science.pdf

LikeLike

Reply
francisquintallauzon says:

February 23, 2015 at 2:42 pm

The two following papers by Roland might be interesting.

This one is on the zero-bias autoencoder which also has an interesting insight in suggesting that AE learn a clustering of the data.

Click to access 1402.3337v2.pdf

This one relates autoencoders to RBM by showing it’s potential energies are the same in a specific case.
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6918504&tag=1

LikeLike

Reply
francisquintallauzon says:

February 23, 2015 at 3:04 pm

I thougth section 10.2 of the DL book about factor analysis illustrates well why we want to keep weights W from growing outragously. Indeed, as the model weights increase, the covariance (WW_T + some term) will increase and the model pdf will become narrower and therefore not be able to represent unseen variations of the data.

LikeLike

Reply
1. francisquintallauzon says:
  
  February 23, 2015 at 5:49 pm
  
  Just realized a mistake here… This doesn’t make sense. Sorry for the useless comment.
  
  LikeLike
  
  Reply
francisquintallauzon says:

February 23, 2015 at 3:59 pm

In section 10.2.3 of the DL book, I don’t think I understand the argument suggesting that imposing independance among gaussian factors does not allow one to disentangle them. I can see how this is true if the input has diagonal covariance, but not otherwise.

LikeLike

Reply
Andrew Doyle says:

February 23, 2015 at 7:20 pm

Are there other ways to train deep unsupervised auto-encoders other than greedy layerwise pre-training?

LikeLike

Reply
1. aaroncourville says:
  
  February 26, 2015 at 2:19 pm
  
  Yes, see the post by Jose Sotelo above.
  
  LikeLike
  
  Reply