Lecture 7, Jan. 29th, 2015: Neural Networks, Odds and Ends

In this lecture, we will conclude our discussion of the standard multi-layer perceptron.

Please study the following material in preparation for the class:

Other relevant material:


32 Replies to “Lecture 7, Jan. 29th, 2015: Neural Networks, Odds and Ends”

  1. Fisher vectors tend to perform well when it comes to hand-crafted features, yet blow up the dimensionality of the original data. Does this not imply that the manifold hypothesis might not hold, or at least not in some settings? Is it simply that while data lies on a lower-dimensional manifold, such a representation is not immediately useful?

    In general, I tend to see quite a lot of work that depends on, or at least seems motivated by, the manifold hypothesis as opposed to work that is agnostic to it or takes a contradictory hypothesis (though it could be chance). Christopher Olah’s blog also states “if you believe [in the manifold hypothesis]”;
    In “practice”, what is the other side of the coin exactly? (Surely someone has explored a more interesting idea than the trivial counter-hypothesis?) Are there particular cases that seem to truly contradict the manifold hypothesis?

    Liked by 2 people

    1. @jz: The first part of your question doesn’t really need to imply that the manifold hypothesis doesn’t hold. Those fisher features are used for classification (I guess), so although the data lives on a low dimensional manifold, it would need a very complicated decision surface to discriminate between the classes. Projecting to a high dimensional space, makes it possible to use a simple (e.g linear) decision surface.


      1. It is not necessarily true that low dimensional manifold corresponding to a more complex decision surface. In fact, if one thinks about a mlp 784-200-10 on MNIST where hidden size is 200, it maps the input 784 dims space to a lower 200 dims hidden space (imaging that the manifold locates in that hidden space), I bet that this mlp will perform better than directly implementing classifier layer on 784 dims input space. As you can see, this mlp does better classification on a lower dimensional manifold, which means that it has a simpler decision surface in low dimension for classifier layer (softmax, linear, etc).

        Here also comes the side-question: How does one define the “dimension” of the “manifold”?


  2. A historical question: why are the no free lunch theorems for machine learning discovered so late (beginning of the 90’s)? Let’s say that machine learning really started in 60’s, this means we were doing 30 years of research before we finally understood that there are fundamental limitations to learning algorithms?

    Liked by 3 people

    1. Well, in fairness, the no free lunch theorem seems like quite a difficult thing to formalize. I found a small site that talks a little more about it, and distinguished the no free lunch theorem for optimization/supervised learning.


      It also points out that Hume believed something similar in the 18th century, but it seems like in the no one was able to prove it.

      Liked by 3 people

  3. Christopher Olah’s blog talks about using a K-NN cost for neural networks, which seems like a pretty interesting approach for semi-supervised learning. Yann LeCun also discusses training a Parzen window on top of a trained neural network to categorize new classes of images unseen during training. Have there been other approaches like this which take something like, say, label propagation, and try to use it for a cost when very little labeled data is available?

    Yann LeCun’s webinar can be found here – http://on-demand-gtc.gputechconf.com/gtcnew/on-demand-gtc.php?searchByKeyword=GTC+Express&searchItems=&sessionTopic=&sessionEvent=&sessionYear=&sessionFormat=&submit=&select=+

    Liked by 3 people

  4. As the book says(page 113) *It can be readily proven that the back propagation algorithm has optimal computational complexity in the sense that there is no algorithm that can compute the gradient faster*. However, is the algorithm efficient enough? I found an interesting paper named “Back Propagation Is Not Efficient”, which published in 1996(link is here http://citeseerx.ist.psu.edu/viewdoc/summary?doi= It says that training even the three-node sigmoid network is NP-hard. I’m not sure if there has any update of this topic recent years. Does someone have any idea?

    Liked by 1 person

    1. This doesn’t really answer your question, but I think the word efficient is being used differently between the two things you linked. In the back propagation case, I think it just refers to computing the gradient with minimal computation. This would be a function of network configuration.

      In the “Back Propagation is Not Efficient”, they are dealing with the decision problem of whether or not some training input has some corresponding weights that satisfies what they call the output separation condition. Efficient in that case refers to whether that decision problem can be determined in polynomial time.

      Liked by 2 people

    2. Yes, Yoshua, Aaron and Ian should certainly clarify what they mean by “effecient” in their book. I guess their definition of efficiency is related to how much we can decrease the loss function (e.g. the cross-entropy error) for a certain number of arithemtic operations. Nesterov has some nice proofs in this direction.

      I only skimmed the paper, but it seems their definition of efficiency is related to how close we are to an / the optimal solution? This is a very different problem…


  5. And I have a question about the paper Aaron mentioned last Monday (Understanding the difficulty of training deep feedforward neural networks) .
    In part 3.1(page 251), the explanation of figure 2 doesn’t convince me actually. The author proposed *The logistic layer output softmax(b + W h) might initially rely more on its biases b (which are learned very quickly) than on the top hidden activations h derived from the input image (because h would vary in ways that are not predictive of y, maybe correlated mostly with other and possibly more dominant variations of x). Thus the error gradient would tend to push Wh towards 0, which can be achieved by pushing h towards 0.* The explanation seems very intuitive and how the top layer could succeed in pushing h towards 0(how it learns)? It’s quite vague reason to confirm the beautiful experiment figure showed in the paper, at least from my perspective.
    Is there any much rigorous explanation for the experiment result? Do you have any thoughts?


      1. I meant something like having a layer where half of the activations are sigmoid and half tanh. This seems to be useful for example for RNNs where you could leave some linear neurons (with no activation) and perhaps even with the some of the weights fixed, so that the gradient flows easily and prevents the ‘vanishing gradient’ problem. This idea was proposed to me today by Mohammed.

        Another option would be to have the activation function to be learned by the training algorithm, but that sounds a little bit more complicated. Especially since this would be constrained to the case where there’s a continuity measure between the different activation functions.

        Liked by 1 person

        1. In RNNs, ‘vanishing gradients’ comes from extreme depth in temporal direction.
          If one uses linear layer, yes, you may avoid vanishing gradients, but you will get exploding gradients. Having shortcut connections (not necessarily linear) in deep RNN structure helps a lot for the optimization problem.

          Liked by 1 person

  6. In the video 2.10, Hugo Larochelle talks about the Model Selection and speaks about two search modes: grid search and random search.

    Can’t we use something smarter that random or grid search?

    In random or grid search, once we have tried the n first possible configurations, we do not take account of the result of those when we start to process the n+1 one, which seems not optimal… Maybe we can already see that we are definitely not going in the good direction, but we will still compute them.

    If we assume that for each hyper-parameter the associate curve has only one local minimum, we could use dichotomic search to find the optimal value.

    Am I missing something?

    Liked by 2 people

    1. The assumption that each hyper-parameter has only one local minimum is not very reasonable.

      And even if that was true for each hyper-parameter taken individually, I think it’s possible to construct a case where it doesn’t mean it’s true when taken together. Basically, this amounts to saying that if a function is convex in each variable separately (fixing the other values), then the function is convex. And I don’t think that’s true.

      That being said, there are other ways to search for hyper-parameters, and Kevin Swersky has a startup that deals with that. You train your model on your own, but you send the hyper-parameters to their server, along with the loss value obtained, and the server suggests the next hyper-parameters to try. They’re on beta, though.

      Some people also fit gaussian processes to the hyper-parameter space to decide which areas should be explored next. In those cases, the n-th configuration makes a good use of all the other configurations tried so far.

      Liked by 1 person

  7. Hyper-parameter optimization is a pretty big area of research. So to answer your question, people are actively trying to do something smarter.

    One of the problems is that often we have so many hyperparameters that simple intuition about what should happen is completely useless.

    In some ways, in my opinion this makes hyper-parameter optimization a good candidate for more principled approaches similar to what you suggest (except more formal). Recently, there has been a bit of excitement and research around Bayesian Optimization for machine learning.

    http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf (Not coincidentally, one of the authors is Hugo Larochelle, who is one of the biggest researchers on this problem)

    The method they use in that paper is based on Gaussian Processes, which is a very different tools for machine learning. The intuition I have as to why they use this approach is that you can compute marginals and conditionals very easily with Gaussians, and also because the function we are trying to optimize is assumed to be smooth which you can encode in your covariance kernel to give you “likely functions”.

    I don’t know too much in depth about the topic, but I do recommend lectures by Nando de Freitas on the basics of Gaussian Processes if you are interested:

    Liked by 1 person

  8. – In Chapter 6.6 it is mention that the rectified linear units are simpler and may behave well in small sample size is it because since it’s a less complex function we do not need as much sample to converge with a SGD?

    – Also it is mentioned that the RL unit cannot learn when their activation is zero and it is proposed to initialize the biases to a small positive number in order to make sure that all hidden units can learn at the beginning. An arbitrary positive number for the bias do not guaranty that all units will be able to learn at initialization. Would it make sens to do a forward prop for all hidden layer and set each bias to make sure that there activation is on the positive side?

    – Could you explain why Maxout units do not have this problem?


  9. Two questions:

    1- Regarding regularization, a quick look around the Internet led me to believe L1 regularizers tend to be preferred because they produce sparse weight matrices. Is this right, and in what situation would an L2 regularizer be used?

    2- Regarding hyperparameter selection, is there a rule of thumb to bound the initial search space? I.e. how do I decide what a “reasonable” range would be for any given hyperparameter. I’m thinking specifically of the number of hidden layers and the number of neurons per hidden layer.


    1. My 2 pennies for your question:

      1. It is quite common to use early-stopping when training NNets. There is a connection between L2 penalization and early stopping. In contrast, the kind of regularization derived by L1 penalization is quite different (tends to produce sparse results).

      This behaviour is mentioned here:

      The original reference is this paper:

      2. I would say that this is dependent of the context of the problem. However, something smart is to choose large search spaces (I know there’s a little bit of ambiguity here) and to search in the log-space. Furthermore, you could refine your search after you find a good place in the hyper-parameter space.

      In the specific case of the number of neurons per hidden layer, it seems to be that it’s better if all the layers have the same number of units. Furthermore, because there are powerful regularization methods, it seems that the problem could rely on not having enough capacity. These two conclusions are also mentioned in the Yoshua’s paper that I linked before.


  10. In colah’s blog entry, I do not understand what the first graph of section “Continuous Visualization of Layers” represents. What does he means with “The tricky part is in understanding how we go from one to another.”?


  11. How can topology help us understand disentanglement in neural networks?

    More specifically, in his blog Christopher discusses how neural networks attempt to untangle non-trivial knots. He mentions that neural networks will try to stretch these areas a lot and that “contractive penalties, penalizing the derivatives of the layers at data points, are the natural way to fight this.”. How do we reason that penalizing large derivatives fights this? If we just stop stretching them out the entanglement will remain at that layer. In that case, is it then the hope that higher layers (with more hidden units and thus a non-singular weight matrix, which therefore accoding to Christopher also cannot be an ambient isotopy) untangles them? If so, should the contractive penality not be restricted mainly to lower layers of the neural network?

    And how does that fit together with many papers which successfully use neural network architectures of the form: input layer => large layer => medium-size layer => small layer => output layer. Doesn’t this mean that disentanglement (at least of non-trivial knots) only happens in the first layer? Given that the manifold hypothesis is true (something that would also be nice to discuss in class), does this then imply that low-dimensional manifold found by these models did not disentangle any non-trivial knots?


  12. Let’s consider a classification problem that we want to solve with a neural network. In order to be able to use the negative log-likelihood cost function, the network has to model P(Y|X) and thus, the outputs have to be positive and sum to one. A softmax activation function is commonly used for this purpose. But there are other possible functions of the following form: f(z_k) = g(z_k) / (sum g(z_i)), where g is an arbitrary positive function. For the softmax, g(x) = exp(x). Do you know if any other functions g have been investigated?

    g(x) = exp(x) has the property to really increase the highest output versus the others. I guess this might be what we want when the targets are one-hot coded. Why not going even further and consider, say g(x)=exp(exp(x)). If the targets are probabilistic, we might consider smoother function like g(x) = x^2.

    Liked by 1 person

    1. It seems that exp(exp(x)) suffers from vanishing gradient, at least in some possible settings:
      For g(x)=exp(exp(x)) the gradient of NLL w.r.t. a_i where
      P_i = exp(exp(a_i))/sum(exp(exp(a_j))) is as below:
      d(-log P_y)/d a_i = P_i * exp(a_i) – exp(a_y) * 1_(i=y)

      Consider a scenario where the network strongly predicts a wrong class, e.g. the real output should be [0, 1] and the network predicts pre-activation a=b+Wh=[+3, -3] (assuming a_i can get large negative/positive values), gradient w.r.t. a_1 is almost zero [=exp(a_1)*(P_1 – 1) ~= -exp(a_1)], while gradient for a_0 is large (here ~20).

      On the contrary, for g(x)=exp(x) we have P_i = exp(a_i)/sum(exp(a_j)) and the gradient of NLL w.r.t. a_i is d(-log P_y)/d a_i = (P_i – 1_(i=y)) which does not have this issue.


  13. In “Understanding the difficulty of training deep feedforward neural networks” paper the intuition for ‘Normalized Initialization’ is to keep information flowing in forward- and backward-propagation (eq. 8,9). But what was the intuition behind standard initialization method?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s