Lecture 13, Feb. 19th, 2015: Optimization II

In this lecture, we will discuss second-order optimization methods for neural networks (and deep models in general).

Please study the following material in preparation for the class:


10 thoughts on “Lecture 13, Feb. 19th, 2015: Optimization II”

  1. First, thank you for the link, it a very nice reading about many optimization methods.

    All these fancy methods suffer from the same basic problem – we do not really optimize a quadratic (and not even convex) function. For example, a second order method that assumes quadratic will be equivalent to temporarily setting a too-high learning rate in SGD. This is because it might jump far away from the closest local minima.

    Could you please post the link for the research about the shape of these high-dimensional functions? I think it could be very helpful to understand what we really deal with as an optimization problem.


    Liked by 1 person

  2. Q1) What does Hessian-free optimization methods have that make them suitable for training RNNs?

    Q2) We already talked about deep nets and associated training difficulties (vanishing gradient, etc.). It is interesting to me that how dynamics of wide nets work. Right now I can think of:
    1. Greater fan-in, hence initialization proportional to (fan-in)^0.5 (Hinton’s Coursera lecture 6) or when initializing weights with U(-sqrt(6/(n1+n2)), sqrt(6/(n1+n2))).
    2. Wider net means its easier to untangle the n-dimensional manifold. (All n-dimensional manifolds can be untangled in 2n+2 dimensions. – Christopher Olah’s blog – NN Manifolds Topology)

    Liked by 2 people

    1. Q1- “There is no clear theoretical justification why using the Hessian matrix would help.” from Mikolov, Tomas, et al. “Learning Longer Memory in Recurrent Neural Networks.” arXiv preprint arXiv:1412.7753 (2014).

      Liked by 1 person

  3. It seems LeCun disagrees with Harm, because he writes “If the Hessian is not positive definite (…) then the Newton algorithm will diverge…”. But that’s not true, using the inverse of the Hessian near a saddle point, will move you toward the saddle point.


  4. Why is it that CG and L-BFGS need the whole batch? Isn’t it possible to generalize a mini-batch version of these methods? In general, is there any way to generalize a line search procedure for mini-batches?


  5. I would also like to have more clarification regarding the H fee optimization.

    We go completely down in one axis of the curvature using multiple iteration only modifying one weight at a time?

    It seam to me that the choice of the first weight is very important since it will influence the rest of the procedure… so how do we choose it?


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s