In this lecture, we will discuss second-order optimization methods for neural networks (and deep models in general).

**Please study the following material in preparation for the class:**

- LeCun et al.’s (1998) Efficient Backprop especially sections 5-9.
- Geoff Hinton’s coursera lecture 8a.
- Lecture slides

Advertisements

First, thank you for the link, it a very nice reading about many optimization methods.

All these fancy methods suffer from the same basic problem – we do not really optimize a quadratic (and not even convex) function. For example, a second order method that assumes quadratic will be equivalent to temporarily setting a too-high learning rate in SGD. This is because it might jump far away from the closest local minima.

Could you please post the link for the research about the shape of these high-dimensional functions? I think it could be very helpful to understand what we really deal with as an optimization problem.

Thanks!

LikeLiked by 1 person

Here is one link I am aware of:

“Exact solutions to the nonlinear dynamics of learning in deep linear neural networks”

Andrew M. Saxe, James L. McClelland, Surya Ganguli

http://arxiv.org/abs/1312.6120

If other’s can think of additional papers, I would appreciate you posting it here.

LikeLiked by 1 person

See also Yann Dauphin’s paper that argues that the loss surface of neural nets most likely suffer from saddle points, and not local minima:

http://arxiv.org/abs/1406.2572

Also, Yann Lecun’s group had a similar paper recently:

http://arxiv.org/abs/1412.0233

LikeLike

Q1) What does Hessian-free optimization methods have that make them suitable for training RNNs?

Q2) We already talked about deep nets and associated training difficulties (vanishing gradient, etc.). It is interesting to me that how dynamics of wide nets work. Right now I can think of:

1. Greater fan-in, hence initialization proportional to (fan-in)^0.5 (Hinton’s Coursera lecture 6) or when initializing weights with U(-sqrt(6/(n1+n2)), sqrt(6/(n1+n2))).

2. Wider net means its easier to untangle the n-dimensional manifold. (All n-dimensional manifolds can be untangled in 2n+2 dimensions. – Christopher Olah’s blog – NN Manifolds Topology)

LikeLiked by 2 people

Q1- “There is no clear theoretical justification why using the Hessian matrix would help.” from Mikolov, Tomas, et al. “Learning Longer Memory in Recurrent Neural Networks.” arXiv preprint arXiv:1412.7753 (2014).

LikeLiked by 1 person

It seems LeCun disagrees with Harm, because he writes “If the Hessian is not positive definite (…) then the Newton algorithm will diverge…”. But that’s not true, using the inverse of the Hessian near a saddle point, will move you toward the saddle point.

LikeLike

Why is it that CG and L-BFGS need the whole batch? Isn’t it possible to generalize a mini-batch version of these methods? In general, is there any way to generalize a line search procedure for mini-batches?

LikeLike

I would also like to have more clarification regarding the H fee optimization.

We go completely down in one axis of the curvature using multiple iteration only modifying one weight at a time?

It seam to me that the choice of the first weight is very important since it will influence the rest of the procedure… so how do we choose it?

LikeLike

Late notice: today, calss would be in Z300!

LikeLike

As mentioned in class, this paper by Maclaurin, Duvenaud, and Adams does hyperparameter optimization by reversing the dynamics of SGD. http://arxiv.org/abs/1502.03492

LikeLiked by 3 people