In this lecture, we will discuss gradient optimization methods for neural networks (and deep models in general).

**Please study the following material in preparation for the class:**

- Geoff Hinton’s coursera lectures 6.
- Chapter 4 and Chapter 8 of the Deep Learning textbook.
- Lecture slides. (covers SGD, SGD+momentum, Nesterov momentum, Adagrad, RMSprop, Adadelta)

Advertisements

I cannot understand why it is recommended to “initialize the weights to be proportional to sqrt(fan-in)” [Hinton]

How does this put us in a better starting point for the optimization process?

(and if it does, I could not find such feature in Pylearn2 rather than manually calculating these values)

LikeLiked by 2 people

The idea is to normalize the inputs/outputs of the neurons so that there are less saturation problems with certain non-linear functions (reduce vanishing gradient problems). By normalizing, I mean having the same output variances for all the neurons. Let’s try to find the maximal range [-r, r] to comply with this constraint.

Let’s consider a neuron and assume its has n inputs x_i (n = “fan-in”) that are already normalized (same var(x_i) for all the inputs) and considered independant (!). Let’s denote the weights of the neurons by w_i and suppose they are initialized between [-r, r].

Then the pre-activation of the neuron is y = sum (x_i * w_i). Taking the variance, we have: var(y) = var(phi(w_i^2 * var(x_i)) ~ sum(w_i^2 * var(x_i)). We want var(y) = var(x_i), i.e.

sum(w_i^2) = 1, i.e. r^2 * n > = 1, ie r >= 1/sqrt(n).

Yes, many approximations here!

For “smarter/more accurate” initialization schemes, have a look to this paper http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf. It works for specific non lineaities and under other assumptions. In addition to have the same variance for the inputs/outputs they try to ensure that the errors have the same variance too (they find a trade-off).

LikeLiked by 4 people

I don’t completly understand the analysis of section 8.1.4 (page 154).

Maybe Aaron can elaborate on this analysis based on the log-normal assumption?

Like why does the log-normal sum grow linearly but at the same time as sqrt(T)? And why is the final product e^{T} instead of e^{1/2 T} as would be the expectation of under log-normal distribution: http://en.wikipedia.org/wiki/Log-normal_distribution#Arithmetic_moments.

LikeLiked by 2 people

In Hinton video (6b) he mention normalizing the input vector to have zero mean over the whole dataset, does he refer to the mean of all the training data or the means of each input?

LikeLiked by 1 person

He talks about the mean over different inputs (for x1 = 101, x2 = 99 the mean in 100).

LikeLiked by 1 person

Edit: over different features.

LikeLike

Yeah, so basically you compute a mean over the training set for each dimension (e.g. if you have 10 dimensional vector, you will get a mean with 10 values). Then you subtract this mean from both the training set, validation set and test set.

LikeLike

If second-order methods are powerful, why don’t we address the computation problem by taking the gradient from whole train set and the Hessian from a small subsample of training set? There should still be some information about the curvature of error surface in subsamples, right?

Here is another thought. Is it reasonable to use second-order method (with small subsample of train set) to find a better initial point for other optimization methods?

A quick review of RMSProp, AdaDelta, AdaGrad, etc. in class is appreciated.

LikeLiked by 3 people

I would have also appreciated a quick review of LeCun’s method.

LikeLike

Soroush, you cannot compute the Hessian effectively for large models, regardless of the number training examples. The problem is that the Hessian requires O(n^2) operations, where n is the number of parameters of the model. So if you have a model with one million parameters (a very typical scenario) you will end up with 1000 trillion calculations!

This excludes any true second-order method for most of deep learning. Most methods used rely on some approximation of the Hessian (e.g. RMSProp which approximates the inverse of the absolute diagonal Hessian) or a subset of it (e.g. computing the diagonal and assuming everything off the diagonal is zero).

LikeLiked by 7 people

I like Julian’s answer.

But just to add, I think there is a common association between batch-learning methods and second order methods however (or is this just me??). I think there is also an association between “batch learning” and “non-starter” with the problems we deal with.

With this association, it may be easy to forget where the real scaling issues are coming from. For the purposes of deep learning book, it might be worth making that point more explicit.

LikeLiked by 2 people

In lecture 6a, Hinton suggests that the function to be optimized with large NN is quite different from the functions that have been studied by the optimization community. What’s the basis for that comment?

LikeLiked by 1 person

One argument Hinton gives is that the optimziation community has focused on convex problems. This makes it easier to derive mathematical results for their performance, yet convexity does not hold for most deep learning model, in fact they typically have many different saddle points and local minima.

A second argument could be that they haven’t focused on “redundancy”. For example, when we apply deep learning to images, we implicitly assume that the same features exist across a wide range of training samples (which is why we use mini-batches in the first place). I don’t think this kind of assumption has been investigated by researchers in optimization theory.

LikeLiked by 3 people

Thanks for the comment Julian.

I am not well fared in optimization, but I believe that trust regions methods, L-BFGS are non-convex optimization methods, am I right? Are you or somebody else aware of people benchmarking such methods against the SGD and it’s variants on deep networks?

I guess that another point against many optimization methods involve the computation of a hessian which might be overwhelmingly computationnaly intensive in deep models with large number of parameters.

LikeLike

I added a comment below which may answer part of your questions or spark more discussion. I agree about Hessian being a primary issue with second order, though I think memory problems are even more constraining than compute. As far as benchmarking, there have not been many thorough investigations, but one that did a pretty good job was this paper (http://arxiv.org/pdf/1312.6055v3.pdf) Unit Tests for Stochastic Optimization, T. Schaul et. al. , though it did not test against any second order methods.

It is also missing comparisons for the most recent optimizers (Adam, ESGD, and Adasecant), though none of these have been officially published yet.

In general I think there is still a lot of value in reading the “standard” optimization literature and trying to find areas where we can bring new tricks into deep learning. The two communities seem to have very different focuses and bridging the gap could be very fruitful.

LikeLike

I also think some of the differences are the acceptability of empirical results versus theoretical between the two communities (optimization/deep learning). Optimization seems to desire and require strong proofs and convergence bounds, which is good for proving generality but also limits these proofs to convex settings.

Deep learning has been driven almost entirely by empirical results, so we seem much more open to “it worked X% better on these datasets” style papers. However, there is also a case to be made that provable optimizers may still be beneficial even outside the applications where they guarantee results.

SGD is also quite easy to implement and works annoyingly well, whereas more complex optimizers are tricky mathematically and numerically. With all the other complexities implementing deep learning algorithms, adding *even more* moving parts is not always a great idea.

LikeLike

Interestingly, Eren Golge just released a blog post on Comparison of different optimization algorithms including SGD, RmsProp, and Momentum which is quite useful.

Here’s the link:

http://www.erogol.com/comparison-sgd-vs-momentum-vs-rmsprop-vs-momentumrmsprop/

LikeLike

In lecture 6b, Hinton talks about 2 common problems that occur in multilayer networks. I don’t fully understand the second situation where “if classification networks that use a squared error or a cross-entropy error, the best guessing strategy is to make each unit always produce an output equal to the proportion of time it should be a 1.” What does it mean? And why is it another plateau which looks like a local minimum?

Another question is why we say adaptive learning rates only deal with axis-aligned effects while momentum doesn’t care about it.

LikeLiked by 3 people

Suppose you’re trying to minimize your objective without actually learning any information about the data itself (that is called “guessing”). Then a good strategy is to take the average over the entire dataset. This way, your error is the sum over D of [d – mean(D)]^2. If you move your guess in any direction, you get sum over D of [d – mean(D) – delta]^2, and this is smaller than guessing the mean if and only if mean(D) < d (which is impossible by definition, since min(D) < mean(D) d (likewise impossible).

This is thus an optimal parameter-free solution, which means it is the best possible guess.

Indeed, guessing E[x] – that is, the proportion of time x should be 1: 0*p(0) + 1*p(1) – minimizes MSE.

Because adding any constant (rather than function of d) increases the MSE, it forms a surface that may look like a local minimum, but I think it would form a saddle point, not sure though.

LikeLike

A chunk of the reply magically disappeared.

For the first inequality, you need delta 0, then you need mean(D) > d.

LikeLike

delta < 0. If delta > 0, then you need mean(D) > d.

LikeLike

This might be useful:

http://ai.stanford.edu/~ang/papers/icml11-OptimizationForDeepLearning.pdf

In particular, I don’t know why L-BFGS is not more commonly used.

LikeLiked by 1 person

This blog post by Criteo (http://labs.criteo.com/2014/09/poh-part-3-distributed-optimization/) talks about some current uses of L-BFGS in a convex setting, but for neural nets in particular I think it was semi-popular before “deep learning” took off. Ultimately I don’t know that L-BFGS buys much over RMSProp and friends for non-convex methods since L-BFGS is approximating the diagonal of the inverse Hessian. Maybe someone could comment more on this.

This paper (http://users.iems.northwestern.edu/~nocedal/PDFfiles/representations.pdf) has a nice overview of limited memory Hessian approximations. Limited memory seems to be our primary issue in deep learning optimization, since any memory complexity besides linear (or close to it) is a non-starter for large networks with millions of parameters. The same goes for computational complexity but we have more tricks for avoiding computational bottlenecks.

LikeLike