In this lecture, we will have a rather detailed discussion of regularization methods and their interpretation.

**Please study the following material in preparation for the class:**

- Geoff Hinton’s coursera lectures 9.
- Chapter 7 of the Deep Learning textbook, Sec. 7.1-7.11 and 7.14

**Slides from class**

Advertisements

Question:

If we want sparsity in our network, why don’t we use the L0 norm? Is it too hard to optimize?

Note:

I guess there is a typo in the pseudo-inverse equations:

Page 132, 2nd equation: (Xw – y)^T * (Xw – y)^T should be: (Xw – y)^T * (Xw – y)

3rd equation: same mistake

4th equation: w = (X^T * X^T)^-1 * X^T * y should be: w = (X^T * X)^-1 * X^T * y

5th equation: w = (X^T * X^T – aI)^-1 * X^T * y should be: w = (X^T * X – aI)^-1 * X^T * y

I hope I’m right 😛

LikeLike

Using L0 norm was also raised as question back in 2013:

https://ift6266h13.wordpress.com/2013/02/07/l0-norm/

The main argument Xavier gives is that it is more difficult to optimize, because it is not convex. In fact, it is an extremely discrete norm which (intuitively) means we loose many of the benefits of gradient descent…

LikeLiked by 1 person

In Hinton’s lectures from secton 10 he gives an argument for averaging classification predictions simply based on applying Jensen’s rule to the log probability. If we don’t use log probability as a measure of success, but instead use, say, accuracy can we still prove that an average of predictors fare better than a single one?

Also, are there other arguments for simply averaging predictors?

LikeLike

Julian,

I like this question, but this will be the subject of Thursday’s lecture. Today’s lecture was concerning Hinton’s lecture 9.

LikeLike

In unsupervised pretraining, I’m not clear on what the network is trying to learn exactly, i.e. what the learning rule is. Are we trying to minimize the mutual information between features at each layer?

LikeLiked by 3 people

In unsupervised pretraining you are trying to maximize the log-likelihood of the input data (typically in an RBM model).

LikeLike

In page 131 of the DL book, I can’t see how to get unnumbered equation from changing variable H to it’s factorized for in eq 7.7

LikeLike

At page 141, where did w_i^{\tau – 1} go in eq. (7.18)?

It would be nice if we could go through the proof of early stopping being equivalent to L2.

LikeLiked by 2 people

In Hinton’s lecture, he mentioned that why early-stopping works while we initialize the weights to be very small, the network is no more than a linear network, which has no big capacity. Then while we are training the model, weights get bigger and entering the non-linear region, meanwhile the capacity of model grows.

However, I have 2 questions concerning this intuition:

0. He says the model cannot tell what regulations are real regulations to learn and what are just data-specific regulations which result in over-fitting, so it learns both and get bad performance. But why does the initial “linear model” tends to learn the real regulations? Why the model with low capacity always end up learning the real regulations?

1. Early-stopping and initializing the weights to be small can also be found useful on other kinds of networks like ReLU, whose activation function vicinity to origin is not linear at all. Then, initially the network cannot be equalized as a linear network. But why the two methods still work so well?

LikeLike

0.

The real regulations have to be stronger than the data spurious regulations, otherwise we can not learn anything. So the linear model will learn in priority the real regulation because it is supposed to be stronger. More generally I think that a a model (with any capacity) will likely FIRST start learning the most obvious regulations in the data (real or spurious) and if it still has capacity, it will THEN learn the spurious as well. This gives a slightly different interpretation of early stopping. To be confirmed in class.

1.

I thought about the same counter example when I was watching. Maybe a possible interpretation: a RELU network with small weights is super “flat”. By flat, I mean that the derivatives of the outputs with respect to the inputs are very small (because equal to products of small weights). And a flat/constant function hasn’t much capacity. With bigger weights, the model will still have the same number of non linearity points (hyperplanes) as before but its shape will be much more complicated and thus more flexible to fit the data. In other words, the angles between the hyperplanes (those that delineate linear regions of the RELU network) can take a much broader range of values that with small weights.

LikeLike

edit:

replace “it will THEN learn the spurious as well” by “it will THEN learn the less obvious regulations (=the weaker, which may be real or spurious) as well”

LikeLike

I really like your intuition on “1.” Thanks a lot.

LikeLike

In 1), I think your intuition is right Alexandre. In fact, it probably holds more general: for hidden units with linear transformations followed by any monotone non-linear function, having small weights (transformations) across all hidden units means all the non-linear functions get a similar input (something very close to zero). Even if they are not linear, because they are monotonly increasing they will still produce more or less the same output…

But I think we are overlooking an important part of this: initialization! We typically initialize RELU units with small, but positive weights. Hence they are effectively linear units at the beginning of training! This might explain the success of early stopping.

LikeLike

Q.1 In section 7.1.1 of the book I didn’t quite understand the intuition behind the usage of the eigen values and the effect they have on the regularization. How does that explain what we see in Fig 7.1? and How do they relate to the “normal” regularization term (lambda)?

Q.2 I understand that big values of the weights imply a more complex model in the case of a tanh or softmax activation (since we would be no more in the linear regime), but why is it the case with RELU (this might have been explained in the previous weeks but I didn’t get it)? Actually I just don’t understand why a L1 or a L2 regularization would have any effect if using a RELU activation.

LikeLiked by 1 person

In Hinton’s lecture 9.1, he discusses about cross-validation. After N-fold cross-validation, I guess the best way to select the hyperparameters is simply to take an average of the optimal hyperparameter of each fold? Am I correct in suggesting that?

LikeLike

When you do a single cross validation, you are only evaluating one single set of hyperparameters. So you have to do as many cross validations as thé number of sets of hyperparameters you want to test.

The best set of hyperparameters corresponds to the best cross validation score.

LikeLike

Alexandre, thanks for the reply.

LikeLike

Sometimes the best configuration of hyper-parameters doesn’t fall into a same region. i.e., hyper-parameters are not provided to be convex and may have multiple modes. For example, maybe for one validation set, the best super-parameter is 3000 hiddens with 0.5 noise, and for another validation set, it is 6000 hiddens with 0.7 noise. Both the two configurations work equally well. But if you average them, and choose 4500 hiddens with 0.5 noise, the performance can be terrible.

LikeLike

In Hinton’s lecture 9.1, it is suggested that starting with small weights limits the capacity of a logistic NN since most units will work in the linear regime. Taking the example of a 3 inputs, 6 hidden units and two outputs NN, Hinton suggests that in the linear regime this network acts more like a single layer with 3 inputs and 2 outputs. Though I can see how this argument would apply with tanh hidden activations, I do not see how this applies to logistic units.

LikeLike

Why is it that larger datasets make unsupervised pre-training less necessary?

How big does the datasets need to be so that we can skip it?

LikeLiked by 1 person

I think it’s because with enough data, you can effectively bruteforce the problem. That is, due to backprop, learning starts from the top and goes on down. The better you’re doing on top given the current configuration of lower layers, the better you’re going to be able to optimize your lower layers.

However, if your lower layers are too bad to give you anything useful to work with, then you’re only going to get a really weak signal at the top no matter how good you are (after all, you’re only ever classifying rubbish). Still, you’ll make “baby step” progress throughout.

If you have more data available, you can just accumulate enough “baby steps” that you get “unstuck” and start learning useful representations at the lower levels (and this would happen because your data will show the model more and more valid examples where the “bad features” that it’s become too sensitive for are really irrelevant, so only the real salient features remain after backprop). At this point, you can move toward a good configuration for the whole network.

On the other hand, if you don’t have enough data to do that, you’ll just stay stuck in the current configuration without being able to explore the larger space provided by your full model capacity.

Layerwise pretraining bypasses the problem by setting up the lower layers so that (hopefully) they provide “useful” features to the higher layers. Of course, if you have enough data, that’s not really necessary.

“How big” would depend on the capacity of your model, I would assume. The bigger your model, the more data you’d need.

LikeLike

My intuition/interpretation.

If you have a small labelled dataset and you train a big model on it, your model will learn a little bit of the real regularities (those present in the data (not many)) but also the spurious noise of the data (spurious regularities), which might be relatively important. On the other hand, if your dataset is big, then the real regularities will likely prevail over the noise of the data. In this case, your network can learn them.

So if your labelled dataset is too small, one possible solution is to pretrain it with a big unlabelled dataset (with similar regularities as the small labelled dataset) so that you have a reasonable initialization point in the weights space of the learning optimization problem. This initialization point captures some good regularities of the problem.

LikeLike

As Hinton says in the lecture 9.3, using noise in the activities as a regularizer(make the units binary and stochastic on the foward pass but do the backward pass using real value) will have significantly better result on the test set. However, can we know more about the principle behind this phenomenon? As we know, minimizing the squared error tends to minimize the squared weights by adding Gaussian noise to the inputs.

LikeLike

Here, I mean linear model for the last sentence.

LikeLike

Hinton explains it in more detail in videos in section 10….

LikeLike

In Hinton’s lecture 9.6, I do not see the intuition behind MacKay’s quick & dirty method. Is it possible to go over that? Also, I do not recall seeing this method being used in recent machine learning competitions. Is anybody aware of it’s recent use? Is it still relevant?

LikeLike

It is a hack! It would only have been the right thing to do (more or less) if the data was actually generated by the model at the end of training and the hidden units were actual latent variables (something that will probably never happen in a thousand years!).

LikeLike

Aaron, could you please comment on my post?

I have decided to write down my question from class today:

https://my6266blog.wordpress.com/2015/02/10/regularization-early-stopping-criticism/

Thanks,

Maor.

LikeLike