Lecture 10, Feb. 9th, 2015: Regularization I

In this lecture, we will have a rather detailed discussion of regularization methods and their interpretation.

Please study the following material in preparation for the class:

Slides from class


28 Replies to “Lecture 10, Feb. 9th, 2015: Regularization I”

  1. Question:
    If we want sparsity in our network, why don’t we use the L0 norm? Is it too hard to optimize?

    I guess there is a typo in the pseudo-inverse equations:
    Page 132, 2nd equation: (Xw – y)^T * (Xw – y)^T should be: (Xw – y)^T * (Xw – y)
    3rd equation: same mistake
    4th equation: w = (X^T * X^T)^-1 * X^T * y should be: w = (X^T * X)^-1 * X^T * y
    5th equation: w = (X^T * X^T – aI)^-1 * X^T * y should be: w = (X^T * X – aI)^-1 * X^T * y
    I hope I’m right 😛


  2. In Hinton’s lectures from secton 10 he gives an argument for averaging classification predictions simply based on applying Jensen’s rule to the log probability. If we don’t use log probability as a measure of success, but instead use, say, accuracy can we still prove that an average of predictors fare better than a single one?

    Also, are there other arguments for simply averaging predictors?


    1. In Hinton’s lecture, he mentioned that why early-stopping works while we initialize the weights to be very small, the network is no more than a linear network, which has no big capacity. Then while we are training the model, weights get bigger and entering the non-linear region, meanwhile the capacity of model grows.

      However, I have 2 questions concerning this intuition:
      0. He says the model cannot tell what regulations are real regulations to learn and what are just data-specific regulations which result in over-fitting, so it learns both and get bad performance. But why does the initial “linear model” tends to learn the real regulations? Why the model with low capacity always end up learning the real regulations?

      1. Early-stopping and initializing the weights to be small can also be found useful on other kinds of networks like ReLU, whose activation function vicinity to origin is not linear at all. Then, initially the network cannot be equalized as a linear network. But why the two methods still work so well?


      1. 0.
        The real regulations have to be stronger than the data spurious regulations, otherwise we can not learn anything. So the linear model will learn in priority the real regulation because it is supposed to be stronger. More generally I think that a a model (with any capacity) will likely FIRST start learning the most obvious regulations in the data (real or spurious) and if it still has capacity, it will THEN learn the spurious as well. This gives a slightly different interpretation of early stopping. To be confirmed in class.

        I thought about the same counter example when I was watching. Maybe a possible interpretation: a RELU network with small weights is super “flat”. By flat, I mean that the derivatives of the outputs with respect to the inputs are very small (because equal to products of small weights). And a flat/constant function hasn’t much capacity. With bigger weights, the model will still have the same number of non linearity points (hyperplanes) as before but its shape will be much more complicated and thus more flexible to fit the data. In other words, the angles between the hyperplanes (those that delineate linear regions of the RELU network) can take a much broader range of values that with small weights.


        1. In 1), I think your intuition is right Alexandre. In fact, it probably holds more general: for hidden units with linear transformations followed by any monotone non-linear function, having small weights (transformations) across all hidden units means all the non-linear functions get a similar input (something very close to zero). Even if they are not linear, because they are monotonly increasing they will still produce more or less the same output…

          But I think we are overlooking an important part of this: initialization! We typically initialize RELU units with small, but positive weights. Hence they are effectively linear units at the beginning of training! This might explain the success of early stopping.


  3. Q.1 In section 7.1.1 of the book I didn’t quite understand the intuition behind the usage of the eigen values and the effect they have on the regularization. How does that explain what we see in Fig 7.1? and How do they relate to the “normal” regularization term (lambda)?

    Q.2 I understand that big values of the weights imply a more complex model in the case of a tanh or softmax activation (since we would be no more in the linear regime), but why is it the case with RELU (this might have been explained in the previous weeks but I didn’t get it)? Actually I just don’t understand why a L1 or a L2 regularization would have any effect if using a RELU activation.

    Liked by 1 person

  4. In Hinton’s lecture 9.1, he discusses about cross-validation. After N-fold cross-validation, I guess the best way to select the hyperparameters is simply to take an average of the optimal hyperparameter of each fold? Am I correct in suggesting that?


    1. When you do a single cross validation, you are only evaluating one single set of hyperparameters. So you have to do as many cross validations as thé number of sets of hyperparameters you want to test.
      The best set of hyperparameters corresponds to the best cross validation score.


    2. Sometimes the best configuration of hyper-parameters doesn’t fall into a same region. i.e., hyper-parameters are not provided to be convex and may have multiple modes. For example, maybe for one validation set, the best super-parameter is 3000 hiddens with 0.5 noise, and for another validation set, it is 6000 hiddens with 0.7 noise. Both the two configurations work equally well. But if you average them, and choose 4500 hiddens with 0.5 noise, the performance can be terrible.


  5. In Hinton’s lecture 9.1, it is suggested that starting with small weights limits the capacity of a logistic NN since most units will work in the linear regime. Taking the example of a 3 inputs, 6 hidden units and two outputs NN, Hinton suggests that in the linear regime this network acts more like a single layer with 3 inputs and 2 outputs. Though I can see how this argument would apply with tanh hidden activations, I do not see how this applies to logistic units.


    1. I think it’s because with enough data, you can effectively bruteforce the problem. That is, due to backprop, learning starts from the top and goes on down. The better you’re doing on top given the current configuration of lower layers, the better you’re going to be able to optimize your lower layers.

      However, if your lower layers are too bad to give you anything useful to work with, then you’re only going to get a really weak signal at the top no matter how good you are (after all, you’re only ever classifying rubbish). Still, you’ll make “baby step” progress throughout.

      If you have more data available, you can just accumulate enough “baby steps” that you get “unstuck” and start learning useful representations at the lower levels (and this would happen because your data will show the model more and more valid examples where the “bad features” that it’s become too sensitive for are really irrelevant, so only the real salient features remain after backprop). At this point, you can move toward a good configuration for the whole network.

      On the other hand, if you don’t have enough data to do that, you’ll just stay stuck in the current configuration without being able to explore the larger space provided by your full model capacity.

      Layerwise pretraining bypasses the problem by setting up the lower layers so that (hopefully) they provide “useful” features to the higher layers. Of course, if you have enough data, that’s not really necessary.

      “How big” would depend on the capacity of your model, I would assume. The bigger your model, the more data you’d need.


    2. My intuition/interpretation.

      If you have a small labelled dataset and you train a big model on it, your model will learn a little bit of the real regularities (those present in the data (not many)) but also the spurious noise of the data (spurious regularities), which might be relatively important. On the other hand, if your dataset is big, then the real regularities will likely prevail over the noise of the data. In this case, your network can learn them.

      So if your labelled dataset is too small, one possible solution is to pretrain it with a big unlabelled dataset (with similar regularities as the small labelled dataset) so that you have a reasonable initialization point in the weights space of the learning optimization problem. This initialization point captures some good regularities of the problem.


  6. As Hinton says in the lecture 9.3, using noise in the activities as a regularizer(make the units binary and stochastic on the foward pass but do the backward pass using real value) will have significantly better result on the test set. However, can we know more about the principle behind this phenomenon? As we know, minimizing the squared error tends to minimize the squared weights by adding Gaussian noise to the inputs.


  7. In Hinton’s lecture 9.6, I do not see the intuition behind MacKay’s quick & dirty method. Is it possible to go over that? Also, I do not recall seeing this method being used in recent machine learning competitions. Is anybody aware of it’s recent use? Is it still relevant?


    1. It is a hack! It would only have been the right thing to do (more or less) if the data was actually generated by the model at the end of training and the hidden units were actual latent variables (something that will probably never happen in a thousand years!).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s