In this lecture, we will have continue our discussion of regularization methods. We will particularly focus on ensemble methods and dropout.

**Please study the following material in preparation for the class:**

- Geoff Hinton’s coursera lectures 10.
- Chapter 7 of the Deep Learning textbook, Sec. 7.12-7.13

**Slides from class**

Advertisements

Is it useful to use both L1/L2 and early stopping to regularize? Do they interact in any particular way?

LikeLiked by 2 people

I think this was answered in the previous lecture, but L2 regularization and early stopping can be seen as performing similar jobs. Both prevent your network from traveling “too far” in parameter space – L2 by penalizing the weights and early stopping by stopping training.

LikeLiked by 1 person

What is the status with unsupervised pretraining?

In class you have mentioned that we do not use this technique any more, however, what is the criteria for still using it?

For example, it looked very beneficial over MNIST: (From your book)

“the state of the art classification error rate on the MNIST dataset is attained by a classifier that uses both dropout regularization and deep Boltzmann machine pretraining. Combining dropout with unsupervised pretraining has not become a popular strategy for larger models and more challenging datasets.”

According to this, what makes MNIST a non-“challenging dataset”?

LikeLike

Typically the only time you will see unsupervised pretraining these days (2012-present) is for applications with very little labeled data. If you also tie in transfer learning, there are applications where you want to train a very large system, then tie all the pieces together and finetune.

For example, the recent work on caption generation (http://www.iro.umontreal.ca/~bengioy/cifar/NCAP2014-summerschool/slides/ryan_kiros_cifar2014kiros.pdf) will tend to use models which are already trained for the dense feature extraction from images as well as teh language model. Some papers then fine tune the whole system together, and others just leave the “big” components fixed and only train the mapping from image to text.

You could also consider personalization applications where there may be a baseline model which is already very good, but training on a small amount of personalized data can lead to a better user experience. In general transfer learning has proved quite successful in a number of applications, and tends to “just work”, whereas unsupervised training can be pretty tricky to get working correctly.

MNIST is considered non-challenging for several reasons. The dataset is quite small (60k training examples). The algorithms and approaches for attacking MNIST have been covered extensively over the last 20 years, and it is fairly rare to see data of a similar nature (small images (28×28), basically black and white) outside of applications in handwriting recognition. Most of the work on handwriting recognition has moved to recognition in other languages such as Arabic and Chinese, and the “number recognition” work has moved to using Street View House Numbers (SVHN) which has much more challenging examples, larger images, and color.

It is still a decent sanity check, but it is hard to validate the usefulness of general algorithms or techniques on only MNIST data, as the ML community has basically overfit the data with models, data augmentation, and other tricks. We are at the point where the state of the art is only 30 or 40 errors on the whole dataset *including* samples with bad labeling! Be very wary anytime something claims to be state-of-the-art but only shows results on MNIST – this could be an indicator that the technique is not generally useful and is basically “overfit” to MNIST.

Since our goal are to use/develop general methods which allow learning from data, and using MNIST doesn’t help evaluate whether a technique is generally useful, people to tend to avoid using it as a primary dataset. However, it is equally strange if a paper *excludes* MNIST results. MNIST is like the Ford Focus of datasets – always there, always reliable, but not usually celebrated or held in particularly high esteem.

LikeLiked by 7 people

Thank you for an **excellent** answer!

BTW: How did you know I had a Focus? 🙂

LikeLiked by 1 person

Just to add to that: Our new architectures (with linear rectified units and perhaps other elements) and training procedures (stochastic gradient descent over very tiny batches) have also helped to make supervised training easier and more effective. I think Aaron mentioned this in class when I asked him a couple of lectures ago…

LikeLiked by 1 person

Also, I don’t believe the technique mentioned in the book is the current state of the art. See http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html for more details. As far as I am aware Convolutional Kernel Networks (%0.39) are the state of the art without ensembling or data augmentation, and they match the mentioned technique if the book is referencing Efﬁcient Learning of Sparse Representations with an Energy-Based Model. Including ensembling and data augmentation there are two others with notably lower scores.

There is some argument to be made that comparing ensembled and/or augmented results with single classifiers isn’t really a fair comparison, though others will say that the best is the best, period. Depends on personal perspective.

LikeLiked by 1 person

Thanks again,

The best MNIST classifier uses another regularization technique we did not cover in class: DropConnect (looks very similar to DropOut).

See http://cs.nyu.edu/~wanli/dropc/ for more details.

LikeLike

Dropout is a very useful and powerful regularizer. There are also some derivations for “fast dropout” by Wang and Manning (http://nlp.stanford.edu/pubs/sidaw13fast.pdf). However, I have heard that fast dropout isn’t as good as “true dropout” – is this really true? I was under the impression that fast dropout was nearly mathematically equivalent.

As far as I know, dropout can only be applied to fully connected networks. There are papers such as Bayer et. al. (http://arxiv.org/abs/1311.0701) which tried to use dropout for RNNs and found it broke the recurrent connections, but isn’t adding Gaussian weight noise basically a form of dropout already via the fast dropout work?

There are papers such as Zaremba, Sutskever and Vinyals (http://arxiv.org/abs/1409.2329) and Zeiler, Fergus (http://www.matthewzeiler.com/pubs/iclr2013/iclr2013.pdf) which try to apply dropout to other architectures. In the former paper, they simply use dropout on the fully connected layer between LSTM layers, and in the latter they accomplish something resembling dropout using stochastic pooling. Can we summarize where/which of these methods are currently used and places where it is currently *missing* for our models?

Research question: Dropout with other distributions. Could there be a version of fast dropout using student-t distributions to simulate an ensemble of networks with varying levels of connectedness? Student-t can be represented by superposition of Gaussians with different variance, hence you might be able to see it as simulating multiple networks with varying capacity of the affected network/layer trained on the same example, rather than simulating a lot of networks with the same capacity (~half the total network capacity) and relying on bagging. This would seem like it could give you more of a “random forest” effect, where some of the pseudo-ensemble networks would share a common space.

LikeLiked by 3 people

I am not familiar at all with the dropout literature, but I disagree with your last argument that because we can interpret a student-t distribution as an infinite (weighted) sum of Gaussians injecting student-t distributed noise into our network is similar to having an infinite number of models.

Here’s an argument against by contradiction. In principle, any (smooth) distribution can be approximated arbitrarily well by a mixture of Gaussians, but that does not mean that injecting noise from such a distribution will cause a similar effect as dropout. The fact that the student-t distribution corresponds to an infinite weighted sum of Gaussians is just one interpretation; another interpretation is that it’s just another unimodal distribution with a simple pdf.

LikeLiked by 1 person

Earlier we discussed, that a squared error applied to sigmoid outputs may lead to vanishing gradients when a network outputs wrong answer. But the mixture of experts model uses squared error weighted by learned parameters `p_i` as presented in Hinton’s course. Is there the gradient vanishing problem with this model?

LikeLike

When I saw Hinton’s lecture I thought he was applying the model to continuous (real-valued) targets. In that case, the ‘p_i’ is the probability of picking a certain expert and not the probability of a given class…. Once an expert has been picked, that expert gives its (real-valued) prediction.

LikeLike

So, this method doesn’t work for classification problems?

LikeLike

This is a more general question about regularization. Observing how a fully connected architecture fares against a convolutional architecture, it seems to me that parameter sharing is one of the most powerful regularizer we know. Any thoughts on that?

Also, the right parameter sharing structure is somewhat easy to find in case there the input data topology is known. On the other hand, when the topology is unknown, is there any way to regularize the learned model towards a beneficial parameter sharing scheme?

(BTW, I have probably asked this question in another form in previous lectures, but this question is really intriguing to me)

LikeLiked by 2 people

Discovering the underlying topology (or, you could say, architecture of the optimal model) is an active area of research as far as I know. But perhaps one of the earliest successful and simple approaches was “softweight sharing”, where basically you assume that there is a Gaussian distribution over different sets of weights, and then you let your model find out what sets of weights belong to what Gaussian distribution:

http://www.cs.toronto.edu/~fritz/absps/sunspots.pdf

LikeLiked by 1 person

Thanks for the reference Julian. I’ll certainly read it.

LikeLike

In section 7.13, dropout, page 150 of the DL book, it is suggested that the ensemble predictor is defined by re-normalizing the geometric mean over all members. Is there a specific reason for using geometric mean as a mean of combining predictions (rather than the arithmetic mean for example)?

LikeLike

I disagree with the arguments given in favour of ensemble methods!

The book and Hinton’s lectures argues that ensemble methods reduce squared error by exploiting the low covariance between their predictions. But in fact, if we use bagging we are increasing the actual variance of each predictor! Suppose I have a regression problem and I use two systems to solve it: System A) trains a (least-squares error) linear regression on the entire training data, and System B) trains two separate linear regressions on datasets produced from the original dataset with bagging. Then surely the predictions of the two linear regressions in System B will have a much higher variance (w.r.t their residual error) than the linear regression in System B). In fact, in most applications I think you will probably find System A) outperforms System B)…

Yet, we know that we improve performance even when we average the predictions of networks with the same architecture trained on the same training set. Couldn’t this performance boost be better explained from a Bayesian/regularization perspective? We have a prior over models (defined implictly through our model architecture and training procedure) and a conditional distribution (given by our surrogate loss function, training set and validation set). Then, everytime we train a new model, we are effectively drawing a sample from the posterior distribution over models. If we draw only one model our prediction will have high variance (and, in fact, will be overconfident) – as Bayesian folk wisedom would tell us. But if we draw many models and average their predictions we are doing Bayesian model averaging. It’s Bayesian precisely because each model we get at the end of training is a sample from the same posterior distribution we defined earlier (as opposed to simply drawing random weights, or using different loss functions for different models which would in retrospect induce a different posterior distribution).

LikeLiked by 2 people

Great question / comment Julian,

You point to a question of some debate in the research community. Check out the refs below (esp. the first) that I believe argue against your bayesian model averaging account of ensemble methods.

Check out:

http://research.microsoft.com/en-us/um/people/minka/papers/minka-bma-isnt-mc.pdf

http://homes.cs.washington.edu/~pedrod/papers/mlc00b.pdf

LikeLike

Thanks Aaron. Those papers clarified it, although my arguement that bagging increases the variance of each individual model still holds and I think it deserves to be mentioned in your book too.

In the paper by Domingo’s they found that Bayesian model averaging increases overfitting, by trying out many models and (effectively) weighting one model much higher than all others based on its training log-likelihood. I wonder: what would happen if we weighted each model according to it’s test log-likelihood?

LikeLike