Class project starts now!

The Pylearn2 implementation of the Dogs vs. Cats dataset is complete, which means that you now have everything you need to start training models for the class project.

This post details where to find and how to use the dataset implementation along with Pylearn2.


  • Pylearn2 and its dependencies
  • PyTables

Getting the code

You can find the code for the dataset in Vincent Dumoulin’s repository.

Clone the repo in a place that’s listed in your PYTHONPATH, and you’re ready to go.

Getting the data

N.B.: This part is only required if you’re working on your own machine. The data is already available on the LISA filesystem.

Downloading the images

First, you’ll need to make sure your PYLEARN2_DATA_PATH environment variable is set (e.g. through an export PYLEARN2_DATA_PATH=<path_of_your_choice> call in your .bashrc if you’re on Linux). This is where Pylearn2 expects your data to be found.

Create a dogs_vs_cats directory in ${PYLEARN2_DATA_PATH}.

Finally, download and unzip the file under ${PYLEARN2_DATA_PATH}/dogs_vs_cats directory (many thanks to Kyle Kastner for making it available without having to go through the whole Kaggle signup process!)

Generating the HDF5 dataset file

Once the images have been downloaded, unzipped and placed into the ${PYLEARN2_DATA_PATH}/dogs_vs_cats directory, run

python ift6266h15/code/pylearn2/datasets/

This will create an HDF5 file under ${PYLEARN2_DATA_PATH}/dogs_vs_cats/train.h5 which contains the whole training set. This may take some time. The file should weight around 11 gigabytes.

Instantiating and iterating over the dataset

We’re going to use the ift6266h15.code.pylearn2.datasets.variable_image_dataset.DogsVsCats subclass of Dataset.

The dataset constructor expects three arguments: an instance of a ift6266h15.code.pylearn2.dataset.variable_image_dataset.BaseImageTransformer subclass, and optionally a starting and stopping index specifying what slice of the whole dataset to use.

The BaseImageTransformer subclass is responsible for transforming a variable-sized image to a fixed-sized one through some sort of preprocessing and is used by the dataset to construct batches of fixed-sized examples.

There is currently only one subclass implemented, ift6266h15.code.pylearn2.dataset.variable_image_dataset.RandomCrop, which scales the input image so that its smallest side has length scaled_size, and takes a random square crop of dimension crop_size inside the scaled image (both scaled_size and crop_size are constructor arguments).

Here’s how we would instantiate and iterate over the dataset:

from ift6266h15.code.pylearn2.datasets.variable_image_dataset import DogsVsCats, RandomCrop
dataset = DogsVsCats(
    RandomCrop(256, 221),
    start=0, stop=20000)
iterator = dataset.iterator(
for X, y in iterator:
    print X.shape, y.shape

Note that by default the dataset iterates over both features and targets.

Here’s how you would use the dataset inside a YAML file to train a linear classifier on the dataset:

!obj:pylearn2.train.Train {
    dataset: &train !obj:ift6266h15.code.pylearn2.datasets.variable_image_dataset.DogsVsCats {
        transformer: &transformer !obj:ift6266h15.code.pylearn2.datasets.variable_image_dataset.RandomCrop {
            scaled_size: 256,
            crop_size: 221,
        start: 0,
        stop: 20000,
    model: !obj:pylearn2.models.mlp.MLP {
        nvis: 146523,
        layers: [
            !obj:pylearn2.models.mlp.Softmax {
                layer_name: 'y',
                n_classes: 2,
                irange: 0.01,
    algorithm: !obj:pylearn2.training_algorithms.sgd.SGD {
        batch_size: &batch_size 100,
        train_iteration_mode: 'batchwise_shuffled_sequential',
        batches_per_iter: 10,
        monitoring_batch_size: *batch_size,
        monitoring_batches: 10,
        monitor_iteration_mode: 'batchwise_shuffled_sequential',
        learning_rate: 1e-3,
        learning_rule: !obj:pylearn2.training_algorithms.learning_rule.Momentum {
            init_momentum: 0.95
        monitoring_dataset: {
            'train' : *train,
            'valid': !obj:ift6266h15.code.pylearn2.datasets.variable_image_dataset.DogsVsCats {
                transformer: *transformer,
                start: 20000,
                stop: 25000,
        cost: !obj:pylearn2.costs.cost.MethodCost {
            method: 'cost_from_X',
        termination_criterion: !obj:pylearn2.termination_criteria.EpochCounter {
            max_epochs: 10

This YAML file trains a softmax classifier for 10 epochs using the first 20,000 examples of the training set. An epoch consists of 10 batches of 100 random examples. Monitoring values are approximated with 10 batches of 100 random examples.

Implementing a BaseImageTransformer subclass

In order to implement your own preprocessing of the images, your BaseImageTransformer subclass needs to implement two methods: get_shape and preprocess.

The get_shape method needs to return the width and height of preprocessed images.

The preprocess method does the actual preprocessing. Given an input image, it returns the preprocessed version whose shape needs to be consistent with what get_shape returns.

Have a look at how RandomCrop is implemented to get a better feel of how it’s done.


20 Replies to “Class project starts now!”

  1. Hey Vincent,
    Thanks for the code (and Kyle for the dataset)! Is there a validation set for this project? If no, it would be great that everybody use the same subset of the training db as validation set, so we can compare our networks fairly.


    1. Regarding my previous comment: I had a look at the Kaggle competition page and they allow post-deadline submissions.

      The only downside is that people would have to create a verified account on Kaggle (i.e., tied to their phone number) in order to make submissions.

      I would be in favour of that option, but I still want to have everyone’s input to make sure they’re comfortable with it.

      We’ll discuss this in class on Thursday.


  2. There is no ‘official’ validation set for the project, but there is an official test set from the Kaggle competition. The labels aren’t publicly available and I need to check whether they still accept submissions.

    Either I’ll define an ‘official’ split within the 25,000 labeled examples we have, or everybody will submit to Kaggle to get their results on the official test set.

    I’ll come to class on Thursday with a definitive answer to that question.

    Liked by 2 people

    1. Congratulations on getting things up and running!

      Looking at your YAML file, you’re using the ‘sequential’ iteration mode. This, combined with your small batch size (8) and the few number of batches per epoch (10), means you’ll see the same 80 training and validation examples at each epoch. Maybe you just got unlucky with which 80 examples you’re stuck with?

      Try using ‘batchwise_shuffled_sequential’ instead: this will reshuffle the batch order at every epoch so you’re not stuck with the same 80 examples all the time.

      Also, keep in mind that by setting the ‘monitoring_batches’ parameter, you’re computing monitoring values on a subset of the training and validation example. You should treat that value as a rough estimate.

      Liked by 1 person

      1. Thank you for your answer. I am now running the same experiment with ‘batchwise_shuffled_sequential’.
        Is there any way to put more than the amount of examples that fit in my GPU’s memory (2GB) in the batch? If not.. I am limited around 80, given my architecture.


  3. Hi Vincent I’m running into problems on the Lisa servers

    I get this error when running any script using the train.h5 dataset …
    ValueError: The file ‘/Tmp/data/dogs_vs_cats/train.h5’ is already opened. Please close it before reopening. HDF5 v.1.8.5-patch1, FILE_OPEN_POLICY = ‘strict’

    I have tried to delete it and re-run the script but it endup giving me the same error
    could you advise please


            1. I’ve been able to train one epoch on barney0 using your convnet_1.yaml file. Can you consistently reproduce the issue? If so, email me and we’ll arrange a meeting so I can look at the problem in person.


  4. For those of us who aren’t members of LISA and who don’t have a GPU available for testing would it be possible to have a temporary LISA account to run the tests? Are there any other alternatives?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s