Lecture 8, Feb. 2nd, 2015: Convolutional Neural Networks I

In this lecture, we will discuss the architecture responsible for virtually all the successes of deep learning in computer vision: the convolutional neural network.

Please study the following material in preparation for the class:

Other relevant material:


25 Replies to “Lecture 8, Feb. 2nd, 2015: Convolutional Neural Networks I”

  1. Hi,

    I’ve been working on Convolutional Neural Nets and i have some issues with the kernels matrices.

    In Hugo Laroche’s tutorials (https://www.youtube.com/watch?v=aAT1t9p7ShM), he exposes the following example:
    The input is an image 83*83, and in layer 1, there are 64 features maps and 75*75 hidden units by feature maps.
    Let Kij be the kernel matrix connecting the i th input channel with the j th feature map and let Xi be the matrix of the i th receptive field. We do the convolution between the matrices Kij and Xi.
    My question is:
    The matrix Kij is the same in the j th feature map, so why do we differentiate the matrix K2j and K3j for example, why is there an index i ? i am missing something ^^

    Thanks for your help, Pierre.


    1. If you have a 1 channel input, you are correct, but that’s not necessarily true. For instance, RGB images may have 1 channel per color channel, which is a total of 3 channels. K_{11} would be the kernel which maps the R channel in the image to the first feature map.

      Liked by 1 person

  2. Hinton’s talk is very interesting.

    I don’t think this kind of architectural change is completly being ignored. We talked in the other lecture about different kinds of action functions (such as linearly mixing linear and tanh functions) and later we will also talk about new functions (such as LSTM units) etc. Christopher Dolan’s blog also talked about a “KNN-inspired” output layers (which similar to Hinton’s lecture is very much inspired from geometry)… These are all diffferent kinds of “capsules” or “hidden units”, which people are researching right now.

    But perhaps there are some obstacles to this kind of research. First, we are very focused on state-of-the-art results in machine learning. Anything short of that won’t get a lot of attention, and hence people will not actively built upon / extend many of these models. Second, our GPUs right now fit perfectly well with standard convolutional neural networks. In fact, some professors believe (including Roland from our lab) that even if we find better models than ConvNets, because our computations are so limited to what GPUs can execute fast and because many companies and research centers have invested heavily in GPUs, new models which cannot be implemented effeciently on GPUs will take many more years to win ground.

    On another note, I wonder if we can mimic Hinton’s central idea of linear coordinate transformation through the layers by simply adding skip-connections (and, of course, not having any pooling layers). What do you think about that?


  3. I didn’t understand the part about learning different kind of invariance by using the pooling operation. For example, in fig. 11.7 p. 207, how the pooling has been performed to build the rotation invariance? Was it across the all feature maps?

    Liked by 2 people

  4. Could the weight sharing of convnets could be interpreted as simply a form of regularization rather than a “new architecture”?
    The way I see it, we just limit our solution of a fully connected neural net in function space.

    Liked by 1 person

    1. As Aaron said today, yes it can be seen as a form of regularization.

      The set of functions that can be represented by a network architecture includes the space of functions that can be represented by the same network but, this time, with weights being shared

      More precisely, if you consider the optimization problem that happens during training, you can see weight sharing as equality constraints applied to the optimization of the original network without shared weights. If weight1 is forced to be equal to weight2, this is equivalent as looking for the solutions on the hyperplane defined by weight1=weight2 equation. For more constraints, you look for the solutions located on the intersection of the corresponding hyperplanes. In other words, you are optimizing the cost function on a subspace of the original weights space.
      In practice, to solve this constrained optimization problem in the original weights space, you incorporate the constraints in the cost function and this is equivalent as solving an unconstrained optimization problem with a new cost function and a smaller weights space.

      Note that L1, L2, dropout regularization techniques change the cost function but the optimization is still carried out in the original weights space. So slightly different.

      Liked by 1 person

  5. Is there a particular reason for zero-padding the input instead of wrapping the kernel around the input, to prevent the representation from shrinking? The way I understand it, we’re essentially filtering the image with all kinds of filters. To be consistent with the Fourier representation of a filtering operation, shouldn’t the kernel be wrapped around the image when it reaches an edge?


    1. Talking about natural images, I believe there are some patterns in the world that wouldn’t be suited by this strategy. For instance, people usually take pictures with the sky at the top.

      I think that the strategy at the edges is something that would not change the results a lot. However, my intuition might as well be wrong.

      I would like to further this question. Is there any study where they analyze the effects of different edge strategies and their impact in the results?

      Liked by 1 person

  6. Of the top of my head, a fourier transformation would assume that the sequence is periodic I think. Then it makes sense to wrap around the transformations at the end points…

    But for convolutional neural netwoks, we do not assume that the sequence is periodic (e.g. the case where our image is repeated horizontally and vetically over and over…). Hence we do the next best thing, which is to assume that the pixels outside provide zero incoming values. If, as people (including LeCun) sometimes suggest, we also add a unique bias for each filter at each position, then the filters which fall outside the image will still be able to adjust their influence appropriately (for example, a filter which needs to have the entire receptive field to create a high-confidence feature could set its post-activation value to zero by adjust the bias when the filter is applied outside the image).


  7. As I understand it, what makes CNN so good is local connectivity and weight sharing within the layer. When a fully connected contractive autoencoder (and variants) is learned on natural images, it does learn local features (at least for the first layer). Still using a CAE weight initialization, it is yet impossible to learn a fully connected architecture that would end up being as efficient as CNN on object recognition tasks. Is there any understanding for that? Assuming that weight sharing is that missing component, is there any way for fully connected layers to learn a weight sharing scheme? Intuitively, I would think that gated units might make this possible though I am not sure how to regularize a model towards weight sharing. Anyway, anyone know if somebody has worked on learning how to share weights?


    1. You can check a recent paper http://arxiv.org/abs/1409.0473. They use recurrent neural networks which also utilize weight sharing. The work uses so called attention mechanism — an MLP which chooses the element of a sequence to generate an element of a second sequence. You can consider it like shared parameters learning (the MLP returns weights which can be shared or not).

      Liked by 1 person

  8. I believe that the number of hidden units in a feature map is another hyper-parameters. If so, does varying this quantity have the same effect on the complexity of the model as for a regular MLP?


  9. I’m curious if anyone has an idea about how much it (typically) hurts performance if we train a convnet with only one channel (for example by gray scaling images), or if we resize our images rather than zero padding them?


  10. Q1) As mentioned in the Hugo’s tutorials the convnet idea is based on 1. local connectivity, 2. parameter sharing, and 3. pooling/subsampling hidden units. Is there any work on showing the contribution of each one in boosting the result?

    Q2) [related to previous lectures] Why networks should have ‘static’ activation functions? Why don’t we add ‘learnable behaviour’ by means of adding parameters to activation units? For instance, instead of using y=tanh(x) maybe we can use y=a*tanh(b0 + b1*x + … + b5*x^5) (or potentially more complex function, even combination of them) where (a,b0,b1,…,b5) for each unit are parameters that the network can learn, like how it learns the weights and biases.
    An obvious drawback is that it is computationally more expensive. On the other hand, this increases the non-linearity of functions that the network can present, as a result, perhaps we can learn shallower networks with the same capacity as the deeper ones.


    1. My explanation for (Q2) would be that, at some point, you need to fix some things and train the other things. If you add too much model flexibility/capacity in the activations functions, maybe you’re not longer really using a neural net properly.

      The whole point of the neural net was that you’d be learning little features that would get re-used by multiple hidden units that shared all that. If your activation functions are made to learn complex functions, then they’re not really doing any parameter sharing.

      That’s not to say that it’s a *bad* idea to have some flexibility in the activation functions, or that it would not work, but maybe it’s not just not all that practical.


  11. It’s is mentioned in the Book (11.5) that the input image size can vary and that convnet can adapt to that but I do not understand how it work… if we have smaller image the convolution will not be computed on those units and we could put zeros for those … but if the new image is larger we do not have units to compute the new rows and columns (does the book assume that we resample every image prior to the convnet ?)


  12. At the end of this lecture, I asked Aaron how it could be that randomly initialized feature maps (i.e. randomly initialized weights for the convolutions) produce gradients-like filters. in the first layer. It turns out that this is due to the repetition of the convolution combined with the (squared)-pooling operation. Aaron also pointed out the paper:



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s