In this lecture, we will conclude our discuss the** convolutional neural network**.

**Please study the following material in preparation for the class:**

- Visualizing and Understanding Convolutional Networks by Matthew D. Zeiler and Rob Fergus (Feel free skip section 3,
**Training Details**as it deals with material we have not yet seen, e.g. dropout training and momentum). - Hinton’s talk on What’s Wrong with Convolutional Neural Networks.

**We will also be following up our discussion of the material of the last lecture on Convolutional Neural Networks I.**

Advertisements

Aaron,

I could not find any resource that shows the maths of projecting the second/third/.. convolution layers weights back to the image space.

However, many papers do plot them.

I have done the tensor maths for the projection (while ignoring the pooling).

Could you show how to do this on the board in class? Ideally, we should also take care of unpooling as well.

Thanks.

LikeLiked by 1 person

The paper on visualizing convnets is very interesting. Oh, how I wish we had scripts in Pylearn2 to visualize something similar (or, at least, anything else than the simple weights)!

But I don’t entirely understand the construction of the deconvnet. Why do the authors choose to use relu (linear rectified) units? Shouldn’t the inverse of the relu unit be a negative relu? And how can we approximate the inverse of the filter (the weight matrix for a given hidden unit) with its transpose? That only holds for orthogonal matrices, or is there some approximation result I am missing?

Also, do you think the same procedure could be used with tanh functions, but where we replace the relu in the deconvnet with the inverse of the tanh function? That might be useful for one of my other projects…

LikeLiked by 1 person

Wait, I think the inverse of the relu unit should be the identity function. But maybe that’s equivalent to the relu unit in their setup, since it only receives non-negative values as input anyway?

LikeLike

That’s quite strange. What is the sense always being in the linear regime? The result would be the same as we had an identity activation.

LikeLike

The relu is non invertible because of many to one mapping for negative inputs. It can only be approximated. I am not sure the use of relu is equivalent to the identity function in this case : why wouldn’t projections back to input space be always non-negative?

LikeLike

Does someone understand why do the authors of the article use transposed filter while they are inverting the convolution? Let us suppose, that we have a vertical edge detector in the convnet. After transposition it’ll become a horizontal edge detector, which is quite far from the inverse. Am I missing something?

LikeLike

That exactly my second question above! Aaron?

LikeLike

Although they mention that in practice they flip each filter vertically and horizontally.

LikeLike

I was wrong that authors of the paper transpose kernels of convolutions. What they do is transposing unrolled kernel. Since the kernel is small, the matrix is almost orthogonal (if we consider regions which are far enough, we can see that kernels doesn’t intersect). Moreover the procedure implies that we choose only one non-zero element when we do unpooling. It also makes the matrix closer to orthogonal.

LikeLike

Could you explain more why small matrices are almost orthogonal?

LikeLike

The paper was awesome!

How are the images in Fig. 2 obtained? The authors remark that they are not samples of the model. In particular, are these images part of the training or validation sets?

This is my understanding, I’m not sure if it’s correct:

First, do fprop through the trained network for some image, and then use the activations of the layers and the deconv network to project back to input space. Since the procedure is not invertible (mainly, because of the unpooling) you’ll get a similar image but not the same. Furthermore, this is not sampling, because you needed an image to initialize the procedure.

LikeLiked by 1 person

There seems to be a little missing piece (although it might be implicit in your post). After fprop, you must select the neuron of interest and set all other ones in that layer to zero. Then backprop that neuron back to input space.

LikeLike

Hinton’s talk was really interesting. He showed that back-propagation on MNIST with 60,000 labeled examples makes 1.6% error while his model makes the same error with only 25 examples!

I didn’t understand it completely buy, here’s my summary and questions. Please correct me if I’m wrong.

Neural networks have too few levels of structures? (neuron, layer, and the whole network)

MY QUESTION: What’s wrong with a few levels of structures? We can approximate any function based on “universal approximation theorem”.

Based on the above-mentioned argument about “standard” neural networks, let’s use capsules. In a hierarchical structure, capsules are connected to the capsules in the layer below and look for a dense cluster. If the capsule lie into a dense cluster, then it outputs 2 things:

1. High probability of existence of an entity in its lower layer.

2. The center of gravity of the cluster, which is the generalized pose of that entity.

Using this we can easily handle noise because noise don’t have dense clusters.

MY QUESTION:

“center of gravity” is like mean?

As I understood pose is like features. Right?

Now, in ConvNets we have a similar structure as capsules. So What’s wrong?

Convolution is good.

Pooling is wrong. Why? 4 arguments:

1. It’s not plausible by human’s perception of shapes like his tetrahedron puzzle.

Humans assign rectangular receptive fields to objects. (That’s why when we rotate Africa’s map, it seems like Australia)

2. It solves the wrong problem: We want equivariance, not invariance. Disentangling rather than discarding. He means we are ok by changing the values of hidden units but not by the change in the final results.

3. It fails to use the underlying linear structure.

It does not make use of the “natural linear manifold” that perfectly handles the largest source of variance in images.

MY QUESTION:

What is “natural linear manifold”? Why “linear”?

4. Pooling is a poor way to do dynamic routing:

We need to route each part of the input to the neurons that know how to deal with it. Finding the best routing is equivalent to parsing the image.

He then states that both our brains and computer graphics deal with different viewpoints easily?

MY QUESTION:

How do they do that?

He also says that an object is not described by “one neuron” or “a set of coarse-coded neurons”, but it is described by a “group of neurons”. So if you translate an object in the pixel level, you see a translation in the upper layer activations. But then since the next layer have a bigger receptive filed, it remains unchanged. (Just like cellphone network. When you move, signal changes. As you go further from a transmitter antenna, signal gets lower until you will be covered by another antenna.) Which are the definitions of “place-coded” and “rate-coded”.

As a result,

“low level place-coded equivariance in lower layers”

will be converted to

“high-level rate-coded equivariance in upper layers”.

He then states that using extrapolation, we can recognize different viewpoint even without seeing such an example in the training set.

Problem: pixels space is so non-linear and we cannot extrapolate.

Solution: transform to a space in which the manifold is globally linear.

MY QUESTION:

This is one the basic ideas of Deep Learning. Isn’t it?

We already have learned how to back propagate gradient through pooling (Hinton calls it as routing) –> by choosing the most active neuron and propagating gradients through it.

A better way is weighting. Let’s send the information to all capsules and ask them if it’s in a dense cluster or it’s like an outlier for you? (Do you need more information to be sent to you?)

Then he introduces his model based on what he said above.

LikeLiked by 3 people

Q1) What are the data augmentation tricks in speech/audio domain?

Q2) Section ‘4 Convenet Visualization’, subsection ‘Feature Evolution during Training’ of paper: “The lower layers of the model can be seen to converge within a few epochs.” Is it true for other deep networks as well?

LikeLike

Q1) I don’t know if there is some “official” tricks for speech and audio, but I’m quite sure that some of the following would work:

– Inverting the phase

– Changing a bit the equalization

– Adding maybe a bit of reverb

– Adding a bit of white noise

I hope this helps 🙂

LikeLike

Coming back to today’s class, and sorry for possibly being redundant with Mohammad’s comment, but can we say that the capsule’s idea is simply to factor out a representation of an object (or part of it) from it’s position on a manifold?

LikeLike

In what I think is a direct reference to Prof Hinton’s pursuit of doing “inverse graphics”,

there is this paper that came out recently from Josh Tenenbaum and friends.

http://arxiv.org/pdf/1503.03167.pdf

LikeLike