Lecture 4 will start with the end of Vincent’s theano tutorial from Lecture 3 (15-20 min).

Then we will have our first discussion session (i.e. flipped class). **Please study the following material in preparation for the discussion:**

- Hugo Larochelle’s video lectures 1.1 to 1.6.
- Chapters 6 of the Deep Learning textbook (sections 6.1 and 6.2)

**Please post questions for the discussion **(we will cover all comments to Lectures 1-4). Also, we can also cover background material (by request):

- Chapters 1-5 of a book draft on Deep Learning, by Yoshua Bengio, Ian Goodfellow and Aaron Courville.

Advertisements

When would you use a tanh activation function vs a sigmoid activation function in a network. I understand that that can give you inhibitory behaviour in a neuron, but it’s unclear to me in what type of network you would want to use this.

LikeLike

My understanding is that the inhibitory behaviour is coming from negative weight values and not from your choice of activation function.

LikeLike

Yeah I have a same kind of question: As far as I know those non-linearities are there to prevent the network to become a “simple” linear model. So why do we use piecewise linear layers (such as ReLU or HardTanh)? And especially, why do we initialize the weights of the previous layer so that we “hit” the non-linearity in its linear (or most linear) part?

LikeLike

To answer to my previous comment: The activation functions (such as ReLU) add non-linearity in the network by “shutting down” some of their inputs. So they perform some kind of selection on the input, and this is why the whole system is non-linear.

LikeLike

Cesar,

Your question seems reasonable to me. But I’m not sure I understand your following answer.

Actually, the most non-linear function is “Step Function”. But we don’t use it because it’s non-differentiable. Instead we use sigmoid, tanh, or relu.

But I’m still wondering why we initialize them somehow to stay in linear part?

LikeLike

A: easier to initialize sigm

LikeLike

It is the other way around I guess: The tanh is centered around 0 so you can initialize the weights with zero-mean. With sigmoid, you have to add a bias term. On the other hand, the output of the sigmoid is in the range [0, 1], so you can interpret its outputs as probabilities.

LikeLiked by 2 people

In section 6.2 of the deep learning book, it’s mentioned that if f was unrestricted (non-parametric), minimizing the expected value of loss function over some data-generating distribution P(X,Y) yields f(x) = E[Y|X], the true conditional expectation of Y given X. Why does the function need to be nonparametric?

LikeLiked by 1 person

The reason why the function need to be non-parametric is that most parametric functions are not able to have any arbitrary value at every point in an infinite input space. However, for a finite input space, it can be true for some parametric functions that can be parametrized by their values for every input.

LikeLike

It seems to address optimization efficiency as tanh is just a rescaled version of sigmoid function, especially when you are normalizing your data — which is quite important.

This might help:

Section 4.4 The Sigmoid, “Efficient Backprop” by Le Cunn et al

http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf

LikeLike