Lecture 4, Jan. 19, 2015

Lecture 4 will start with the end of Vincent’s theano tutorial from Lecture 3 (15-20 min).

Then we will have our first discussion session (i.e. flipped class). Please study the following material in preparation for the discussion:

Please post questions for the discussion (we will cover all comments to Lectures 1-4). Also, we can also cover background material (by request):

  • Chapters 1-5 of a book draft on Deep Learning,  by Yoshua Bengio, Ian Goodfellow and Aaron Courville.




10 Replies to “Lecture 4, Jan. 19, 2015”

  1. When would you use a tanh activation function vs a sigmoid activation function in a network. I understand that that can give you inhibitory behaviour in a neuron, but it’s unclear to me in what type of network you would want to use this.


    1. Yeah I have a same kind of question: As far as I know those non-linearities are there to prevent the network to become a “simple” linear model. So why do we use piecewise linear layers (such as ReLU or HardTanh)? And especially, why do we initialize the weights of the previous layer so that we “hit” the non-linearity in its linear (or most linear) part?


      1. To answer to my previous comment: The activation functions (such as ReLU) add non-linearity in the network by “shutting down” some of their inputs. So they perform some kind of selection on the input, and this is why the whole system is non-linear.


      2. Cesar,
        Your question seems reasonable to me. But I’m not sure I understand your following answer.
        Actually, the most non-linear function is “Step Function”. But we don’t use it because it’s non-differentiable. Instead we use sigmoid, tanh, or relu.
        But I’m still wondering why we initialize them somehow to stay in linear part?


      1. It is the other way around I guess: The tanh is centered around 0 so you can initialize the weights with zero-mean. With sigmoid, you have to add a bias term. On the other hand, the output of the sigmoid is in the range [0, 1], so you can interpret its outputs as probabilities.

        Liked by 2 people

  2. In section 6.2 of the deep learning book, it’s mentioned that if f was unrestricted (non-parametric), minimizing the expected value of loss function over some data-generating distribution P(X,Y) yields f(x) = E[Y|X], the true conditional expectation of Y given X. Why does the function need to be nonparametric?

    Liked by 1 person

    1. The reason why the function need to be non-parametric is that most parametric functions are not able to have any arbitrary value at every point in an infinite input space. However, for a finite input space, it can be true for some parametric functions that can be parametrized by their values for every input.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s