# First assignment: MLP on MNIST

As a starting point for the class, you should have a good enough understanding of Python and numpy to work through the basic task of classifying MNIST digits with a one-hidden-layer MLP.

The due date for the assignment is Thursday, January 15, 2015. You’re not required to hand in anything. However, it is highly recommended that you complete it, as some of the material that will be covered in the next few weeks will build on the assumption that you’re comfortable with this assignment.

### Refresher: MNIST

The MNIST dataset contains labeled images of handwritten digits. Images have size 28 x 28 pixels and are divided into a training set containing 60,000 examples and a test set containing 10,000 examples. Here are some examples:

You can download the dataset as follows:

wget http://deeplearning.net/data/mnist/mnist.pkl.gz


In this version of the dataset, the training set has further been split into a training set containing 50,000 examples and a validation set containing 10,000 examples.

Once unzipped and unpickled, the file contains a tuple with three elements for the training, validation and test set respectively. Each element is itself a tuple containing a $n \times 784$ matrix of examples and a $n$-dimensional vector of labels (where $n$ is the number of examples in this specific set). Features in the example matrix are floats in [0, 1] (0 being black, and 1 being white), and labels are integers ranging between 0 and 9.

You can load the data like so:

import cPickle
import gzip

with gzip.open('mnist.pkl.gz', 'rb') as f:
train_set, valid_set, test_set = cPickle.load(f)
# Whatever else needs to be done


### Refresher: MLP

A multilayer perceptron (MLP) produces a prediction according to the following equations:

${\mathbf{h} = \sigma (\mathbf{W} \mathbf{x} + \mathbf{b}) = \frac{1}{1 + \exp(-\mathbf{W} \mathbf{x} - \mathbf{b})}}$

$\mathbf{y} = \frac{\exp(\mathbf{V} \mathbf{h} + \mathbf{c})}{\sum_i \exp(\mathbf{V}_i \mathbf{h} + c_i)}$

Here, $\mathbf{y}$ is interpreted as a vector of probabilities, i.e. $y_k$ is the probability that example $\mathbf{x}$ belongs to class $k$ and is normalized such that $\sum_k y_k = 1$.

Given a one-hot encoded target $\mathbf{t}$ (i.e., $\mathbf{t} = [t_1, \ldots, t_C]$ where $C$ is the number of classes, $t_k \in \{0, 1\}$ and $\sum_k t_k = 1$), the loss function is definded as

${\mathcal{L} = \sum_k -t_k \log(y_k)}$

1. Using the backpropagation principle, find the derivatives of the loss function with respect to parameters $W_{ij}, b_j, V_{jk}, c_k$.
2. Using what you derived in step 1, find an expression for the derivative of the loss function with respect to parameter vectors and matrices $\mathbf{W}, \mathbf{b}, \mathbf{V}, \mathbf{c}$. This is a good thing to do, because numpy has abstractions to represent dot products between vectors, vector-matrix products, elementwise operations, etc. that rely on heavily optimized linear algebra libraries.  Coding up all these operations using for-loops would be both inefficient and tedious.
3. Implement the code necessary to compute gradients using the expressions derived in step 2 for a pair of feature and target vectors $\mathbf{x}$ and $\mathbf{t}$.