Before neural netowrk frameworks were released, there was no effective way to automatically run gradient descent, defined model layer dimension, or initialize weights. All these features are baked into Tensorflow and Pytorch.

In a world without any neural network frameworks, we need to understand the fundamentals of a fully connected layer (aka MLP, neural network).

The following example assumes that we use the MNIST dataset.

# fully connected layer

A matrix represents an image of a digit, and each cell is a pixel value. The image has dimension n by d.

Before passing through the activation function, we need to multiply the input by the weight and add the bias offset.

Mathmatically, it is represented like so

Wx^T + b

W: dh by d, where dh is the dimension of the hidden layer x: n by d b: dh by n

which yields a dh by n matrix.

The activation is applied per element of the matrix

```
def relu(pre_activation):
'''
Relu using only numpy.
'''
relu_output = np.zeros(pre_activation.shape)
relu_flat = relu_output.flatten()
for i, neuron in enumerate(pre_activation.flatten()):
if neuron > 0:
relu_flat[i] = neuron
relu_output = relu_flat.reshape(pre_activation.shape)
return relu_output
```

The output after activation retains the same input matrix dimenion, in this case dh by n

# weight initialization

If we initialized all the weights to 0, backpropogation will always be 0. There is nothing to learn.

Instead, a simple and effective initialization uses the hidden layer dimensions, and samples a matrix from a normal distribution, with the value bounded by the square root of $\frac{6}{sum of layer dimensions}$

```
def glorot_init(d, dh1, dh2, m):
dl_1 = np.sqrt((6/(d + dh1)))
W_1 = np.random.uniform((-1)*dl_1, dl_1, (dh1, d))
return W_1
```

The intuition behind this design is the forward propagation Wx^T + b will sum over every row in W, so having the layer’s dimension in the denominator will prevent the result Wx^T from exploding.

# Output

Typically, we use softmax function to essentially *cast* a set of values into a probability distribution.
The output will retain the shape of the input to softmax, and each value can be interpreted as a probability.
The index of the maximum probability is the index of the predicted category.

```
def softmax(pre_activation):
'''
Numerically stable because subtracting the max value makes bit overflow impossible,
we will only have non-positive values in the vector
'''
exps = np.exp(pre_activation - np.max(pre_activation))
return exps / np.sum(exps)
```

# Backpropagation

The probability of an image of handwritten “1” to be 1 should be 100%. We want to maximize this probability. Since we have the softmax representation from the forward propagation’s output, we want to maximize the softmax, aka we want softmax of the correct class to equal 1.

The following will require you thinking like a true mathematician. In order to maximize this probaility, we can minimize negative of the log-likelihood to achieve the same effect.

We don’t have a magically formula to find the weight values to minimize the negative log likehood. Instead, we modify the values so we move closer to the minimum log likelihood iteration by iteration. This can be done using gradient descent. For every iteration, we update the value of the weights by the value of this gradient. If you are interested in the derivations, visit this blog

```
def backprop(self, batch_data, batch_target):
'''
dimensions:
o_s: m x1
grad_oa : m x 1
hs: dh x 1
grad_w2: m x dh
grad_oa: m x n
grad_b2: m x n
grad_oa: m x n
W(2): m x dh
grad_hs: dh x n
grad_oa: m x n
grad_ha: dh x n
x : n x d
grad_W1: dh x d
grad_ha: dh x n
grad_b1: dh x n
'''
self.grad_oa = self.o_s - batch_target
# hidden layer 3
self.grad_W3 = np.outer(self.grad_oa, self.h_s2.T)
self.grad_b3 = self.grad_oa
self.grad_hs2 = np.dot(self.W_3.T, self.grad_oa)
h_a_stack2 = np.where(self.h_a2 > 0, 1, 0)
self.grad_ha2 = np.multiply(self.grad_hs2, h_a_stack2)
self.grad_W2 = np.outer(self.grad_ha2, self.h_s1.T)
self.grad_b2 = self.grad_ha2
self.grad_hs1 = np.dot(self.W_2.T , self.grad_ha2)
h_a_stack1 = np.where(self.h_a1 > 0, 1, 0)
self.grad_ha1 = np.multiply(self.grad_hs1, h_a_stack1)
# hidden layer 1
self.grad_W1 = np.outer(self.grad_ha1, batch_data)
self.grad_b1 = self.grad_ha1
```

After we obtained the gradients for each layer, we apply the *descent* by subtracting it from the weights. The learning rate serves as a control, and can be designed to be adaptive according to the shape of the loss function to accelerate or decelrate the descent.

```
def update_weights(self):
self.W_1 -= self.grad_W1 * self.learning_rate
self.W_2 -= self.grad_W2 * self.learning_rate
self.W_3 -= self.grad_W3 * self.learning_rate
self.b_1 -= self.grad_b1 * self.learning_rate
self.b_2 -= self.grad_b2 * self.learning_rate
self.b_3 -= self.grad_b3 * self.learning_rate
```

The update is performed for every single image (aka our input data) over a number of times. Each complete set is referred to as an epoch. The number of epoch is usually determined at run time using early stopping. There is no consensus on how many epochs to run, and more isn’t always better.

There is probably no occasion where we will actually write everything using numpy and pretend we don’t have tensorflow or pytorch. This post aims to provide some visbility into a simple neural network training works, and why the frameworks are appreciated in the ML community.