I have always been curious about how deep learning frameworks are created. I use Keras, TensorFlow, and PyTorch and they all are really good, but sometimes I feel like I am playing with a black box (in some frameworks I feel it more than in others) that hides its secrets.

If you feel the same way, this post is for you.

We are going to **create a deep learning framework using Numpy arrays** while we briefly study the theory of basic artificial neural networks. I won’t go into much detail with the theory, but you will find really good resources at the end of the post.

## Why create this from scratch?

Well, firstly, it is cool. I am a big fan of doing things just because it is cool to do them.

Secondly, you will learn a lot by implementing things from scratch. For instance, the** backpropagation algorithm can be a little tricky** when you first study it. A project like this may help you to better understand what is going on inside a neural network.

## Neural networks in a nutshell

This is a brief summary of an immensely big topic. If you are a visual learner, I recommend you to check Fast.ai, 3Blue1Brown or, if you speak Spanish, DotCSV.

You can also check the post From the neuron to the net, where aporras explains the basics of a neural network.

Neural networks are universal function generators, so a **neural network is just a function that maps an input into an output. **

$$ f:x \rightarrow y $$

The basic unit of a neural networks is the neuron, which is itself a linear regression

\( s \ (w, \theta) \) (usually called “weighted sum”) passed through a non-linear function like Sigmoid or Hyperbolic Tangent.

$$

y_{j} = \sigma \left( \sum_{j=1}^{n} w_{jk} \cdot x_{j} + \theta_{j} \right)

$$

Neurons are grouped in a higher topological level called layer. Basically, the way a layer is constructed, and how the layers of a net interact among themselves, determines the type of neural network you are working with.

In this case, we will create a framework to construct Linear (Dense in Keras) neural networks.

Each linear layer has its own weights and biases matrices. The weights

\( w_{jk} \) estimate the connection strength between 2 neurons while the biases \( \theta_{j} \) act as a threshold, being a measure of how active a neuron is.

## Gradient descent and backpropagation

It is time we learn **how neural networks are trained**. Typically, the learning consists of reducing the error of the cost function used to evaluate the network. A common cost function may be `Mean Squared Error`

:

$$

J = \frac{1}{n} \sum_{i=1}^{n} \left(Y_{i} – \hat Y_{i} \right)^2

$$

How can we tune our parameters to reduce the loss? Well, we can find what is the error of each parameter and update it according to that value. That is, we want to know how each parameter contributes to the final error.

To do that, maths provides us with a very useful tool: **derivatives**. Let’s say you want to know how much contribution a change in the weights parameters has on the final loss. We can do:

$$

\frac{\partial J}{\partial w}

$$

Now, taking into account that a neural network is just a function composition:

$$

J (y \ (s \ (w,\theta)))

$$

We can compute \( w\) gradient by applying the `chain rule`

. For example, the weights gradient of the last layer:

$$

\frac{\partial J}{\partial w} = \frac{\partial J}{\partial y} \cdot \frac{\partial y}{\partial s} \cdot \frac{\partial s}{\partial w}

$$

And the same for the biases:

$$

\frac{\partial J}{\partial \theta} = \frac{\partial J}{\partial y} \cdot \frac{\partial y}{\partial s} \cdot \frac{\partial s}{\partial \theta}

$$

If we want to compute the gradient for the previous layers we just have to repeat the process recursively:

$$

\frac{\partial J}{\partial w^{L-1}} = \frac{\partial J}{\partial y^{L}} \cdot \frac{\partial y^{L}}{\partial s^{L}} \cdot \frac{\partial s^{L}}{\partial y^{L-1}} \cdot \frac{\partial y^{L-1}}{\partial s^{L-1}} \cdot \frac{\partial s^{L-1}}{\partial w^{L-1}}

$$

Once we have the gradient, we can use it to optimize our parameters:

$$

w = w – \eta \cdot \nabla J(w, \theta)

$$

$$

\theta = \theta – \eta \cdot \nabla J(w, \theta)

$$

The \( \eta \) parameter is what we call “learning rate”, and determines the step size of each iteration. These slides from the UCL may help you understand **how changes in the learning rate value influence the learning process**. The smaller the size, the more steps you need.

There are other algorithms to optimize the parameters, but we will stick to the most basic one.

After backpropagating the error, we compute, again, the forward results of the net to check if the error has changed.

Each iteration over the dataset is called `epoch`

. With enough epochs, we will reduce the error and increase the accuracy of our predictions.

## Putting things together

To understand backpropagation, we need to calculate the derivatives we were talking about.

Starting with the loss function:

$$

\frac{\partial J}{\partial y} =

\frac{2}{n} \sum_{i=1}^{n} \left(Y_{i} – \hat Y_{i} \right)

$$

Regarding the activation function, we will only solve for the sigmoid case (ReLU is on the code):

$$

\frac{\partial y}{\partial s} = \sigma \cdot (1 – \sigma(s))

$$

Lastly, we have the weighted sum \( s \):

$$

\frac{\partial s}{\partial w} = x

$$

$$

\frac{\partial s}{\partial \theta } = 1

$$

We usually name \( \delta\) to

$$

\delta = \frac{\partial J}{\partial y} \cdot \frac{\partial y}{\partial s}

$$

Now, the algorithm for backpropagation can be written as follows:

Note that we have added the superscripts to clarify the inputs and outputs of each step.

## Give me the code!

Here it is! A neural network will be created using the `Model`

class.

class Model: def __init__(self): self.layers = [] self.loss = [] def add(self, layer): self.layers.append(layer) def predict(self, X): # Forward pass for i, _ in enumerate(self.layers): forward = self.layers[i].forward(X) X = forward return forward def train( self, X_train, Y_train, learning_rate, epochs, verbose=False ): for epoch in range(epochs): loss = self._run_epoch(X_train, Y_train, learning_rate) if verbose: if epoch % 50 == 0: print(f'Epoch: {epoch}. Loss: {loss}') def _run_epoch(self, X, Y, learning_rate): # Forward pass for i, _ in enumerate(self.layers): forward = self.layers[i].forward(input_val=X) X = forward # Compute loss and first gradient bce = BinaryCrossEntropy(forward, Y) error = bce.forward() gradient = bce.backward() self.loss.append(error) # Backpropagation for i, _ in reversed(list(enumerate(self.layers))): if self.layers[i].type != 'Linear': gradient = self.layers[i].backward(gradient) else: gradient, dW, dB = self.layers[i].backward(gradient) self.layers[i].optimize(dW, dB, learning_rate) return error

As you can see, the class `Model`

has 3 methods: `add`

, `train`

and `predict`

that allow us to control the network behaviour.

The private method `_run_epoch`

computes only one epoch. It does it by following the next procedure:

- Compute forward pass.
- Calculate error and gradient on the last layer.
- Backpropagates the gradient.

Notice that we don’t actually need the error in backpropagation, just the gradient. **We use the error to see how far we are from our objective**.

You will find the code for the classes below:

class Layer: """Layer abstract class""" def __init__(self): pass def __len__(self): pass def __str__(self): pass def forward(self): pass def backward(self): pass def optimize(self): pass class Linear(Layer): def __init__(self, input_dim, output_dim): self.weights = np.random.rand(output_dim, input_dim) self.biases = np.random.rand(output_dim, 1) self.type = 'Linear' def __str__(self): return f"{self.type} Layer" def forward(self, input_val): self._prev_acti = input_val return np.matmul(self.weights, input_val) + self.biases def backward(self, dA): dW = np.dot(dA, self._prev_acti.T) dB = dA.mean(axis=1, keepdims=True) delta = np.dot(self.weights.T, dA) return delta, dW, dB def optimize(self, dW, dB, rate): self.weights = self.weights - rate * dW self.biases = self.biases - rate * dB class ReLU(Layer): def __init__(self, output_dim): self.units = output_dim self.type = 'ReLU' def __str__(self): return f"{self.type} Layer" def forward(self, input_val): self._prev_acti = np.maximum(0, input_val) return self._prev_acti def backward(self, dJ): return dJ * np.heaviside(self._prev_acti, 0) class Sigmoid(Layer): def __init__(self, output_dim): self.units = output_dim self.type = 'Sigmoid' def __str__(self): return f"{self.type} Layer" def forward(self, input_val): self._prev_acti = 1 / (1 + np.exp(-input_val)) return self._prev_acti def backward(self, dJ): sig = self._prev_acti return dJ * sig * (1 - sig)

To calculate the error, we have a lot of options. Probably, the most basic one is the Mean Squared Error we saw earlier. I have added another one called **Binary Cross-Entropy **(the one that is in the code) because we will test our model using the latter in the following sections.

class MeanSquaredError(Layer): def __init__(self, predicted, real): self.predicted = predicted self.real = real self.type = 'Mean Squared Error' def forward(self): return np.power(self.predicted - self.real, 2).mean() def backward(self): return 2 * (self.predicted - self.real).mean() class BinaryCrossEntropy(Layer): def __init__(self, predicted, real): self.real = real self.predicted = predicted self.type = 'Binary Cross-Entropy' def forward(self): n = len(self.real) loss = np.nansum(-self.real * np.log(self.predicted) - (1 - self.real) * np.log(1 - self.predicted)) / n return np.squeeze(loss) def backward(self): n = len(self.real) return (-(self.real / self.predicted) + ((1 - self.real) / (1 - self.predicted))) / n

The layers can compute in 2 directions: forward and backward. **This is an inherited behaviour from the **

design, and it makes computationally easier to calculate the derivatives. In fact, we could have split the Linear layer into “multiply and “add” classes, as TensorFlow does it.**computational graphs**

The weights and biases are initialized using a uniform distribution. There are other ways to initialize these parameters, like kaiming initialization.

The forward pass of a linear layer just computes the formula of a neuron we saw previously. The backward pass is a little trickier to understand: once we compute the gradient on the last layer, we backpropagate it by multiplying the corresponding derivatives of the actual layer with the incoming gradient of the following layer.

In the linear layer, we need to calculate the weights and biases gradients too.

The `optimize`

method updates the weights and biases parameters with the local gradient of the layer if it is of linear type.

Christopher Olah’s post contains more information on computing derivatives using computational graphs.

## The results

To check the results we will generate a dataset for binary classification using `sklearn`

and a little help from `pandas`

:

def generate_data(samples, shape_type='circles', noise=0.05): # We import in the method for the shake of simplicity import matplotlib import pandas as pd from matplotlib import pyplot as plt from sklearn.datasets import make_moons, make_circles if shape_type is 'moons': X, Y = make_moons(n_samples=samples, noise=noise) elif shape_type is 'circles': X, Y = make_circles(n_samples=samples, noise=noise) else: raise ValueError(f"The introduced shape {shape_type} is not valid. Please use 'moons' or 'circles' ") data = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=Y)) return data def plot_generated_data(data): ax = data.plot.scatter(x='x', y='y', figsize=(16,12), color=data['label'], cmap=matplotlib.colors.ListedColormap(['skyblue', 'salmon']), grid=True); return ax

The resulting data is shown in the following picture:

data = generate_data(samples=5000, shape_type='circles', noise=0.04) plot_generated_data(data);

The creation and addition of layers to the model is very straightforward because it works pretty much the same as in `Keras`

. Below you will find the code to create and train a classification model:

X = data[['x', 'y']].values Y = data['label'].T.values # Create model model = Model() # Add layers model.add(Linear(2, 5)) model.add(ReLU(5)) model.add(Linear(5,2)) model.add(ReLU(2)) model.add(Linear(2,1)) model.add(Sigmoid(1)) # Train model model.train(X_train = X.T, Y_train = Y, learning_rate = 0.05, epochs=9000, verbose=True)

After training, we can plot the loss of the model:

plt.figure(figsize=(17,10)) plt.plot(model.loss)

The loss curve is not ideal, but is good enough for our purposes.

Using a slightly modified version of this code, we can also visualize the decision boundary of our model:

Everything seems to work reasonably well, although we need a high number of epochs to converge. This is probably due to a lack of optimization when compared with other professional Frameworks like PyTorch.

The last test we can perform is to use a metric to test the classification result. I chose roc auc score.

from sklearn.metrics import roc_auc_score # Make predictions predictions = model.predict(X.T).T # Format the predictions new_pred = [] for p in predictions: if p < 0.5: new_pred.append(0) else: new_pred.append(1) # Calculate the score roc_auc_score(y_true=Y, y_score=new_pred)

On average in 10 different computations (with the same data and model configuration), the accuracy is 0.8061, which I consider a success.

## Conclusions

We have seen how neural networks work in a nutshell, we have also learned **how to create a really basic deep learning framework** that we can use to test our knowledge about the topic and to play around.

Of course, we left out of this post a lot of interesting theories: regularization, batch size, recurrent nets, overfitting, cross-validation, etc. Maybe, in another post, we will cover some of these topics; but now, you know the basics to keep researching on your own.

To go even deeper into the code, you can go to Github and check this repository, with the DOC string I omitted in this article and more activation functions to play with.

I hope you find this post useful. If you have any doubt, do not hesitate to leave a message below.

## References

- Sebastian Ruder – An overview of gradient descent optimization algorithms.
- Chris Olah – Calculus on Computational Graphs: Backpropagation.
- 3Blue1Brown – Neural Networks.
- DotCSV – Aprendiendo Inteligencia Artificial.
- Fast.ai – Making neural nets uncool again.
- Justin Johnson – Backpropagation for a linear layer.
- Pedro Almagro Blanco – Algoritmo de Retropropagación.
- Terence Parr and Jeremy Howard – The matrix calculus you need for Deep Learning.
- Fei-Fei Li, Justin Johnson & Serena Yeung – Lecture 4: Neural Networks and Backpropagation.