Understanding Neural Networks (with Graphs)

Pablo Leo


No Comments

Artificial Neural Networks (ANN) have been applied with success to many daily tasks that needed human supervision, but due to its complexity, it is hard to understand how they work and how they are trained.

Along this blog, we have deeply talked about what Neural Networks are, how they work, and how to apply them to problems such as finding outliers or forecasting financial time series.

In this post, I try to visually show how a simple Feedforward Neural Network maps a set of inputs into a different space during its training process, so they can be more easily understood.


To show how it works, firstly I create a ‘toy’ dataset. It contains 400 samples equally distributed in two classes (0 and 1), each sample having two dimensions (X0 and X1).

Synthetic 'toy' set generated from normal distributions
Synthetic set

NOTE: All the data comes from three random normal distributions with means [-1, 0, 1] and standard deviations [0.5, 0.5, 0.5].

Network Architecture

The next step is to define the structure of the ANN, which is the following:

 ANN Architecture
ANN Architecture

The dimension of the hidden layer is minimal (2 neurons) to show where the network maps each sample in a 2D scatterplot.

Despite the previous graph does not show it, each layer has an activation function that modifies its output.

  • The input layer has a linear activation function that copies its input value.
\[a(x) = x\]
  • The hidden layer has a ReLU or a tanh activation function.
\[a(x) = max(0,x) \quad \quad a(x) = \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}\]
  • The output layer has a sigmoid activation function that ‘shrinks’ its input value to the range [0, 1].
\[a(x) = \frac{1}{1+e^{-x}}\]


Besides the architecture of the Network, another key aspect of a Neural Network is the training process. There are many ways of training an ANN but the most common one is the Backpropagation process.

The backpropagation process starts by feed forwarding all the training cases (or a batch) to the network, afterward an optimizer calculates ‘how’ to update the weights of the network according to a loss function and updates them according to a learning rate (if this value is high, the updates are more abrupt).

The training process stops when the loss converges, a certain number of epochs has elapsed or the user stops it.

In our study case, the architecture is trained using 2 different activation functions in the hidden layer (ReLU and Tanh) and 3 different learning rates (0.1, 0.01, and 0.001).

Around the input samples, there is a ‘mesh’ of points that show the prediction probability that the model gives to a sample in that position. This makes clearer the frontiers that the model generates along the training process
yielding to these awesome GIFs.

ReLU activation

Training Evolution for activation ReLU and learning rate 0.1
Activation: ReLU | Learning rate: 0.1
Training Evolution for activation ReLU and learning rate 0.01
Activation: ReLU | Learning rate: 0.01
Training Evolution for activation ReLU and learning rate 0.001
Activation: ReLU | Learning rate: 0.001

Tanh activation

Training Evolution for activation Tanh and learning rate 0.1
Activation – Tanh | Learning rate – 0.1
Training Evolution for activation Tanh and learning rate 0.01
Activation – Tanh | Learning rate – 0.01
Training Evolution for activation Tanh and learning rate 0.001
Activation: Tanh | Learning rate: 0.001

NOTE: The loss function used is the binary cross-entropy because we are dealing with a binary classification problem and the optimizer is a modification of the original Stochastic Gradient Descent (SGD) called Adam. The model training stops when it reaches the epoch number 200 or the loss goes below 0.263.

The code to generate the graphs is publicly available in the following link:

Thanks for your attention!