Artificial Neural Networks (ANN) have been applied with success to many daily tasks that needed human supervision, but due to its complexity, it is hard to understand how they work and how they are trained.
Along this blog, we have deeply talked about what Neural Networks are, how they work, and how to apply them to problems such as finding outliers or forecasting financial time series.
In this post, I try to visually show how a simple Feedforward Neural Network maps a set of inputs into a different space during its training process, so they can be more easily understood.
Data
To show how it works, firstly I create a ‘toy’ dataset. It contains 400 samples equally distributed in two classes (0 and 1), each sample having two dimensions (X0 and X1).

NOTE: All the data comes from three random normal distributions with means [-1, 0, 1] and standard deviations [0.5, 0.5, 0.5].
Network Architecture
The next step is to define the structure of the ANN, which is the following:

The dimension of the hidden layer is minimal (2 neurons) to show where the network maps each sample in a 2D scatterplot.
Despite the previous graph does not show it, each layer has an activation function that modifies its output.
- The input layer has a linear activation function that copies its input value.
- The hidden layer has a ReLU or a tanh activation function.
- The output layer has a sigmoid activation function that ‘shrinks’ its input value to the range [0, 1].
Training
Besides the architecture of the Network, another key aspect of a Neural Network is the training process. There are many ways of training an ANN but the most common one is the Backpropagation process.
The backpropagation process starts by feed forwarding all the training cases (or a batch) to the network, afterward an optimizer calculates ‘how’ to update the weights of the network according to a loss function and updates them according to a learning rate (if this value is high, the updates are more abrupt).
The training process stops when the loss converges, a certain number of epochs has elapsed or the user stops it.
In our study case, the architecture is trained using 2 different activation functions in the hidden layer (ReLU and Tanh) and 3 different learning rates (0.1, 0.01, and 0.001).
Around the input samples, there is a ‘mesh’ of points that show the prediction probability that the model gives to a sample in that position. This makes clearer the frontiers that the model generates along the training process
yielding to these awesome GIFs.
ReLU activation



Tanh activation



NOTE: The loss function used is the binary cross-entropy because we are dealing with a binary classification problem and the optimizer is a modification of the original Stochastic Gradient Descent (SGD) called Adam. The model training stops when it reaches the epoch number 200 or the loss goes below 0.263.
The code to generate the graphs is publicly available in the following link:
Thanks for your attention!