Artificial Neural Networks (**ANN**) have been applied with success to many daily tasks that needed human supervision, but due to its complexity, it is hard to understand how they work and how they are trained.

Along this blog, we have deeply talked about what Neural Networks are, how they work, and how to apply them to problems such as finding outliers or forecasting financial time series.

In this post, I try to **visually** show** **how a simple Feedforward Neural Network maps a set of inputs into a different space during its training process, so they can be more easily understood.

## Data

To show how it works, firstly I create a ‘toy’ dataset. It contains 400 samples equally distributed in two classes (0 and 1), each sample having two dimensions (X0 and X1).

**NOTE:** All the data comes from three random normal distributions with means [-1, 0, 1] and standard deviations [0.5, 0.5, 0.5].

## Network Architecture

The next step is to define the structure of the ANN, which is the following:

The dimension of the hidden layer is minimal (2 neurons) to show where the network maps each sample in a 2D scatterplot.

Despite the previous graph does not show it, each layer has an **activation function** that modifies its output.

- The input layer has a
**linear**activation function that copies its input value.

- The hidden layer has a
**ReLU**or a**tanh**activation function.

- The output layer has a
**sigmoid**activation function that ‘shrinks’ its input value to the range [0, 1].

## Training

Besides the architecture of the Network, another key aspect of a Neural Network is the training process. There are many ways of training an ANN but the most common one is the **Backpropagation** process.

The backpropagation process starts by feed forwarding all the training cases (or a batch) to the network, afterward an optimizer calculates ‘how’ to update the weights of the network according to a **loss function** and updates them according to a **learning rate** (if this value is high, the updates are more abrupt).

The training process stops when the loss converges, a certain number of epochs has elapsed or the user stops it.

In our study case, the architecture is trained using 2 different activation functions in the hidden layer (**ReLU **and **Tanh**) and 3 different learning rates (**0.1**, **0.01,** and **0.001**).

Around the input samples, there is a ‘mesh’ of points that show the prediction probability that the model gives to a sample in that position. This makes clearer the frontiers that the model generates along the training process

yielding to these awesome GIFs.

### ReLU activation

### Tanh activation

**NOTE**: The loss function used is the **binary cross-entropy** because we are dealing with a binary classification problem and the optimizer is a modification of the original **Stochastic Gradient Descent** (SGD) called Adam. The model training stops when it reaches the **epoch** number 200 or the **loss** goes below 0.263.

The code to generate the graphs is publicly available in the following link:

Thanks for your attention!