# Dropout in feed-forward neural networks

### jramos

#### 06/12/2017

No Comments

In this post we’ll talk about dropout: a technique used in Machine Learning to prevent complex and powerful models like neural networks from overfitting.

## Adaptive models and overfitting

Neural networks are a versatile family of models used to find relationships between enormous volumes of data, such as the ones we usually work with. They come in all shapes and sizes. Their accuracy is significantly conditioned by both their structure and the size and quality of the data they are trained on.

Building models help data scientists to answer their questions. However, when we use adaptive models, like feed-forward neural networks, the risk of overfitting is almost always present. Thankfully, we can apply a number of procedures and techniques to avoid this overfitting –like pruning when using classification trees, stop criteria in genetic algorithms or bagging in a more general context. Some Machine Learning methods like the ensemble methods –where many weak learners co-operate smartly combining their predictions- were designed to avoid overfitting. Models following this kind of pattern of many weak learners co-operating often show higher accuracy and more stable results (that is, they generalise better) than other singleton complex models out there.

As we said, the adaptability of feed-forward neural networks is a source of overfitting. Furthermore, the amount of data and computational effort required to train a single neural network grows rapidly as we add hidden layers to its architecture. Thus, separately training lots of different neural networks in an attempt to mimic ensemble methods is a rather daunting task.

Dropout is a technique that tackles both of these issues by exploiting a simple idea: Dropping some of the neurons and their connections to their counterparts during training.

### The process goes as follows:

–         In every training batch some neurons’ connections are temporarily removed, obtaining a simpler and lighter version of the complete neural network. The most generic way to do this is by “dropping” each neuron with probability $$p$$ independent of the others. This means that their weights won’t be modified either in the feed-forward or in the back-propagation process, and no output is issued from that neuron.

–         Once trained, at test time, every weight $$W_{ij}$$ in the complete neural network is scaled down, multiplying it by the expected probability of have been used in a given instance of the lighter versions. In the previous case, this just amounts to substitute $$W_{ij}$$ for its scaled-down value $$pW_{ij}$$.

## Why would I want to cripple my neural network?

As weird as it may sound, cancelling some neurons’ ability to learn during training actually aims to obtain better trained neurons and reduce overfitting.

By doing so, we get an approximate result of averaging the simpler trained models, which would otherwise take a lot more time and computational power to be trained one by one. But this isn’t the only reason. In fact, training our neurons in such a particular way not only helps them to co-adapt, balancing their weaknesses and strengths, it also ensures that the features they encapsulate work well with randomly chosen subsets of other neurons’ learned features. After all, during training time, they couldn’t rely on all of their colleagues to do the job as most of the time some of them went missing.

This results in more demanding neurons that try to move past complicated, tailor-made features -which are prone to generalise poorly-, and retain more useful information on their own. In the following figure (extracted from the paper Dropout: A Simple Way to Prevent Neural Networks from Overfitting), we find a comparison of the features learned on MNIST dataset with one hidden layer autoencoder having 256 rectified linear units without dropout (left) and the features learned by the same structure using dropout in its hidden layer with $$p = 0.5$$ (right).

While the former shows unstructured, messy patterns which are impossible to interpret, the latter clearly exhibits purposeful weight distributions that detect strokes, edges and spots on their own, breaking their co-dependence with other neurons to carry out the job.