Neural networks have turned into the top Machine Learning algorithm. Progress in this field is plenty on fire and it is really impressive all the goals nets have achieved, like painting à la Van-Gogh, writing a Beatles song, creating a person who does not exist, making everybody dance and even bringing the Mona Lisa to life. These awesome accomplishments have many applications to all the industries.
This post is about the definition of a neural network and the main components that it involves. So let’s start from the very beginning…
Neural networks are inspired by our biological neural networks in the sense that stimuli are combined and interact until they produce a response. Their sophistication comes from the interactions of many simple parts that work jointly.
The neuron is the basic unit inside a neural net. One neuron receives some signals that it combines and transforms to create a new single signal that is sent to the next level. Actually, a neuron is a very cool name for a function:
Any of the inputs (other neurons) has a linked weight that can be interpreted as the intensity each signal comes to the particular neuron with.
The propagation function plays the role of combining the input data. It consists of the weighted sum of the input values. It includes a bias, an independent term that, in contrast with the signal provided by the rest of the equation, captures the existing noise in the relationship between x and y.
As you can see, the neuron seems to be very similar to linear regression, but it is not the same, due to the inclusion of the activation function. This activation distorts the previous combination and provides the network with the nonlinearity that is a key point when trying to find complex relationships in the data. The most common ones are:
A neural network is a set of neurons. They are organized in levels, the layers. Neurons of one layer interact with neurons on the next layer through the weighted connections. The information flows from one level to the next one: The neurons of the same layer receive the same inputs, those processed in the neurons of the previous layer. The outputs of the neurons on the same layer will be combined in the neurons of the next one.
The first layer is the input layer and has as many neurons as the number of available features that are used as inputs. The last one, the output layer, will have as many neurons as the needed outputs. When there are no more than these two layers, we called it perceptron, the simplest net you can define. But the normal case it is to place between them the hidden layers, which can be as many as you prefer and can also contain as many neurons as you want.
This increment in the number of layers and the networks’ complexity is known as deep learning, something like a conventional network on steroids. Click here to go in depth in deep learning
Remember that any of these neurons is combining information by linear regression. It is easy to prove that the union of linear layers is also linear, as you can simplify it and obtain a unique equivalent layer. As mentioned before, the use of activation functions is necessary to introduce nonlinearity. These nonlinear distortions are what give the layers’ sequence meaning and provide the neural nets with the power of modeling intricate relationships in data.
Find here a geometric interpretation of three types of nets, just a linear model, a simple net and a quite difficult one, and their success when trying to solve three different classification problems:
If you are further interested in nets from a geometric point of view, visit this fantastic post. And if you would like playing the game of configuring your own net, you can do it on the playground TensorFlow website, which is the source where I generated the previous graphs.
Now maybe you are wondering…
How many layers and neurons may I choose?
Considering the previous explanation, you can imagine the answer will depend fully on the problem. The first obvious point is that if your data is linearly separated, then no hidden layers are required. But actually, this isn’t very helpful…
If you have some intuition about the problem to solve, then there are some tricks you can apply to determine the optimal architecture (look at this post for more insight). However, it is not the standard case. In real-world problems, there is no way to determine the best number of hidden layers and neurons without trying.
Normally, you can choose between two techniques: Pruning, i. e. building a large net and then prune it by deleting the nodes that don’t contribute to the result. Growing, i. e. starting with a simple net and adding nodes only when the improvement is clear.
On the other hand, number of layers and number of neurons are the two hyper-parameters of the net and so, a cross-validation optimization could help to set the most suitable values for them.
What happens if my net has an inappropriate number of hidden neurons?
Low complexity will lead to under-fitting, while too much complexity surely drives to over-fitting. The density of your net must be justified.
What activation function may I choose?
Although there are no golden rules to answer this, the ReLu function is often the best choice. It is computationally cheaper and fixes the Vanishing Gradient problem, which is an inconvenience when using Sigmoid and Tanh. ReLu presents disadvantages too, such as the Dying Gradient problem, and some versions exist to mitigate its weaknesses. It is quite frequent to find architectures that use the ReLu function for hidden layers and another one, usually a version of ReLu for the output layer.
To discover more about it you can go through these two posts: Activation functions and its types & Understanding Activation Functions
To sum up…
A neural net consists of many small units called neurons that are grouped into several layers. Neurons of one layer interact with neurons on the next level through weighted connections. The interaction involves no more than a linear combination by a propagation function and a nonlinear transformation by an activation function. Thanks to stacking several layers and neurons it is possible to work out very complex solutions.
Note that these weighted connections are just real values to optimize. These values will define the machine learning model that the net will come up with. It must find the right weights to get the right results and of course, it must learn by itself. This will be during the training stage that I leave for the next chapter: How to train your net. Coming soon!