Machine Learning

The secret sauce that makes Deep Learning frameworks so powerful

Alejandro Pérez


No Comments

Inside most of the Deep Learning frameworks that are available lies a powerful technique called Automatic Differentiation. If you ever encountered these words but don’t know what they mean or how this procedure works, this post is for you.

In a previous post, we saw how to built a deep learning framework using NumPy. In that post, I mentioned that we could implement the computations at operation level and track the gradients. This is what we called Automatic Differentiation.


Automatic differentiation is a numerical technique to automatically evaluate derivatives from a set of operations. As you may remember from the previous post, derivatives are the tool we use to compute the gradient of the loss function, and we use that gradient to update the parameters of our model.

No matter how complex a deep learning model is, it can be summarized as a set of elementary operations, from which the gradients are derived from.

Under the hood, the most popular Deep Learning libraries are just automatic differentiation frameworks. Let’s see an example with PyTorch:

import torch

# notice how we force PyTorch to track gradients
a = torch.ones(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

# forward pass
c = a * b

# backward pass
back = c.backward()

print(f"a gradient: {a.grad} | b gradient: {b.grad}")

The code above outputs the following line:

a gradient: tensor([0.]) | b gradient: tensor([1.])

Did PyTorch do it right? Of course, but lets prove it:

Let c be
c(a,b) = a \cdot b

Then, the partial derivatives of c with respect to a and b are:
\dfrac{\partial c}{\partial a} = b , \quad \dfrac{\partial c}{\partial b} = a

Therefore, the computation performed by PyTorch is correct (we didn’t expect less :P).

The theory

Like we said before, Automatic differentiation converts a program into a sequence of primitive operations, called computational graph or Wengert lists, which have specified routines for computing their corresponding derivatives [Roger Gross].

Let’s see a slightly more complex function:

d(a,b,c) = a \cdot (b + c)

Assume that $$s(b,c)=(b+c)$$

We are going to draw the computational graph of d; that is, we take our mathematical expression and write it as a graph where each node corresponds to a specific operation.

Computational graph of the function d.

Remember that, for each operation in the forward direction. we have a backward. The backward pass contains the derivative of the operation the node is implementing times the gradient from the following operations. Do you see what is happening? Every time we evaluate an operation in reverse mode we apply the chain rule! Thinking about backpropagation like this makes the whole process a lot easier to understand.

PyTorch keeps track of the operations and computes the derivatives of each one of them, applying the chain rule to compute the gradients.

Reverse-mode differentiation of the function d.

A numerical example using PyTorch

Using the same function d we defined previously, we set a value for the variables.

from torch.autograd import Variable

a = Variable(torch.Tensor([2.2]), requires_grad=True)
b = Variable(torch.Tensor([-1.8]), requires_grad=True)
c = Variable(torch.Tensor([0.1]), requires_grad=True)

# forward pass
d = a * (b + c)

# backward pass
back = d.backward()

print(f"a gradient: {a.grad} | b gradient: {b.grad} | c gradient: {c.grad}")

The resulting gradients are:

a gradient: tensor([-1.7000]) | b gradient: tensor([2.2000]) | c gradient: tensor([2.2000])

The forward pass in the graph looks like:

Forward pass process
Forward pass of the function d.

The backward pass in the graph looks like:

Backward pass of the function d.

No surprise here: our theoretical calculations match PyTorch’s results.

Building it from scratch

To end this post, we will go even deeper and build our own autodiff system.

This section contains implementations for 2 operations that keep track of the gradient. They are designed for scalar values, although they can process NumPy arrays element-wisely.

We start defining a base class the operations can extend.

class _Operation:
    """ Operation abstract class. 
    An operations takes two operands and computes the result in the 'forward'
    method and the gradient in the 'backward' method.
    a : object
        First operand of the operation.
    b : object
        Second operand of the operation.
    a_grad : object
        Gradient of the first operand.
    b_grad : object
        Gradient of the second operand
    def __init__(self):
        # operands
        self.a = None
        self.b = None
        # gradient of the operands
        self.a_grad = None
        self.b_grad = None

    def __call__(self, *args):
        return self.forward(*args)
    def forward(self):

Notice how we only allow two operands. The next step is to define the operations. The forward pass is pretty straightforward. The backward pass is just the derivative of the operation times the incoming gradient (if any).

class Add(_Operation):
    Implements the operation a + b and computes its gradient.
    def __init__(self):
        super(Add, self).__init__()
    def forward(self, a, b):
        self.a = a
        self.b = b
        return np.add(a,b)
    def backward(self, incoming_grad=1):
        self.a_grad = incoming_grad * 1
        self.b_grad = incoming_grad * 1
        return self.a_grad, self.b_grad

class Multiply(_Operation):
    Implements the operation a * b and computes its gradient.
    def __init__(self):
        super(Multiply, self).__init__()
    def forward(self, a, b):
        self.a = a
        self.b = b
        return np.multiply(a,b)
    def backward(self, incoming_grad=1):
        self.a_grad = incoming_grad * self.b
        self.b_grad = incoming_grad * self.a
        return self.a_grad, self.b_grad

As you can see, it is a really simple implementation. The following lines contain an example:

a = 2.2
b = -1.8
c = 0.1

s = Add()
d = Multiply()

# forward pass
res = d(a, s(b,c))

# gradients
dd_da, dd_ds = d.backward(incoming_grad=1)
ds_db, ds_dc = s.backward(incoming_grad=dd_ds)

print(f"a grad: {dd_da} | b grad: {ds_db} | c grad : {ds_dc}")

The code above outputs the following line:

a grad: -1.7 | b grad: 2.2 | c grad : 2.2

We made it! We now can extend this framework and build other operations to make it more complete. Here you can find a slightly more complete library that has more computation options.


In this post we’ve seen what automatic differentiation is and how Deep Learning frameworks use it. We even created our own basic and verbose but still functional autodiff framework.

More advanced libraries to perform these tasks can be found in the References section, as well as other resources. I recommend you to check these links and try the different libraries to get a sense of what kind of tools you can find in the market.