Inside most of the Deep Learning frameworks that are available lies a powerful technique called Automatic Differentiation. If you ever encountered these words but don’t know what they mean or how this procedure works, this post is for you.
In a previous post, we saw how to built a deep learning framework using NumPy. In that post, I mentioned that we could implement the computations at operation level and track the gradients. This is what we called Automatic Differentiation.
Introduction
Automatic differentiation is a numerical technique to automatically evaluate derivatives from a set of operations. As you may remember from the previous post, derivatives are the tool we use to compute the gradient of the loss function, and we use that gradient to update the parameters of our model.
No matter how complex a deep learning model is, it can be summarized as a set of elementary operations, from which the gradients are derived from.
Under the hood, the most popular Deep Learning libraries are just automatic differentiation frameworks. Let’s see an example with PyTorch:
import torch
# notice how we force PyTorch to track gradients
a = torch.ones(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)
# forward pass
c = a * b
# backward pass
back = c.backward()
print(f"a gradient: {a.grad} | b gradient: {b.grad}")
The code above outputs the following line:
a gradient: tensor([0.]) | b gradient: tensor([1.])
Did PyTorch do it right? Of course, but lets prove it:
Let c be
$$
c(a,b) = a \cdot b
$$
Then, the partial derivatives of c with respect to a and b are:
$$
\dfrac{\partial c}{\partial a} = b , \quad \dfrac{\partial c}{\partial b} = a
$$
Therefore, the computation performed by PyTorch is correct (we didn’t expect less :P).
The theory
Like we said before, Automatic differentiation converts a program into a sequence of primitive operations, called computational graph or Wengert lists, which have specified routines for computing their corresponding derivatives [Roger Gross].
Let’s see a slightly more complex function:
$$
d(a,b,c) = a \cdot (b + c)
$$
Assume that $$s(b,c)=(b+c)$$
We are going to draw the computational graph of d; that is, we take our mathematical expression and write it as a graph where each node corresponds to a specific operation.

Remember that, for each operation in the forward direction. we have a backward. The backward pass contains the derivative of the operation the node is implementing times the gradient from the following operations. Do you see what is happening? Every time we evaluate an operation in reverse mode we apply the chain rule! Thinking about backpropagation like this makes the whole process a lot easier to understand.
PyTorch keeps track of the operations and computes the derivatives of each one of them, applying the chain rule to compute the gradients.

A numerical example using PyTorch
Using the same function d we defined previously, we set a value for the variables.
from torch.autograd import Variable
a = Variable(torch.Tensor([2.2]), requires_grad=True)
b = Variable(torch.Tensor([-1.8]), requires_grad=True)
c = Variable(torch.Tensor([0.1]), requires_grad=True)
# forward pass
d = a * (b + c)
# backward pass
back = d.backward()
print(f"a gradient: {a.grad} | b gradient: {b.grad} | c gradient: {c.grad}")
The resulting gradients are:
a gradient: tensor([-1.7000]) | b gradient: tensor([2.2000]) | c gradient: tensor([2.2000])
The forward pass in the graph looks like:

The backward pass in the graph looks like:

No surprise here: our theoretical calculations match PyTorch’s results.
Building it from scratch
To end this post, we will go even deeper and build our own autodiff system.
This section contains implementations for 2 operations that keep track of the gradient. They are designed for scalar values, although they can process NumPy arrays element-wisely.
We start defining a base class the operations can extend.
class _Operation:
""" Operation abstract class.
An operations takes two operands and computes the result in the 'forward'
method and the gradient in the 'backward' method.
Attributes
----------
a : object
First operand of the operation.
b : object
Second operand of the operation.
a_grad : object
Gradient of the first operand.
b_grad : object
Gradient of the second operand
"""
def __init__(self):
# operands
self.a = None
self.b = None
# gradient of the operands
self.a_grad = None
self.b_grad = None
def __call__(self, *args):
return self.forward(*args)
def forward(self):
pass
Notice how we only allow two operands. The next step is to define the operations. The forward pass is pretty straightforward. The backward pass is just the derivative of the operation times the incoming gradient (if any).
class Add(_Operation):
"""Adds.
Implements the operation a + b and computes its gradient.
"""
def __init__(self):
super(Add, self).__init__()
def forward(self, a, b):
self.a = a
self.b = b
return np.add(a,b)
def backward(self, incoming_grad=1):
self.a_grad = incoming_grad * 1
self.b_grad = incoming_grad * 1
return self.a_grad, self.b_grad
class Multiply(_Operation):
"""Multiplies.
Implements the operation a * b and computes its gradient.
"""
def __init__(self):
super(Multiply, self).__init__()
def forward(self, a, b):
self.a = a
self.b = b
return np.multiply(a,b)
def backward(self, incoming_grad=1):
self.a_grad = incoming_grad * self.b
self.b_grad = incoming_grad * self.a
return self.a_grad, self.b_grad
As you can see, it is a really simple implementation. The following lines contain an example:
a = 2.2
b = -1.8
c = 0.1
s = Add()
d = Multiply()
# forward pass
res = d(a, s(b,c))
# gradients
dd_da, dd_ds = d.backward(incoming_grad=1)
ds_db, ds_dc = s.backward(incoming_grad=dd_ds)
print(f"a grad: {dd_da} | b grad: {ds_db} | c grad : {ds_dc}")
The code above outputs the following line:
a grad: -1.7 | b grad: 2.2 | c grad : 2.2
We made it! We now can extend this framework and build other operations to make it more complete. Here you can find a slightly more complete library that has more computation options.
Conclusions
In this post we’ve seen what automatic differentiation is and how Deep Learning frameworks use it. We even created our own basic and verbose but still functional autodiff framework.
More advanced libraries to perform these tasks can be found in the References section, as well as other resources. I recommend you to check these links and try the different libraries to get a sense of what kind of tools you can find in the market.
- William W. Cohen – Automatic Reverse-Mode Differentiation: Lecture Notes.
- Roger Grosse – CSC321 Lecture 10: Automatic Differentiation.
- Matthew Johnson – Autodidact.
- Harvard Intelligent Probabilistic Systems Group – Autograd.
- Andrej Karpathy – Micrograd.
- Christopher Olah – Calculus on Computational Graphs: Backpropagation.
- Alejandro Pérez – Toydiff.
- Facebook AI – PyTorch.