How does ChatGPT work? What is behind deep fake images of celebrities? How do we deal with the lack of data in finance? All these issues have in common the same underlying concept; they are based on generative models.
Generative models are algorithms that create new instances of data that mimic the data on which they have been trained. Depending on their task, different generative models are preferred. For instance, ChatGPT belongs to the field of Natural Language Processing, and is based on Transformer models. Deepfakes belong to the image processing field and financial data belongs to the time series field but they could be treated with the same model, Generative Adversarial Networks, or GANs for short.
What is a GAN?
A Generative Adversarial Network is a deep learning model composed of two Neural Networks. The network which generates the samples is called generator, \(G\). We want this network to generate feasible images or financial time series, as real as possible but new, different from our current dataset. However, the quality of the samples generated may be poor and differ too much from the real data we are trying to replicate, since the network weights need to be optimized. How can we train our generator to improve the quality of the samples? We can use another network to help us, called discriminator, \(D\). It is responsible for classifying samples into real (from our train dataset) or fake (generated by \(G\)). From the discriminator output, the generator will improve its performance but, how are these two networks related?
How do the networks work?
We can sketch the workflow as follows:

The generator network needs an input to generate the output sample. It takes a vector of “random noise” , \(z\), from the so-called latent space to pass through the net and generate a fake sample, \(x_{fake}\). Mathematically, \(G\) is a function such that \(G(z) = x_{fake}\).
By “random noise” we mean a vector with the same length as the number of neurons in the first layer which is sampled randomly from a given distribution, \(p_{z}\). Usually, \(p_{z} \sim \mathcal{U}(0,1)\), the uniform distribution between 0 and 1, or \(p_{z} \sim \mathcal{N}(0,1)\), the normal distribution with 0 mean and 1 standard deviation. This randomness of the latent space allows the generated \(x_{fake}\) to vary each time, so each generated sample is a new unique sample.
How does the generator know which are the real samples? In these first steps we got a fake sample \(x_{fake}\) but we did not feed it with any information of the real samples \(x_{real}\). For that purpose, discriminator network takes action. We feed it with both, real samples from our train dataset and fake samples from generator. Due to this workflow, it is clear that the number of neurons in the last generator layer and the first discriminator layer, and the number of features (dimensions) of \(x_{fake}\) and \(x_{real}\) must be all the same.
The discriminator acts as a classic binary classifier. It discriminates between fake samples, which we label with 0 and real samples, which we label with 1. Its output value is the probability (from 0 to 1) of being fake or real. The optimal discriminator will output \(D(x_{real})=1\) and \(D(x_{fake})=0\). As we know which sample is fake and which is not, the data is labeled, so the discriminator training is supervised learning while generator training is unsupervised since no labels are needed. However, the whole process of training the GAN model is consider unsupervised.
We can think about the relationship between generator and discriminator as the relationship between a student (\(G\)) trying to copy and the teacher (\(D\)) who tries to verify whether the work is original \(x_{real}\) or copied \(x_{fake}\).
How do I generate high quality samples?
Once the discriminator outputs the probability, how does the generator improve? At this point we need a loss function, as in all neural networks, to tune the parameters through gradient descent. We may choose the Binary-Cross Entropy, a well-known function for binary classification problems. Since we have two neural networks, this is not the usual problem where we just minimize only one cost function, there are two networks to optimize. We need one for \(G\), say \(J^{G}\), and other for \(D\), say \(J^{D}\), but they are just the opposite.
The two networks are adversaries trying to beat each other. Rather than a optimization problem, we can see it from the game theory point of view. They are players in a zero-sum game, to understand it better let us go back to the school analogy. If the student copies and fools the teacher, he wins but the teacher loses. Viceversa, if the teacher realizes the work is a copy, he wins but the student loses. The aggregated cost of the players add up to 0. Due to this casuistic and since the cost functions are opposite, \(J^{G} = -J^{D}\), we do not need to minimize two functions. We can summarize them in a single value function, \(V^{G,D} = -J^{D}\), and let \(G\) minimize it and \(D\) maximize it. But before writing the full equation we need to go deeper into the GAN maths…
Show me the maths!
Formally, a GAN is a structured probabilistic model [2] with latent variables, \(z\), and observed variables, \(x\), and \(G\) is a function that takes \(z \sim p_{z} \rightarrow x_{fake} \sim p_{G}\), where \(p_{G}\) is the probability density function (pdf) of the fake samples, that is, the pdf of the generator. When we say “the generator trains to try to replicate the real data”, what really happens is that with each GAN iteration, the parameters of \(G\) change. That means its pdf changes trying to get closer to the real data pdf, \(x_{real} \sim p_{real}\). For a better understanding we may look at the example below:

The green line is the generator distribution \(p_{G}\), as the weights are randomly initialized it could take whatever shape. We can see how \(G\) takes the \(z\) uniform distribution values and maps them to a \(p_{G}\) distribution. The black dotted line is the real distribution (target) \(p_{real}\) and the blue dotted one is the discriminator distribution \(p_{D}\) which classifies.
In (a) we can see the initial distributions. If we train first the discriminator in (b), we can see how the blue line becomes more accurate at distinguishing real (black) and generator (green) distributions. Next step (c), we train the generator, since the discriminator got better classifying data, the generator must update its parameters in the direction of the (minus) gradient to generate better samples to fool it. So its distribution gets closer to the real distribution after each iteration. At the end, ideally if the training succeeded, \(p_{G} = p_{real}\) and the discriminator will classify fake samples right with a chance of \(50\%\) because of the high quality of the fake samples.
We can look at this time series from the point of view of our analogy. In (a) there is no previous training, the student (green) and the teacher (blue) do not know each other. In (b) the student submits some copied homework and some original homework. The teacher classifies them and learns to distinguish them better (blue line changes its shape). In (c) the teacher gives feedback to the student. The student realizes which parts of the copied work have betrayed him and improves (green line changes its shape). After several repetitions of (b) and (c) we come to (d). In (d) the student creates perfect copies of the homework. The teacher is so confused that he/she cannot distinguish them and fails half the time.
Now, we are ready to write the full value function.
Finally the GAN value function
$$ \underset{G}{min} \ \underset{D}{max} \ V^{G,D} = \mathbb{E}_{x_{real}\sim p_{real}} [log D(x_{real})] + \mathbb{E}_{z\sim p_{z}} [log(1-D(G(z))] $$
The expression may be scary, but it is easy to read and check that it makes sense. \(V^{G,D}\) is composed of two terms, which are expected values. First, the expected value of the output given by \(D\) for the real samples (remember, it is a percentage between 0 and 1). Second, the expected value of 1 minus the output given by \(D\) for the fake samples.
Let us start with \(D\). The perfect \(D\) always classifies right, so \(D(x_{real})=1\) and \(D(x_{fake})=0\), this way the first and second term get their maximum value, so the aim of \(D\) is to maximize \(V^{G,D}\). The perfect \(G\) always fools the discriminator, \(G\) has nothing to do with the first term, but with the second term. If \(G\) always fools \(D\) that means \(D(G(z))=1\), so \(D\) labels wrong the data, this way \(V^{G,D}\) gets the lowest possible value, so the aim of \(G\) is to minimize it.
A few considerations…
So far, we have seen the basics of how GAN works and the statistical model behind. But many difficulties not present in stand-alone networks arise from the concept of training two networks together [4].
- Which one should we train first?
Once you have calculated \(V^{G,D}\) you can perform gradient ascent on \(D\) to update its parameters first and then perform gradient descent on \(G\), or viceversa. \(\underset{G}{min} \ \underset{D}{max} \ V^{G,D}\) is not the same as \(\underset{D}{max} \ \underset{G}{min} \ V^{G,D}\) and convergence problems may appear. To get better results, we update the discriminator in the inner loop.
- How many times?
For each state of \(G\) (fixed parameters), \(D\) is usually trained 5 times. So per each update of \(G\) parameters, \(D\) parameters are updated 5 times in the inner loop. Commonly, \(D\) network has a simpler architecture than \(G\) so it is not too time-consuming.
- Why 5 times and not 1 or 100?
There is not a fixed number of iterations for the inner loop, this value is a suggestion and may vary depending on the GAN model, but there must be a balance between both networks. Back to our scholar analogy, an over-trained teacher will always realize which homework has been copied and the student will not have a clue about what could work, so he will not improve his “cheating”, this is the vanishing gradient problem. On the other hand, an under-trained teacher will always be fooled, so the student has no need to improve and generate better quality copies since his work is easily done.
So far, we have treated the vanilla GAN, but more sophisticated versions have arisen in recent years. DeepConvolutional-GAN for image processing, Wasserstein-GAN for financial series, Conditional-GAN for multi-modal outputs… However, this is the underlying base for all of them. Now that you are an expert in GANs you can go further into the different types or go back to the basics and check the original GAN paper from Ian Goodfellow. See you in the next GANs post!
References
[1] J. Hany and G. Walters, Hands-On Generative PyTorch 1 . x.
[2] I. J. Goodfellow, Y. Bengio, and A. Courville, (2016). Deep Learning. MIT Press.
[3] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, Generative Adversarial Nets.
[4] I. J. Goodfellow (2017), NIPS 2016 Tutorial: Generative Adversarial Networks