For every quantity to be estimated (estimand), there’s a plethora of ways to estimate it (estimators). This raises the question of what properties we should be looking for so as to make a sensible choice. Often highlighted as one of such properties is unbiasedness, which we will discuss below with a particular focus on its shortcomings.

## Introducing unbiasedness

In estimation, we want to obtain the true value of an estimand \( \theta^*\), for which we use an estimator \( \hat{\theta}(\cdot) \). Note that the estimator is a function, when we evaluate it in a particular sample \(X\), we get an estimate \(\hat{\theta}(X) \):

$$ \hat{\theta}(X) \simeq \theta^*. \quad (1)$$

In this context, an unbiased estimator is one whose expected output is the true value being estimated. From a frequentist point of view, this is to mean that if we repeated this process multiple times (each with a different sample \( X_i \) thus getting a different estimate \( \hat{\theta}(X_i) \)), the average \( \mathrm{E}[\hat{\theta}(X)] \) would tend towards the true value \( \theta^*\). In other words, our errors will be equally distributed^{[1]} on each side of the estimand, hence our estimator having no bias.

$$ \text{Bias} = \mathrm{E}[\hat{\theta}(X)] – \theta^*. \quad (2)$$

### Unbiasedness alone is not enough

To better get a feel for this concept, we can start with a basic example, the estimation of the population mean \( \mu^* \). Let’s say we have a sample \( X \), composed of \( n \) values, \( (x_1, x_2, …, x_n) \). The most common estimator would be the sample mean, \( \hat{\mu} \):

$$ \hat{\mu} = \frac{1}{n} \sum_i x_i. \quad (3)$$

Note that because the expected value of any \( x_i \) is by definition the population mean,

$$ \mathrm{E}[x_i] = \mu^*, \quad(4)$$

we have an unbiased estimator:

$$ \mathrm{E} [ \hat{\mu} ] = \mathrm{E} \left[ \frac{1}{n} \sum_i x_i \right] = \frac{1}{n} \sum_i \mathrm{E} [x_i ] = \frac{1}{n} n \mathrm{E} [ \mu^* ] = \mu^*. \quad (5)$$

However, there’s more to this. If you take another look at equation (4), you can see that \( x_i \) in itself is already an unbiased estimator of \( \mu^* \). Clearly (3) is preferable to (4), but considering unbiasedness alone we cannot tell the difference between the two, hence we need to incorporate some additional criterion.

## Variance to the rescue

The problem with our current condition of unbiasedness is that it is satisfied as long as our estimator is not systematically erring on either side of the true value, regardless of the magnitude of the deviations. Presumably, if we looked at this magnitude we could confirm our intuition that the sample mean is a better estimator of the population mean than just taking a random value from the sample.

To take this into account, we can add another criterion that penalizes estimators with greater dispersion, which we can measure through variance (i.e., expected squared deviation):

$$ \text{Variance} = \mathrm{E} \left[ \left( \hat{\theta} – \mathrm{E} [ \hat{\theta}] \right)^2 \right]. \quad (6)$$

The idea is that if our estimates are more concentrated around its average, and its

average is close to the true value, our error will be smaller. When our estimator is unbiased, \( \mathrm{E} [ \hat{\theta}] = \theta^* \), we can do a straightforward substitution to see it more clearly:

$$ \text{Variance} = \mathrm{E} \left[ \left( \hat{\theta} – \theta^* \right)^2 \right]. \quad (7)$$

Now variance is equivalent to the expected squared deviation from the true value, so by minimizing it we minimize the error of our estimator, leading to what’s called Minimum Variance Unbiased Estimator (MVUE, for short).

## Why not minimize the error all along?

It seems like we have gone a long way to minimize the expected error, why not do that in the first place? Not only this would be more straightforward, one could argue it’s also a more meaningful approach as we will see. Our objective would be thus to obtain an estimator with minimum mean squared error (MSE):

$$ \text{MSE} = \mathrm{E} \left[ \left( \hat{\theta} – \theta^* \right)^2 \right]. \quad (8)$$

To see how this relates to the concepts of bias and variance (we’ve seen MSE equals variance for unbiased estimators, but we are interested in the more general case) we can add and subtract \( \mathrm{E} [\hat{\theta}] \) and rearrange a little bit:

$$ \text{MSE} = \mathrm{E} \left[ \left( \hat{\theta} – \mathrm{E} [\hat{\theta}] + \mathrm{E} [\hat{\theta}] – \theta^* \right)^2 \right] = \\ \mathrm{E} \left[ \left( \hat{\theta} – \mathrm{E} [\hat{\theta}] \right)^2 \right] + \mathrm{E} \left[ \left( \mathrm{E} [\hat{\theta}] – \theta^* \right)^2 \right] – 2 \mathrm{E} \left[ \left( \hat{\theta} – \mathrm{E} [\hat{\theta}] \right) \left( \mathrm{E} [\hat{\theta}] – \theta^* \right) \right]. \quad(9) $$

Since \( \mathrm{E} [ \hat{\theta} ] \) and \( \theta^* \) are constant we can move them out of the expectation:

$$ \text{MSE} = \mathrm{E} \left[ \left( \hat{\theta} – \mathrm{E} [\hat{\theta}] \right)^2 \right] + \left( \mathrm{E} [\hat{\theta}] – \theta^* \right)^2 – 2 \mathrm{E} \left[ \hat{\theta} – \mathrm{E} [\hat{\theta}] \right] \left( \mathrm{E} [\hat{\theta}] – \theta^* \right). \quad (10) $$

Note that the first term is the variance, the second is the bias squared and the third is zero, given that \( \mathrm{E} \left[ \hat{\theta} – \mathrm{E} [\hat{\theta}] \right] = \mathrm{E} [ \hat{\theta} ] – \mathrm{E} [\hat{\theta}] \). Hence:

$$ \text{MSE} = \text{Variance} + \text{Bias}^2. \quad(11) $$

Observe this is consistent with our previous result that for unbiased estimators varianced equaled MSE.

### The bias-variance trade-off

Looking at (11) we can see that increasing bias might be useful whenever it allows us to reduce variance in a greater amount (greater than bias squared, that is). In general is not uncommon to have a trade-off between bias and variance; especially when we are at the optimum MSE, where the only way to reduce one is by increasing the other (otherwise we could further decrease the MSE, meaning we weren’t at the optimum in the first place).

In this context, there’s no reason to privilege bias over variance as in the MVUE framework and in fact the opposite is more often the case. For instance, when models struggle to generalize (i.e., achieve high out of sample performance), lowering the variance at the expense of bias (even at the expense of in sample MSE) is a common recommendation. This is because an estimator with low variance will not vary much depending on the sample, thus being less prone to overfitting. In general, the bias-variance trade-off constitutes one of the fundamental principles concerning parameter optimization in statistics and machine learning, and highlights the limitations of sticking to unbiasedness.

## Is it even feasible to strive for unbiasedness?

Let’s consider the case of variance estimation, variance being (as we’ve seen) the expected squared deviation from the mean. The most straightforward estimator would be the following,

$$ \hat{\sigma}^2 = \frac{1}{n} \sum_i (x_i – \hat{\mu}) ^2. \quad (12) $$

which is known as the Maximum Likelihood Estimator (MLE). However, since this estimator is biased (assuming we have estimated the mean using (3)), is not uncommon to see the divisor \( n \) replaced by \( n – 1 \):

$$ \hat{\sigma}^2 = \frac{1}{n-1} \sum_i (x_i – \hat{\mu}) ^2. \quad (13) $$

This is called the Bessel correction and leads to an unbiased estimation of variance. Then the standard deviation is often (for instance it’s pandas and scipy‘s default) obtained as follows:

$$ \hat{\sigma} = \sqrt{ \frac{1}{n-1} \sum_i (x_i – \hat{\mu}) ^2 } . \quad (14) $$

However, this is no longer an unbiased estimator, in fact it will be downwardly biased. Following from Jensen’s inequality, and the fact that the square root is a concave function:

$$ \mathrm{E} \left[ \sqrt{ \hat{\sigma}^2} \right] < \sqrt{ \mathrm{E} \left[ \hat{\sigma}^2 \right] } = \sqrt{\sigma^2} = \sigma. \quad (15) $$

Correcting this bias is not straightforward, as it requires further assumptions about the underlying distribution from which the data have been obtained. In general, non-linear transformations are bound to introduce bias in the resulting estimate, another example is put forward by Tom Leinster:

*“Being unbiased is perhaps a less crucial property of an estimator than it might at first appear. Suppose the boss of a chain of pizza takeaways wants to know the average size of pizzas ordered. “Size” could be measured by diameter — what you order by — or area — what you eat. But since the relationship between diameter and area is quadratic rather than linear, an unbiased estimator of one will be a biased estimator of the other.”*

## Wrapping up

With all of this in mind, aiming for unbiasedness seems more problematic than useful and its preferably to rely on other alternatives for building estimators. Here we have briefly discussed MSE, but Maximum Likelihood Estimation (MLE) is a reasonable default too. Or one could just go bayesian, but that will be a topic for another time.

## Footnotes

[1] Actually this would only be true if the error distribution was symmetric, but it serves to get the point across.