A while ago, I began the course ‘Exploratory Data Analysis’ within the ‘Data Science’ specialisation on the Coursera website (which, by the way, I recommend to anyone who’s curious about the subject).
In one of the first classes, which outlined the basic principles of visual data analysis (and in particular the convenience of displaying multivariate data), we were given the following example, and then asked about it:
From the data of the following study, the relationship between the particle concentration in the air and daily mortality is plotted on a graph.
As you can see, the regression line slope is negative.
Now we divide the data sample according to the season in which they were taken. We can see that in all cases, the regression line slope is positive.
How is this possible?
This is known as the Simpson Paradox.
“Simpson’s paradox, or the Yule–Simpson effect, is a phenomenon in probability and statistics, in which a trend appears in different groups of data but disappears or reverses when these groups are combined.” – Wikipedia.
One of the best-known examples of this paradox occurred in 1973 at the University of Berkeley, California.
The results for university admissions during this summer showed that female applicants were less likely to be accepted than men. The difference was so significant that it couldn’t possibly be due to chance.
Men | Women | |
Admissions | 1198 | 557 |
Rejections | 1493 | 1278 |
% Admissions | 44.52% | 30.35% |
Faced with this data, a young woman whose admission application had just been rejected filed a discrimination suit against them.
However, when each department’s data were subsequently analysed, in no cases was there evidence of bias against female applicants. In fact, most department data had presented a small but statistically significant bias in favour of women.
¿Te apetece leerme en español?