All

The Benford law and the Zipf law

Jose Leiva

05/10/2018

No Comments

Take your favourite newspaper and write down all the numbers you can find in it. Then take the first, leftmost digit of each number. This is called the most significative digit (MSD). What is the frequency you expect to observe for these digits? If your answer is that all digits from 1 to 9 have the same probability, please think again!

In order to understand why the distribution is not uniform, consider the following news borrowed from The Guardian, The Times and NY Times and the numbers highlighted:

 

If we put all these numbers in an axis, we get the following:

Ok, this representation in a natural (linear) scale is not useful. The reason is that, since one of the numbers is huge, the rest are squeezed towards zero. What if we use a logarithmic scale instead?

 

 

Much better. The reason why this representation is more appropriate is that the numbers we come across in our daily life have a huge dynamic range (they can refer to goals, GDP of countries or distances among galaxies) so that we can assume they are uniformly distributed across a logarithmic axis. The American physicist Frank Benford realized this fact and stated his famous law.

The Benford’s law

Coming back to our original question, what is the frequency of numbers starting with “1” compared to those starting with, for example, “4”? According to our previous assumption, we can answer the question by measuring the area corresponding to the numbers starting with “1” and  “4”, respectively, in the logarithmic axis.

Numbers starting with “1” (blue) vs. numbers starting with “4” (red)

In the figure, we have painted in blue the area corresponding to numbers starting with 1 (from 1 to 2, from 10 to 20, etc) and, for comparison, the numbers starting with 4 (4 to 5, 40, to 50, etc) are painted in red. We can see that the blue area is wider than the red one.

Formally, the probability mass function (PMF) that corresponds to this distribution is:

$$ p(n) = \log_{10}\frac{n+1}{n}, \quad n=1,\dots,9 \tag{1}$$

Proving that this is indeed a valid PMF, i.e. that \(\sum_{n=1}^9 p(n)=1\), is easy and fun (and left as an exercise).  We plot the values provided by Eq. (1) in the following figure:

Probability mass function of the Benford’s distribution.

As shown in the figure, numbers starting with 1 are more than six times as frequent as those starting with 9, so you will find that around 30% of numbers in the newspaper start with 1, and less than 5% start with a 9.

Ok, the Benford’s law is cool but, is it also useful? Indeed it is. Keep reading!

The Enron scandal

In 2001, the energy and commodities American company Enron went bankrupt after the revelation of systematic accounting fraud. In other words, its books had been “cooked” for quite a while. It wasn’t the first company to “make up” its numbers, but it’s been the biggest case so far. The case raised concerns about the need for more powerful accounting techniques. The Benford’s law could have helped to detect the fraud because “fake” numbers (those made up by humans) tend not to follow the Benford distribution. Because of that, Deutsche Bank has applied the Law to accounting books of a number of companies, with the surprising conclusion that companies that do not adhere to the Law underperform the market!

Source: businessinsider.com

Also, the U.S. Internal Revenue Service is using the Benford’s law to detect tax fraud – another kind of fraud that also involves substituting natural number by made-up ones. Unfortunately, a sophisticated fraudster can take countermeasures and make up numbers that follow the Benford law.

So far we have considered a nice statistical property of “natural” numbers, i.e. those that we come across through our daily life. What about language? Does it hold any interesting statistical properties?

The Zipf’s law

We have the intuition that, regardless of the language, some words are more frequent than others. According to the Brown Corpus of American English text (a repository of text for research purposes with one million words),  the word the is the most frequent one, accounting for nearly 7% of all word occurrences. The second one is “of”  and accounts for slightly over 3.5% of words. The third one is “and”, present 2.8% of the time. Far earlier than the Brown Corpus was built, the American linguist George K. Zipf had found this pattern and formulated his law as follows: the frequency of any word is inversely proportional to its rank in the frequency table. In other words: the second most frequent word is half as frequent as the first one. The third word’s frequency is a third of the first one’s, and so on. Mathematically:

$$ P(w) \propto \frac{1}{R(w)} $$

being \(P(w)\) the probability (frequency) of the word \(w\) and \(R(w)\) his ranking in the frequency table.

As you can see, the Law describes very well the distribution of “the”, “and”, and “by” we mentioned before. Actually, most studies have found that an exponent of 1.07 rather than 1 fits better:

$$ P(w) \propto \frac{1}{R(w)^{1.07}} $$

It is difficult to think of an application of the Zipf’s law similar to the fraud detection based on Benford’s. Fake numbers might not follow the Benford’s law if they are not carefully chosen, but “fake news” still follow the rules of grammar and fulfil the Zipf’s law, so we cannot use it to identify them. Now, imagine a text written in an unknown language, so that we use the Zipf’s law to determine whether it is a real language or just a joke. Does such a text even exist?

The Voynich manuscript

The Voynich manuscript is probably the most enigmatic, mysterious book ever written. It dates back to the beginning of the XV century according to the Carbon 14 test, and it is thought to have been written in the north of Italy. The book is full of illustrations of unknown plants – remarkably, one of them is very similar to an American variety of sunflowers (remember: it was written around 1400). It looks like an ancient hoax, something similar to Borges’ Encyclopaedia of Tlön or the plot described in Umberto Eco’s Foucault’s Pendulum. But the most intriguing thing about the codex is the fact that is written in a completely unknown language and writing system.

Excerpt from the Voynich manuscript.

The text has been intensively studied by cryptographers, including WWII code-breakers and experts of the NSA, in order to decipher its contents. None of those efforts has succeeded. However, the entropy of the text is similar to that of real languages and the words seem to follow the Zipf’s law. If this medieval book is a joke, it is an astonishingly sophisticated one. The mystery is still open!