Benford’s law is a really

fascinating observation that in many real-life sets of numerical data, the

first digit is most likely to be 1, and every digit `d` is more common than

`d+1`. Here’s a table of the probability distribution, from Wikipedia:

Now, the caveat “*real-life* data sets” is really important. Specifically, this

only applies when the data spans several orders of magnitude. Clearly, if we’re

measuring the height in inches of some large group of adults, the

overwhelming majority of data will lie between 50 and 85 inches, and won’t

follow Benford’s law. Another aspect of real-life data is that it’s non random;

if we take a bunch of truly random numbers spanning several orders of magnitude,

their leading digit won’t follow Benford’s law either.

In this short post I’ll try to explain how I understand Benford’s law, and why

it intuitively makes sense. During the post I’ll collect a set of **clues**,

which will help get the intuition in place eventually. By the way, we’ve already

encountered our first clues:

**Clue 1**: Benford’s law only works on real-life data.**Clue 2**: Benford’s law isn’t just about the digit 1; 2 is more common than

3, 3 is more common than 4 etc.

## Real-world example

First, let’s start with a real-world demonstration of the law in action. I

found a data table of the

populations of California’s ~480 largest cities, and ran an analysis of the

population number’s leading digit . Clearly, this is real-life data, and it

also spans many orders of magnitude (from LA at 3.9 mln to Amador with 153

inhabitants). Indeed, Benford’s law applies beautifully on this data:

Eyeballing the city population data, we’ll notice something important but also

totally intuitive: most cities are small. There are many more small cities than

large ones. Out of the 480 cities in our data set, only 74 have population over

100k, for example.

The same is true of other real-world data sets; for example, if we take a

snapshot of stock prices of S&P 500 companies at some historic point, the prices

range from $1806 to $2, though 90% are under $182 and 65% are under $100.

**Clue 3**: in real-world data distributed along many orders of magnitude,

smaller data points are more common than larger data points.

Statistically, this is akin to saying that the data follows the Pareto

distribution, of which the

“80-20 rule” – known as the Pareto principle – is a special case.

Another similar mathematical description (applied to discrete probability

distributions) is Zipf’s law.

## Logarithmic scale

To reiterate, a lot of real-world data isn’t really uniformly distributed.

Rather, it follows a Pareto distribution where smaller numbers are more common.

Here’s a useful logarithmic scale borrowed from Wikipedia – this could be the

X axis of any logarithmic plot:

In this image, smaller values get more “real estate” on the X axis, which is

fair for our distribution if smaller numbers are more common than larger

numbers. It should not be hard to convince yourself that every time we “drop a

pin” on this scale, the chance of the leading digit being 1 is the highest.

Another (related) way to look at it is – when smaller numbers are more common it

takes a 100% percent increase to go from leading digit being 1 to it being 2,

but only a 50% increase to go from 2 to 3, etc.

**Clue 4**: on a logarithmic scale, the distance between numbers starting

with 1s and numbers starting with 2s is bigger than the distance between

numbers starting with 2s and numbers starting with 3s, and so on.

We can visualize this in another way; let’s plot the ratio of numbers starting

with 1 among all numbers up to some point. On the X axis we’ll place N which

means “in all numbers up to N”, and on the Y axis we’ll place the ratio of

numbers `i` between 0 and N that start with 1:

Note that whenever some new order of magnitude is reached, the ratio starts to

climb steadily until it reaches ~0.5 (because there are just as many numbers

with D digits as numbers starting with 1 and followed by another D digits);

it then starts falling until it reaches ~0.1 just before we flip to the next

order of magnitude (because in all D-digit numbers, numbers starting with each

digit are one tenth of the population). If we calculate the smoothed average of

this graph over time, it ends up at about 0.3, which corresponds to Benford’s

law.

## Summary

When I’m thinking of Benford’s law, the observation that really brings it home

for me is that “smaller numbers are more common than larger numbers” (this is

clue 3). This property of many realistic data sets, along with an understanding

of the logarithmic scale (the penultimate image above) is really all you need

to intuitively grok Benford’s law.

Benford’s law is also famous for being scale-invariant (by typically applying

regardless of the unit of measurement) and base-invariant (works in bases other

than 10). Hopefully, this post makes it clear why these properties are expected

to be true.