# Benford’s Law: around 30% of the first digits in many real world data-sets are “1”.

Yes, folks it’s Benford’s Law – from Kaggle’s website.

One fun aspect of working with real data is that you get to observe real-life phenomenon. For example, Benford’s Law (also known as the “first-digit law”) states:

“in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time.”

A simple SQL query on the training dataset gives us the raw data with which we can compare the data:

digit count  actual_probability benford_expected_probability abs diff
1 3368866  27.9% 30.1% 2.3%
2 1912850  15.8% 17.6% 1.8%
3 1483366  12.3% 12.5% 0.2%
4 1258157  10.4% 9.7% 0.7%
5 1109766  9.2% 7.9% 1.3%
6 933048  7.7% 6.7% 1.0%
7 787636  6.5% 5.8% 0.7%
8 668351  5.5% 5.1% 0.4%
9 573359  4.7% 4.6% 0.1%

Sure enough, the data from millions of shopping visits demonstrates the validity of this law.

I just thought this was an interesting application of something you hear about all the time in statistics discussions.

This entry was posted in Geeky Musings. Bookmark the permalink.
Subscribe
Notify of
Inline Feedbacks
10 years ago

If you express all your numbers in binary, the first digit is always 1.

derrida derider
10 years ago

Benford’s Law is well known as a way of testing whether data has been fabricated – it’s caught many cheaters in science and the social sciences. Unfortunately it’s become well known enough that the cheaters are adapting to it …

Rex
10 years ago

Clearly ‘1’ is very productive digit. Punching well above its weight in world affairs. Have there been any studies done to see if some of the slacker numbers could either shape up or ship out?

Rex
10 years ago

That’s an even better angle Nicholas. I wonder if Peter Foster has a prospectus out yet?