Yes, folks it’s Benford’s Law – from Kaggle’s website.

One fun aspect of working with real data is that you get to observe real-life phenomenon. For example, Benford’s Law (also known as the “first-digit law”) states:

“in lists of numbers from many (but not all) real-life sources of data, the leading digit is distributed in a specific, non-uniform way. According to this law, the first digit is 1 about 30% of the time, and larger digits occur as the leading digit with lower and lower frequency, to the point where 9 as a first digit occurs less than 5% of the time.”

A simple SQL query on the training dataset gives us the raw data with which we can compare the data:

digit count actual_probability benford_expected_probability abs diff 1 3368866 27.9% 30.1% 2.3% 2 1912850 15.8% 17.6% 1.8% 3 1483366 12.3% 12.5% 0.2% 4 1258157 10.4% 9.7% 0.7% 5 1109766 9.2% 7.9% 1.3% 6 933048 7.7% 6.7% 1.0% 7 787636 6.5% 5.8% 0.7% 8 668351 5.5% 5.1% 0.4% 9 573359 4.7% 4.6% 0.1% Sure enough, the data from millions of shopping visits demonstrates the validity of this law.

I just thought this was an interesting application of something you hear about all the time in statistics discussions.

If you express all your numbers in binary, the first digit is always 1.

Benford’s Law is well known as a way of testing whether data has been fabricated – it’s caught many cheaters in science and the social sciences. Unfortunately it’s become well known enough that the cheaters are adapting to it …

Clearly ‘1’ is very productive digit. Punching well above its weight in world affairs. Have there been any studies done to see if some of the slacker numbers could either shape up or ship out?

That’s a good angle Rex. I think we should have an international commission on it. Benford would also suggest that when they eventually come to privatise the digits, ‘1’ will have the highest market cap – that’s where the smart money will be.

That’s an even better angle Nicholas. I wonder if Peter Foster has a prospectus out yet?

No 1 also has the best marketing lines. Who wants to be No 2?