Armchair Statistics: Benford’s Law and other Misconceptions in the Age of Data

“One of the pervasive risks that we face in the information age … is that even if the amount of knowledge in the world is increasing, the gap between what we know and what we think we know may be widening.” ― Nate Silver, 538 Editor In Chief

In the direct wake of the November 2020 election, it was hard to find any part of the news that wasn’t talking about some new election fraud conspiracy theory. That being said, you might’ve missed the brief Twitter storm that centered around an odd statistical theory known as Benford’s Law. Though this passing discussion of a century-old statistical theory seems rather isolated in today’s world, its implications allude to a wider problem with how we, as everyday Americans, perceive data and statistics in the Information Age.

Benford’s Law (first noted by astronomer Simon Newcomb but later popularized by physicist Frank Benford) states that in a given numerical dataset (subject to appropriate constraints that will be detailed later) when looking at the leading digits of all the numbers, there will be more 1s than 2s, more 2s than 3s, and so on. Specifically, for a given digit d{1, … , 9}, the probability that it occurs as the leading digit is given by:

(For an in-depth dive into the math behind Benford’s Law, look here). Benford’s Law pops up in a lot of seemingly random places, from atomic weights to stock prices to the population of cities. Notably, Benford’s Law has been proposed as a way to fight tax fraud by comparing the distribution in one’s taxes to the above histogram (taxes are another instance of data that is roughly Benfordian). However, the recent stir about Benford’s Law comes from applying it to detect fraud in a dataset that is NOT Benfordian: precinct election results.

Originally circling around conservative websites such as Newsmax and the Gateway Pundit, and later being picked up by right-wing podcast hosts, the theory stated that President Biden’s win in key states like Michigan was fraudulent due to the fact that Biden’s vote counts in the states’ precincts showed a non-Benfordian distribution, whereas former President Trump’s counts did. This claim was eventually debunked, and the reasoning as to why Biden’s non-Benfordian counts were nothing out of the ordinary can be summed up in two main points (remember the constraints mentioned earlier?):

  1. Benford’s Law works on datasets only if they span orders of magnitude
  2. Benford’s Law works on datasets where each entry is independent of every other entry

Point 1 means that data which follows Benford’s Law generally range from values in the 10s/100s to those in the 1000s/10000s and so on (think town population sizes in America for example). Voting precincts inherently do not cover multiple orders of magnitude, since each precinct is sized similarly in order to make vote counting easier for local administrators. Point 2 means that in general, the value of a given datum in the dataset should not be dependent on any other values (tax reports in America, in general, are independent of one another for example). Here we can see the precinct election data again fails, largely due to our two-party system here in America. Since third-party votes make up such a small percentage of votes that they’re typically statistically insignificant, finding the number of Democratic votes is (generally) identical to finding the total number of votes and subtracting the Republican votes from that sum. Thus, the vote counts for Biden and Trump are inherently linked by this relationship, and their resulting distributions do not follow Benford’s Law. Republican media outlets were simply picking up on the fact that if (out of say 100 people), Trump was to win only 10 to 30% of the vote, then Biden would implicitly win 70 to 90% of the vote and look less Benfordian. As a side note, research has been done to look into the leading second digit and its distribution as a marker for election fraud, but the results aren’t much better than chance.

So at the end of the day, another election conspiracy theory was debunked. But this mass proliferation of an incorrect and fraudulent statistical claim taps into the wider issue of how we interact with statistics in our everyday lives. Lying with statistics has always been an issue (many people don’t realize that the popular book How to Lie with Statistics was written in 1954), but the massive accessibility of information in the Digital Age, coupled with multiple highly visible issues involving “big data”, has led to an explosion of distrust in statistics, especially from reputable sources (this trend is most salient in American conservatives). This unfortunate interplay between distrusting and directly misunderstanding statistical principles is becoming more apparent in multiple aspects of everyday life. Consider the now infamous graph made by Florida Republicans discussing the number of gun deaths after passing the state’s Stand Your Ground law:

For most of us, this would seem to suggest that the law helped lower gun deaths, except for the fact that the y-axis is in the wrong direction. While it may seem a bit ridiculous now that this chart was ever published, the reality is that without a basic understanding of data visualization, many people were probably fooled by this misleading graphic. We can see more graphical misunderstandings arise through a group of researchers at Yale Law who found that people were routinely not able to understand logarithmic graphs about COVID-19, despite the fact that some neuroscientists have found that people tend to think in logarithmic terms anyways. From misinformation about how COVID-19 data is reported to errors in reasoning about probabilistic poll results, these issues are becoming more pervasive as the world becomes more and more interconnected.

One could argue that in comparison to the mass radicalization and new misinformation spreading across the country, a few misleading results are not the biggest issue. But after the events of the January 6 insurrection, and the 554,000 COVID-19 deaths as of writing this, it would be irresponsible to understate the effects that fraudulent data and statistics have on real human lives. After all, what is more convincing for a conspiracy theorist than “hard” evidence that their beliefs are real? As a society, we must invest in improving the statistical literacy of our people, so that we will have a fighting chance in the coming struggles with misinformation that are already at our doorstep.

Zachary Novak is a junior studying Statistics and Machine Learning with a minor in math and music technology. Academically, he’s interested in issues regarding algorithmic bias, fair ML, and network analysis in fringe communities. In his free time, he likes to cook and bake bread.

The Triple Helix at Carnegie Mellon University promotes the interdisciplinary nature of public policy, science, technology, and society.