Statistics is Hard, Even for Scientists

As a burgeoning data scientist I have taken a more professional interest in statistics, which in this day and age means statistics blogs!  In particular I really enjoy the prolific writings of Andrew Gelman.  This guy is super organized, planning out his posts for an entire month.  It’s either amazing or indicative of a psychological disorder or maybe both.  Anyways, I have learned that while I “know” elementary statistics, the implications of even the basics were never imparted to me and are not easy to suss out on one’s own.

For instance, the elementary result that any statistically significant result under certain very common conditions ALWAYS overestimates the magnitude of an effect was mindblowing to me. This becomes even more pronounced in the often underpowered studies endemic to social sciences. I would say it is not entirely unfair to take a good 1/3 off any effect size in social sciences as a rule of thumb, it’s really that enormous. This actually has implications in physics too. For instance, if we report that our entanglement measurement is statistically different than a situation with no entanglement, it is likely we are overestimating the quality of our entanglement especially if we employ a bit of post processing on the data. It really makes you discount a bit every result in the scientific literature.

More fundamentally, statistical significance itself is kind of a red herring when it comes to significance. Obviously, but not always appreciated, is the fact that because something is statistically significant does not mean it is significant. The classic examples are Facebook studies that take a sample of a million to just tease out some trivial effect. Who really cares at if sad Facebook posts adjust your mood by 1% if this is completely swamped by everything else in your life? On the other hand, statistical significance has become a signifier of “publishable” results without any further validation of the results. There are many examples of this effect, Gelman really likes a study that published an 8% effect size for how attractiveness influences the likelihood you will bear girls. This is an order of magnitude higher than pretty much any other influence on girl/boy sex ratios and furthermore lacks a credible explanation. Yet it got published in a prestigious journal just because it was statistically significant. It was quickly pointed out how ludicrous this was on its face and the bad statistics that went into the paper.

Connected to this idea is that Bayesian statistics is superior to frequentist statistics in almost all cases and that is almost solely due to informed priors. Essentially, a lot of scientific research assumes flat priors, that every possibility is equally likely. They publish confidence intervals under this assumption and then misinterpret what a confidence interval means. Just for precision, an X% confidence interval does not indicate that the interval has a X% chance of containing the true value. After all there are multiple ways to construct confidence intervals that give different results. Furthermore, you can often construct intervals that verifiably do or do not contain the true value. Instead a particular confidence interval algorithm says that it will on average contain the true value X% of the time. That is, if I sample the same population multiple times and use the same instructions for the confidence interval for each sample then these myriad intervals I calculate will contain the true value X% of the time.

However, that is not the truly egregious error here, rather it is the assumption that the prior probability of a result is the same for all possible results. Do we really think that eating a beet before a run will improve performance by 50%? Of course not and a proper analysis should heavily discount that possibility rather than giving it equal weight with a more likely effect size of a few %. Unfortunately, scientists are lazy or ignorant or ambitious and a proper Bayesian analysis with informed priors might reveal the weakness in their study.

Apart from statistical analysis, informed priors are just a good common sense check on results. You see, scientists are implicit data filters, constantly looking for patterns. Some people have accused scientists of fishing for statistical significance; i.e. if you try 20 different tests then odds are one of them will be statistically significant at the 5% level. However, it’s more likely that scientists take data, see a pattern and then analyze based on that. However, in doing so they are implicitly making multiple comparisons. With enough dimensions you can pretty much always find a correlation between two of them. Scientists aren’t doing this explicitly, but it doesn’t matter. Implicitly or explicitly the multiple comparisons are very likely to find something of statistical significance. An informed prior is a very easy check on your results. Another Gelman example was a study that posited that the ovulation cycle in women affects their political views at a 20% level despite evidence that political views are remarkably stable. A look at the literature for typical effect sizes may have prompted the authors to reexamine how they arrived at their result. It reminds me of my teaching experience where I often asked my students if their answer made any sense in the context of the problem. These kind of intuitive checks are highly valuable, particularly when the math is highly complicated or unintuitive to you.

I conclude by saying that all of this was novel to me and that I don’t want to judge the many excellent scientists that fall into these statistical traps. I guess I am more chagrined that these issues are still relatively unknown and that it is still seen as acceptable to use the rather crude statistical analysis tools wielded by most scientists when their issues are well documented. I mean if a journal wanted to they could require Bayesian credible intervals instead of confidence intervals and require a defense of the prior used in their calculation. This would require more engagement with prior literature by authors and readers and provide a better estimate of true effect sizes. Gelman suggests avoiding the effect of multiple comparisons by replicating your experiment on new data; essentially the first run was to generate a hypothesis and test with flexibility in interpreting and analyzing data, but the second should be a rigid replication to see if the effect is still apparent. Again there are professional bodies that could take steps toward requiring such experimental procedures. As it is, a lot of social science research, which I always took with a grain of salt due to the many implausible results I read, is looking a bit farcical.