Some thoughts on the New Significance

Wed, Jul 26, 2017

The busy corridors of psychology twitter have been abuzz over a new paper, written by some heavy-hitters in the field, arguing for a new threshold for statistical significance. Now, instead of \(p < .05\), the threshold would be \(p < .005\). If you’ve not read the paper, I recommend doing so; it’s short and won’t take too much time.

The argument isn’t new–there was a 2013 paper arguing for something similar by Johnson–but it has made quite a splash this time.

As a way of clarifying my own thoughts on this, I thought I’d work through the main discussion points around this proposal that I’ve encountered on Twitter. This isn’t comprehensive by a long shot. People had a lot to say about this. But these were the most common arguments.

1. This won’t fix NHST

This is probably the most common point. It’s true, of course–all the same issues with NHST (null-hypothesis significance testing), even when it is properly used, don’t disappear at \(p < .005\). People opposed to NHST in general won’t ever be pleased by changes like this; generally they want to see a move to estimation (effect sizes and confidence/credible intervals) and/or Bayesian methods. So do many of the authors on this paper, in fact.

I’m a fan of the effect size/estimation approach to reporting, and I’d like to see it used more widely (or at least always alongisde p-values, if we’re starting gradually). On the other hand, change is slow; a lot of the issues we’re discussing now have been discussed decades before, to little effect. In the meantime, it might not be an outrageous idea to be more stringent. \(p < .005\) is much harder to p-hack than \(p < .05\), and as the authors point out in the paper, it drops the “false discovery rate” across the board for all levels of power.

If you don’t like p-values and would like to see them discontinued, then proposals like this are going to look like so much wasted effort. But, there is something to be said for the fact that the system can’t be changed overnight and seeing what changes we can make in the meantime.

2. This encourages dichotomous thinking

Related to #1, one argument against that I’ve seen articulated by a few different people is that this proposal doesn’t fix the issue of the “significant-nonsignificant” dichotomy. This point is hard to argue with. The authors in the paper suggest treating \(p < .05\) as “suggestive,” not unlike how there is the tendency for people to call \(p < .1\) “trending towards significance.” But, as some of these folks have pointed out, that’s still entirely based on these thresholds. It’s essentially letting the p-value do the thinking for you, rather than demanding more difficult, nuanced interpretation.

I think there’s definitely something to this point. When these thresholds become really important– either because they are defacto required for publication, or required to avoid having others dismiss your work–results can get distilled down to the p-value and nothing more. For instance, an effect that is small, highly variable, and difficult to replicate even within the same lab is held up as “real” because it meets the threshold; the sixth of seven unplanned tests limbos under the cutoff, and it’s latched onto as the “important” result, and so on. The p-value becomes more important than whether or not the effect was predicted, how big it is, or how reliable it is.

I think this concern is reasonable. But, related to #1, change is gradual, and p-values are going to be here for a while yet (and may be here forever). So, is there any harm in a lower significance threshold? Sure, it might not help in terms of the philosophy of the p-value and NHST, but if these are your primary concerns, it can’t hurt, either. Unless you view it as a waste of time and resources (see #3).

3. Change the culture, not the arbitrary cutoffs

Why not both? This is somewhat related to #1 and #2. The argument goes that we should be spending time and effort pushing for systematic cultural change. We definitely should! Things would be better if “open science practices” were just “science practices.” But I don’t think this is zero-sum. I think we can keep trying to change the culture and change the arbitrary cutoffs at the same time, particularly if you think that changing the cutoffs could be a force for good.

4. This will throw the Type I to Type II error ratio out of whack

A Type I error is when you reject the null hypothesis when in reality it should have been retained; a Type II error is when you retain the null when you should have rejected it. Typically, part of NHST is striking a balance between these two errors. The alpha level (significance threshold) is the rate of Type I errors we are willing to accept. To put things very simply, the Type II error rate is controlled by power; how often will you detect a given effect size with your chosen sample size?

At the traditional values of \(\alpha = .05\) and 80% power, the ratio of Type I to Type II errors is 1:4. Under the new threshold, this ratio would be 1:40, which for some people is far too skewed.

The authors actually address this point in the paper, and Simine Vazire covers it on her blog. The short answer is that, right now, with p-hacking, multiple testing, and a lack of preregistration, our alpha level isn’t actually what it seems. Power, too, is often much lower than the typically-recommended 80% level, and hasn’t improved much since Coehn first wrote about it back in 1962. In practice, this means that the error ratio isn’t 1:4. As Simine points out, the 1:40 ratio probably would be too strict in the case of systematic preregistration and other changes. We’ve got a ways to go before we need to worry about that, though, since we’ve only just started to publicize and prevent the effects of HARKing, QRPs, and so on.

5. This will make publication harder / What about expensive data and small populations of interest

The authors stress that this threshold should not be used as a publication criterion. This is important, because using these cutoffs as requirements for publication inflate the effect size estimates in the literature. You only avoid biased estimates if there is no publication bias. This doesn’t mean it wouldn’t be adopted as a requirement, of course, but the authors are advocating this as a standard-of-evidence metric, not a publication requirement. (Insert the line about the paving material on the road to hell here.)

As for the second points, this was also true at \(p < .05\). I’ve also seen this point brought up in response to other proposed changes to the field, like encouraging more power. I don’t have much perspective to offer here, because I work with undergrads and MTurkers as subjects and have little difficulty getting enough people. There’s a little thread here about this sort of thing by people smarter than I am.

What does this actually look like?

I’ve done some simulations looking at p-hacking, etc. with the \(p < .05\) significance level. It’s helped me develop some intuitions, but I’m curious about the new significance level. I’d like to get to know it a little better.

So, I’m going to do some simulating. First, I want to look at how many significant results you’ll get at various effect sizes at various sample sizes. I’m going to simulate the plain old t-test because, hey, it’s just a first date.

I want to see how “hits” (getting a value below the significance threshold when there is a real effect, aka power) varies with sample size and effect size between the two cutoffs.

suppressPackageStartupMessages({
  library(data.table)
  library(dplyr)
  library(ggplot2)
  library(tidyr)
  library(viridis)
})

 sim.es <- function(n, es) {
    g1 <- rnorm(n, es, 1)
    g2 <- rnorm(n, 0, 1)
    c('p'=t.test(g1, g2, paired=FALSE)$p.value, 'n'=n, 'es'=es)
 }
params <- expand.grid(list('n'=c(20, 50, 100, 500), 'es'=seq(0, 1, .1)))
simulation_results <- rbindlist(mapply(function(n, es) 
  as.data.frame(t(replicate(1000, sim.es(n, es)))), 
                             n=params$n, es=params$es, SIMPLIFY=FALSE))

sim_graph <- ggplot(data=simulation_results %>% 
                      group_by(n, es) %>% 
                      summarize(sig_results_new=sum(p < .005)/1000,
                                sig_results_old=sum(p < .05)/1000) %>%
                      gather(group, sig_pct, sig_results_old, sig_results_new)) +
  geom_line(aes(x=es, y=sig_pct, lty=group, color=as.factor(n))) +
  scale_color_viridis(discrete=TRUE) +
  scale_linetype_discrete(labels=c('p < .005', 'p < .05')) +
  labs(color='N per cell', lty='Alpha level') +
  xlab('Effect Size') +
  ylab('Percent significant results') +
  theme_minimal()
sim_graph

So, yeah. Even at a big effect size, you’re not getting away with 20 per cell under the .005 threshold. Even 50 per cell is pretty lean unless you’re hunting a big effect.

So…

If nothing else, you’ve got to hand it to this paper for sparking a diverse and extensive discussion about where the field is now, and how best to encourage the changes that will make our science more robust.

I know you’re waiting with bated breath to hear my personal opinion on this. Call me an ideological slouch, but after reading the paper, and reading the arguments for and against, I remain solidly middle-ground. I think it’s true that changing this threshold wouldn’t change much; it doesn’t fix NHST or dichotomous, significance-based thinking. Many of the old problems would persist under the new world order. On the other hand, I don’t think this proposal is actively harmful (although others would vehemently disagree with me on that point), and I think it would even help a bit in terms of going to the mat for more power and higher standards of evidence.

This is definitely not the Thing That Will Save Psych Forever, and the authors don’t claim that it is. But I think it’s an interesting idea that has potential benefits, and merits consideration. It should also make us think about the strength of evidence we want to see, and how we should go about trying to effect that change in the field. Requiring a stricter cutoff is certainly a place to start.

If you’re interested in following this discussion, there’s a lot of it on Twitter. People are also getting into the mathematics behind the Bayesian perspective on p-values, prior probabilities, and so on; it’s worth a look if you’re interested. If this paper accomplishes nothing else, it’s sparked some interesting discussions.