The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.
In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated draws of my independent variable :
For each , I define a as
In other words, for each effect size, , the simulation draws and with some error . The following regression model is estimated and the p-value of is observed.
The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.
Note, these results are for a fixed . Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.
There are many take-aways from this plot.
First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).
Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.
Code is available on GitHub.