The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated draws of my independent variable :

where

For each , I define a as

where

In other words, for each effect size, , the simulation draws and with some error . The following regression model is estimated and the p-value of is observed.

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed . Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.

There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Hi. You should consider also plotting the average effect size of those estimates that pass a given p-value threshold. It will show another central issue regarding power and p-values, namely that significant effects of underpowered studies have inflated effect size estimates. See e.g. http://pilab.psy.utexas.edu/publications/Yarkoni_PPS_2009.pdf

Yes, the p-value has a distribution which depends on the sample size. Seems to me the biggest problem is not whether one uses them or eschews them in favor of a Bayesian approach. The problem lies in the definitions of the regressed variables themselves. In the physical sciences the use of regression is usually not problematic. But as one moves out from them towards the so-called behavioral sciences, the definitions of the quantities become mushy and the causal relationship between them more questionable.

Hi. You should consider also plotting the average effect size of those estimates that pass a given p-value threshold. It will show another central issue regarding power and p-values, namely that significant effects of underpowered studies have inflated effect size estimates. See e.g. http://pilab.psy.utexas.edu/publications/Yarkoni_PPS_2009.pdf

Yes, the p-value has a distribution which depends on the sample size. Seems to me the biggest problem is not whether one uses them or eschews them in favor of a Bayesian approach. The problem lies in the definitions of the regressed variables themselves. In the physical sciences the use of regression is usually not problematic. But as one moves out from them towards the so-called behavioral sciences, the definitions of the quantities become mushy and the causal relationship between them more questionable.