# Exploring P-values with Simulations in R

The recent flare-up in discussions on p-values inspired me to conduct a brief simulation study.

In particularly, I wanted to illustrate just how p-values vary with different effect and sample sizes.
Here are the details of the simulation. I simulated $n$ draws of my independent variable $X$:

$X_n \sim N(100, 400)$

where

$n \in \{5,6,...,25\}$

For each $X_n$, I define a $Y_n$ as

$Y_n := 10+\beta X_n +\epsilon$

where

$\epsilon \sim N(0,1)$
$\beta \in \{.05, .06,..., .25 \}$

In other words, for each effect size, $\beta$, the simulation draws $X$ and $Y$ with some error $\epsilon$. The following regression model is estimated and the p-value of $\beta$ is observed.

$Y_n = \beta_0 + \beta X_n$

The drawing and the regression is done 1,000 times so that for each effect size – sample size combination, the simulation yields 1,000 p-values. The average of these 1,000 p-values for each effect size and sample size combination is plotted below.

Note, these results are for a fixed $var(\epsilon)=1$. Higher sampling error would typically shift these curves upward, meaning that for each effect size, the same sample would yield a lower signal.

There are many take-aways from this plot.

First, for a given sample size, larger effect sizes are “detected” more easily. By detected, I mean found to be statistically significant using the .05 threshold. It’s possible to detect larger effect sizes (e.g. .25) with relatively low sample sizes (in this case <10). By contrast, if the effect size is small (e.g. .05), then a larger sample is needed to detect the effect (>10).

Second, this figure illustrates an oft-heard warning about p-values: always interpret them within the context of sample size. Lack of statistical significance does not imply lack of an effect. An effect may exist, but the sample size may be insufficient to detect it (or the variability in the data set is too high). On the other hand, just because a p-value signals statistical significance does not mean that the effect is actually meaningful. Consider an effect size of .00000001 (effectively 0). According to the chart, even the p-value of this effect size tends to 0 as the sample size increases, eventually crossing the statistical significance threshold.

Code is available on GitHub.

# Finding and Plotting Lorenz Solution using MATLAB

I use MATLAB to solve the following Lorenz initial value problem:
$\begin{cases} x'=-10(x+y) \\ y'=-x(z+28)-y \\ z'=xy-\frac{8}{3}z \\ x(0)=y(0)=z(0)=5 \end{cases}$

I wrote a function, LorenzRK4IVP(), that takes the system of three differential equations as input and solves the system using the Runge-Kutta method with step size $h=.01$. I plot the strange attractor as well as use MATLAB to produce a GIF of the solution.

# Iterative OLS Regression Using Gauss-Seidel

I just finished covering a few numerical techniques for solving systems of equations, which can be applied to find best-fit lines through a give set of data points.

The four points $\{(0,0), (1,3), (2,3), (5,6)\}$ are arranged into an inconsistent system of four equations and two unknowns:

$b+a(0) = 0 \\ b+a(1) = 3 \\ b+a(2) = 3 \\ b+a(5) = 6$

The system can be represented in matrix form:

$\begin{bmatrix} 1 & 0 \\ 1 & 1 \\ 1 & 2 \\ 1 & 5 \end{bmatrix} \begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} 1 \\ 3 \\ 3 \\ 6 \end{bmatrix}$

The least-squares solution vector can be found by solving the normal equations

$A^TA\vec{x_*}=A^T\vec{b}$

Solving for $A^TA$ and $A^T \vec{b}$ yields the following matrix system:

$\begin{bmatrix} 4 & 8 \\ 8 & 30 \end{bmatrix} \begin{bmatrix} b \\ a \end{bmatrix} = \begin{bmatrix} 12 \\ 39 \end{bmatrix}$

The system can be solved using Gauss-Seidel method.

Below, I run 10 iterations of Gauss-Seidel (visualized in the figure above). The estimated line gets successively closer to the true solution in green. The estimates are shown in blue – each iteration is shown in a darker shade than the next (see highlighted lines).

library(MASS) # package needed for ginv() function

A=cbind(c(4,8),c(8,30)) # coefficient matrix
b<-t(t(c(12,39))) # b-vector
x0<-t(t(c(0,0)))  # starting vector
iter<-10          # number of Gauss-Seidel iterations to run

L<-lower.tri(A)*A  # lower triang. A
U<-upper.tri(A)*A  # upper triang. A
D<-diag(diag(A))   # diag of A

# plot points
xc<-c(0,1,2,5)
yc<-c(0,3,3,6)
plot(xc,yc,
col='red',
xlab="X-Values",
ylab="Y-Values",
bty='n')
title(main="OLS Estimates")
legend("bottomright",
c("Data","Estimates","True Line"),
lty=c(0,1,1),
pch=c(1,NA,NA),
col=c("red","blue","green"),
bty='n')

# create color palette - lines will get darker with each iter
pal<-colorRampPalette(c("#f2f2f2", "Blue"))
colors<-pal(iter) # creates color palette of length(iter)

# plot true line
abline(a=1.0714,b=.8571,col='green')

n<-1
while(n<=iter){
print(x0)

# Gauss-Seidel formula
x1<-(ginv((L+D)))%*%((-U%*%x0)+b)
x0<-x1
n=n+1

# plot estimated line
abline(a=as.numeric(x0[2,1]), # slope of estimated line
b=as.numeric(x0[1,1]), # y-intercept of estimated line
col=colors[n]) # pick nth color in palette
}


# Unstable Market

The necessary and sufficient condition for convergence is that the slope of the supply curve be greater than the absolute value of the slope of the demand curve. If the slope of the supple curve is less, then price and quantity diverge from equilibrium over time.

# Monte Carlo Simulation of Pi

Partially out of boredom and partially because I was inspired by the movie title “Life of Pi”, I decided to make a monte carlo simulator that could approximate the value of pi.

Monte carlo simulations are used in everything from derivative pricing to biology (and, in this case, boredom alleviation). Basically, it’s good for solving problems that have no exact solution.

The simulator throws a random point on a 2×2 square and then throws a random point on a circle of radius 1. This is one trial. However, it does this (in this case) for 6000 trials. So the square is filled with 6000 points and the circle is filled with less than 6000 points.

Now that the simulation has done a really good job at filling in the square and the circle, we count the number of points in the the circle and the square to get the areas of each. Then we take the ratio: areaCir/areaSq.

However, we know that the areaSq is 4 for a 2×2 square. So we multiple that ratio by 4 to cancel out the denominator and are left with areaCir. Since areaCir=pi*radius^2, areaCir=pi for our circle because radius=1.

The more trials we conduct, the more filled our circle and square become and, thus, the more accurate out approximation of pi. Here is the result:

It’s clear that as trials -> ∞ , error -> 0% and Pi,Approx -> Pi,True

Doing more than 6000 trials really slows down processing unless the simulator is done in code. So, I built a monte carlo simulator using VBA.

I ran 50,000,000 simulations. It took 34.2187 seconds and it approximated pi=3.14155704. Pi is actually equal to 3.14159265. Error was .001%