What is R?
To quote the R project website:
R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.
What does that mean?
Why are we using R?
Some OS-specific extras
To help smooth some software installation issues further down the road, please also do the following (depending on your OS):
Checklist
[check] Do you have the most recent version of R?
version$version.string
## [1] "R version 4.1.3 (2022-03-10)"
[check] Do you have the most recent version of RStudio? (The preview version is fine.)
RStudio.Version()$version
## Requires an interactive session but should return something like "[1] ‘1.2.5001’"
[check] Have you updated all of your R packages?
update.packages(ask = FALSE, checkBuilt = TRUE)
R is a powerful calculator and recognizes all of the standard arithmetic operators:
1+2 ## Addition
## [1] 3
6-7 ## Subtraction
## [1] -1
5/2 ## Division
## [1] 2.5
2^3 ## Exponentiation
## [1] 8
We can also invoke modulo operators (integer division & remainder). - Very useful when dealing with time, for example.
100 %/% 60 ## How many whole hours in 100 minutes?
## [1] 1
100 %% 60 ## How many minutes are left over?
## [1] 40
R also comes equipped with a full set of logical operators (and Boolean functions), which follow standard programming protocol. For example:
1 > 2
## [1] FALSE
1 == 2
## [1] FALSE
1 > 2 | 0.5 ## The "|" stands for "or" (not a pipe a la the shell)
## [1] TRUE
1 > 2 & 0.5 ## The "&" stands for "and"
## [1] FALSE
isTRUE (1 < 2)
## [1] TRUE
4 %in% 1:10
## [1] TRUE
is.na(1:10)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# etc..
You can read more about these logical operators here and here.
In R, we can use either =
or <-
to
handle assignment. The <-
is really a <
followed by a -
.
Assignment with <-
<-
is normally read aloud as “gets”. You can think of
it as a (left-facing) arrow saying assign in this
direction.
a <- 10 + 5
a
## [1] 15
f course, an arrow can point in the other direction too
(i.e. ->
). So, the following code chunk is equivalent to
the previous one, although used much less frequently.
10 + 5 -> a
Assignment with =
You can also use =
for assignment.
b = 10 + 10 ## Note that the assigned object *must* be on the left with "=".
b
## [1] 20
More discussion about <-
vs =
: https://github.com/Robinlovelace/geocompr/issues/319#issuecomment-427376764
and https://www.separatinghyperplanes.com/2018/02/why-you-should-use-and-never.html.
Different types (or classes) of objects.
The c() function creates vectors. This is one of the objects we’ll use to store data.
myvec <- c(1, 2, 3)
print(myvec)
## [1] 1 2 3
Shortcut for consecutive numbers:
myvec <- 1:3
print(myvec)
## [1] 1 2 3
Basic algebraic operations all work entrywise on vectors.
myvec <- c(1, 3, 7)
myvec2 <- c(5, 14, 3)
myvec3 <- c(9, 4, 8)
myvec + myvec2
## [1] 6 17 10
myvec / myvec2
## [1] 0.2000000 0.2142857 2.3333333
myvec * (myvec2^2 + sqrt(myvec3))
## [1] 28.00000 594.00000 82.79899
So are the binary logical operations &
|
!=
.
# logical vectors
logi_1 = c(T,T,F)
logi_2 = c(F,T,T)
logi_12 = logi_1 & logi_2
print(logi_12)
## [1] FALSE TRUE FALSE
You can “slice” a vector (grab subsets of it) in several different
ways: Vector selection is specified in square bracket a[ ]
by either positive integer or logical vector.
myvec <- 7:20
myvec[8]
## [1] 14
myvec[2:5]
## [1] 8 9 10 11
# vector of booleans for whether each entry of myvec is greater than 13
indvec <- myvec > 13
indvec
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE
indvec2 <- myvec == 8
indvec2
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE
# only outputs entries of myvec for which indvec is true
myvec[indvec]
## [1] 14 15 16 17 18 19 20
# same thing but all in one line, without having to define indvec
myvec[myvec>13]
## [1] 14 15 16 17 18 19 20
Some useful vector functions:
length(myvec)
## [1] 14
mean(myvec)
## [1] 13.5
var(myvec)
## [1] 17.5
Matrices are just collections of several vectors of the same length.
myvec <- c(1, 3, 7)
myvec2 <- c(5, 14, 3)
myvec3 <- c(9, 4, 8)
# creates matrix whose columns are the inputs of myvec
mat_1 <- cbind(myvec, myvec2, myvec3)
print(mat_1)
## myvec myvec2 myvec3
## [1,] 1 5 9
## [2,] 3 14 4
## [3,] 7 3 8
# now they're rows instead
mat_2 <- rbind(myvec, myvec2, myvec3)
print(mat_2)
## [,1] [,2] [,3]
## myvec 1 3 7
## myvec2 5 14 3
## myvec3 9 4 8
# Define a matrix by its element
mat_3 <- matrix(1:8, 2, 4)
print(mat_3)
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
mat_4 <- matrix(1:8, 2, 4, byrow = TRUE)
print(mat_4)
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
Matrix algebra works the same way as vector algebra - it’s all done
entrywise with the *
and +
operators. If you
want to do matrix multiplication, use %*%
dim(mat_1)
## [1] 3 3
mat_1 * mat_2
## myvec myvec2 myvec3
## [1,] 1 15 63
## [2,] 15 196 12
## [3,] 63 12 64
mat_1 %*% mat_2 #Note that this differs from the elementwise product
## [,1] [,2] [,3]
## [1,] 107 109 94
## [2,] 109 221 95
## [3,] 94 95 122
Other often used matrix operations:
t(mat_1) # Transpose
## [,1] [,2] [,3]
## myvec 1 3 7
## myvec2 5 14 3
## myvec3 9 4 8
solve(mat_1) # Inverse
## [,1] [,2] [,3]
## myvec -0.146842878 0.01908957 0.155653451
## myvec2 -0.005873715 0.08076358 -0.033773862
## myvec3 0.130690162 -0.04698972 0.001468429
eigen(mat_1) # eigenvalues and eigenvectors
## eigen() decomposition
## $values
## [1] 18.699429 8.556685 -4.256114
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4631214 0.3400869 0.87156297
## [2,] -0.7270618 -0.6713411 -0.03609066
## [3,] -0.5068528 0.6585150 -0.48895343
For more operations, check out https://www.statmethods.net/advstats/matrix.html.
“Slicing” matrices:
mat_1[1, 1]
## myvec
## 1
mat_1[2, ]
## myvec myvec2 myvec3
## 3 14 4
data.frame is a two-dimensional table that stores the data, similar to a spreadsheet in Excel. A matrix is also a two-dimensional table, but it only accommodates one type of elements. Real world data can be a collection of integers, real numbers, characters, categorical numbers and so on. Data frame is the best way to organize data of mixed type in R.
df_1 <- data.frame(a = 1:2, b = 3:4)
print(df_1)
## a b
## 1 1 3
## 2 2 4
df_2 <- data.frame(name = c("Jack", "Rose"), score = c(100, 90))
print(df_2)
## name score
## 1 Jack 100
## 2 Rose 90
print(df_2[, 1])
## [1] "Jack" "Rose"
print(df_2$name)
## [1] "Jack" "Rose"
A vector only contains one type of elements. list is a
basket for objects of various types. It can serve as a container when a
procedure returns more than one useful object. For example, recall that
when we invoke eigen
, we are interested in both eigenvalues
and eigenvectors, which are stored into $value
and
$vector
, respectively.
x_list <- list(a = 1:2, b = "hello world")
print(x_list)
## $a
## [1] 1 2
##
## $b
## [1] "hello world"
print(x_list[[1]]) # Different from vectors and matrices
## [1] 1 2
print(x_list$a)
## [1] 1 2
You do things using functions. Functions come pre-written in packages (i.e. “libraries”), although you can — and should — write your own functions too. - In the developing stage, it allows us to focus on a small chunk of code. It cuts an overwhelmingly big project into manageable pieces. - A long script can have hundreds or thousands of variables. Variables defined inside a function are local. They will not be mixed up with those outside of a function. Only the input and the output of a function have interaction with the outside world. - If a revision is necessary, We just need to change one place. We don’t have to repeat the work in every place where it is invoked.
# Built-in function
sum(c(3, 4))
## [1] 7
# User-defined function
add_func <- function(x, y){
return(x + y)
}
add_func(3, 4)
## [1] 7
Package
A pure clean installation of R is small, but R has an extensive ecosystem of add-on packages. This is the unique treasure for R users. Most packages are hosted on CRAN. A common practice today is that statisticians upload a package to CRAN after they write or publish a paper with a new statistical method. They promote their work via CRAN, and users have easy access to the state-of-the-art methods.
A package can be installed by
install.packages("package_name")
and invoked by
library(package_name)
.
# Function from a package
stats::sd(1:10)
## [1] 3.02765
# Package isntall
# install.packages("glmnet")
library(glmnet)
It is also common for authors to make packages available on GitHub or on their websites. You can often use the devtools package or the remotes packages to install these, following instructions on the project website.
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
## Loading required namespace: remotes
remotes::install_github("kolesarm/RDHonest")
## Skipping install of 'RDHonest' from a github remote, the SHA1 (7b5b7475) has not changed since last install.
## Use `force = TRUE` to force installation
Help System
The help system is the first thing we must learn for a new language.
In R, if we know the exact name of a function and want to check its
usage, we can either call help(function_name)
or a single
question mark ?function_name
. If we do not know the exact
function name, we can instead use the double question mark
??key_words
. It will provide a list of related function
names from a fuzzy search.
Example: ?seq
,
??sequence
For many packages, you can also try the vignette()
function, which will provide an introduction to a package and it’s
purpose through a series of helpful examples.
Example: vignette("dplyr")
OLS estimation with one \(x\) regressor and a constant. Graduate textbook expresses the OLS in matrix form \[\hat{\beta} = (X' X)^{-1} X'y.\] To conduct OLS estimation in R, we literally translate the mathematical expression into code.
Step 1: We need data \(Y\) and \(X\) to run OLS. We simulate an artificial dataset.
# simulate data
set.seed(111) # can be removed to allow the result to change
# set the parameters
n <- 100
b0 <- matrix(1, nrow = 2 )
# generate the data
e <- rnorm(n)
X <- cbind( 1, rnorm(n) )
Y <- X %*% b0 + e
Step 2: translate the formula to code
# OLS estimation
(bhat <- solve( t(X) %*% X, t(X) %*% Y ))
## [,1]
## [1,] 0.9861773
## [2,] 0.9404956
class(bhat)
## [1] "matrix" "array"
# User-defined function
ols_est <- function(X, Y) {
bhat <- solve( t(X) %*% X, t(X) %*% Y )
return(bhat)
}
(bhat_2 <- ols_est(X, Y))
## [,1]
## [1,] 0.9861773
## [2,] 0.9404956
class(bhat_2)
## [1] "matrix" "array"
# Use built-in functions
(bhat_3 <- lsfit(X, Y, intercept = FALSE)$coefficients)
## X1 X2
## 0.9861773 0.9404956
class(bhat_3)
## [1] "numeric"
Step 3 (additional): plot the regression graph with the scatter points and the regression line. Further compare the regression line (black) with the true coefficient line (red).
# plot
plot(y = Y, x = X[,2], xlab = "X", ylab = "Y", main = "regression")
abline(a = bhat[1], b = bhat[2])
abline(a = b0[1], b = b0[2], col = "red")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
Step 4: In econometrics we are often interested in hypothesis testing. The t-statistic is widely used. To test the null \(H_0: \beta_2 = 1\), we compute the associated t-statistic. Again, this is a translation. \[ t = \frac{\hat{\beta}_2 - \beta_{02}}{ \hat{\sigma}_{\hat{\beta}_2} } = \frac{\hat{\beta}_2 - \beta_{02}}{ \sqrt{ \left[ (X'X)^{-1} \hat{\sigma}^2 \right]_{22} } }. \] where \([\cdot]_{22}\) is the (2,2)-element of a matrix.
# calculate the t-value
bhat2 = bhat[2] # the parameter we want to test
e_hat = Y - X %*% bhat
sigma_hat_square = sum(e_hat^2)/ (n-2)
Sigma_B = solve( t(X) %*% X ) * sigma_hat_square
t_value_2 = ( bhat2 - b0[2]) / sqrt( Sigma_B[2,2] )
print(t_value_2)
## [1] -0.5615293
Exercise: Can you write a function with both \(\hat{\beta}\) and \(t\)-value as outputs?
Recommend the use of csv
format, a plain ASCII file
format.
read.table()
or read.csv()
imports data
from an ASCII file into an R session.write.table()
or write.csv()
exports the
data in an R session to an ASCII file.Example: FRED-MD Data
# Read data
fred_md <- read.csv("./data/fred_md_data.csv")
fred_md <- fred_md[-1, ] # remove the first row (transform code)
# Read in CPI and calculate the inflation
cpi <- fred_md$CPIAUCSL
infl_cpi = 100 * (cpi[-1] - cpi[-length(cpi)]) / cpi[-length(cpi)]
# Transform the format of date
date <- zoo::as.yearmon(zoo::as.Date( fred_md[, 1], format = "%m/%d/%Y" ))
date <- date[-1]
plot(date, infl_cpi, type = 'l', xlab = "Date", ylab = "Inflation")
library(AER)
data("CreditCard")
head(CreditCard)
## card reports age income share expenditure owner selfemp dependents
## 1 yes 0 37.66667 4.5200 0.033269910 124.983300 yes no 3
## 2 yes 0 33.25000 2.4200 0.005216942 9.854167 no no 3
## 3 yes 0 33.66667 4.5000 0.004155556 15.000000 yes no 4
## 4 yes 0 30.50000 2.5400 0.065213780 137.869200 no no 0
## 5 yes 0 32.16667 9.7867 0.067050590 546.503300 yes no 2
## 6 yes 0 23.25000 2.5000 0.044438400 91.996670 no no 0
## months majorcards active
## 1 54 1 12
## 2 34 1 13
## 3 58 1 5
## 4 25 1 7
## 5 64 1 5
## 6 54 1 1
Besides loading a data file on the local hard disk, We can directly
download data from internet. Here we show how to retrieve the stock
daily data of Apple Inc. from Yahoo Finance, and save
the dataset locally. A package called quantmod
is used.
quantmod::getSymbols("AAPL", src = "yahoo")
## [1] "AAPL"
tail(AAPL)
## AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
## 2023-01-26 143.17 144.25 141.90 143.96 54105100 143.96
## 2023-01-27 143.16 147.23 143.08 145.93 70492800 145.93
## 2023-01-30 144.96 145.55 142.85 143.00 64015300 143.00
## 2023-01-31 142.70 144.34 142.28 144.29 65874500 144.29
## 2023-02-01 143.97 146.61 141.32 145.43 77663600 145.43
## 2023-02-02 148.90 151.18 148.17 150.82 118339000 150.82
plot(AAPL$AAPL.Close)
Another example: Industrial Production: Total Index in FRED data
quantmod::getSymbols.FRED(Symbols = c("IPB50001SQ"), env = .GlobalEnv)
## [1] "IPB50001SQ"
plot(IPB50001SQ)
Flow control is common in all programming languages. if
is used for choice, and for
or while
is used
for loops.
if ... else ...
x <- 1
if (x == 1) {
print("Yes!")
} else if (x == 0) {
print("No!")
} else {
print("I don't understand.")
}
## [1] "Yes!"
for
loopfor (i in 1:3) {
x <- rnorm(i)
print(x)
}
## [1] -0.2986997
## [1] -0.5935356 0.8267038
## [1] -2.1614064 -0.9210425 0.5171631
When we use loops, we should minimize the tasks within the loop. One principle is that always pre-specify the dimension and allocate memory for variables outside the loop.
Example Empirical coverage of confidence intervals.
CI <- function(x) {
# construct confidence interval
# x is a vector of random variables
n <- length(x)
mu <- mean(x)
sig <- sd(x)
upper <- mu + 1.96 / sqrt(n) * sig
lower <- mu - 1.96 / sqrt(n) * sig
return(list(lower = lower, upper = upper))
}
Rep <- 100000
sample_size <- 1000
mu <- 2
# Append a new outcome after each loop
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
x <- rpois(sample_size, mu)
bounds <- CI(x)
out_i <- ((bounds$lower <= mu) & (mu <= bounds$upper))
if (i == 1) {
out <- out_i
} else {
out <- c(out, out_i)
}
}
mean(out)
## [1] 0.95023
cat("Takes", Sys.time() - pts0, "seconds\n")
## Takes 11.76872 seconds
# Initialize the result vector
out <- rep(0, Rep)
pts0 <- Sys.time() # check time
for (i in 1:Rep) {
x <- rpois(sample_size, mu)
bounds <- CI(x)
out[i] <- ((bounds$lower <= mu) & (mu <= bounds$upper))
}
mean(out)
## [1] 0.94893
cat("Takes", Sys.time() - pts0, "seconds\n")
## Takes 4.198398 seconds
In R and other high-level languages, for loops are SLOW. If you have to loop through many many elements and do expensive operations for each element, your code will take forever. There are several ways to speed up loops, one way relies on vectorization.
Many operations naively done in for-loops can instead be done using matrix operations.
Consider an example of generating 2-D random walk (example borrowd from Ross Ihaka’s online note)
for
looprw2d_loop <- function(n) {
xpos <- rep(0, n)
ypos <- rep(0, n)
xdir <- c(TRUE, FALSE)
step_size <- c(1, -1)
for (i in 2:n) {
if (sample(xdir, 1)) {
xpos[i] <- xpos[i - 1] + sample(step_size, 1)
ypos[i] <- ypos[i - 1]
} else {
xpos[i] <- xpos[i - 1]
ypos[i] <- ypos[i - 1] + sample(step_size, 1)
}
}
return(data.frame(x = xpos, y = ypos))
}
rw2d_vec <- function(n) {
xsteps <- c(-1, 1, 0, 0)
ysteps <- c(0, 0, -1, 1)
dir <- sample(1:4, n - 1, replace = TRUE)
xpos <- c(0, cumsum(xsteps[dir]))
ypos <- c(0, cumsum(ysteps[dir]))
return(data.frame(x = xpos, y = ypos))
}
n <- 100000
t0 <- Sys.time()
df <- rw2d_loop(n)
cat("Naive implementation: ", difftime(Sys.time(), t0, units = "secs"))
## Naive implementation: 0.46785
t0 <- Sys.time()
df2 <- rw2d_vec(n)
cat("Vectorized implementation: ", difftime(Sys.time(), t0, units = "secs"))
## Vectorized implementation: 0.002739906
We can exploit the power of multicore machines to speed up loops by
parallel computing. The packages foreach
and
doParallel
are useful for parallel computing.
capture <- function() {
x <- rpois(sample_size, mu)
bounds <- CI(x)
return(((bounds$lower <= mu) & (mu <= bounds$upper)))
}
# Workhorse packages
library(foreach)
library(doParallel)
Rep <- 100000
sample_size <- 1000
registerDoParallel(8) # open 8 CPUs to accept incoming jobs.
pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %dopar% {
x <- rpois(sample_size, mu)
bounds <- CI(x)
((bounds$lower <= mu) & (mu <= bounds$upper))
}
cat("parallel loop takes", Sys.time() - pts0, "seconds\n")
## parallel loop takes 5.086341 seconds
pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %do% {
x <- rpois(sample_size, mu)
bounds <- CI(x)
((bounds$lower <= mu) & (mu <= bounds$upper))
}
cat("sequential loop takes", Sys.time() - pts0, "seconds\n")
## sequential loop takes 10.0201 seconds
We change the nature of the task a bit.
Rep <- 200
sample_size <- 200000
pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %dopar% {
x <- rpois(sample_size, mu)
bounds <- CI(x)
((bounds$lower <= mu) & (mu <= bounds$upper))
}
cat("parallel loop takes", Sys.time() - pts0, "seconds\n")
## parallel loop takes 0.2538779 seconds
pts0 <- Sys.time() # check time
out <- foreach(icount(Rep), .combine = c) %do% {
x <- rpois(sample_size, mu)
bounds <- CI(x)
((bounds$lower <= mu) & (mu <= bounds$upper))
}
cat("sequential loop takes", Sys.time() - pts0, "seconds\n")
## sequential loop takes 1.465558 seconds
R is a language created by statisticians. It has elegant built-in
statistical functions. p
(probability), d
(density for a continuous random variable, or mass for a discrete random
variable), q
(quantile), r
(random variable
generator) are used ahead of the name of a probability distribution,
such as norm
(normal), chisq
(\(\chi^2\)), t
(t),
weibull
(Weibull), cauchy
(Cauchy),
binomial
(binomial), pois
(Poisson), to name a
few.
Example
This example illustrates the sampling error.
x_axis = seq(0.01, 15, by = 0.01)
(black line).set.seed(100)
x_axis <- seq(0.01, 15, by = 0.01)
y <- dchisq(x_axis, df = 3)
z <- rchisq(1000, df = 3)
hist(z, freq = FALSE, xlim = range(0.01, 15), ylim = range(0, 0.25))
lines(y = y, x = x_axis, type = "l", xlab = "x", ylab = "density", col = "red")
Statistical models are formulated as y~x
, where
y
on the left-hand side is the dependent variable, and
x
on the right-hand side is the explanatory variable. The
built-in OLS function is lm
. It is called by
lm(y~x, data = data_frame)
.
All built-in regression functions in R share the same structure. Once one type of regression is understood, it is easy to extend to other regressions.
# Linear models
n <- 100
x <- rnorm(n)
y <- 0.5 + 1 * x + rnorm(n)
result <- lm(y ~ x)
summary(result)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49919 -0.53555 -0.08413 0.46672 2.74626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.43953 0.09161 4.798 5.74e-06 ***
## x 1.03031 0.10158 10.143 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9091 on 98 degrees of freedom
## Multiple R-squared: 0.5122, Adjusted R-squared: 0.5072
## F-statistic: 102.9 on 1 and 98 DF, p-value: < 2.2e-16
class(result)
## [1] "lm"
typeof(result)
## [1] "list"
result$coefficients
## (Intercept) x
## 0.4395307 1.0303095
An simple example of logit regression using the Swiss Labor Market Participation Data (Cross-section data originating from the health survey SOMIPOPS for Switzerland in 1981).
library("AER")
data("SwissLabor", package="AER")
head(SwissLabor)
## participation income age education youngkids oldkids foreign
## 1 no 10.78750 3.0 8 1 1 no
## 2 yes 10.52425 4.5 8 0 1 no
## 3 no 10.96858 4.6 9 0 0 no
## 4 no 11.10500 3.1 11 2 0 no
## 5 no 11.10847 4.4 12 0 2 no
## 6 yes 11.02825 4.2 12 0 1 no
SwissLabor <- SwissLabor %>%
mutate(participation = ifelse(participation == "yes", 1, 0))
glm_fit <- glm(
participation ~ age + education + youngkids + oldkids + foreign,
data = SwissLabor,
family = "binomial")
summary(glm_fit)
##
## Call:
## glm(formula = participation ~ age + education + youngkids + oldkids +
## foreign, family = "binomial", data = SwissLabor)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9110 -0.9783 -0.5891 1.1227 2.1989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.085961 0.540535 3.859 0.000114 ***
## age -0.527066 0.089670 -5.878 4.16e-09 ***
## education -0.001298 0.027502 -0.047 0.962371
## youngkids -1.326957 0.177893 -7.459 8.70e-14 ***
## oldkids -0.072517 0.071878 -1.009 0.313024
## foreignyes 1.353614 0.198598 6.816 9.37e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1203.2 on 871 degrees of freedom
## Residual deviance: 1069.9 on 866 degrees of freedom
## AIC: 1081.9
##
## Number of Fisher Scoring iterations: 4
We demonstrate the skills from this notes by a Monte Carlo simulation practice. In such experiments, we sample data from a specified statistical model and examine the finite sample performance of estimation/inference procedures.
For different data generating processes and different sample sizes \(n\), we simulate data and estimate the model for many times and summarized the results (in most cases) by measures (empirical) Bias, RMSE, size and empirical power functions, or empirical density plots.
Generically, bias, RMSE and size are calculated by \[bias = R^{-1}\sum_{r=1}^R \left( \hat{ \theta}^{(r)} - \theta_0 \right),\] \[RMSE = \left(R^{-1}\sum_{r= 1}^R \left( \hat{\theta}^{(r)} -\theta_0 \right)^2\right)^{1/2},\] \[Size = R^{-1}\sum_{r=1}^R \boldsymbol{1}\left[\left\vert \hat{\theta}^{(r)} - \theta_0 \right\vert / \hat{\sigma}_{\hat{\theta}}^{(r)} > \mathrm{cv}_{0.05} \right].\] for true parameter \(\theta_0\), its estimate \(\hat{\theta}^{(r)}\), the estimated standard error of \(\hat{\theta}^{(r)}\), \(\hat{\sigma}_{\hat{\theta}}^{(r)}\).
\[Power(\theta_\delta) = R^{-1}\sum_{r=1}^R \boldsymbol{1}\left[ \left\vert \hat{\theta}^{(r)} - \theta_\delta \right\vert / \hat{\sigma}_{\hat{\theta}}^{(r)} > \mathrm{cv}_{0.05} \right]\]
Major components: 1. Data generating functions. 2. Estimation Functions. 3. Master files. Repeatedly generate data and estimate the model. 4. Results generating functions.
Initialize result_container
for (i in [different setups]) {
for (r in 1:number_of_replications) {
data = generate_data_function(sample size, parameters, ...)
result_container[i, r] = statistical_operation(data, ...)
}
}
list(bias, rmse, size, power, ...) = result_summary_function(result_container, ...)
Example Omitted variable bias.
Consider a linear regression model \(y_i = \alpha + x_{i1}\beta_1 + x_{i2}\beta_2 + u_i\). \((y_i, x_i)\) are i.i.d. with \[u_i\sim i.i.d.N(0,1), \quad (x_{i1}, x_{i2})^\prime \sim i.i.d. N\left(\begin{pmatrix}0 \\ 1 \end{pmatrix}, \begin{pmatrix} \sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2 \end{pmatrix} \right).\] True parameters are \(a = 0.11\), \(\beta = (0.22, 0.33)^\prime\), \(\rho = 0.5\), \(\sigma_1 = 1\), and \(\sigma_2 = 4\).
We investigate the finite sample performance of the OLS estimator in two scenarios: 1. Regress \(y\) on \(x_1\) and \(x_2\). 2. Regress \(y\) on \(x_1\) only.
Run \(1000\) replications and report the bias and RMSE.
# The data generating process (DGP)
dgp <- function(n, a = 0.11, b = c(0.22, 0.33), mu = c(0,1),
sigma1 = 1, sigma2 = 4, rho = 0.1){
Sigma <- matrix(c(sigma1^2, rho*sigma1*sigma2,
rho*sigma1*sigma2, sigma2^2), 2, 2)
X <- MASS::mvrnorm(n = n, mu = mu, Sigma = Sigma)
y <- a + X %*% b + rnorm(n)
return( list(y = y, X = X) )
}
# Fix the seed for random number generator
# so that the results are replicable
set.seed(200)
# Setup
n <- 1000
p <- 3
a0 <- 0.11
b0 <- c(0.22, 0.33)
Rep <- 1000
# Container
Coef_hat <- matrix(0, p, Rep)
Coef_hat_unspec <- matrix(0, p-1, Rep)
# Estimation
for(i in 1:Rep){
data <- dgp(n)
Coef_hat[, i] <- lsfit(data$X, data$y)$coefficients
Coef_hat_unspec[, i] <- lsfit(data$X[, 1], data$y)$coefficients
}
# A function that summarizes results
result_sum <- function(coef_hat, para_0) {
Rep <- ncol(coef_hat)
data.frame(mean = rowMeans(coef_hat),
bias = rowMeans(coef_hat) - para_0,
RMSE = sqrt(rowMeans((coef_hat - matrix(para_0, length(para_0), Rep))^2)))
}
cat("Correct-specified Model: \n")
## Correct-specified Model:
result_sum(Coef_hat, c(a0, b0))
## mean bias RMSE
## 1 0.1110109 0.0010109166 0.032403825
## 2 0.2198700 -0.0001299870 0.031024382
## 3 0.3295965 -0.0004034982 0.007871069
cat("Under-specified Model: \n")
## Under-specified Model:
result_sum(Coef_hat_unspec, c(a0, b0[1]))
## mean bias RMSE
## 1 0.4413665 0.3313665 0.3355070
## 2 0.3501253 0.1301253 0.1397925
R Markdown is document format in which we can execute R code and write documents using Markdown grammar.
R Markdown provides an authoring framework for data science. You can use a single R Markdown file to both
save and execute code
generate high quality reports that can be shared with an audience
A friendly introduction to R Markdown by RStudio: https://rmarkdown.rstudio.com/lesson-1.html