**What is R?**

To quote the R project website:

R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS.

What does that mean?

- R was created for the statistical and graphical work required by econometrics.
- R has a vibrant, thriving online community. (stack overflow)
- Plus it’s
**free**and**open source**.

**Why are we using R?**

- Many alternatives:
- STATA v.s.R v.s. Python v.s MATLAB v.s. Julia v.s. C/C++
- No clear answer…

- Advantages of R
- R is
**free**and**open source**(v.s. STATA and MATLAB) - Designed for data work. Favored by the academia, particular the
statistics and econometrics community, which means most of the times you
can apply most up-to-date methods with packages developed by the
authors. (v.s. all others)
- Example: C-Lasso, Bubble test

- R is very
**flexible and powerful**— adaptable to nearly any task,*e.g.*, ’metrics, spatial data analysis, machine learning, web scraping, data cleaning, website building, teaching.- Well-developed environment. Rich sources of packages. (v.s. Julia.)

- R imposes
**no limitations**on your amount of observations, variables, memory, or processing power. (v.s. Stata) - If you put in the work, you will come away with a
**valuable and marketable**tool. - https://github.com/matloff/R-vs.-Python-for-Data-Science

- R is

- Download and install R.

- It comes with a GUI, which opens when you run the R application (e.g. from the applications folder on Mac). You can also run it from the command line.

- Download and install RStudio.

- A powerful integrated development environment (IDE) can be very helpful.

*Download Git. (Optional but recommended.)*

- Version control is important in practice, particularly when many people work on the same project.

*Create an account on GitHub and register for a student/educator discount. (Optional but recommended.)*

- For installation guidance and troubleshooting: Jenny Bryan’s http://happygitwithr.com.

**Some OS-specific extras**

To help smooth some software installation issues further down the road, please also do the following (depending on your OS):

**Windows:**Install Rtools.**Mac:**Install Homebrew. I also recommend that you configure/open your C++ toolchain (see here.)**Linux:**None (you should be good to go).

Checklist

[check] Do you have the most recent version of R?

`version$version.string`

`## [1] "R version 4.3.3 (2024-02-29)"`

[check] Do you have the most recent version of RStudio? (The preview version is fine.)

```
RStudio.Version()$version
## Requires an interactive session but should return something like "[1] ‘1.2.5001’"
```

[check] Have you updated all of your R packages?

`update.packages(ask = FALSE, checkBuilt = TRUE)`

- Official Manual
- Course materials
- EC 510 and EC 607 taught by Prof. Grant McDermott at the University of Oregon.
- ECON 5170 taught by Prof. Zhentao Shi at the Chinese University of Hong Kong.
- Tibshirani, R. Statistics 36-350 taught by Prof. Ryan Tibshirani
- Coursera/Udacity/Udemy/YouTube/Bilibili…

- Books
- Grolemund G. Hands-On Programming with R
- Wickham, H. and Grolemund, G. R for data science
- Wickham, H Advanced R
- Adams. C. P. Learning Microeconometrics with R

- Stack Overflow/Google oriented programming

R is a powerful calculator and recognizes all of the standard arithmetic operators:

`1+2 ## Addition`

`## [1] 3`

`6-7 ## Subtraction`

`## [1] -1`

`5/2 ## Division`

`## [1] 2.5`

`2^3 ## Exponentiation`

`## [1] 8`

We can also invoke modulo operators (integer division & remainder). - Very useful when dealing with time, for example.

`100 %/% 60 ## How many whole hours in 100 minutes?`

`## [1] 1`

`100 %% 60 ## How many minutes are left over?`

`## [1] 40`

R also comes equipped with a full set of logical operators (and Boolean functions), which follow standard programming protocol. For example:

`1 > 2`

`## [1] FALSE`

`1 == 2`

`## [1] FALSE`

`1 > 2 | 0.5 ## The "|" stands for "or" (not a pipe a la the shell)`

`## [1] TRUE`

`1 > 2 & 0.5 ## The "&" stands for "and"`

`## [1] FALSE`

`isTRUE (1 < 2)`

`## [1] TRUE`

`4 %in% 1:10`

`## [1] TRUE`

`is.na(1:10)`

`## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE`

`# etc..`

You can read more about these logical operators here and here.

In R, we can use either `=`

or `<-`

to
handle assignment. The `<-`

is really a `<`

followed by a `-`

.

**Assignment with <-**

`<-`

is normally read aloud as “gets”. You can think of
it as a (left-facing) arrow saying *assign in this
direction*.

```
a <- 10 + 5
a
```

`## [1] 15`

f course, an arrow can point in the other direction too
(i.e. `->`

). So, the following code chunk is equivalent to
the previous one, although used much less frequently.

`10 + 5 -> a`

**Assignment with =**

You can also use `=`

for assignment.

```
b = 10 + 10 ## Note that the assigned object *must* be on the left with "=".
b
```

`## [1] 20`

More discussion about `<-`

vs `=`

: https://github.com/Robinlovelace/geocompr/issues/319#issuecomment-427376764
and https://www.separatinghyperplanes.com/2018/02/why-you-should-use-and-never.html.

- Everything is an object.
- Everything has a name.

Different *types* (or *classes*) of objects.

The c() function creates vectors. This is one of the objects we’ll use to store data.

```
myvec <- c(1, 2, 3)
print(myvec)
```

`## [1] 1 2 3`

Shortcut for consecutive numbers:

```
myvec <- 1:3
print(myvec)
```

`## [1] 1 2 3`

Basic algebraic operations all work entrywise on vectors.

```
myvec <- c(1, 3, 7)
myvec2 <- c(5, 14, 3)
myvec3 <- c(9, 4, 8)
myvec + myvec2
```

`## [1] 6 17 10`

`myvec / myvec2`

`## [1] 0.2000000 0.2142857 2.3333333`

`myvec * (myvec2^2 + sqrt(myvec3))`

`## [1] 28.00000 594.00000 82.79899`

So are the binary logical operations `&`

`|`

`!=`

.

```
# logical vectors
logi_1 = c(T,T,F)
logi_2 = c(F,T,T)
logi_12 = logi_1 & logi_2
print(logi_12)
```

`## [1] FALSE TRUE FALSE`

You can “slice” a vector (grab subsets of it) in several different
ways: Vector selection is specified in square bracket `a[ ]`

by either positive integer or logical vector.

```
myvec <- 7:20
myvec[8]
```

`## [1] 14`

`myvec[2:5]`

`## [1] 8 9 10 11`

```
# vector of booleans for whether each entry of myvec is greater than 13
indvec <- myvec > 13
indvec
```

```
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
## [13] TRUE TRUE
```

```
indvec2 <- myvec == 8
indvec2
```

```
## [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE
```

```
# only outputs entries of myvec for which indvec is true
myvec[indvec]
```

`## [1] 14 15 16 17 18 19 20`

```
# same thing but all in one line, without having to define indvec
myvec[myvec>13]
```

`## [1] 14 15 16 17 18 19 20`

Some useful vector functions:

`length(myvec)`

`## [1] 14`

`mean(myvec)`

`## [1] 13.5`

`var(myvec)`

`## [1] 17.5`

Matrices are just collections of several vectors of the same length.

```
myvec <- c(1, 3, 7)
myvec2 <- c(5, 14, 3)
myvec3 <- c(9, 4, 8)
# creates matrix whose columns are the inputs of myvec
mat_1 <- cbind(myvec, myvec2, myvec3)
print(mat_1)
```

```
## myvec myvec2 myvec3
## [1,] 1 5 9
## [2,] 3 14 4
## [3,] 7 3 8
```

```
# now they're rows instead
mat_2 <- rbind(myvec, myvec2, myvec3)
print(mat_2)
```

```
## [,1] [,2] [,3]
## myvec 1 3 7
## myvec2 5 14 3
## myvec3 9 4 8
```

```
# Define a matrix by its element
mat_3 <- matrix(1:8, 2, 4)
print(mat_3)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 1 3 5 7
## [2,] 2 4 6 8
```

```
mat_4 <- matrix(1:8, 2, 4, byrow = TRUE)
print(mat_4)
```

```
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
```

Matrix algebra works the same way as vector algebra - it’s all done
entrywise with the `*`

and `+`

operators. If you
want to do matrix multiplication, use `%*%`

`dim(mat_1)`

`## [1] 3 3`

`mat_1 * mat_2`

```
## myvec myvec2 myvec3
## [1,] 1 15 63
## [2,] 15 196 12
## [3,] 63 12 64
```

`mat_1 %*% mat_2 #Note that this differs from the elementwise product`

```
## [,1] [,2] [,3]
## [1,] 107 109 94
## [2,] 109 221 95
## [3,] 94 95 122
```

Other often used matrix operations:

`t(mat_1) # Transpose`

```
## [,1] [,2] [,3]
## myvec 1 3 7
## myvec2 5 14 3
## myvec3 9 4 8
```

`solve(mat_1) # Inverse`

```
## [,1] [,2] [,3]
## myvec -0.146842878 0.01908957 0.155653451
## myvec2 -0.005873715 0.08076358 -0.033773862
## myvec3 0.130690162 -0.04698972 0.001468429
```

`eigen(mat_1) # eigenvalues and eigenvectors`

```
## eigen() decomposition
## $values
## [1] 18.699429 8.556685 -4.256114
##
## $vectors
## [,1] [,2] [,3]
## [1,] -0.4631214 0.3400869 0.87156297
## [2,] -0.7270618 -0.6713411 -0.03609066
## [3,] -0.5068528 0.6585150 -0.48895343
```

For more operations, check out https://www.statmethods.net/advstats/matrix.html.

“Slicing” matrices:

`mat_1[1, 1]`

```
## myvec
## 1
```

`mat_1[2, ]`

```
## myvec myvec2 myvec3
## 3 14 4
```

*data.frame* is a two-dimensional table that stores the data,
similar to a spreadsheet in Excel. A matrix is also a two-dimensional
table, but it only accommodates one type of elements. Real world data
can be a collection of integers, real numbers, characters, categorical
numbers and so on. Data frame is the best way to organize data of mixed
type in R.

```
df_1 <- data.frame(a = 1:2, b = 3:4)
print(df_1)
```

```
## a b
## 1 1 3
## 2 2 4
```

```
df_2 <- data.frame(name = c("Jack", "Rose"), score = c(100, 90))
print(df_2)
```

```
## name score
## 1 Jack 100
## 2 Rose 90
```

`print(df_2[, 1])`

`## [1] "Jack" "Rose"`

`print(df_2$name)`

`## [1] "Jack" "Rose"`

A vector only contains one type of elements. *list* is a
basket for objects of various types. It can serve as a container when a
procedure returns more than one useful object. For example, recall that
when we invoke `eigen`

, we are interested in both eigenvalues
and eigenvectors, which are stored into `$value`

and
`$vector`

, respectively.

```
x_list <- list(a = 1:2, b = "hello world")
print(x_list)
```

```
## $a
## [1] 1 2
##
## $b
## [1] "hello world"
```

`print(x_list[[1]]) # Different from vectors and matrices`

`## [1] 1 2`

`print(x_list$a)`

`## [1] 1 2`

You do things using functions. Functions come pre-written in packages (i.e. “libraries”), although you can — and should — write your own functions too. - In the developing stage, it allows us to focus on a small chunk of code. It cuts an overwhelmingly big project into manageable pieces. - A long script can have hundreds or thousands of variables. Variables defined inside a function are local. They will not be mixed up with those outside of a function. Only the input and the output of a function have interaction with the outside world. - If a revision is necessary, We just need to change one place. We don’t have to repeat the work in every place where it is invoked.

```
# Built-in function
sum(c(3, 4))
```

`## [1] 7`

```
# User-defined function
add_func <- function(x, y){
return(x + y)
}
add_func(3, 4)
```

`## [1] 7`

**Package**

A pure clean installation of R is small, but R has an extensive ecosystem of add-on packages. This is the unique treasure for R users. Most packages are hosted on CRAN. A common practice today is that statisticians upload a package to CRAN after they write or publish a paper with a new statistical method. They promote their work via CRAN, and users have easy access to the state-of-the-art methods.

A package can be installed by
`install.packages("package_name")`

and invoked by
`library(package_name)`

.

```
# Function from a package
stats::sd(1:10)
```

`## [1] 3.02765`

```
# Package isntall
# install.packages("glmnet")
library(glmnet)
```

It is also common for authors to make packages available on GitHub or on their websites. You can often use the devtools package or the remotes packages to install these, following instructions on the project website.

```
if (!requireNamespace("remotes")) {
install.packages("remotes")
}
```

`## Loading required namespace: remotes`

`remotes::install_github("kolesarm/RDHonest")`

`## Using GitHub PAT from the git credential store.`

```
## Skipping install of 'RDHonest' from a github remote, the SHA1 (7391e914) has not changed since last install.
## Use `force = TRUE` to force installation
```

**Help System**

The help system is the first thing we must learn for a new language.
In R, if we know the exact name of a function and want to check its
usage, we can either call `help(function_name)`

or a single
question mark `?function_name`

. If we do not know the exact
function name, we can instead use the double question mark
`??key_words`

. It will provide a list of related function
names from a fuzzy search.

**Example**: `?seq`

,
`??sequence`

For many packages, you can also try the `vignette()`

function, which will provide an introduction to a package and it’s
purpose through a series of helpful examples.

**Example**: `vignette("dplyr")`

OLS estimation with one \(x\) regressor and a constant. Graduate textbook expresses the OLS in matrix form \[\hat{\beta} = (X' X)^{-1} X'y.\] To conduct OLS estimation in R, we literally translate the mathematical expression into code.

Step 1: We need data \(Y\) and \(X\) to run OLS. We simulate an artificial dataset.

```
# simulate data
set.seed(111) # can be removed to allow the result to change
# set the parameters
n <- 100
b0 <- matrix(1, nrow = 2 )
# generate the data
e <- rnorm(n)
X <- cbind( 1, rnorm(n) )
Y <- X %*% b0 + e
```

Step 2: translate the formula to code

```
# OLS estimation
(bhat <- solve( t(X) %*% X, t(X) %*% Y ))
```

```
## [,1]
## [1,] 0.9861773
## [2,] 0.9404956
```

`class(bhat)`

`## [1] "matrix" "array"`

```
# User-defined function
ols_est <- function(X, Y) {
bhat <- solve( t(X) %*% X, t(X) %*% Y )
return(bhat)
}
(bhat_2 <- ols_est(X, Y))
```

```
## [,1]
## [1,] 0.9861773
## [2,] 0.9404956
```

`class(bhat_2)`

`## [1] "matrix" "array"`

```
# Use built-in functions
(bhat_3 <- lsfit(X, Y, intercept = FALSE)$coefficients)
```

```
## X1 X2
## 0.9861773 0.9404956
```

`class(bhat_3)`

`## [1] "numeric"`

Step 3 (additional): plot the regression graph with the scatter points and the regression line. Further compare the regression line (black) with the true coefficient line (red).

```
# plot
plot(y = Y, x = X[,2], xlab = "X", ylab = "Y", main = "regression")
abline(a = bhat[1], b = bhat[2])
abline(a = b0[1], b = b0[2], col = "red")
abline(h = 0, lty = 2)
abline(v = 0, lty = 2)
```