ECON 4370 / 6370 Computing for Economics
Lecture 4: Version Control with Git
This set of notes is based on Grant McDermott’s lecture notes on version control with Git and GitHub.
Introduction
This lecture covers version control with Git and GitHub, essential tools for modern data science and research workflows. We’ll learn how to track changes, collaborate with others, and maintain organized project histories.
Learning Objectives
By the end of this lecture, you will be able to:
- Understand the purpose and benefits of version control
- Set up Git repositories and connect them to GitHub
- Perform basic Git operations (add, commit, push, pull)
- Handle merge conflicts
- Work with branches and pull requests
- Use Git effectively in your research projects
Why Version Control?
The Problem
Anyone who has worked on a project has experienced this scenario:Multiple versions of files with unclear names, uncertainty about which version is current, and the fear of losing work when making changes.
You need a version control system (VCS) to track changes to your files.
As the name implies, these tools help maintain a history of changes; furthermore, they facilitate collaboration. VCSs track changes to a folder and its contents in a series of snapshots, where each snapshot encapsulates the entire state of files/folders within a top-level directory. VCSs also maintain metadata like who created each snapshot, messages associated with each snapshot, and so on.
Why is version control useful? Even when you’re working by yourself, it can let you look at old snapshots of a project, keep a log of why certain changes were made, work on parallel branches of development, and much more. When working with others, it’s an invaluable tool for seeing what other people have changed, as well as resolving conflicts in concurrent development.
Modern VCSs also let you easily (and often automatically) answer questions like:
- Who wrote this module?
- When was this particular line of this particular file edited? By whom? Why was it edited?
- Over the last 1000 revisions, when/why did a particular unit test stop working?
The Solution: Git and GitHub
Git
Git is a distributed version control system that solves these problems:
- Version Control: Track all changes to your files over time
- Collaboration: Multiple people can work on the same project simultaneously
- Backup: Your work is safely stored and can be recovered
- History: See exactly what changed, when, and why
Think of Git as if Dropbox and the “Track changes” feature in MS Word had a baby. It’s even better because Git is optimized for the things that economists and data scientists work on (code, data analysis, documentation).
To be further motivated, please refer to git vs. Dropbox from a researcher’s perspective by Michael Stepner.
GitHub
GitHub is an online hosting platform that provides services built on top of Git:
- Cloud Storage: Your repositories are stored online
- Collaboration Tools: Issues, pull requests, wikis
- Social Features: Follow projects, star repositories
- Integration: Works seamlessly with RStudio and other tools
Just like we don’t need RStudio to run R code, we don’t need GitHub to use Git… but it makes our lives much easier.
Git(Hub) for Scientific Research
From Software Development to Research
Git and GitHub’s role in global software development is well-established. There’s a high probability that your favorite app, program, or package is built using Git-based tools (RStudio is a case in point).
Scientists and academic researchers are increasingly adopting these tools:
- Open Science: Git(Hub) helps operationalize the ideals of open science and reproducibility
- Journal Requirements: Many journals have strict requirements regarding reproducibility and data access
- Research Workflow: Host code, data, and documentation for papers
- Collaboration: Work with co-authors and research teams
Nature article: “Democratic databases: science on GitHub” (Perkel, 2016).
Getting Started with Git and GitHub
Prerequisites
Before we start, make sure you have:
☑ Installed R
☑ Installed RStudio
☑ Installed Git
☑ Created an account on GitHub
If you need help with installation, consult Jenny Bryan’s excellent guide: Happy Git with R.
Git(Hub) + RStudio Integration
One of RStudio’s great features is how well it integrates version control into your everyday workflow. Even though Git is a completely separate program from R, they feel like part of the same “thing” in RStudio.
This integration greatly reduces the cognitive overhead associated with traditional workflows where you have to juggle multiple programs and languages simultaneously.
SourceTree and GitKraken are two popular Git GUIs.
Setting Up Your First Repository
Link a GitHub Repository to an RStudio Project
The starting point for our workflow is to link a GitHub repository to an RStudio Project. Here are the steps:
- Create the repo on GitHub and initialize with a README
- Copy the HTTPS/SSH link (the green “Clone or Download” button)
- Open RStudio
- Navigate to File → New Project → Version Control → Git
- Paste your copied link into the “Repository URL:” box
- Choose the project path and click Create Project
It’s easiest to start with HTTPS, but SSH is advised for more advanced users.
Making Local Changes
Once you’ve cloned your repository, you can start making changes:
- Look at the top-right panel in RStudio for the “Git” tab
- Open the README file (in the “Files” tab in the bottom-right panel)
- Add some text like “Hello World!” and save the file
- Notice the changes appear in the “Git” panel
Main Git Operations
Now that you’ve cloned your first repo and made some local changes, it’s time to learn the four main Git operations:
1. Stage (or “Add”)
Tell Git that you want to add changes to the repo history (file edits, additions, deletions, etc.).
2. Commit
Tell Git that, yes, you are sure these changes should be part of the repo history.
3. Pull
Get any new changes made on the GitHub repo (i.e., the upstream remote), either by your collaborators or you on another machine.
4. Push
Push any (committed) local changes to the GitHub repo.
Visual Workflow
Step-by-Step Summary
Here’s what we just did:
- Made some changes to a file and saved them locally
- Staged these local changes
- Committed these local changes to our Git history with a helpful message
- Pulled from the GitHub repo (good practice, even on solo projects)
- Pushed our changes to the GitHub repo
Always pull from the upstream repo before you push any changes. Seriously, do this even on solo projects; making it a habit will save you headaches down the road.
Why This Workflow?
Creating the repo on GitHub first means that it will always be “upstream” of your (and any other) local copies. This allows GitHub to act as the central node in the distributed version control network.
This is especially valuable when collaborating on projects with others, but also has advantages when working alone.
RStudio Projects are excellent because they: - Interact seamlessly with Git(Hub) - Solve absolute vs. relative path problems (the .Rproj file acts as an anchor point)
Git from the Shell
While RStudio’s Git integration is ideal for new users, understanding shell commands helps you internalize the basics and provides more power and flexibility.
Main Git Shell Commands
Basic Operations
Clone a repository:
$ git clone REPOSITORY-URL
See the commit history:
$ git log
Check what has changed:
$ git status
Staging Files
Stage a specific file or folder:
$ git add NAME-OF-FILE-OR-FOLDER
Stage all files:
$ git add -A
Stage updated files only (modified or deleted, but not new):
$ git add -u
Stage new files only (not updated):
$ git add .
Committing and Syncing
Commit your changes:
$ git commit -m "Helpful message"
Pull from the upstream repository:
$ git pull
Push local changes to the upstream repo:
$ git push
Other
- displays historical commits stored chronologically in the repository.
$ git log
- visualizes history as a DAG
$ git log --all --graph --decorate
- add a tag
$ git tag -a v1.0 -m 'message' [optional:commit-id]
- remove
$ git rm --cached filename
Eraser-like features
The latest commit is called HEAD commit
- displays view the HEAD commit.
$ git show HEAD
- restore the file in the working directory to what you made in last commit.
$ git checkout HEAD filename
- unstages the file from committing in the staging area. This command resets the file in the staging area to be the same as the HEAD commit. It does not discard file changes from the working directory; it just removes them from the staging area.
$ git reset HEAD filename
- works by using the first 7 characters of the SHA of a previous commit.
$ git reset SHA
Merge Conflicts
Collaborators working separately on the same paragraph of a file may easily encounter conflicts. In such situations, the command git pull
will not merge changes from the remote into your local repository automatically due to the conflict. You are expected to something like this in the shell:
git pull
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From https://github.com/your-user-name/repo-name
9b02654..0462f26 master -> origin/master
Auto-merging README.md
CONFLICT (content): Merge conflict in README.md
Automatic merge failed; fix conflicts and then commit the result.
If you check the status by git status
, you would be told that you have unmerged paths:
git status
On branch master
Your branch and 'origin/master' have diverged,
and have 1 and 1 different commits each, respectively.
(use "git pull" to merge the remote branch into yours)
You have unmerged paths.
(fix conflicts and run "git commit")
(use "git merge --abort" to abort the merge)
Unmerged paths:
(use "git add <file>..." to mark resolution)
both modified: README.md
no changes added to commit (use "git add" and/or "git commit -a")
To resolve the conflicts, we simply use a text editor, say Sublime Text 3, Atom or Visual Studio Code, to open the file with conflicts. In the illustrative example, it is the README.md
file.
# Conflict-Resolving
<<<<<<< HEAD
Local changes.
=======
Remote changes.
>>>>>>> 0462f26ddfe83c35424168c2d7a0bed62c653413
You can see conflicts indicated by Git. The content between <<<<<<< HEAD
and =======
is on HEAD and content between =======
and >>>>>>>
is from the commit 0462f26
.
<<<<<<< HEAD
- Indicates the start of the merge conflict=======
- Indicates the break point used for comparison
>>>>>>> <commit-hash>
- Indicates the end of the conflicted lines
To resolve the conflicts, you just remove <<<<<<< HEAD
, =======
and >>>>>> 0462f26
, and keep what you want to keep in the file. Detailed steps:
- Edit the file to keep the content you want
- Delete the conflict markers (
<<<<<<<
,=======
,>>>>>>>
) - Save the file
- Stage the resolved file:
git add README.md
- Commit the resolution:
git commit -m "Resolve merge conflict"
- Push the changes:
git push
In some cases, the difference between the local branch and the remote branch is significant, or the file with conflicts is of a special format, say Lyx file. It might be cumbersome to resolve all conflicts in the text editor. Another way to resovle conflicts is simply to open two versions of the files and copy and paste new updates and then commit the updated file. It is easy to do so in SourceTree: select the commit -> right click the file -> click Open Selected Version
.
In addition, we can resolve conflicts with SourceTree (preferred) or external merge tools, for example KDiff3. For details, it is recommended to refer to the answer on Stack Overflow.
Important Notes
- The person who fixes the merge conflict gets to decide what to keep
- The full commit history is preserved, so collaborators can always recover their changes
- A more elegant solution to merge conflicts is provided by Git branches
Line Endings and Different Operating Systems
The Problem
You may encounter situations where Git highlights differences on seemingly unchanged sentences. This often happens when collaborators use different operating systems.
The issue is that Git evaluates invisible characters at the end of every line:
- Linux and macOS: Line ending is “LF”
- Windows: Line ending is “CRLF”
The Solution
Configure Git to handle line endings automatically:
$ git config --global core.autocrlf input
(Windows users: Change input
to true
)
Branches and Forking
What Are Branches?
Branches are one of Git’s coolest features:
- Safe Experimentation: Take a snapshot of your repo and try new ideas without affecting the main branch
- Feature Development: Work on new features or bug fixes in isolation
- Research Applications: Try robustness checks, revisions, or alternative analyses
- Easy Cleanup: If you’re not happy with changes, just delete the branch
Only merge back to the main branch when you (and collaborators) are 100% satisfied.
Creating Branches in RStudio
Branch Shell Commands
Create a new branch and switch to it:
$ git checkout -b NAME-OF-YOUR-NEW-BRANCH
Push the new branch to GitHub:
$ git push origin NAME-OF-YOUR-NEW-BRANCH
List all branches:
$ git branch
Switch back to the master branch:
$ git checkout master
Delete a branch:
$ git branch -d NAME-OF-YOUR-FAILED-BRANCH
$ git push origin :NAME-OF-YOUR-FAILED-BRANCH
Merging Branches
You have two options for merging branches:
1. Local Merging
- Commit your final changes to the new branch
- Switch back to the master branch:
git checkout master
- Merge in the new branch:
git merge new-idea
- Delete the branch (optional):
git branch -d new-idea
2. Pull Requests on GitHub
Pull requests (PRs) are a way to notify collaborators that you’ve completed a feature:
- Create a summary of all changes in the branch
- Assign reviewers who can approve the changes
- Review and discuss the changes
- Merge the pull request when satisfied
Forks
Git forks lie somewhere between cloning a repo and branching from it. When you fork a repo, you create an independent copy under your GitHub account.
How Forking Works
- Fork a repository by clicking the “Fork” button on GitHub
- Clone your fork to your local machine
- Make changes and push them to your fork
- Create a pull request to the original repository
This is how much of the world’s software is developed: - Outside contributor forks a project - Adds a new feature or fixes a bug - Issues an upstream pull request - Original maintainer reviews and potentially merges the contribution
Maintaining Forks
If you want to stay up to date with the original repository, you’ll need to sync your fork periodically. See GitHub’s guide on Syncing a fork.
Additional Tips and Best Practices
README Files
README files are special in GitHub because they act as repository landing pages:
- Project Documentation: Explain the goal, software requirements, how to run the analysis
- Research Papers: Link to published papers, describe methodology
- Standalone Documentation: Some repos are essentially version-controlled blog posts
- Subdirectory READMEs: Can be added to subdirectories for more detailed documentation
READMEs should be written in Markdown, which GitHub automatically renders.
.gitignore Files
A .gitignore file tells Git what files to ignore. This is especially useful for:
- Proprietary data files (if you plan to make the repo public)
- Large files (>100 MB exceed GitHub’s limits)
- Compiled datasets (focus on the code that produces them)
- System files (OS-specific files, temporary files)
Creating .gitignore Files
You can create a .gitignore file in several ways:
- Automatically generated when cloning with RStudio Project
- Added when creating a repo on GitHub
- Created manually with your text editor
.gitignore Syntax
- Ignore a single file:
FILE-I-WANT-TO-IGNORE.csv
- Ignore a whole folder:
FOLDER-NAME/**
- Ignore all CSV files:
*.csv
- Ignore files beginning with “test”:
test*
- Don’t ignore a particular file:
!somefile.txt
GitHub Issues
GitHub Issues are another great way to interact with collaborators and package maintainers:
- Bug reports
- Feature requests
- General questions
- Project management
A sample .gitignore
file:
.Rproj.user
.Rhistory
.RData
.Ruserdata
/reference/
/References/
/doc/
/Arc/
*.DS_Store
*ipynb_checkpoints*/
.ipynb_checkpoints/*.*
output_mac*
output_linux*
/DebFm/.vscode/
/*.zip
## LyX
*.lyx~
*.lyx#
*.tmp
## SWP
*.bak
*.aaa
## R
*.Rhistory
*_cache*
*.RData
## matlab
*.asv
## Core latex/pdflatex auxiliary files:
*.aux
*.lof
*.log
*.lot
*.fls
*.out
*.toc
*.gz
## Intermediate documents:
*.dvi
*-converted-to.*
# these rules might exclude image files for figures etc.
# *.ps
# *.eps
# *.pdf
## Bibliography auxiliary files (bibtex/biblatex/biber):
*.bbl
*.bcf
*.blg
*-blx.aux
*-blx.bib
*.brf
*.run.xml
## Build tool auxiliary files:
*.fdb_latexmk
*.synctex
*.synctex.gz
*.synctex.gz(busy)
*.pdfsync
## Auxiliary and intermediate files from other packages:
# algorithms
*.alg
*.loa
# achemso
acs-*.bib
# amsthm
*.thm
# beamer
*.nav
*.snm
*.vrb
#(e)ledmac/(e)ledpar
*.end
*.[1-9]
*.[1-9][0-9]
*.[1-9][0-9][0-9]
*.[1-9]R
*.[1-9][0-9]R
*.[1-9][0-9][0-9]R
*.eledsec[1-9]
*.eledsec[1-9]R
*.eledsec[1-9][0-9]
*.eledsec[1-9][0-9]R
*.eledsec[1-9][0-9][0-9]
*.eledsec[1-9][0-9][0-9]R
# glossaries
*.acn
*.acr
*.glg
*.glo
*.gls
# hyperref
*.brf
# knitr
*-concordance.tex
*.tikz
*-tikzDictionary
# listings
*.lol
# makeidx
*.idx
*.ilg
*.ind
*.ist
# minitoc
*.maf
*.mtc
*.mtc0
# minted
*.pyg
# morewrites
*.mw
# nomencl
*.nlo
# sagetex
*.sagetex.sage
*.sagetex.py
*.sagetex.scmd
# sympy
*.sout
*.sympy
sympy-plots-for-*.tex/
# todonotes
*.tdo
# xindy
*.xdy
IVX_paper/ref.bib.sav
IVX_paper/IVX_paper-annotate.pdf
Summary and Best Practices
The Git Workflow Recipe
- Create a repo on GitHub and initialize with a README
- Clone the repo to your local machine (preferably using RStudio Project)
- Stage any changes:
git add -A
- Commit your changes:
git commit -m "Helpful message"
- Pull from GitHub:
git pull
- Fix any merge conflicts (if they occur)
- Push your changes:
git push
Repeat steps 3-7 often, especially steps 3 and 4.
Frequently Asked Questions
When should I commit (and push) changes?
Early and often. It’s not quite as important as saving your work regularly, but it’s a close second. You should certainly push everything that you want your collaborators to see.
Do I need branches if I’m working on a solo project?
You don’t need them, but they offer big advantages in maintaining a sane workflow: - Experiment without any risk to the main project - Compress significant additions into single branches - Use pull requests to review your own work
What’s the difference between cloning and forking a repo?
- Cloning: Directly ties your local version to the original repo
- Forking: Creates a copy on your GitHub account (which you can then clone)
Cloning makes it easier to fetch updates, but forking has advantages for contributing to others’ projects.
What happens when something goes wrong?
Think: “Oh shit, Git!” Seriously, check out ohshitgit.com.
What happens when something goes horribly wrong?
Burn it down and start again. See Happy Git with R’s burn section. This is a great advantage of Git’s distributed nature - if something goes horribly wrong, there’s usually an intact version somewhere else.
Next Steps
Practice
- Create your own test repository on GitHub
- Practice the basic workflow (add, commit, push, pull)
- Try working with branches for experimental features
- Collaborate with a classmate to experience merge conflicts
Resources
- Happy Git with R - Comprehensive guide by Jenny Bryan
- GitHub Guides - Official GitHub documentation
- Pro Git Book - Free online book about Git
- GitHub Desktop - GUI alternative to command line
- Atlassian Online Tutorial This online tutorial is a good starting point to learn the basics of Git.
- The version control module from The Missing Semester of Your CS Education, offered by MIT, is also a great reference for beginners. The online notes is accompanied with a lecture video. Though targeted at computer science majors, this open course is also helpful for econometrics researchers. Modules other than the version control one are also recommended.
- If more elaboration is needed, online courses, Udacity Course and CodeAcademy Course for examples, and other video tutorial are helpful.