ECON 4370 / 6370 Computing for Economics

Lecture 3: Learning to Love the Shell

Published

Invalid Date

Note

This set of notes is based on Grant McDermott’s lecture notes on the shell.

Introduction

The shell tools that we’re going to be using today have their roots in the Unix family of operating systems originally developed at Bell Labs in the 1970s.

Besides paying homage, acknowledging the Unix lineage is important because these tools still embody the “Unix philosophy”:

Do One Thing And Do It Well

By pairing and chaining well-designed individual components, we can build powerful and much more complex larger systems. You can see why the Unix philosophy is also referred to as “minimalist and modular”. Again, this philosophy is very clearly expressed in the design and functionality of the Unix shell.

Definitions

Don’t be thrown off by terminology: shell, terminal, tty, command prompt, etc. These are all basically just different names for the same thing.¹ They are all referring to a command line interface (CLI).

There are many shell variants, but we’re going to focus on Bash (i.e. Bourne again shell).

Included by default on Linux and MacOS.
Windows users need to install a Bash-compatible shell first. (See here for more details.)

Why bother with the shell?

Power - Both for executing commands and for fixing problems. There are some things you just can’t do in an IDE or GUI. It also avoids memory complications associated with certain applications and/or IDEs.
Reproducibility - Scripting is reproducible, while clicking is not.
Interacting with servers and super computers - The shell is often the only game in town for high performance computing.
Automating workflow and analysis pipelines - Easily track and reproduce an entire project (e.g. use a Makefile to combine multiple programs, scripts, etc.)

We’re going to focus on 1, 2 and 3 in this course. That’s not to say that 4 is unimportant (far from it!), but we just won’t have time to cover it.

Here, here, and here are great places to start learning about automation on your own.

Things that I use the shell for

Git
Renaming and moving files en masse
Finding things on my computer
Combining and manipulating PDFs
Installing and updating software
Scheduling tasks
Monitoring system resources
Connecting to cloud environments
Running analyses (“jobs”) on super computers
etc.

With Generative AI tools (e.g. Warp, Claude Code, Cursor, etc.), you can achieve a lot more interacting with the shell… but it is still important to understand the shell and how to use it.

Bash Shell Basics

First Look

Let’s open up our Bash shell. A convenient way to do this is through RStudio’s built-in Terminal. - Hitting Shift+Alt+T will cause a “Terminal” tab to open up in the bottom-left window pane (i.e. next to the “Console” tab). - This should run Bash by default if it is installed on your system.

You should see something like:

username@hostname:~$

This is shell-speak for: “Who am I and where am I?”

username denotes a specific user (one of potentially many on this computer).
@hostname denotes the name of the computer or server.
:~ denotes the directory path (where ~ signifies the user’s home directory).
$ denotes the start of the command prompt.

Useful Keyboard Shortcuts

Tab completion.
Use the ↑ (and ↓) keys to scroll through previous commands.
Ctrl+→ (and Ctrl+←) to skip whole words at a time.
Ctrl+a moves the cursor to the beginning of the line.
Ctrl+e moves the cursor to the end of the line.
Ctrl+k deletes everything to the right of the cursor.
Ctrl+u deletes everything to the left of the cursor.
Ctrl+Shift+c to copy and Ctrl+Shift+v to paste.
clear to clear your terminal.

Syntax

All Bash commands have the same basic syntax:

command option(s) argument(s)

Examples:

$ ls -lh ~/Documents/
$ sort -u myfile.txt

Commands

You don’t always need options or arguments. (E.g. $ ls ~/Documents/ and $ ls -lh are both valid commands that will yield output.)
However, you always need a command.

Options (also called flags)

Start with a dash.
Usually one letter.

Multiple options can be chained together under a single dash.

$ ls -l -a -h /var/log ## This works
$ ls -lah /var/log ## So does this

An exception is with (rarer) options requiring two dashes.

$ ls --group-directories-first --human-readable /var/log

Arguments

Tell the command what to operate on.
Usually a file, path, or a set of files and folders.

Help: man

The man command (“manual pages”) is your friend if you ever need help.

Tip

Hit spacebar to scroll down a page at a time, “h” to see the help notes of the man command itself and “q” to quit.

man ls

A useful feature of man is quick pattern searching with “/pattern”. - Try this now by running $ man ls again and then typing “/human” and hitting the return key. - To continue on to the next case, hit n.

Help: cheat

I also like the cheat utility, which provides a more readable summary / cheatsheet of various commands. You’ll need to install it first. (Linux and MacOS only.)

$ cheat ls

## # Displays everything in the target directory
## ls path/to/the/target/directory
## 
## # Displays everything including hidden files
## ls -a
## 
## # Displays all files, along with the size (with unit suffixes) and timestamp
## ls -lh 
## 
## # Display files, sorted by size
## ls -S
## 
## # Display directories only
## ls -d */
## 
## # Display directories only, include hidden
## ls -d .*/ */

Files and Directories

Navigation

Key navigation commands:

pwd to print (the current) working directory.
cd to change directory.

pwd

You can use absolute paths, but it’s better to use relative paths and invoke special symbols for a user’s home folder (~), current directory (.), and parent directory (..) as needed.

cd examples ## Move into the "examples" sub-directory of this lecture directory.
cd ../.. ## Now go back up two directories.
pwd

Navigation with Spaces

Beware of directory names that contain spaces. Say you have a directory called “My Documents”. (I’m looking at you, Windows.)

Why won’t $ cd My Documents work?

Answer: Bash syntax is super pedantic about spaces and ordering. Here it thinks that “My” and “Documents” are separate arguments.

Small brain: Use quotation marks: $ cd "My Documents".
Big brain: Use Tab completion to automatically “escape” the space: $ cd My\ Documents.
Galaxy brain: Don’t use spaces in file and folder names.

Listing Files and Their Properties

We’re about to go into more depth about the ls command. - To do this effectively, it will be helpful if we’re all working off the same group of files and folders. - Navigate to the directory containing these lecture notes (i.e. 03-shell). Now list the contents of the examples/ sub-directory with the -lh option (“long format”, “human readable”).

# cd PathWhereYouClonedThisRepo/lectures/03-shell ## change as needed
ls -lh examples

What does this all mean? Let’s focus on the top line.

drwxr-xr-x 2 grant users 4.0K Jan 12 22:12 ABC

The first column denotes the object type: d (directory or folder), l (link), or - (file)
Next, we see the permissions associated with the object’s three possible user types: 1) owner, 2) the owner’s group, and 3) all other users.
- Permissions reflect r (read), w (write), or x (execute) access.
- - denotes missing permissions for a class of operations.
The number of hard links to the object.
We also see the identity of the object’s owner and their group.
Finally, we see some descriptive elements about the object: Size, date and time of creation, and the object name.

Note: We’ll return to file permissions and ownership at the end of the lecture.

Create: touch and mkdir

One of the most common shell tasks is object creation (files, directories, etc.)

We use mkdir to create directories. E.g. To create a new “testing” directory:

mkdir testing

We use touch to create (empty) files. E.g. To add some files to our new directory:

touch testing/test1.txt testing/test2.txt testing/test3.txt

Check that it worked:

ls testing

Remove: rm and rmdir

Let’s delete the objects that we just created. Start with one of the .txt files, by using rm. - We could delete all the files at the same time, but you’ll see why I want to keep some.

rm testing/test1.txt

The equivalent command for directories is rmdir.

rmdir testing

Uh oh… It won’t let us delete the directory while it still has files inside of it. The solution is to use the rm command again with the “recursive” (-r or -R) and “force” (-f) options. - Excluding the -f option is safer, but will trigger a confirmation prompt for every file, which I’d rather avoid here.

rm -rf testing ## Success

Copy: cp

The syntax for copying is $ cp object path/copyname - If you don’t provide a new name for the copied object, it will just take the old name. - However, if there is already an object with the same name in the target destination, then you’ll have to use -f to force an overwrite.

## Create new "copies" sub-directory
mkdir examples/copies
## Now copy across a file (with a new name)
cp examples/reps.txt examples/copies/reps-copy.txt
## Show that we were successful
ls examples/copies

You can use cp to copy directories, although you’ll need the -r (or -R) flag if you want to recursively copy over everything inside of it to.

Try this by copying over the meals/ sub-directory to copies/.

Move (and rename): mv

The syntax for moving is $ mv object path/newobjectname

## Move the abc.txt file and show that it worked
mv examples/ABC/abc.txt examples
ls examples/ABC ## empty

## Move it back again
mv examples/abc.txt examples/ABC
ls examples/ABC ## not empty

Note that “moving” an object within the same directory, but with the (newobjectname) option, is effectively the same as renaming it.

## Rename reps-copy to reps2 by "moving" it with a new name
mv examples/copies/reps-copy.txt examples/copies/reps2.txt
ls examples/copies

Rename en masse: rename

Speaking of renaming, a more convenient way to do this is with rename.

The syntax is pattern replacement file(s)

For example, say we want to change the file type (i.e. extension) of a particular file in the examples/meals directory.

rename csv TXT examples/meals/monday.csv
ls examples/meals

Where rename really shines, however, is in conjunction with regular expressions and wildcards (more on the next slide).

This works especially well for dealing with a whole list of files or folders.

For example, let’s change all of the file extensions in the examples/meals directory.

rename csv TXT examples/meals/*
ls examples/meals

Better change them back before we continue. (Confirm that this worked for yourself.)

rename TXT csv examples/meals/*

Wildcards

Wildcards are special characters that can be used as a replacement for other characters. The two most important ones are:

Replace any number of characters with *.

Convenient when you want to copy, move, or delete a whole class of files.

cp examples/*.sh examples/copies ## Copy any file with an .sh extension to "copies"
rm examples/copies/* ## Delete everything in the "copies" directory

Replace a single character with ?

Convenient when you want to discriminate between similarly named files.

ls examples/meals/??nday.csv
ls examples/meals/?onday.csv

We’ve already seen one wildcard in the form of the backlash (\) that was used to escape spaces in file and folder names, e.g. $ cd My\ Documents.

Find

The last command that I want to mention w.r.t. navigation is find.

This can be used to locate files and directories based on a variety of criteria; from pattern matching to object properties.

find examples -iname "monday.csv" ## will automatically do recursive

find . -iname "*.txt" ## must use "." to indicate pwd

find . -size +100k ## find files larger than 100 KB

Working with Text Files

Motivation

Economists and other (data) scientists spend a lot of time working with text, including scripts, Markdown documents, and delimited text files like CSVs.

It therefore makes sense to spend a few slides showing off some Bash shell capabilities for working with text files. - We’ll only scratch the surface, but hopefully you’ll get an idea of how powerful the shell is in the text domain.

Counting Text: wc

You can use the wc command to count: 1) lines of text, 2) the number of words, and 3) the number of characters.

Let’s demonstrate with a text file containing all of Shakespeare’s Sonnets.

wc examples/sonnets.txt

PS — You couldn’t tell here, but the character count is actually higher than we’d get if we (bothered) counting by hand, because wc counts the invisible newline character “”.

Reading Text

Read everything: cat

The simplest way to read in text is with the cat (“concatenate”) command. Note that cat will read in all of the text. You can scroll back up in your shell window, but this can still be a pain.

Again, let’s demonstrate using Shakespeare’s Sonnets. (This will overflow the slide.) - I’m also going to use the -n flag because I want to show line numbers.

cat -n examples/sonnets.txt

Scroll: more and less

The more and less commands provide extra functionality over cat. For example, they allow you to move through long text one page at a time. - Try this yourself with $ more examples/sonnets.txt - You can move forward and back using the “f” and “b” keys, and quit by hitting “q”.

Preview: head and tail

The head and tail commands let you limit yourself to a preview of the text, down to a specified number of rows. (The default is 10 rows if you don’t specify a number.)

head -n 3 examples/sonnets.txt ## First 3 rows
# head examples/sonnets.txt ## First 10 rows (default)

tail works very similarly to head, but starting from the bottom. For example, we can see the very last row of a file as follows

tail -n 1 examples/sonnets.txt ## Last row

However, there’s one other neat option that I want to show you. By using the -n +N option, we can specify that we want to preview all lines starting from row N and after. E.g.

tail -n +3024 examples/sonnets.txt ## Show everything from line 3024

Find Patterns: grep

To find patterns in text, we can use regular expression-type matching with grep.

For example, say we want to find the famous opening line to Shakespeare’s Sonnet 18. - I’m going to include the -n (“number”) flag to get the line that it occurs on.

grep -n "Shall I compare thee" examples/sonnets.txt

By default, grep returns all matching patterns.

What happens if you run $ grep -n "summer" examples/sonnets.txt?
Or, for that matter, $ grep -n "the" examples/sonnets.txt?

Note that grep can be used to identify patterns in a group files (e.g. within a directory) too.

This is particularly useful if you are trying to identify a file that contain, say, a function name.

Here’s a simple example: Which days will I eat pasta this week? - I’m using the R (recursive) and l (just list the files; don’t print the output) flags.

grep -Rl "pasta" examples/meals

What about muesli? And pizza?

Take a look at the grep man or cheat file for other useful examples and flags (e.g. -i for ignore case).

PS — Another cool (and very fast) shell utility along these lines is the silver searcher. Check it out.

Manipulate Text: sed and awk

There are two main commands for manipulating text in the shell, namely sed and awk.

Both of these are very powerful and flexible (awk is particularly good with CSVs).

I’m going to show two basic examples without going into depth, but I strongly encourage you to explore more on your own.

Example 1. Replace one text pattern with another.

cat examples/nursery.txt

Now, change “Jack” to “Bill”.

sed -i 's/Jack/Bill/g' examples/nursery.txt
cat examples/nursery.txt

Explanation of Example 1:

The sed command breaks down as follows:

sed - The stream editor command
-i - Edit the file “in-place” (modify the original file directly)
's/Jack/Bill/g' - The substitution command:
- s = substitute command
- /Jack/ = pattern to find (the text “Jack”)
- /Bill/ = replacement text (“Bill”)
- g = global flag (replace ALL occurrences, not just the first one)
examples/nursery.txt - The target file

What it does:

Before: The file contains “Jack and Jill” and “Went up the hill”
After: The file contains “Bill and Jill” and “Went up the hill”
The -i flag means the original file is permanently modified

The cat commands show the file contents before and after the change, demonstrating the before/after effect of the sed command.

Example 2. Find and count the 10 most commonly used words in Shakespeare’s Sonnets.

Note: We’ll learn more about the pipe operator (|) in a few slides.

sed -e 's/\s/\n/g' < examples/sonnets.txt | sort | uniq -c | sort -nr | head -10

Explanation of Example 2:

This is a complex pipeline that finds the 10 most common words. Let me break it down step by step:

Step 1: sed -e 's/\s/\n/g' < examples/sonnets.txt

sed -e - Execute the following expression
's/\s/\n/g' - Replace all whitespace characters (\s) with newlines (\n)
< examples/sonnets.txt - Read from the sonnets file
Result: Converts the entire text into one word per line

Step 2: | sort

| - Pipe operator (passes output from previous command to next)
sort - Alphabetically sorts all the words
Result: All identical words are now grouped together

Step 3: | uniq -c

uniq -c - Count unique consecutive lines
Result: Shows each unique word with its count (e.g., “5 the”, “3 and”)

Step 4: | sort -nr

sort -nr - Sort numerically (-n) in reverse order (-r)
Result: Words sorted by frequency (highest count first)

Step 5: | head -10

head -10 - Show only the first 10 lines
Result: The 10 most frequently used words

What the pipeline accomplishes: 1. Tokenization: Breaks the text into individual words 2. Grouping: Groups identical words together 3. Counting: Counts how many times each word appears 4. Ranking: Orders words by frequency 5. Filtering: Shows only the top 10

This is a powerful example of Unix philosophy: combining simple tools with pipes to create complex text processing workflows. Each command does one thing well, and they work together to solve a complex problem.

PS — You can also use double quotes (“) instead of single ones (’) for sed and awk commands. This can sometimes run you into trouble with special symbols or patterns in the text, though.

Text Processing with `awk`

awk is a powerful programming language designed for text processing and data extraction. It’s particularly useful for working with structured data like CSV files.

Example 3. Process meal data from CSV files.

First, let’s look at the data structure:

cat examples/meals/monday.csv
cat examples/meals/friday.csv

Now, let’s extract specific information using awk:

# Extract just the breakfast column from all meal files
awk -F',' 'NR>1 {print $2}' examples/meals/*.csv

# Find all meals that contain "pasta" 
awk -F',' 'NR>1 && /pasta/ {print $1, $4}' examples/meals/*.csv

# Count total number of meal entries
awk -F',' 'NR>1 {count++} END {print "Total meals:", count}' examples/meals/*.csv

Explanation of Example 3:

Basic awk Structure

awk processes text line by line and can execute code for each line. The basic syntax is:

awk 'pattern { action }' filename

Command Breakdown:

Command 1: awk -F',' 'NR>1 {print $2}' examples/meals/*.csv

-F',' - Set field separator to comma (for CSV files)
NR>1 - Pattern: process only lines where Number of Records > 1 (skip header)
{print $2} - Action: print the second field (breakfast column)
examples/meals/*.csv - Process all CSV files in the meals directory
Result: Lists all breakfast items from all days

Command 2: awk -F',' 'NR>1 && /pasta/ {print $1, $4}' examples/meals/*.csv

NR>1 && /pasta/ - Pattern: skip header AND find lines containing “pasta”
{print $1, $4} - Action: print day (field 1) and dinner (field 4)
Result: Shows which days have pasta for dinner

Command 3: awk -F',' 'NR>1 {count++} END {print "Total meals:", count}' examples/meals/*.csv

NR>1 {count++} - For each data line, increment counter
END {print "Total meals:", count} - After processing all lines, print the total
Result: Counts total number of meal entries across all files

Key awk Concepts:

NR - Number of Records (line number)
$1, $2, $3... - Field variables (first, second, third column, etc.)
/pattern/ - Pattern matching (like grep)
&& - Logical AND operator
END - Special pattern that executes after all lines are processed
BEGIN - Special pattern that executes before processing any lines

awk is incredibly powerful for data analysis and can handle complex calculations, conditional logic, and data transformations that would be difficult with other shell tools alone.

Sorting and Removing Duplicates: sort

We can remove duplicate lines in various ways in Bash, but I’ll demonstrate using sort.

cat examples/reps.txt

There’s a fair bit of repetition in this file (and a double entendre). Let’s fix that.

Note the use of the -u (“unique”) flag to remove duplicates. I’ll also add a -r (“reverse”) flag, but only because sort orders alphabetically and this makes less sense for this simple example.

sort -ur examples/reps.txt

Redirecting and Pipes

Redirect: >

You can send output from the shell to a file using the redirect operator >

For example, let’s print a message to the shell using the echo command.

echo "At first, I was afraid, I was petrified"

If you wanted to save this output to a file, you need simply redirect it to the filename of choice.

echo "At first, I was afraid, I was petrified" > survive.txt
find survive.txt ## Show that it now exists

If you want to append text to an existing file, then you should use >>.

Using > will try to overwrite the existing file contents.

echo "'Kept thinking I could never live without you by my side" >> survive.txt
cat survive.txt

(Don’t be shy. You can hum the rest of the song to yourself now.)

Aside: I often use this sequence when adding files to my .gitignore. E.g. $ echo "*.csv" >> .gitignore.

Pipes: |

The pipe operator | is one of the coolest features in Bash.

It lets you send (i.e. “pipe”) intermediate output to another command.
In other words, it allows us to chain together a sequence of simple operations and thereby implement a more complex operation. (Remember the Unix philosophy!)

Let me demonstrate using a very simple example.

cat -n examples/sonnets.txt | head -n100 | tail -n10

An exercise: Say I want to pull out all of text from (but limited to) Sonnet 18.

How might you go about this task using the pipe and other Bash commands?
Tip: Use your knowledge of the starting line (i.e. 336) and the fact that sonnets are 14 lines long.

tail -n +336 examples/sonnets.txt | head -n14

A final aside about pipe the friends: You can use it to search through your Bash command history. - Every shell command you type is stored in a ~/.bash_history file.

What happens if you type $ cat ~/.bash_history | grep head?

FWIW, I use this approach often to remind myself of certain shell commands that I tend to forget. A related and extremely useful command is Ctrl+R, which lets you search and cycle through your shell history.

Iteration (for loops)

for loop syntax

for loops in Bash work similarly to other programming languages that you are probably familiar with.

The basic syntax is

for i in LIST
do 
  OPERATION $i ## the $ sign indicates a variable in bash
done

We can also condense things into a single line by using “;” appropriately.

for i in LIST; do OPERATION $i; done

I find the top approach more readable, but I may use single line approach in these slides to save vertical space. - Note: Using “;” isn’t limited to for loops. Semicolons are a standard way to denote line endings in Bash.

Example 1: Print a sequence of numbers

To help make things concrete, here’s a simple for loop in action.

for i in 1 2 3 4 5; do echo $i; done

FWIW, we can use bash’s brace expansion ({1..n}) to save us from having to write out a long sequence of numbers.

for i in {1..5}; do echo $i; done

Example 2: Combine CSVs

Here’s a more realistic for loop use-case that I use quite often: Combining (i.e. concatenating) multiple CSVs.

Say we want to combine all the “daily” files in the examples/meals directory into a single CSV, which I’ll call mealplan.csv. Here’s one attempt that incorporates various bash commands and tricks that we’ve learned so far. The basic idea is: 1. Create a new (empty) CSV 2. Then, loop over the relevant input files, appending their contents to our new CSV

## create an empty CSV
$ touch examples/meals/mealplan.csv
## loop over the input files and append their contents to our new CSV
$ for i in $(ls examples/meals/*day.csv)
> do 
>   cat $i >> examples/meals/mealplan.csv
> done

Did it work? Let’s check:

cat examples/meals/mealplan.csv

Hmmm. Sort of, but we need to get rid of the repeating header.

Can you think of a way? - Answer: Use tail and head…

Let’s try again. First delete the old file so we can start afresh.

rm -f examples/meals/mealplan.csv ## delete old file

Here’s our adapted gameplan: - First, create the new file by grabbing the header (i.e. top line) from any of the input files and redirecting it. No need for touch this time. - Next, loop over all the input files as before, but this time only append everything after the top line.

## create a new CSV by redirecting the top line of any file
$ head -1 examples/meals/monday.csv > examples/meals/mealplan.csv
## loop over the input files, appending everything after the top line
$ for i in $(ls examples/meals/*day.csv)
> do 
>   tail -n +2 $i >> examples/meals/mealplan.csv
> done

It worked!

cat examples/meals/mealplan.csv

We still have to sort the correct week order, but that’s an easy job in R (or Stata, Python, etc.) - The explicit benefit of doing the concatenating in the shell is it is much more efficient, since all the files don’t simultaneously have to be held in memory (i.e RAM). - This doesn’t matter here, but can make a dramatic difference once we start working with lots of files (or even a few really big ones). We’ll revisit this idea later in the big data section of the course.

Scripting

Hello World!

Writing code and commands interactively in the shell makes a lot of sense when you are exploring data, file structures, etc.

However, it’s also possible (and often desirable) to write reproducible shell scripts that combine a sequence of commands. - These scripts are demarcated by their .sh file extension.

Let’s look at the contents of a short shell script that I’ve included in the examples folder.

cat examples/hello.sh

I’m sure that you already have a good idea of what this script is meant to do, but it will prove useful to quickly go through some things together.

#!/bin/sh
echo -e "\nHello World!\n"

#!/bin/sh is a shebang, indicating which program to run the command with (here: any Bash-compatible shell). However, it is typically ignored (note that it begins with the hash comment character.)
echo -e "\nHello World!\n" is the actual command that we want to run. The -e flag tells bash that we want to evaluate an expression rather than a file.

To run this simple script, you can just type in the file name and press enter.

examples/hello.sh
# bash examples/hello.sh ## Also works

Rscript

It’s important to realise that we aren’t limited to running shell scripts in the shell. The exact same principles carry over to other programs and files.

The most relevant case for this class is the Rscript command for (you guessed it) executing R scripts and expressions. For example:

Rscript -e "cat('Hello World, from R!')"

Of course, the more typical Rscript use case is to execute full length R scripts. An optional, but very useful feature here is the ability to pass extra arguments from the shell to your R script. Consider the hello.R script that I’ve bundled in the examples folder.

cat examples/hello.R

The key step for using additional Rscript arguments is held within the top two lines.

args = commandArgs(trailingOnly = TRUE)
i = args[1]; j = args[2]

These tell Rscript to capture any trailing arguments (i.e. after the file name) and then pass them on as objects that can be used within R.

Let’s run the script to see it in action.

Rscript examples/hello.R 12 9

Again, including trailing arguments is entirely optional. You could run Rscript myfile.R without any problems. But it often proves very useful for the type of work that you’d likely be using Rscript for (e.g. batching big jobs).

Editing and Writing Scripts in the Shell

Say you want to edit my (amazing) hello.sh script. - Maybe you want to add some additional lines of text, or maybe you’re bothered by the fact that there should be a comma after “Hello”. (It’s a salutation!)

We have already seen how to append text lines to a file. But when it comes to more complicated editing work, you’re better off using a dedicated shell editor. - I use vim. Extremely powerful, but a steep learning curve. - An easier starting point is nano.

Open up my script in nano by typing $ nano examples/hello.sh. - Note that the functionality is more limited than a normal text editor. - Once you are finished editing, hit “Ctrl+X”, then “y” and enter to exit. - Finally, run the edited version of the script.

If you’ve been having trouble executing this script (or want to limit who else can execute it), then you need to alter its permissions. Which takes us neatly on to our final section…

User Roles and File Permissions

Disclaimer

This next section is tailored towards Unix-based operating systems, like Linux or MacOS. Windows users: Don’t be surprised if some commands don’t work, especially if you haven’t installed the WSL…

Regardless, the things we learn here will become relevant to everyone (even Windows users) once we start interacting with Linux servers, spinning up Docker containers, etc. later in the course.

The superuser: root, sudo, etc.

There are two main user roles on a Linux system:
1. Normal users 2. A superuser (AKA “root”)

Difference is one of privilege. - Superusers can make system changes, install software, browse through different users’ home folders, etc. Normal users are much more restricted in what they can do. - Explains why Unix-based OS’s are much more resilient to security threats like viruses. Need superuser privileges to install (potentially malicious) software.

You can log in as the superuser… but this is generally considered very poor practice, since you needlessly risk messing up your system. - There are no safety checks and no “undo” options.

Question: How, then, can normal users perform meaningful system operations (including installing new programs and updating software)?

Answer: Invoke temporary superuser status with sudo. - Stands for “superuser do”. - Simply prepend sudo to whatever command you want to run.

grant@laptop:~$ ls /root ## fails
grant@laptop:~$ sudo ls /root ## works

Changing Permissions and Ownership

Let’s think back to the ABC/ directory that we saw previously while exploring the ls command.

drwxr-xr-x 2 grant users 4.0K Jan 12 22:12  ABC

We can change the permissions and ownership of this folder with the chmod and chown commands, respectively. We’ll now review these in turn. - Note that I’m going to use the “recursive” option (i.e. -R) in the examples that follow, but only because ABC/ is a directory. You can drop that when modifying individual files.

chmod

Changing permissions using chmod depends on how those permissions are represented.

There are two options: 1) Octal notation and 2) Symbolic notation.

Example 1: rwxrwxrwx. Read, write and execute permission for all users. - Octal: $ chmod -R 777 ABC - Symbolic: $ chmod -R a=rwx ABC

Example 2: rwxr-xr-x. Read, write and execute permission for the main user (i.e. owner) of the file. For all other users, read and execute permission only. - Octal: $ chmod -R 755 ABC - Symbolic: $ chmod -R u=rwx,g=rx,o=rx ABC

Octal notation

Takes advantage of the fact that 4 (for “read”), 2 (for “write”), and 1 (for “execute”) can be combined in unambiguous ways. - 7 (= 4 + 2 + 1) means read, write and execute permission. - 5 (= 4 + 0 + 1) means read and execute permission, but not write permission. - etc. - Note that Octal notation requires a number for each of the three user types: owner, owner’s group, and all others. E.g. $ chmod 777 myfile.txt

Symbolic notation

Links permissions to different symbols (i.e. abbreviations). - Users: u (“User/owner”), g (“Group”), o (“Others”“), a (”All”) - Permissions: r (“read”), w (“write”), x (“execute”) - Changes: + (“add permissions”), - (“remove permissions”), = (“set new permissions”)

Here’s a quick comparison table with some common permission levels.

Octal value	Symbolic value	Permission level
777	a+rwx	rwxrwxrwx
770	u+rwx,g+rwx,o-rwx	rwxrwx—
755	a+rwx,g-w,o-w	rwxr-xr-x
700	u+rwx,g-rwx,o-rwx	rwx——
644	u=rw,g=r,o=r	rw-r–r–

PS — Note the Symbolic method allows for relative changes, which means that you don’t necessarily need to write out the whole entry in the table above. E.g. To go from the first line to the second line, you’d only need $ chmod o-rwx myfile.

chown

Changing file ownership is somewhat easier than changing permissions, because you don’t have to remember the different Octal and Symbolic notation mappings. - E.g. Say there is another user on your computer called “alice”, then you could just assign her ownership of the ABC subfolder using: bash $ chown -R alice ABC

Things get a little more interesting when we want to add new users and groups, or change an existing users group. - I’ll save that for a later lecture on cloud servers, though.

Next Steps

Things we didn’t cover today

I know we covered a lot of ground today. I hope that I’ve given you a sense of how Bash works and how powerful it is. - My main goal has been to “demystify” the shell, so that you aren’t intimidated when we use shell commands later on.

At the same time, there’s loads that we didn’t cover. - Environment variables, SSH, memory management (e.g. top and htop), GNU parallel, etc. - We’ll get to some of these topics in the later lectures, but please try to work some of the suggested exercises on the next slide and make use of the recommended readings.

Exercises

Navigate to the examples/ sub-directory associated with this lecture. I want you to play around with the contents using some of the different Bash commands we practiced today. - Change the permissions on an individual file or a whole directory. - Read in (or fix) some lines of text from one file and pipe them to another file. - Count the number of times Shakespeare refers to “mistress” or “love” in his Sonnets. - Write a new bash script and execute it. - Etc.

Appendix (Windows users only)

Bash on Windows

Windows users have two options:

1. Git Bash

Pros: You should already have installed this as part of the previous lecture.
Cons: Functionality is limited to Git-related commands, so various things that we’re going to practice today won’t work.

2. Windows Subsystem for Linux (WSL)

Pros: A self-contained Linux image (terminal) that allows full Bash functionality.
Cons: Must be installed first and only available to Windows 10 users.

I’m going to go out on a limb and recommend option 2 (WSL) if available to you. It’s more overhead, but I think worth it.

WSL

The basic WSL installation guide is here. - Follow the guide to install your preferred Linux distro. Ubuntu is a good choice. - Then, once you’ve restarted your PC, come back to these slides.

After installing your chosen WSL, you need to navigate to today’s lecture directory to run the examples. You have two options:

Option (i) Access WSL through RStudio (recommended)

If you access WSL through RStudio, then it will conveniently configure your path to the present working directory. So, here’s how to make WSL your default RStudio Terminal: - In RStudio, navigate to: Tools > Terminal > Terminal Options…. - Click on the dropdown menu for New terminals open with and select “Bash (Windows Subsystem for Linux)”, Then click OK. - Refresh your RStudio terminal (Alt+Shift+R). - You should see the WSL Bash environment with the path automatically configured to the present working directory, mount point and all.

Option (ii) Go directly through the WSL

Presumably, you’ve cloned the course repo somewhere on your C drive. - The way this work is that Windows drives are mounted on the WSL’s mnt directory. - Say you cloned the repo to “C:\Users\Grant\ec607\lectures”. - The WSL equivalent is “/mnt/c/Users/Grant/ec607/lectures”. - So, then you can navigate to today’s lecture through the WSL with: $ cd /mnt/c/Users/Grant/ec607/lectures/03-shell. Adjust as needed.

Which option to choose?

Both are fine, but I recommend option (i). As a Windows user, being able to access a true Bash shell (i.e. terminal) conveniently from RStudio will make things much easier for you in my class. You can always revert back to a different shell later if you want.

Footnotes

Truth be told, there are some subtle and sometimes important differences, as well as some interesting history behind the names. But we can safely ignore these here.↩︎

--- title: "ECON 4370 / 6370 Computing for Economics" subtitle: "Lecture 3: Learning to Love the Shell" date: "`r format(Sys.time(), '%d %B %Y')`" format: html: theme: flatly toc: true toc_depth: 3 css: css/notes_styles.css code-fold: true code-tools: true execute: echo: true warning: false message: false --- ::: {.callout-note} This set of notes is based on Grant McDermott's [lecture notes](https://github.com/uo-ec607/lectures) on the shell. ::: # Introduction The shell tools that we're going to be using today have their roots in the [Unix](https://en.wikipedia.org/wiki/Unix) family of operating systems originally developed at Bell Labs in the 1970s. Besides paying homage, acknowledging the Unix lineage is important because these tools still embody the "[Unix philosophy](https://en.wikipedia.org/wiki/Unix_philosophy)": > **Do One Thing And Do It Well** By pairing and chaining well-designed individual components, we can build powerful and much more complex larger systems. You can see why the Unix philosophy is also referred to as "minimalist and modular". Again, this philosophy is very clearly expressed in the design and functionality of the Unix shell. ## Definitions Don't be thrown off by terminology: *shell*, *terminal*, *tty*, *command prompt*, etc. These are all basically just different names for the same thing.^[Truth be told, there are some subtle and sometimes important differences, as well as some interesting history behind the names. But we can safely ignore these here.] They are all referring to a **command line interface** (CLI). There are many shell variants, but we're going to focus on [**Bash**](https://www.gnu.org/software/bash/) (i.e. **B**ourne **a**gain **sh**ell). - Included by default on Linux and MacOS. - Windows users need to install a Bash-compatible shell first. (See [here](#windows) for more details.) ## Why bother with the shell? 1. **Power** - Both for executing commands and for fixing problems. There are some things you just can't do in an IDE or GUI. It also avoids memory complications associated with certain applications and/or IDEs. 2. **Reproducibility** - Scripting is reproducible, while clicking is not. 3. **Interacting with servers and super computers** - The shell is often the only game in town for high performance computing. 4. **Automating workflow and analysis pipelines** - Easily track and reproduce an entire project (e.g. use a Makefile to combine multiple programs, scripts, etc.) We're going to focus on 1, 2 and 3 in this course. That's not to say that 4 is unimportant (far from it!), but we just won't have time to cover it. - [Here](https://stat545.com/automation-overview.html), [here](https://books.ropensci.org/drake/index.html), and [here](https://web.stanford.edu/~gentzkow/research/CodeAndData.pdf) are great places to start learning about automation on your own. ## Things that I use the shell for - Git - Renaming and moving files *en masse* - Finding things on my computer - Combining and manipulating PDFs - Installing and updating software - Scheduling tasks - Monitoring system resources - Connecting to cloud environments - Running analyses ("jobs") on super computers - etc. With Generative AI tools (e.g. Warp, Claude Code, Cursor, etc.), you can achieve a lot more interacting with the shell... but it is still important to understand the shell and how to use it. # Bash Shell Basics ## First Look Let's open up our Bash shell. A convenient way to do this is through [RStudio's built-in Terminal](https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal). - Hitting `Shift`+`Alt`+`T` will cause a "Terminal" tab to open up in the bottom-left window pane (i.e. next to the "Console" tab). - This should run Bash by default if it is installed on your system. You should see something like: ```bash username@hostname:~$ ``` This is shell-speak for: "Who am I and where am I?" - `username` denotes a specific user (one of potentially many on this computer). - `@hostname` denotes the name of the computer or server. - `:~` denotes the directory path (where `~` signifies the user's home directory). - `$` denotes the start of the command prompt. ## Useful Keyboard Shortcuts - `Tab` completion. - Use the `↑` (and `↓`) keys to scroll through previous commands. - `Ctrl`+`→` (and `Ctrl`+`←`) to skip whole words at a time. - `Ctrl`+`a` moves the cursor to the beginning of the line. - `Ctrl`+`e` moves the cursor to the end of the line. - `Ctrl`+`k` deletes everything to the right of the cursor. - `Ctrl`+`u` deletes everything to the left of the cursor. - `Ctrl`+`Shift`+`c` to copy and `Ctrl`+`Shift`+`v` to paste. - `clear` to clear your terminal. ## Syntax All Bash commands have the same basic syntax: **command option(s) argument(s)** Examples: ```bash $ ls -lh ~/Documents/ $ sort -u myfile.txt ``` **Commands** - You don't always need options or arguments. (E.g. `$ ls ~/Documents/` and `$ ls -lh` are both valid commands that will yield output.) - However, you always need a command. **Options** (also called **flags**) - Start with a dash. - Usually one letter. - Multiple options can be chained together under a single dash. ```bash $ ls -l -a -h /var/log ## This works $ ls -lah /var/log ## So does this ``` - An exception is with (rarer) options requiring two dashes. ```bash $ ls --group-directories-first --human-readable /var/log ``` **Arguments** - Tell the command *what* to operate on. - Usually a file, path, or a set of files and folders. ## Help: man The `man` command ("manual pages") is your friend if you ever need help. ::: {.callout-tip} Hit spacebar to scroll down a page at a time, "h" to see the help notes of the `man` command itself and "q" to quit. ::: ```bash man ls ``` A useful feature of `man` is quick pattern searching with "/pattern". - Try this now by running `$ man ls` again and then typing "/human" and hitting the return key. - To continue on to the next case, hit `n`. ## Help: cheat I also like the [cheat](https://github.com/chrisallenlane/cheat) utility, which provides a more readable summary / cheatsheet of various commands. You'll need to install it first. (Linux and MacOS only.) ```bash $ cheat ls ## # Displays everything in the target directory ## ls path/to/the/target/directory ## ## # Displays everything including hidden files ## ls -a ## ## # Displays all files, along with the size (with unit suffixes) and timestamp ## ls -lh ## ## # Display files, sorted by size ## ls -S ## ## # Display directories only ## ls -d */ ## ## # Display directories only, include hidden ## ls -d .*/ */ ``` # Files and Directories ## Navigation Key navigation commands: - `pwd` to print (the current) working directory. - `cd` to change directory. ```bash pwd ``` You can use absolute paths, but it's better to use relative paths and invoke special symbols for a user's home folder (`~`), current directory (`.`), and parent directory (`..`) as needed. ```bash cd examples ## Move into the "examples" sub-directory of this lecture directory. cd ../.. ## Now go back up two directories. pwd ``` ## Navigation with Spaces Beware of directory names that contain spaces. Say you have a directory called "My Documents". (I'm looking at you, Windows.) - Why won't `$ cd My Documents` work? **Answer:** Bash syntax is super pedantic about spaces and ordering. Here it thinks that "My" and "Documents" are separate arguments. - Small brain: Use quotation marks: `$ cd "My Documents"`. - Big brain: Use Tab completion to automatically "escape" the space: `$ cd My\ Documents`. - Galaxy brain: Don't use spaces in file and folder names. ## Listing Files and Their Properties We're about to go into more depth about the `ls` command. - To do this effectively, it will be helpful if we're all working off the same group of files and folders. - Navigate to the directory containing these lecture notes (i.e. `03-shell`). Now list the contents of the `examples/` sub-directory with the `-lh` option ("long format", "human readable"). ```bash # cd PathWhereYouClonedThisRepo/lectures/03-shell ## change as needed ls -lh examples ``` What does this all mean? Let's focus on the top line. ```bash drwxr-xr-x 2 grant users 4.0K Jan 12 22:12 ABC ``` - The first column denotes the object type: `d` (directory or folder), `l` (link), or `-` (file) - Next, we see the permissions associated with the object's three possible user types: 1) owner, 2) the owner's group, and 3) all other users. - Permissions reflect `r` (read), `w` (write), or `x` (execute) access. - `-` denotes missing permissions for a class of operations. - The number of [hard links](http://www.giannistsakiris.com/2011/04/15/counting-and-listing-hard-links-on-linux/) to the object. - We also see the identity of the object's owner and their group. - Finally, we see some descriptive elements about the object: Size, date and time of creation, and the object name. Note: We'll return to file permissions and ownership at the end of the lecture. ## Create: touch and mkdir One of the most common shell tasks is object creation (files, directories, etc.) We use `mkdir` to create directories. E.g. To create a new "testing" directory: ```bash mkdir testing ``` We use `touch` to create (empty) files. E.g. To add some files to our new directory: ```bash touch testing/test1.txt testing/test2.txt testing/test3.txt ``` Check that it worked: ```bash ls testing ``` ## Remove: rm and rmdir Let's delete the objects that we just created. Start with one of the .txt files, by using `rm`. - We could delete all the files at the same time, but you'll see why I want to keep some. ```bash rm testing/test1.txt ``` The equivalent command for directories is `rmdir`. ```bash rmdir testing ``` Uh oh... It won't let us delete the directory while it still has files inside of it. The solution is to use the `rm` command again with the "recursive" (`-r` or `-R`) and "force" (`-f`) options. - Excluding the `-f` option is safer, but will trigger a confirmation prompt for every file, which I'd rather avoid here. ```bash rm -rf testing ## Success ``` ## Copy: cp The syntax for copying is `$ cp object path/copyname` - If you don't provide a new name for the copied object, it will just take the old name. - However, if there is already an object with the same name in the target destination, then you'll have to use `-f` to force an overwrite. ```bash ## Create new "copies" sub-directory mkdir examples/copies ## Now copy across a file (with a new name) cp examples/reps.txt examples/copies/reps-copy.txt ## Show that we were successful ls examples/copies ``` You can use `cp` to copy directories, although you'll need the `-r` (or `-R`) flag if you want to recursively copy over everything inside of it to. - Try this by copying over the `meals/` sub-directory to `copies/`. ## Move (and rename): mv The syntax for moving is `$ mv object path/newobjectname` ```bash ## Move the abc.txt file and show that it worked mv examples/ABC/abc.txt examples ls examples/ABC ## empty ``` ```bash ## Move it back again mv examples/abc.txt examples/ABC ls examples/ABC ## not empty ``` Note that "moving" an object within the same directory, but with the (newobjectname) option, is effectively the same as renaming it. ```bash ## Rename reps-copy to reps2 by "moving" it with a new name mv examples/copies/reps-copy.txt examples/copies/reps2.txt ls examples/copies ``` ## Rename *en masse*: rename Speaking of renaming, a more convenient way to do this is with `rename`. - The syntax is `pattern replacement file(s)` For example, say we want to change the file type (i.e. extension) of a particular file in the `examples/meals` directory. ```bash rename csv TXT examples/meals/monday.csv ls examples/meals ``` Where `rename` really shines, however, is in conjunction with regular expressions and wildcards (more on the next slide). - This works especially well for dealing with a whole list of files or folders. For example, let's change *all* of the file extensions in the `examples/meals` directory. ```bash rename csv TXT examples/meals/* ls examples/meals ``` Better change them back before we continue. (Confirm that this worked for yourself.) ```bash rename TXT csv examples/meals/* ``` ## Wildcards [Wildcards](http://tldp.org/LDP/GNU-Linux-Tools-Summary/html/x11655.htm) are special characters that can be used as a replacement for other characters. The two most important ones are: 1. Replace any number of characters with `*`. - Convenient when you want to copy, move, or delete a whole class of files. ```bash cp examples/*.sh examples/copies ## Copy any file with an .sh extension to "copies" rm examples/copies/* ## Delete everything in the "copies" directory ``` 2. Replace a single character with `?` - Convenient when you want to discriminate between similarly named files. ```bash ls examples/meals/??nday.csv ls examples/meals/?onday.csv ``` We've already seen one wildcard in the form of the backlash (`\`) that was used to escape spaces in file and folder names, e.g. `$ cd My\ Documents`. ## Find The last command that I want to mention w.r.t. navigation is `find`. - This can be used to locate files and directories based on a variety of criteria; from pattern matching to object properties. ```bash find examples -iname "monday.csv" ## will automatically do recursive ``` ```bash find . -iname "*.txt" ## must use "." to indicate pwd ``` ```bash find . -size +100k ## find files larger than 100 KB ``` # Working with Text Files ## Motivation Economists and other (data) scientists spend a lot of time working with text, including scripts, Markdown documents, and delimited text files like CSVs. It therefore makes sense to spend a few slides showing off some Bash shell capabilities for working with text files. - We'll only scratch the surface, but hopefully you'll get an idea of how powerful the shell is in the text domain. ## Counting Text: wc You can use the `wc` command to count: 1) lines of text, 2) the number of words, and 3) the number of characters. Let's demonstrate with a text file containing all of Shakespeare's Sonnets. ```bash wc examples/sonnets.txt ``` PS — You couldn't tell here, but the character count is actually higher than we'd get if we (bothered) counting by hand, because `wc` counts the invisible newline character "\n". ## Reading Text ### Read everything: cat The simplest way to read in text is with the `cat` ("concatenate") command. Note that `cat` will read in *all* of the text. You can scroll back up in your shell window, but this can still be a pain. Again, let's demonstrate using Shakespeare's Sonnets. (This will overflow the slide.) - I'm also going to use the `-n` flag because I want to show line numbers. ```bash cat -n examples/sonnets.txt ``` ### Scroll: more and less The `more` and `less` commands provide extra functionality over `cat`. For example, they allow you to move through long text one page at a time. - Try this yourself with `$ more examples/sonnets.txt` - You can move forward and back using the "f" and "b" keys, and quit by hitting "q". ### Preview: head and tail The `head` and `tail` commands let you limit yourself to a preview of the text, down to a specified number of rows. (The default is 10 rows if you don't specify a number.) ```bash head -n 3 examples/sonnets.txt ## First 3 rows # head examples/sonnets.txt ## First 10 rows (default) ``` `tail` works very similarly to `head`, but starting from the bottom. For example, we can see the very last row of a file as follows ```bash tail -n 1 examples/sonnets.txt ## Last row ``` However, there's one other neat option that I want to show you. By using the `-n +N` option, we can specify that we want to preview all lines starting from row N *and after*. E.g. ```bash tail -n +3024 examples/sonnets.txt ## Show everything from line 3024 ``` ## Find Patterns: grep To find patterns in text, we can use regular expression-type matching with `grep`. For example, say we want to find the famous opening line to Shakespeare's [Sonnet 18](https://en.wikipedia.org/wiki/Sonnet_18). - I'm going to include the `-n` ("number") flag to get the line that it occurs on. ```bash grep -n "Shall I compare thee" examples/sonnets.txt ``` By default, `grep` returns all matching patterns. - What happens if you run `$ grep -n "summer" examples/sonnets.txt`? - Or, for that matter, `$ grep -n "the" examples/sonnets.txt`? Note that `grep` can be used to identify patterns in a group files (e.g. within a directory) too. - This is particularly useful if you are trying to identify a file that contain, say, a function name. Here's a simple example: Which days will I eat pasta this week? - I'm using the `R` (recursive) and `l` (just list the files; don't print the output) flags. ```bash grep -Rl "pasta" examples/meals ``` What about muesli? And pizza? Take a look at the `grep` man or cheat file for other useful examples and flags (e.g. `-i` for ignore case). PS — Another cool (and very fast) shell utility along these lines is [the silver searcher](https://github.com/ggreer/the_silver_searcher). Check it out. ## Manipulate Text: sed and awk There are two main commands for manipulating text in the shell, namely `sed` and `awk`. - Both of these are very powerful and flexible (`awk` is particularly good with CSVs). I'm going to show two basic examples without going into depth, but I strongly encourage you to explore more on your own. **Example 1.** Replace one text pattern with another. ```bash cat examples/nursery.txt ``` Now, change "Jack" to "Bill". ```bash sed -i 's/Jack/Bill/g' examples/nursery.txt cat examples/nursery.txt ``` **Explanation of Example 1:** The `sed` command breaks down as follows: - **`sed`** - The stream editor command - **`-i`** - Edit the file "in-place" (modify the original file directly) - **`'s/Jack/Bill/g'`** - The substitution command: - `s` = substitute command - `/Jack/` = pattern to find (the text "Jack") - `/Bill/` = replacement text ("Bill") - `g` = global flag (replace ALL occurrences, not just the first one) - **`examples/nursery.txt`** - The target file **What it does:** - Before: The file contains "Jack and Jill" and "Went up the hill" - After: The file contains "Bill and Jill" and "Went up the hill" - The `-i` flag means the original file is permanently modified The `cat` commands show the file contents before and after the change, demonstrating the before/after effect of the `sed` command. **Example 2.** Find and count the 10 most commonly used words in Shakespeare's Sonnets. - Note: We'll learn more about the pipe operator (`|`) in a few slides. ```bash sed -e 's/\s/\n/g' < examples/sonnets.txt | sort | uniq -c | sort -nr | head -10 ``` **Explanation of Example 2:** This is a complex pipeline that finds the 10 most common words. Let me break it down step by step: *Step 1: `sed -e 's/\s/\n/g' < examples/sonnets.txt`* - **`sed -e`** - Execute the following expression - **`'s/\s/\n/g'`** - Replace all whitespace characters (`\s`) with newlines (`\n`) - **`< examples/sonnets.txt`** - Read from the sonnets file - **Result:** Converts the entire text into one word per line *Step 2: `| sort`* - **`|`** - Pipe operator (passes output from previous command to next) - **`sort`** - Alphabetically sorts all the words - **Result:** All identical words are now grouped together *Step 3: `| uniq -c`* - **`uniq -c`** - Count unique consecutive lines - **Result:** Shows each unique word with its count (e.g., "5 the", "3 and") *Step 4: `| sort -nr`* - **`sort -nr`** - Sort numerically (`-n`) in reverse order (`-r`) - **Result:** Words sorted by frequency (highest count first) *Step 5: `| head -10`* - **`head -10`** - Show only the first 10 lines - **Result:** The 10 most frequently used words **What the pipeline accomplishes:** 1. **Tokenization:** Breaks the text into individual words 2. **Grouping:** Groups identical words together 3. **Counting:** Counts how many times each word appears 4. **Ranking:** Orders words by frequency 5. **Filtering:** Shows only the top 10 This is a powerful example of Unix philosophy: combining simple tools with pipes to create complex text processing workflows. Each command does one thing well, and they work together to solve a complex problem. PS — You can also use double quotes (") instead of single ones (') for `sed` and `awk` commands. This can sometimes run you into trouble with special symbols or patterns in the text, though. ## Text Processing with `awk` `awk` is a powerful programming language designed for text processing and data extraction. It's particularly useful for working with structured data like CSV files. **Example 3.** Process meal data from CSV files. First, let's look at the data structure: ```bash cat examples/meals/monday.csv cat examples/meals/friday.csv ``` Now, let's extract specific information using `awk`: ```bash # Extract just the breakfast column from all meal files awk -F',' 'NR>1 {print $2}' examples/meals/*.csv # Find all meals that contain "pasta" awk -F',' 'NR>1 && /pasta/ {print $1, $4}' examples/meals/*.csv # Count total number of meal entries awk -F',' 'NR>1 {count++} END {print "Total meals:", count}' examples/meals/*.csv ``` **Explanation of Example 3:** *Basic `awk` Structure* `awk` processes text line by line and can execute code for each line. The basic syntax is: ```bash awk 'pattern { action }' filename ``` *Command Breakdown:* **Command 1: `awk -F',' 'NR>1 {print $2}' examples/meals/*.csv`** - **`-F','`** - Set field separator to comma (for CSV files) - **`NR>1`** - Pattern: process only lines where Number of Records > 1 (skip header) - **`{print $2}`** - Action: print the second field (breakfast column) - **`examples/meals/*.csv`** - Process all CSV files in the meals directory - **Result:** Lists all breakfast items from all days **Command 2: `awk -F',' 'NR>1 && /pasta/ {print $1, $4}' examples/meals/*.csv`** - **`NR>1 && /pasta/`** - Pattern: skip header AND find lines containing "pasta" - **`{print $1, $4}`** - Action: print day (field 1) and dinner (field 4) - **Result:** Shows which days have pasta for dinner **Command 3: `awk -F',' 'NR>1 {count++} END {print "Total meals:", count}' examples/meals/*.csv`** - **`NR>1 {count++}`** - For each data line, increment counter - **`END {print "Total meals:", count}`** - After processing all lines, print the total - **Result:** Counts total number of meal entries across all files *Key `awk` Concepts:* - **`NR`** - Number of Records (line number) - **`$1, $2, $3...`** - Field variables (first, second, third column, etc.) - **`/pattern/`** - Pattern matching (like grep) - **`&&`** - Logical AND operator - **`END`** - Special pattern that executes after all lines are processed - **`BEGIN`** - Special pattern that executes before processing any lines `awk` is incredibly powerful for data analysis and can handle complex calculations, conditional logic, and data transformations that would be difficult with other shell tools alone. ## Sorting and Removing Duplicates: sort We can remove duplicate lines in various ways in Bash, but I'll demonstrate using `sort`. ```bash cat examples/reps.txt ``` There's a fair bit of repetition in this file (and a double entendre). Let's fix that. - Note the use of the `-u` ("unique") flag to remove duplicates. I'll also add a `-r` ("reverse") flag, but only because `sort` orders alphabetically and this makes less sense for this simple example. ```bash sort -ur examples/reps.txt ``` # Redirecting and Pipes ## Redirect: > You can send output from the shell to a file using the redirect operator `>` For example, let's print a message to the shell using the `echo` command. ```bash echo "At first, I was afraid, I was petrified" ``` If you wanted to save this output to a file, you need simply redirect it to the filename of choice. ```bash echo "At first, I was afraid, I was petrified" > survive.txt find survive.txt ## Show that it now exists ``` If you want to *append* text to an existing file, then you should use `>>`. - Using `>` will try to overwrite the existing file contents. ```bash echo "'Kept thinking I could never live without you by my side" >> survive.txt cat survive.txt ``` (Don't be shy. You can hum the rest of the song to yourself now.) Aside: I often use this sequence when adding files to my .gitignore. E.g. `$ echo "*.csv" >> .gitignore`. ## Pipes: | The pipe operator `|` is one of the coolest features in Bash. - It lets you send (i.e. "pipe") intermediate output to another command. - In other words, it allows us to chain together a sequence of simple operations and thereby implement a more complex operation. (Remember the Unix philosophy!) Let me demonstrate using a very simple example. ```bash cat -n examples/sonnets.txt | head -n100 | tail -n10 ``` An exercise: Say I want to pull out all of text from (but limited to) Sonnet 18. - How might you go about this task using the pipe and other Bash commands? - Tip: Use your knowledge of the starting line (i.e. 336) and the fact that sonnets are 14 lines long. ```bash tail -n +336 examples/sonnets.txt | head -n14 ``` A final aside about pipe the friends: You can use it to search through your Bash command history. - Every shell command you type is stored in a `~/.bash_history` file. What happens if you type `$ cat ~/.bash_history | grep head`? FWIW, I use this approach often to remind myself of certain shell commands that I tend to forget. A related and extremely useful command is [`Ctrl`+`R`](https://unix.stackexchange.com/questions/73498/how-to-cycle-through-reverse-i-search-in-bash), which lets you search and cycle through your shell history. # Iteration (for loops) ## for loop syntax *for* loops in Bash work similarly to other programming languages that you are probably familiar with. The basic syntax is ```bash for i in LIST do OPERATION $i ## the $ sign indicates a variable in bash done ``` We can also condense things into a single line by using ";" appropriately. ```bash for i in LIST; do OPERATION $i; done ``` I find the top approach more readable, but I may use single line approach in these slides to save vertical space. - Note: Using ";" isn't limited to *for loops*. Semicolons are a standard way to denote line endings in Bash. ## Example 1: Print a sequence of numbers To help make things concrete, here's a simple *for* loop in action. ```bash for i in 1 2 3 4 5; do echo $i; done ``` FWIW, we can use bash's brace expansion (`{1..n}`) to save us from having to write out a long sequence of numbers. ```bash for i in {1..5}; do echo $i; done ``` ## Example 2: Combine CSVs Here's a more realistic *for* loop use-case that I use quite often: Combining (i.e. concatenating) multiple CSVs. Say we want to combine all the "daily" files in the `examples/meals` directory into a single CSV, which I'll call `mealplan.csv`. Here's one attempt that incorporates various bash commands and tricks that we've learned so far. The basic idea is: 1. Create a new (empty) CSV 2. Then, loop over the relevant input files, appending their contents to our new CSV ```bash ## create an empty CSV $ touch examples/meals/mealplan.csv ## loop over the input files and append their contents to our new CSV $ for i in $(ls examples/meals/*day.csv) > do > cat $i >> examples/meals/mealplan.csv > done ``` Did it work? Let's check: ```bash cat examples/meals/mealplan.csv ``` Hmmm. Sort of, but we need to get rid of the repeating header. Can you think of a way? - Answer: Use `tail` and `head`... Let's try again. First delete the old file so we can start afresh. ```bash rm -f examples/meals/mealplan.csv ## delete old file ``` Here's our adapted gameplan: - First, create the new file by grabbing the header (i.e. top line) from any of the input files and redirecting it. No need for `touch` this time. - Next, loop over all the input files as before, but this time only append everything *after* the top line. ```bash ## create a new CSV by redirecting the top line of any file $ head -1 examples/meals/monday.csv > examples/meals/mealplan.csv ## loop over the input files, appending everything after the top line $ for i in $(ls examples/meals/*day.csv) > do > tail -n +2 $i >> examples/meals/mealplan.csv > done ``` It worked! ```bash cat examples/meals/mealplan.csv ``` We still have to sort the correct week order, but that's an easy job in R (or Stata, Python, etc.) - The explicit benefit of doing the concatenating in the shell is it is *much* more efficient, since all the files don't simultaneously have to be held in memory (i.e RAM). - This doesn't matter here, but can make a dramatic difference once we start working with lots of files (or even a few really big ones). We'll revisit this idea later in the big data section of the course. # Scripting ## Hello World! Writing code and commands interactively in the shell makes a lot of sense when you are exploring data, file structures, etc. However, it's also possible (and often desirable) to write reproducible shell scripts that combine a sequence of commands. - These scripts are demarcated by their `.sh` file extension. Let's look at the contents of a short shell script that I've included in the examples folder. ```bash cat examples/hello.sh ``` I'm sure that you already have a good idea of what this script is meant to do, but it will prove useful to quickly go through some things together. ```bash #!/bin/sh echo -e "\nHello World!\n" ``` - `#!/bin/sh` is a [shebang](https://en.wikipedia.org/wiki/Shebang_(Unix)), indicating which program to run the command with (here: any Bash-compatible shell). However, it is typically ignored (note that it begins with the hash comment character.) - `echo -e "\nHello World!\n"` is the actual command that we want to run. The `-e` flag tells bash that we want to evaluate an expression rather than a file. To run this simple script, you can just type in the file name and press enter. ```bash examples/hello.sh # bash examples/hello.sh ## Also works ``` ## Rscript It's important to realise that we aren't limited to running shell scripts in the shell. The exact same principles carry over to other programs and files. The most relevant case for this class is the [`Rscript`](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/Rscript.html) command for (you guessed it) executing R scripts and expressions. For example: ```bash Rscript -e "cat('Hello World, from R!')" ``` Of course, the more typical `Rscript` use case is to execute full length R scripts. An optional, but very useful feature here is the ability to pass extra arguments from the shell to your R script. Consider the `hello.R` script that I've bundled in the examples folder. ```bash cat examples/hello.R ``` The key step for using additional `Rscript` arguments is held within the top two lines. ```r args = commandArgs(trailingOnly = TRUE) i = args[1]; j = args[2] ``` These tell Rscript to capture any trailing arguments (i.e. after the file name) and then pass them on as objects that can be used within R. Let's run the script to see it in action. ```bash Rscript examples/hello.R 12 9 ``` Again, including trailing arguments is entirely optional. You could run `Rscript myfile.R` without any problems. But it often proves very useful for the type of work that you'd likely be using `Rscript` for (e.g. batching big jobs). ## Editing and Writing Scripts in the Shell Say you want to edit my (amazing) `hello.sh` script. - Maybe you want to add some additional lines of text, or maybe you're bothered by the fact that there should be a comma after "Hello". (It's a salutation!) We have already seen how to append text lines to a file. But when it comes to more complicated editing work, you're better off using a dedicated shell editor. - I use [**vim**](https://missing.csail.mit.edu/2020/editors/). Extremely powerful, but a steep learning curve. - An easier starting point is [**nano**](https://www.nano-editor.org/). Open up my script in nano by typing `$ nano examples/hello.sh`. - Note that the functionality is more limited than a normal text editor. - Once you are finished editing, hit "Ctrl+X", then "y" and enter to exit. - Finally, run the edited version of the script. If you've been having trouble executing this script (or want to limit who else can execute it), then you need to alter its permissions. Which takes us neatly on to our final section... # User Roles and File Permissions ## Disclaimer *This next section is tailored towards Unix-based operating systems, like Linux or MacOS. **Windows users:** Don't be surprised if some commands don't work, especially if you haven't installed the WSL...* *Regardless, the things we learn here will become relevant to everyone (even Windows users) once we start interacting with Linux servers, spinning up Docker containers, etc. later in the course.* ## The superuser: root, sudo, etc. There are two main user roles on a Linux system: 1. Normal users 2. A superuser (AKA "root") Difference is one of privilege. - Superusers can make system changes, install software, browse through different users' home folders, etc. Normal users are much more restricted in what they can do. - Explains why Unix-based OS's are much more resilient to security threats like viruses. Need superuser privileges to install (potentially malicious) software. You *can* log in as the superuser... but this is generally considered very poor practice, since you needlessly risk messing up your system. - There are no safety checks and no "undo" options. **Question:** How, then, can normal users perform meaningful system operations (including installing new programs and updating software)? **Answer:** Invoke temporary superuser status with `sudo`. - Stands for "superuser do". - Simply prepend `sudo` to whatever command you want to run. ```bash grant@laptop:~$ ls /root ## fails grant@laptop:~$ sudo ls /root ## works ``` ## Changing Permissions and Ownership Let's think back to the `ABC/` directory that we saw previously while exploring the ls command. ```bash drwxr-xr-x 2 grant users 4.0K Jan 12 22:12 ABC ``` We can change the permissions and ownership of this folder with the `chmod` and `chown` commands, respectively. We'll now review these in turn. - Note that I'm going to use the "recursive" option (i.e. `-R`) in the examples that follow, but only because `ABC/` is a directory. You can drop that when modifying individual files. ## chmod Changing permissions using `chmod` depends on how those permissions are represented. There are two options: 1) Octal notation and 2) Symbolic notation. **Example 1: rwxrwxrwx.** Read, write and execute permission for all users. - Octal: `$ chmod -R 777 ABC` - Symbolic: `$ chmod -R a=rwx ABC` **Example 2: rwxr-xr-x.** Read, write and execute permission for the main user (i.e. owner) of the file. For all other users, read and execute permission only. - Octal: `$ chmod -R 755 ABC` - Symbolic: `$ chmod -R u=rwx,g=rx,o=rx ABC` ### Octal notation Takes advantage of the fact that `4` (for "read"), `2` (for "write"), and `1` (for "execute") can be combined in unambiguous ways. - 7 (= 4 + 2 + 1) means read, write and execute permission. - 5 (= 4 + 0 + 1) means read and execute permission, but not write permission. - etc. - Note that Octal notation requires a number for each of the three user types: owner, owner's group, and all others. E.g. `$ chmod 777 myfile.txt` ### Symbolic notation Links permissions to different symbols (i.e. abbreviations). - Users: `u` ("User/owner"), `g` ("Group"), `o` ("Others""), `a` ("All") - Permissions: `r` ("read"), `w` ("write"), `x` ("execute") - Changes: `+` ("add permissions"), `-` ("remove permissions"), `=` ("set new permissions") Here's a quick comparison table with some common permission levels. | Octal value | Symbolic value | Permission level | |:-----------:|:-----------------:|:----------------:| | 777 | a+rwx | rwxrwxrwx | | 770 | u+rwx,g+rwx,o-rwx | rwxrwx--- | | 755 | a+rwx,g-w,o-w | rwxr-xr-x | | 700 | u+rwx,g-rwx,o-rwx | rwx------ | | 644 | u=rw,g=r,o=r | rw-r--r-- | PS — Note the Symbolic method allows for relative changes, which means that you don't necessarily need to write out the whole entry in the table above. E.g. To go from the first line to the second line, you'd only need `$ chmod o-rwx myfile`. ## chown Changing file ownership is somewhat easier than changing permissions, because you don't have to remember the different Octal and Symbolic notation mappings. - E.g. Say there is another user on your computer called "alice", then you could just assign her ownership of the ABC subfolder using: ```bash $ chown -R alice ABC ``` Things get a little more interesting when we want to add new users and groups, or change an existing users group. - I'll save that for a later lecture on cloud servers, though. # Next Steps ## Things we didn't cover today I know we covered a *lot* of ground today. I hope that I've given you a sense of how Bash works and how powerful it is. - My main goal has been to "demystify" the shell, so that you aren't intimidated when we use shell commands later on. At the same time, there's loads that we didn't cover. - Environment variables, SSH, memory management (e.g. [top](https://ss64.com/bash/top.html) and [htop](https://hisham.hm/htop/)), GNU parallel, etc. - We'll get to some of these topics in the later lectures, but please try to work some of the suggested exercises on the next slide and make use of the recommended readings. ## Exercises Navigate to the `examples/` sub-directory associated with this lecture. I want you to play around with the contents using some of the different Bash commands we practiced today. - Change the permissions on an individual file or a whole directory. - Read in (or fix) some lines of text from one file and pipe them to another file. - Count the number of times Shakespeare refers to "mistress" or "love" in his Sonnets. - Write a new bash script and execute it. - Etc. ## Further Reading - [The Unix Shell](http://swcarpentry.github.io/shell-novice/) (Software Carpentry) - [The Unix Workbench](https://seankross.com/the-unix-workbench/) (Sean Kross) - [Data Science at the Command Line](https://www.datascienceatthecommandline.com/) (Jeroen Janssens) - [Using AWK and R to parse 25tb](https://livefreeordichotomize.com/2019/06/04/using_awk_and_r_to_parse_25tb/) (Nick Strayer) - [The Missing Semester of Your CS Education](https://missing.csail.mit.edu/) (MIT) # Appendix (Windows users only) {#windows} ## Bash on Windows Windows users have two options: ### 1. [Git Bash](https://gitforwindows.org/) - **Pros:** You should already have installed this as part of the previous lecture. - **Cons:** Functionality is limited to Git-related commands, so various things that we're going to practice today won't work. ### 2. [Windows Subsystem for Linux (WSL)](https://docs.microsoft.com/en-us/windows/wsl/install-win10) - **Pros:** A self-contained Linux image (terminal) that allows full Bash functionality. - **Cons:** Must be installed first and only available to Windows 10 users. I'm going to go out on a limb and recommend option 2 (WSL) if available to you. It's more overhead, but I think worth it. ## WSL The basic WSL installation guide is [**here**](https://docs.microsoft.com/en-us/windows/wsl/install-win10). - Follow the guide to install your preferred Linux distro. [Ubuntu](https://www.microsoft.com/en-us/p/ubuntu/9nblggh4msv6) is a good choice. - Then, once you've restarted your PC, come back to these slides. After installing your chosen WSL, you need to navigate to today's lecture directory to run the examples. You have two options: #### Option (i) Access WSL through RStudio (*recommended*) If you access WSL through RStudio, then it will conveniently configure your path to the present working directory. So, here's how to make WSL your default [RStudio Terminal](https://support.rstudio.com/hc/en-us/articles/115010737148-Using-the-RStudio-Terminal): - In RStudio, navigate to: *Tools > Terminal > Terminal Options...*. - Click on the dropdown menu for *New terminals open with* and select "Bash (Windows Subsystem for Linux)", Then click *OK*. - Refresh your RStudio terminal (`Alt+Shift+R`). - You should see the WSL Bash environment with the path automatically configured to the present working directory, mount point and all. #### Option (ii) Go directly through the WSL Presumably, you've cloned the course repo somewhere on your C drive. - The way this work is that Windows drives are mounted on the WSL's `mnt` directory. - Say you cloned the repo to "C:\\Users\\Grant\\ec607\\lectures". - The WSL equivalent is "/mnt/c/Users/Grant/ec607/lectures". - So, then you can navigate to today's lecture through the WSL with: `$ cd /mnt/c/Users/Grant/ec607/lectures/03-shell`. Adjust as needed. ### Which option to choose? Both are fine, but I recommend option **(i)**. As a Windows user, being able to access a true Bash shell (i.e. terminal) conveniently from RStudio will make things *much* easier for you in my class. You can always revert back to a different shell later if you want.

Introduction

Definitions

Why bother with the shell?

Things that I use the shell for

Bash Shell Basics

First Look

Useful Keyboard Shortcuts

Syntax

Help: man

Help: cheat

Files and Directories

Navigation

Navigation with Spaces

Listing Files and Their Properties

Create: touch and mkdir

Remove: rm and rmdir

Copy: cp

Move (and rename): mv

Rename en masse: rename

Wildcards

Find

Working with Text Files

Motivation

Counting Text: wc

Reading Text

Read everything: cat

Scroll: more and less

Preview: head and tail

Find Patterns: grep

Manipulate Text: sed and awk

Text Processing with awk

Sorting and Removing Duplicates: sort

Redirecting and Pipes

Redirect: >

Pipes: |

Iteration (for loops)

for loop syntax

Example 1: Print a sequence of numbers

Example 2: Combine CSVs

Scripting

Hello World!

Rscript

Editing and Writing Scripts in the Shell

User Roles and File Permissions

Disclaimer

The superuser: root, sudo, etc.

Changing Permissions and Ownership

chmod

Octal notation

Symbolic notation

chown

Next Steps

Things we didn’t cover today

Exercises

Further Reading

Appendix (Windows users only)

Bash on Windows

1. Git Bash

2. Windows Subsystem for Linux (WSL)

WSL

Option (i) Access WSL through RStudio (recommended)

Option (ii) Go directly through the WSL

Which option to choose?

Footnotes

Text Processing with `awk`