Bash

Connecting to the cluster (Day 1)

The command ssh lets you connect to the cluster. You will need VPN active to be able to connect if you are not using the university LAN or “eduroam” WLAN (when you are working from home).

ssh [username]@corso.came.sbg.ac.at
# For example:
ssh nfortelny@corso.came.sbg.ac.at
Tip

You will see a number of different commands over the next days. The key commands are listed at the bottom of this page.

Next you will change your password. Make sure to remember this password, which you will you throughout this course.

passwd

Now we are connected to the cluster. In the following we will explore this file system. Details will be explored in the next lecture of this course. For now, just execute the commands and see if you can make sense of the results.

The cluster has a file system like any computer. Currently you are in your home directory. The current “path” is shown by the following command:

pwd

You can list the content of your home directory using

ls
ls -l
Tip

Do not confuse the lowercase letter l (as used in ls -l) with the number 1 (“one”).

You can clear the terminal by typing

clear

Now, use the up- and down-arrows on your keyboard to look through the history of commands.

Now create a directory:

mkdir day1

List content of the directory again:

ls
ls -l                   # This shows you the home directory (and the day1 directory that is within the home directory)
ls -l day1/     # This shows you the content of the day1 directory
Tip

Note: folder and directory mean the same thing. We will use directory from now on.

What does mkdir do? Bring up the manual (press q to exit the manual).

man mkdir

Now create some files. There are many ways to create a file.

# Create an empty file in directory "day1"
touch day1/this_file_is_empty.txt

# Write the text "abcdefgh" in a file.
echo "abcdefgh"  > day1/letters.txt

List files again.

ls
ls -l
ls -l day1/

Take a look at the file content.

head day1/*

Files and file systems (Day 2)

Getting files

Make sure you are in your home directory.

pwd
cd ~/
# same as "cd " or "cd ~", since the default value
# for the dir argument is HOME (see "man cd")

pwd

Obtain the list of gRNAs.

# look at the man pages of the following commands – what is their purpose?
man mv
man cp

# now try it
mv /resources/bash/gRNAs.txt ~/  # this will not work
cp /resources/bash/gRNAs.txt ~/  # this works

# Why does mv not work? Look at file permissions:
ls -l /resources/bash/gRNAs.txt
ls -l ~/gRNAs.txt

Now explore the copied file.

head gRNAs.txt
head -5 gRNAs.txt
tail -5 gRNAs.txt
less gRNAs.txt
more gRNAs.txt

# Count the number of lines
wc -l gRNAs.txt
Tip

You may be used to a workflow where you first copy a file (e.g., Ctrl+C), then go to the destination directory, and paste it there (e.g., Ctrl+V). By contrast, the copy command cp does both the copying and pasting.

Exercises

Copy the file gRNAs.txt from your home directory into the directory day2. Then rename the copied file in this directory to gRNAs_exercise.txt.

Editing in nano

Now add a line at the end of the file gRNAs.txt that is located in your home directory, adding the gRNA sequence “ACTGACTG”. Use the nano editor for this purpose. To quit the nano-editor you need to press Ctrl+X. Then type y+Enter to save the changes. Nano commands are shown in the editor and can be found on the internet. A list is provided below.

Nano commands:

Command Function
ctrl+r read/insert file
ctrl+o save file
ctrl+x close file
alt+a start selecting text
ctrl+k cut selection
ctrl+u uncut (paste) selection
alt+/ go to end of the file
ctrl+a go to start of the line
ctrl+e go to end of the line
ctrl+c show line number
ctrl+_ go to line number
ctrl+w find matching word
alt+w find next match
ctrl+\ find and replace
nano gRNAs.txt

Now use nano to modify the shell to make things prettier. To do so change the file .bash_profile. This file contains settings for each user (the naming is just by convention). It starts with a ., which for Linux means the file is hidden.

nano ~/.bash_profile
# This creates the file and also opens it in nano

Add the following lines to .bash_profile in nano, then exit the file and save it - see the commands above.

Tip

In Windows PowerShell, copy/paste works best if you enable the option to do so via Ctrl+Shift+V, as shown below.

if [ -x /usr/bin/dircolors ]; then
    test -r ~/.dircolors && eval "$(dircolors -b ~/.dircolors)" || eval "$(dircolors -b)"
    alias ls='ls --color=auto'
    alias grep='grep --color=auto'
fi

The changes we added to .bash_profile will come into effect next time you log in. To also activate them for your current login, you can source the file, executing the commands stored within.

ls -l *
source ~/.bash_profile
ls -l *

Also, notice the difference in ls commands to show hidden files (like .bash_profile).

ls -l
ls -al
Tip

If your grade sheet does not show that the content of .bash_profile is correct but it still works, then leave it. There may be a small difference in between your version and the expected one that does not impact the functionality.

Zipped files

Now let’s download all human gene sequences from Ensembl. Download the file to your home directory.

man wget
wget http://ftp.ensembl.org/pub/release-103/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz

If the above fails, then you can also copy the file from the resources directory into you home directory.

cp /resources/bash/Homo_sapiens.GRCh38.cds.all.fa.gz ~/

To make sure you have the entire file properly downloaded, compare the MD5 hash of the file. MD5 hash functions are a compact digital fingerprint of a file. The MD5 hash of the file should be b16d46bf09c3b8b7909624f1e6c414ce.

md5sum ~/Homo_sapiens.GRCh38.cds.all.fa.gz
md5sum /resources/bash/Homo_sapiens.GRCh38.cds.all.fa.gz

Use the du command to determine the file size.

du Homo_sapiens.GRCh38.cds.all.fa.gz

# the -h argument displays the file size in a human-readable format
du -h Homo_sapiens.GRCh38.cds.all.fa.gz

Have a look at this file.

head Homo_sapiens.GRCh38.cds.all.fa.gz

This doesn’t look great. Remember to clean up your terminal.

clear

The above file is zipped. Now unzip it.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz
# This command will run through the entire file which is very long.
# Press Ctrl+C to stop the command.

man gunzip
# -c --stdout --to-stdout
#   Write output on standard output; keep original files unchanged.
#   If there are several input files, the output consists of a sequence 
#   of independently compressed members. To obtain better compression,
#   concatenate all input files before compressing them.

Again, remember to clean up your terminal.

clear

Can we use head on the unzipped output? Yes - this is done using a pipe.

Pipes

Linux pipes enables you to pass the output of one command to another command.

Pipe command Function
cmd < file use file as input for command cmd
cmd > file write output to file
cmd >> file append output to file
cmd 2> stderr write error output to file
cmd &> file send output and error to file
cmd1 | cmd2 send output of cmd1 to cmd2

Let’s have a look at the first few lines of this file.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head

Some programs let you look at decompressed output, for example

zless Homo_sapiens.GRCh38.cds.all.fa.gz

# very similar to:
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | zless

Now we can also count the number of lines in this file:

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | wc -l

Exercises

Place the following files into the directory ~/day2, using pipes:

  • Store the number of lines of Homo_sapiens.GRCh38.cds.all.fa.gz into the file lineNumber.txt.
  • Write the first 15 lines of Homo_sapiens.GRCh38.cds.all.fa.gz into the file lines1.txt.
  • Write the 31th to 35th line of Homo_sapiens.GRCh38.cds.all.fa.gz into the file lines2.txt.
  • Store the size of Homo_sapiens.GRCh38.cds.all.fa.gz in Megabytes into the file size.txt.

Patterns and regular expressions (Day 3)

File pattern matches

Commands can be executed on multiple files at the same time using pattern matches (we have used this already above).

ls -l
ls -l *
ls -l day*
wc -l *
wc -l *.gz
Description Pattern
Match zero or more characters *
Match any single character ?
Match any of the characters in a set [...]
Match zero or one occurrences of the patterns (extglob) ?(patterns)
Match zero or more occurrences of the patterns (extglob) *(patterns)
Match one or more occurrences of the patterns (extglob) +(patterns)
Match one occurrence of the patterns (extglob) @(patterns)
Match anything that doesn’t match one of the patterns (extglob) !(patterns)

Simple regular expressions

Regular expressions can be executed on file names but also content within files. Below is the example from the PDF presented during the lecture.

Tip

The file sample needs to be copied to your home directory from the directory /resources/bash.

cat sample
grep ^a sample
grep -E p\{2} sample
grep "a\+t" sample
sed "s/a/XXX/g" sample
Descriptions Symbol
replaces any character .
matches start of string ^
matches end of string $
matches up zero or more times the preceding character *
Represent special characters \
Groups regular expressions ()
Matches up exactly one character ?

Checking the nucleotides in the Ensembl fasta file

Now let’s use regulare expressions to explore the fasta file. Have a look at the first few lines.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -50

We expect all DNA sequences to be made up of A, C, T, and Gs. Now, we will verify this. First, we will get all DNA-sequences from this file, excluding lines starting with a > (in a FASTA files, these lines are the lines describing the sequence; see https://en.wikipedia.org/wiki/FASTA_format).

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">"

# from "man grep":
# -v Invert the sense of matching, to select non-matching lines.

Next we will use tr to delete newline characters. This will place all sequences in one (very long) line.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n'

Next we place every character on a different line.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o .

Finally, we want to remove repetition, to only see the unique nucleotides. To do so, we have to first sort the lines and then make them unique.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | sort | uniq

# what happens if we only make them unique?
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | head -20
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | head -20 | uniq
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | uniq | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | grep -v ">" | tr -d '\n' | grep -o . | sort | uniq | wc -l

Exercises

  • Create a directory day3 in your home directory.
  • Count the number of A, C, T, and G of the DNA sequences of the first 1000 lines of file Homo_sapiens.GRCh38.cds.all.fa.gz.
  • Store each number in a file called nt_A.txt, nt_C.txt,… in the directory day3.

Reformatting the Ensembl fasta file

Next, we will reformat the fasta file, such that every entry is on one line and the DNA sequence is at the end of the line.

Remind ourselves of how this file looks like:

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500

In the original format, the > separates different entries and sequences are in the lines below the line with >. First, we will remove newline characters that separate the sequences. To remove the newline characters, we will use the sed command.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g'

Let breack this up:

  • s/// tells sed to substitute.
  • s///g tells sed to substitue globally - replacing each occurance of x.
  • s/x/y/g means we substitute x by y.
  • ([ACTG]) matches any of the four letters A, C, T, and G. The brackets () tell the regex to store the match for later (see \1 and \2 below). If we want to use this storing we need to use sed -E.
  • \n matches the newline.
  • \1 is going to be replaced by the first match within the first brackets (). Therefore, here it will be replaced by the nucleotide before the new line.
  • \2 is going to be replaced by the second match within the second brackets (). Therefore, here it will be replaced by the nucleotide after the new line.
  • So ultimately the nulceotide before and after the newline are stored, and then they are written again but without the newline.

Next, we will remove the newline character that separates the sequence from the line with the entry information, which starts with >, and replace it with seq:.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | head -500 | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g' | sed -z 's/\n[^>]/ seq:/g'

Explanation:

  • [^>] means to NOT match >.
  • \n[^>] replaces all newline characters that are NOT before a >.

We developed the above approach on the first lines. Now run it an the full file. This may take a couple of minutes. The gzip command will compress the results. The > will store it in a new file GRCh38_reformatted.gz. The reformatted file should end up in your home directory.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | sed -zE 's/([ACTG])\n([ACTG])/\1\2/g' | sed -z 's/\n[^>]/ seq:/g' | gzip > GRCh38_reformatted.gz

Explore the generated file.

gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | wc -l
gunzip -c Homo_sapiens.GRCh38.cds.all.fa.gz | grep ">" | wc -l
gunzip -c GRCh38_reformatted.gz | wc -l
zless GRCh38_reformatted.gz

Identifying gRNA matches

Now, with the newly formatted file, we can easily identify genes in the database that match a specific gRNA.

gunzip -c GRCh38_reformatted.gz | head -30

We check that all entries have gene symbols:

gunzip -c GRCh38_reformatted.gz | grep "gene_symbol" | head -30 
gunzip -c GRCh38_reformatted.gz | grep -v "gene_symbol" | head -30 

# remember: grep -v means to *not* match

Next we extract those sequences matching the guide “TTAAGACA”:

gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | head -30 

Now we extract the gene symbols of these matches. First we match all text (.*) from the beginning of the line (^) before the text gene_symbol and delete it.

gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | head -30

Then we also remove all text after the gene symbol, in this case the space and everything (.*) after that until the end of the line ($).

gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g'  | head -30

Finally, we get unique gene IDs.

gunzip -c GRCh38_reformatted.gz | grep "seq:[ACTG]*TTAAGACA" | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g'  | sort | uniq

Exercises

Create the following files in the directory day3:

  • Write all entries with sequences matching the guide “GCGGTTTC” into the file guideMatch_GCGGTTTC.txt.
  • Write the count of entries with sequences starting with “TGC” into the file count_TGC.txt.
  • Write the count of entries of gene “MMP2” into the file count_MMP2.txt. Note: Do not count genes MMP20, MMP21,…
  • Write the count of unique genes whose symbol starts with “RPL” into the file count_RPL.txt.
  • Write the count of all unique protein coding genes into the file count_protein_coding.txt.

Loops and variables (Day 4)

Variables

Variables can hold information which can be passed to different programs.

x="test variable"
echo $x
echo $x | sed 's/a//g'

Now we will write the gRNAs to test and the regexp patterns into variables.

guide=TTAAGACA
echo $guide
pattern="seq:[ACTG]*${guide}"
echo $pattern
gunzip -c GRCh38_reformatted.gz | grep $pattern | head -30
gunzip -c GRCh38_reformatted.gz | grep $pattern | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | head -30

Loops

We will use while loops. They have the following syntax:

while [condition] do [something] done

Different types of loops exist (https://ryanstutorials.net/bash-scripting-tutorial/bash-loops.php), but those are not important for this course.

The following loop starts with x = 1. It then repeats printing “Welcome … times” and adding +1 to x, until x is greater than 5 (while x is lower or equal to 5 $x -le 5).

x=1
while [ $x -le 5 ]
do
  echo "Welcome $x times"
  x=$(( $x + 1 )) # syntax to add a number to x
done

Importantly, the loop is ONE construct. Thus when entering while [ $x -le 5 ], the terminal expects additional information, in particular do [...] done. While waiting for additional input, the terminal will start lines with >. Keep entering the do, etc in those lines. The final results will look like this:

[user]@corso:~$ while [ $x -le 5 ]
> do
> echo "Welcome $x times"
> x=$(( $x + 1 ))
> done
Welcome 1 times
Welcome 2 times
Welcome 3 times
Welcome 4 times
Welcome 5 times

Try to make the above loop count to 10. Or from 10 to 5.

Here are different ways to compare numbers. Assume variable a holds 10 and variable b holds 20 then:

Operator Description Example
-eq Checks if the value of two operands are equal or not; if yes, then the condition becomes true. [ $a -eq $b ] is not true.
-ne Checks if the value of two operands are equal or not; if values are not equal, then the condition becomes true. [ $a -ne $b ] is true.
-gt Checks if the value of left operand is greater than the value of right operand; if yes, then the condition becomes true. [ $a -gt $b ] is not true.
-lt Checks if the value of left operand is less than the value of right operand; if yes, then the condition becomes true. [ $a -lt $b ] is true.
-ge Checks if the value of left operand is greater than or equal to the value of right operand; if yes, then the condition becomes true. [ $a -ge $b ] is not true.
-le Checks if the value of left operand is less than or equal to the value of right operand; if yes, then the condition becomes true. [ $a -le $b ] is true.

You can also use nano to write the code above into a file (for example here: script.sh) and then use the following to execute it:

bash script.sh

We can also loop through the content of a file.

while read p; do
  echo "$p"
done < gRNAs.txt

Exercise

Now we will combine the above variables and loopes to test all guides in file gRNAs.txt against all DNA sequences in GRCh38_reformatted.gz. To do so you need the following steps (also see the hints below!!!):

  • Use the code above (under Variables) that matches guides against DNA sequences and extracts gene symbols for matched sequences.
  • Place this function into the loop that iterates through the file gRNAs.txt, using the while loop shown above that iterates through a file.
  • Store all matches (not just the top 30 coming from head -30) of each guide into a file that is named results_${guide}.txt. Place all files into a new directory day4 - ideally already within the loop.
  • Double-check the results by hand (no points).
    • How many files did you just create?
    • Look at one of the created files. Is the content what you would expect?
    • When you run it with only one guide (without the loop) and compare to the results generated within the loop: Do you get the right genes for this guide? Do you get the correct number of genes for this guide?
Tip

Hint: To complete the exercise, you can use the following code to extract genes for one guide. In this code, the guide needs to be defined first using guide=[...]. However, when placing this code block inside of the loop, you don’t define the guide - it is defined by the loop.

echo $guide
pattern="seq:[ACTG]*${guide}"
echo $pattern
gunzip -c GRCh38_reformatted.gz | grep $pattern | sed 's/^.*gene_symbol://g' | sed 's/ .*$//g' | sort | uniq

Place the above in the loop that iterates throught the gRNA file.

while read guide; do
  echo $guide
done < gRNAs.txt
Tip

Remember: you need to save the result of the last line (gunzip ... | uniq) to a file results_${guide}.txt.

Comparing files

Have a look at the files we generated.

cd ~/day4
ls results_*.txt
ls -l results_*.txt
head results_*.txt
wc -l results_*.txt

Now let’s compare results in a pairwise manner. The command comm will provide the gene symbols unique to one or the other file, plus the intersection of both.

cd ~/day4
comm results_AAGTTGGC.txt results_GCCATACA.txt
comm -12 results_AAGTTGGC.txt results_GCCATACA.txt

The count of genes in the overlap (intersection) can be stored in a new variable. To store results of an expression in a new variable, place the expression into parenthesis x=$().

cd ~/
guide1=AAGTTGGC
guide2=GCCATACA
overlap=$(comm -12 day4/results_${guide1}.txt day4/results_${guide2}.txt | wc -l)
echo "$guide1 $guide2 $overlap"

Exercise

Now we will compare each pair of guides to test the overlap of genes. Place the result in the directory day4:

  • Use one loop (that iterates through all guides) within a second loop (that also iterates through all guides) to compare the overlap between all pairs of guides (thus, for simplicity, we will compare all pairs of guides also the overlap of each guide with itself).
  • Use the comm command as shown above to extract the overlap, counting the number of genes, and writing the result into a file named overlap_${guide1}_${guide2}.txt.
  • “Manually” check results (no points):
    • How many files did you create? How many did you expect to create?
    • Look at one of the created files. Is the content what you would expect?
    • Compare individual examples (for example “AAGTTGGC” vs “GCCATACA”). Did you get the right overlap?

Special symbols

  • The ( is called a parenthesis
  • The [ is called a squared bracket (sometimes only bracket)
  • The { is called a curly bracket (sometimes only brace)
  • The ~ is called a tilde
  • The * is called an asterisk

Commands

Command function
basics
ssh connect to server
man show manual of a command
directories
mkdir create directory
cd change directory
pwd show current path
ls list files in path
files
cp copy file
mv move file
wc -l count lines in file
touch create empty file
chmod change permissions of file
file content
head look at the first lines of a file
tail look at the last lines of a file
less scroll through a file
cat print entire file
zip
gunzip unzip a file
gzip zip a file
zless scroll through a zipped file
regexp
grep extract regular expressions
sed replace text
sort
sort sort the lines in a file alphabetically
uniq remove duplicated lines