Introduction

Advances in next generation sequencing technologies alongside substantial reductions in the cost of sequencing have made it possible to measure the expression of thousands of genes across thousands of samples in non-model organisms. Evolutionary genomic methods comparing gene expression differences between populations provide a forward genetic approach to identify the genetic basis of divergent phenotypes. The huge amount of data produced by these types of experiments make it impractical to process results on a single desktop machine. Instead, many recent breakthroughs in genomic research were made possible by high performance computer clusters that allow for fast, parallel processing of many samples at once.


1. Open two terminals. Log onto farm on ONE of them.

2. What is a computer cluster?

A computer cluster is a set of connected computers (compute nodes) directed by a head node that runs a centralized management software. Farm is a research and teaching cluster for the College of Agricultural and Environmental Sciences that uses the Slurm workload management software to handle jobs submitted by many of users.

  • NOTE! Never run a job directly on the head node!
    • cat example
  • The only things you should do on the head node are:
    • Submit or check on jobs
    • Download files
    • Edit files
    • Install R packages




3. Why use a cluster?

Here is an example of a job I ran recently using this Slurm script:

#!/bin/bash

#SBATCH --job-name=CA17_angsd_downsample_sfs
#SBATCH --mem=40G
#SBATCH --ntasks=8
#SBATCH -e CA17_angsd_downsample_sfs_%A_%a.err
#SBATCH --time=48:00:00
#SBATCH --mail-user=jamcgirr@ucdavis.edu ##email you when job starts,ends,etc
#SBATCH --mail-type=ALL
#SBATCH -p high

shuf /home/jamcgirr/ph/data/angsd/SFS/bamlist_test/CA17_bams_p_1_5_rm.txt | head -41 > /home/jamcgirr/ph/data/angsd/SFS/downsample/downsample_bams_CA17.txt

/home/jamcgirr/apps/angsd_sep_20/angsd/angsd -bam /home/jamcgirr/ph/data/angsd/SFS/downsample/downsample_bams_CA17.txt -doSaf 1 -doMajorMinor 1 -doMaf 3 -doCounts 1 -anc /home/jamcgirr/ph/data/c_harengus/c.harengus.fa -ref /home/jamcgirr/ph/data/c_harengus/c.harengus.fa -minMapQ 30 -minQ 20 -GL 1 -P 8 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -minInd 10 -setMinDepth 10 -setMaxDepth 100 -out /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30

/home/jamcgirr/apps/angsd_sep_20/angsd/misc/realSFS /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30.saf.idx -P 8 -fold 1 -nSites 100000000 > /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30_folded.sfs
/home/jamcgirr/apps/angsd_sep_20/angsd/misc/realSFS /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30.saf.idx -P 8 -nSites 100000000 > /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30_unfolded.sfs


#run: sbatch script_CA17_angsd_downsample_sfs.sh
  • This job ran for 36 hours and created a file that is 157 GB (compressed! impressed?)
  • I needed to do this 14 times (CA17 is one of 14 different populations in my data set). This would take 3 weeks if I ran one job after another.
  • In total this created 2.1 TB of data, which is twice the amount of space I have on my laptop.


4. Let’s take a look at our resources using the sinfo command

sinfo  -o "%12P %.5D %.4c %.6mMB %.11l"
PARTITION    NODES CPUS MEMORYMB   TIMELIMIT
low2            34  64+ 256000MB  1-00:00:00
med2            34  64+ 256000MB 150-00:00:0
high2           34  64+ 256000MB 150-00:00:0
low            101   24  64300MB    13:20:00
med*           101   24  64300MB 150-00:00:0
high           101   24  64300MB 150-00:00:0
bigmeml          9  64+ 480000MB 150-00:00:0
bigmemm          9  64+ 480000MB 150-00:00:0
bigmemh          8  64+ 480000MB 150-00:00:0
bigmemht         1   96 970000MB 150-00:00:0
bit150h          1   80 500000MB 150-00:00:0
ecl243           1   80 500000MB 150-00:00:0
bml             18   96 480000MB 150-00:00:0
bmm             18   96 480000MB 150-00:00:0
bmh             18   96 480000MB 150-00:00:0
bgpu             1   40 128000MB 150-00:00:0
bmdebug          2   96 100000MB 3600-00:00:
gpuh             2   48 768000MB  7-00:00:00
gpum             2   48 768000MB  7-00:00:00
  • Unless you already had a farm account before this class, you can only request the ecl243 partition







Questions?







Tutorial

Today we will learn how to use these resources to run the first step in an RNAseq pipeline. We are going to play with gene expression data from my graduate work with Cyprinodon pupfishes.




1. Log in with ssh and create directories

  • You should have already done this.
mkdir fastqs
mkdir scripts




2. File transfer

  • Open up another terminal so that you can interact with your local files

  • There are several ways to transfer files between farm and your computer. We will be using scp

  • Note: If you find yourself doing a lot of transferring in the future for your project, check out:

    • WinSCP (for windows)
    • Filezilla
    • Slack me for help

Download slurm_template.sh from Slack and use scp command to upload to farm

Whenever you see these symbols ‘< >’ that means you need to change what I have written

  • Remember to use your username
scp -P 2022 <path/to/>slurm_template.sh <username>@farm.cse.ucdavis.edu:~/scripts

Download .fastq files with wget

  • The files we will trim are uploaded to on my github. These are 150bp paired-end illumina reads.

  • Since we are just downloading small files with wget, we can run this on the head node

cd fastqs/
wget https://github.com/joemcgirr/joemcgirr.github.io/raw/master/tutorials/farm_slurm/CPE1_R1.fastq
wget https://github.com/joemcgirr/joemcgirr.github.io/raw/master/tutorials/farm_slurm/CPE1_R2.fastq




3. Software

  • Many programs are loaded and ready to use on farm.

List availible programs with module avail

module avail
----------------------------- /share/apps/modulefiles/lang -----------------------------
aocc/2.1.0          intel/2013         julia/0.6.2  perlbrew/5.16.0  python3/3.6.1
gcc/4.5             intel/2019         julia/0.7.0  pgi/13.3         python3/3.7.4
gcc/4.7.3           ipython2/2.7.16    julia/1.0.0  pgi/13.4         python3/system
gcc/4.9.3           ipython3/3.6.9     julia/1.0.3  proj/4.9.3       R/3.6
gcc/5.5.0           java-jre/1.8.0_20  julia/1.1.0  proj/7.0.1       R/3.6.2
gcc/6.3.1           java/1.8           julia/1.1.1  python/2.7.4     R/3.6.3(default)
gcc/7.2.0           jdk/1.7.0_79       julia/1.2.0  python/2.7.6     R/4.0.2
gcc/7.3.0(default)  jdk/1.8.0.31       julia/1.3.0  python/2.7.14    tools/0.2
gcc/9.2.0           jdk/1.8.0.121      julia/1.3.1  python/2.7.15    udunits/2.2.2
golang/1.13.1       julia/0.6.0        julia/1.4.2  python2/system

----------------------------- /share/apps/modulefiles/hpc ------------------------------
a5miseq/0                        masurca/2.3.1                 velvet/1.2.10
a5miseq/20160825                 masurca/2.3.2                 ViennaRNA/2.1.8
a5pipeline/20130326              masurca/3.1.3                 ViennaRNA/2.4.11
abblast/4Jan2019                 matlab/1-2019a                VirtualGL/2.6.2
abyss/1.3.5                      matlab/7.11                   VirusDetect/1.7
abyss/1.5.1                      matlab/7.13                   vsearch/1.10.1
abyss/1.5.2                      matlab/2016b                  WASP/0.3.4
abyss/1.9.0                      matlab/2017a                  WinHAP2/1
AGOUTI/0.3.3                     matlab/2018b                  wise/2.2.3-rc7
aksrao/3.0                       matplotlib/2.0                wrf/4.0

We’ll be using one of these programs called trim_galore

  • We will use trim galore to trim illumina adapters from pupfish RNAseq reads




4. Slurm headers

Take a look at slurm_template.sh

cd scripts/
cat slurm_template.sh
#!/bin/bash

#SBATCH --job-name=   # create a short name for your job
#SBATCH --ntasks=     # total number of tasks across all nodes
#SBATCH --mem=        # memory to allocate
#SBATCH --time=       # total run time limit (HH:MM:SS)
#SBATCH --partition=  # request a specific partition for the resource allocation
#SBATCH --error       # create a file that contains error messages
#SBATCH --mail-type=  # send email when job begins and ends
#SBATCH --mail-user=email@ucdavis.edu 

Use nano to edit slurm_template.sh to match what is shown below and save it as trim_galore.sh

  • Note to students who already had farm accounts and did not log in with ecl243 username:
    • Use ‘–partition=high’
#!/bin/bash

#SBATCH --job-name=trim_galore        # create a short name for your job
#SBATCH --ntasks=1                    # total number of tasks across all nodes
#SBATCH --mem=8G                      # memory to allocate
#SBATCH --time=00:01:00               # total run time limit (HH:MM:SS)
#SBATCH --partition=ecl243            # request a specific partition for the resource allocation
#SBATCH --error trim_galore_%A_%a.err # create a file that contains error messages
#SBATCH --mail-type=ALL               # send email when job begins and ends
#SBATCH --mail-user=<email>@ucdavis.edu

module load trim_galore
trim_galore -q 20 --paired ~/fastqs/CPE1_R1.fastq ~/fastqs/CPE1_R2.fastq




5. Submit our job!

Submit the job with sbatch

sbatch trim_galore.sh

Check on the status of the job with squeue

squeue -u username
JOBID PARTITION     NAME     USER ST        TIME  NODES CPU MIN_ME NODELIST(REASON)
29741087    ecl243 trim_gal ecl243-0  R        0:10      1 1   8G     bigmem9

If you need to stop a job, use scancel

scancel JOBID

Checking stdout

  • A file will appear in the directory containing your trim_galore.sh that looks like slurm-JOBID.out

  • This file will contain anything that is written to standard out by trim galore, along with information about your job.

cat slurm-<JOBID>.out
==========================================
SLURM_JOB_ID = 29741087
SLURM_NODELIST = bigmem9
==========================================
1.15
Name                : trim_galore
User                : ecl243-06
Partition           : ecl243
Nodes               : bigmem9
Cores               : 1
GPUs                : 0
State               : COMPLETED
Submit              : 2021-01-22T16:46:26
Start               : 2021-01-22T16:46:26
End                 : 2021-01-22T16:46:49
Reserved walltime   : 00:01:00
Used walltime       : 00:00:23
Used CPU time       : 00:00:02
% User (Computation): 88.00%
% System (I/O)      : 11.95%
Mem reserved        : 8G/node
Max Mem used        : 0.00  (bigmem9)
Max Disk Write      : 0.00  (bigmem9)
Max Disk Read       : 0.00  (bigmem9)
  • It’s a good idea to run a test with dummy data and check the amount of memory used. This can let you know how much to request for future jobs when you scale up (unless you’re using really tiny dummy files like we are).

Checking stderror

Another file will appear that looks like this trim_galore_JOBID_TASKID.err

head trim_galore_<JOBID_TASKID>.err
Module perlbrew/5.16.0 loaded
 Please be sure your perl scripts hashbang line is #!/usr/bin/env perl
Module trim_galore/1 loaded
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)

Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')


AUTO-DETECTING ADAPTER TYPE
  • Most of this information is contained in CPE1_*.fastq_trimming_report.txt files







Questions?







6. Run multiqc with an interactive session

Multiqc is a really awesome quality control tool that can recognize summary files produced by popular bioinformatics software. You simply run multiqc . and the program will look through the current directory to find summary files and produce interactive plots that can be viewed in .html.

Should I use interactive mode or submit a job?

  • Interactive sessions let you move to a compute node where you can test commands and run short jobs. You need to be logged on while the entire time a job runs.

  • Submitting jobs allows you to log off of the cluster and enjoy your day while your job runs.

Run multiqc

!!!!!! BUT FIRST !!!!!!

Initiate an interactive session with srun

srun -p ecl243 --mem 8G -c 4 -t 00:10:00 --pty bash

Load multiqc and run it in your home directory

cd
module load multiqc
multiqc .

Transfer the resulting multiqc_report.html file to a local directory

  • Remember to use your username
scp -P 2022 <username>@farm.cse.ucdavis.edu:multiqc_report.html path/to/Downloads/ 




Congratulations! You’re a farmer!

Questions?