Advances in next generation sequencing technologies alongside substantial reductions in the cost of sequencing have made it possible to measure the expression of thousands of genes across thousands of samples in non-model organisms. Evolutionary genomic methods comparing gene expression differences between populations provide a forward genetic approach to identify the genetic basis of divergent phenotypes. The huge amount of data produced by these types of experiments make it impractical to process results on a single desktop machine. Instead, many recent breakthroughs in genomic research were made possible by high performance computer clusters that allow for fast, parallel processing of many samples at once.
A computer cluster is a set of connected computers (compute nodes) directed by a head node that runs a centralized management software. Farm is a research and teaching cluster for the College of Agricultural and Environmental Sciences that uses the Slurm workload management software to handle jobs submitted by many of users.
Here is an example of a job I ran recently using this Slurm script:
#!/bin/bash
#SBATCH --job-name=CA17_angsd_downsample_sfs
#SBATCH --mem=40G
#SBATCH --ntasks=8
#SBATCH -e CA17_angsd_downsample_sfs_%A_%a.err
#SBATCH --time=48:00:00
#SBATCH --mail-user=jamcgirr@ucdavis.edu ##email you when job starts,ends,etc
#SBATCH --mail-type=ALL
#SBATCH -p high
shuf /home/jamcgirr/ph/data/angsd/SFS/bamlist_test/CA17_bams_p_1_5_rm.txt | head -41 > /home/jamcgirr/ph/data/angsd/SFS/downsample/downsample_bams_CA17.txt
/home/jamcgirr/apps/angsd_sep_20/angsd/angsd -bam /home/jamcgirr/ph/data/angsd/SFS/downsample/downsample_bams_CA17.txt -doSaf 1 -doMajorMinor 1 -doMaf 3 -doCounts 1 -anc /home/jamcgirr/ph/data/c_harengus/c.harengus.fa -ref /home/jamcgirr/ph/data/c_harengus/c.harengus.fa -minMapQ 30 -minQ 20 -GL 1 -P 8 -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -minInd 10 -setMinDepth 10 -setMaxDepth 100 -out /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30
/home/jamcgirr/apps/angsd_sep_20/angsd/misc/realSFS /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30.saf.idx -P 8 -fold 1 -nSites 100000000 > /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30_folded.sfs
/home/jamcgirr/apps/angsd_sep_20/angsd/misc/realSFS /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30.saf.idx -P 8 -nSites 100000000 > /home/jamcgirr/ph/data/angsd/SFS/downsample/CA17_minQ20_minMQ30_unfolded.sfs
#run: sbatch script_CA17_angsd_downsample_sfs.sh
sinfo
commandsinfo -o "%12P %.5D %.4c %.6mMB %.11l"
PARTITION NODES CPUS MEMORYMB TIMELIMIT
low2 34 64+ 256000MB 1-00:00:00
med2 34 64+ 256000MB 150-00:00:0
high2 34 64+ 256000MB 150-00:00:0
low 101 24 64300MB 13:20:00
med* 101 24 64300MB 150-00:00:0
high 101 24 64300MB 150-00:00:0
bigmeml 9 64+ 480000MB 150-00:00:0
bigmemm 9 64+ 480000MB 150-00:00:0
bigmemh 8 64+ 480000MB 150-00:00:0
bigmemht 1 96 970000MB 150-00:00:0
bit150h 1 80 500000MB 150-00:00:0
ecl243 1 80 500000MB 150-00:00:0
bml 18 96 480000MB 150-00:00:0
bmm 18 96 480000MB 150-00:00:0
bmh 18 96 480000MB 150-00:00:0
bgpu 1 40 128000MB 150-00:00:0
bmdebug 2 96 100000MB 3600-00:00:
gpuh 2 48 768000MB 7-00:00:00
gpum 2 48 768000MB 7-00:00:00
ecl243
partition
Today we will learn how to use these resources to run the first step in an RNAseq pipeline. We are going to play with gene expression data from my graduate work with Cyprinodon pupfishes.
mkdir fastqs
mkdir scripts
Open up another terminal so that you can interact with your local files
There are several ways to transfer files between farm and your computer. We will be using scp
Note: If you find yourself doing a lot of transferring in the future for your project, check out:
slurm_template.sh
from Slack and use scp
command to upload to farmWhenever you see these symbols ‘< >’ that means you need to change what I have written
scp -P 2022 <path/to/>slurm_template.sh <username>@farm.cse.ucdavis.edu:~/scripts
.fastq
files with wget
The files we will trim are uploaded to on my github. These are 150bp paired-end illumina reads.
Since we are just downloading small files with wget, we can run this on the head node
cd fastqs/
wget https://github.com/joemcgirr/joemcgirr.github.io/raw/master/tutorials/farm_slurm/CPE1_R1.fastq
wget https://github.com/joemcgirr/joemcgirr.github.io/raw/master/tutorials/farm_slurm/CPE1_R2.fastq
module avail
module avail
----------------------------- /share/apps/modulefiles/lang -----------------------------
aocc/2.1.0 intel/2013 julia/0.6.2 perlbrew/5.16.0 python3/3.6.1
gcc/4.5 intel/2019 julia/0.7.0 pgi/13.3 python3/3.7.4
gcc/4.7.3 ipython2/2.7.16 julia/1.0.0 pgi/13.4 python3/system
gcc/4.9.3 ipython3/3.6.9 julia/1.0.3 proj/4.9.3 R/3.6
gcc/5.5.0 java-jre/1.8.0_20 julia/1.1.0 proj/7.0.1 R/3.6.2
gcc/6.3.1 java/1.8 julia/1.1.1 python/2.7.4 R/3.6.3(default)
gcc/7.2.0 jdk/1.7.0_79 julia/1.2.0 python/2.7.6 R/4.0.2
gcc/7.3.0(default) jdk/1.8.0.31 julia/1.3.0 python/2.7.14 tools/0.2
gcc/9.2.0 jdk/1.8.0.121 julia/1.3.1 python/2.7.15 udunits/2.2.2
golang/1.13.1 julia/0.6.0 julia/1.4.2 python2/system
----------------------------- /share/apps/modulefiles/hpc ------------------------------
a5miseq/0 masurca/2.3.1 velvet/1.2.10
a5miseq/20160825 masurca/2.3.2 ViennaRNA/2.1.8
a5pipeline/20130326 masurca/3.1.3 ViennaRNA/2.4.11
abblast/4Jan2019 matlab/1-2019a VirtualGL/2.6.2
abyss/1.3.5 matlab/7.11 VirusDetect/1.7
abyss/1.5.1 matlab/7.13 vsearch/1.10.1
abyss/1.5.2 matlab/2016b WASP/0.3.4
abyss/1.9.0 matlab/2017a WinHAP2/1
AGOUTI/0.3.3 matlab/2018b wise/2.2.3-rc7
aksrao/3.0 matplotlib/2.0 wrf/4.0
trim_galore
cd scripts/
cat slurm_template.sh
#!/bin/bash
#SBATCH --job-name= # create a short name for your job
#SBATCH --ntasks= # total number of tasks across all nodes
#SBATCH --mem= # memory to allocate
#SBATCH --time= # total run time limit (HH:MM:SS)
#SBATCH --partition= # request a specific partition for the resource allocation
#SBATCH --error # create a file that contains error messages
#SBATCH --mail-type= # send email when job begins and ends
#SBATCH --mail-user=email@ucdavis.edu
slurm_template.sh
to match what is shown below and save it as trim_galore.sh
#!/bin/bash
#SBATCH --job-name=trim_galore # create a short name for your job
#SBATCH --ntasks=1 # total number of tasks across all nodes
#SBATCH --mem=8G # memory to allocate
#SBATCH --time=00:01:00 # total run time limit (HH:MM:SS)
#SBATCH --partition=ecl243 # request a specific partition for the resource allocation
#SBATCH --error trim_galore_%A_%a.err # create a file that contains error messages
#SBATCH --mail-type=ALL # send email when job begins and ends
#SBATCH --mail-user=<email>@ucdavis.edu
module load trim_galore
trim_galore -q 20 --paired ~/fastqs/CPE1_R1.fastq ~/fastqs/CPE1_R2.fastq
sbatch
sbatch trim_galore.sh
squeue
squeue -u username
JOBID PARTITION NAME USER ST TIME NODES CPU MIN_ME NODELIST(REASON)
29741087 ecl243 trim_gal ecl243-0 R 0:10 1 1 8G bigmem9
scancel
scancel JOBID
A file will appear in the directory containing your trim_galore.sh
that looks like slurm-JOBID.out
This file will contain anything that is written to standard out by trim galore, along with information about your job.
cat slurm-<JOBID>.out
==========================================
SLURM_JOB_ID = 29741087
SLURM_NODELIST = bigmem9
==========================================
1.15
Name : trim_galore
User : ecl243-06
Partition : ecl243
Nodes : bigmem9
Cores : 1
GPUs : 0
State : COMPLETED
Submit : 2021-01-22T16:46:26
Start : 2021-01-22T16:46:26
End : 2021-01-22T16:46:49
Reserved walltime : 00:01:00
Used walltime : 00:00:23
Used CPU time : 00:00:02
% User (Computation): 88.00%
% System (I/O) : 11.95%
Mem reserved : 8G/node
Max Mem used : 0.00 (bigmem9)
Max Disk Write : 0.00 (bigmem9)
Max Disk Read : 0.00 (bigmem9)
Another file will appear that looks like this trim_galore_JOBID_TASKID.err
head trim_galore_<JOBID_TASKID>.err
Module perlbrew/5.16.0 loaded
Please be sure your perl scripts hashbang line is #!/usr/bin/env perl
Module trim_galore/1 loaded
No quality encoding type selected. Assuming that the data provided uses Sanger encoded Phred scores (default)
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
AUTO-DETECTING ADAPTER TYPE
CPE1_*.fastq_trimming_report.txt
files
Multiqc is a really awesome quality control tool that can recognize summary files produced by popular bioinformatics software. You simply run multiqc .
and the program will look through the current directory to find summary files and produce interactive plots that can be viewed in .html.
Interactive sessions let you move to a compute node where you can test commands and run short jobs. You need to be logged on while the entire time a job runs.
Submitting jobs allows you to log off of the cluster and enjoy your day while your job runs.
!!!!!! BUT FIRST !!!!!!
srun
srun -p ecl243 --mem 8G -c 4 -t 00:10:00 --pty bash
cd
module load multiqc
multiqc .
multiqc_report.html
file to a local directoryscp -P 2022 <username>@farm.cse.ucdavis.edu:multiqc_report.html path/to/Downloads/
https://github.com/RILAB/lab-docs/wiki/Using-Farm