1 A quick tutorial using my favorite R package called plyranges.

This notebook is used to find SNPs within gene coding regions and get .fasta files for your favorite pupfish genes.

Plyranges has lots of advanced options for overlapping genomic intervalsYou can find out more about plyranges here

The R markdown file used to generate this can be found here

2 Make sure you are running R v.3.6 or higher

3 Install plyranges.

if (!requireNamespace("BiocManager", quietly = TRUE))
  install.packages("BiocManager")
BiocManager::install()
BiocManager::install("plyranges")

4 Install tidyverse.

BiocManager::install("tidyverse")

5 Load libraries.

library(tidyverse)
library(plyranges)

6 Read in two files

gene coordinates - c_brontotheroides.all.renamed.putative_function.genes_only.reformated.known.final.saf
exon coordinates - c_brontotheroides.all.renamed.putative_function.exons_only_reformated_known_final.saf

Normally this information is combined in a .gtf or .gff3 file

genes <- read.table("D:/Cyprinodon/bronto/c_brontotheroides.all.renamed.putative_function.genes_only.reformated.known.final.saf", header = TRUE, stringsAsFactors = FALSE)
exons <- read.table("https://github.com/joemcgirr/fishfASE/raw/master/markdown/data_files/c_brontotheroides.all.renamed.putative_function.exons_only_reformated_known_final.saf", header = FALSE, stringsAsFactors = FALSE)

head(genes)
head(exons)

7 Change the headers into labels recognized by plyranges.

plyranges works with objects called granges (genomic ranges) which require the following information:
seqnames (chromosome names)
start (start position of feature)
end (end position of feature)
strand (strand of feature)
All other columns with different names will be treated as metadata.

names(genes) <- c("gene","seqnames","start","end","strand")
names(exons) <- c("exon","seqnames","start","end","strand")

8 Convert genes and exons dataframes into grange objects.

genes <- genes %>% as_granges()
exons <- exons %>% as_granges()

9 Find overlap between gene regions and a list of interesting SNPs

Read in your SNP file and make sure it has the appropriate granges headers (seqnames, start, end)
I’ll show an example with these 5 made up SNP coordinates:

scaffolds <- c("HiC_scaffold_1089","HiC_scaffold_1089","HiC_scaffold_1089","HiC_scaffold_16","HiC_scaffold_16")
starts <- c(2400,2502,4334,10045453,10045738)
ends <- starts + 1

snps <- data.frame(seqnames=scaffolds, 
                   start=starts,
                   end=ends,
                   stringsAsFactors=FALSE)

snps

plyranges has lots of options for looking at overlaps between regions.
To find SNPs that overlap with any coding gene region, convert your SNP dataframe to a granges object and use join_overlap_intersect()

snps <- snps %>% as_granges()
hits <- join_overlap_intersect(genes,snps) %>% as.data.frame()
hits

10 For more practice, we can match exons to their gene regions and generate fasta files for interesting exons.

features <- join_overlap_intersect(genes,exons) %>% as.data.frame()
head(features)

11 Now lets grab the coordinates for the gene gpa33.

One of the exons in this gene is deleted in the scale-eater pupfish species.
see Fig. 3 from here

gpa33

gene_name <- "Gpa33"

gene_region <- features %>% filter(gene == gene_name)
gene_region

12 Notice that there are two regions of the genome with a gene called gpa33.

One version is on scaffold 5 and the other is on scaffold 1089.
This is one of the joys of working with non-model organisms! /s
Often times genome annotations are a little messy.
However, in cases like this, we can be reasonably sure that
these genes correspond to gpa33a and gpa33b
orthologs in zebrafish. The genome duplications that occured in
teleosts makes this kind of thing common in this system.
Lets only include the Gpa33 found on scaffold 1089.

gpa33 <- gene_region %>% filter(seqnames == "HiC_scaffold_1089")

13 The exon that is deleted in scale-eaters is labeled CBRO_00019393-RA:5

deleted_exon <- gpa33 %>% filter(exon == "CBRO_00019393-RA:5")

14 The last thing we will do in R is create two .bed files.

One with the location of the entire gpa33 coding region
and another with the location of the deleted exon.

gpa33 <- data.frame("seqnames" = c(as.character(gpa33$seqnames)[1]), 
                    "start"    = c(min(gpa33$start)), 
                    "end"      = c(max(gpa33$end)), 
                    stringsAsFactors = FALSE)

# #write.table(gpa33, "gpa33.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")
# #write.table(deleted_exon[c("seqnames","start","end")], "gpa33_exon5.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")

15 Finally, get the fasta sequence from the reference genome for each .bed file.

This should be done on the cluster or on a computer thats running bedtools.
I run the following on the cluster:

module load bedtools
bedtools getfasta -fi /path/to/asm.racon.fasta -bed /path/to/gpa33.bed -fo gpa33.fa
bedtools getfasta -fi /proj/cmarlab/users/joe/Cyprinodon/bronto/asm.racon.fasta -bed /pine/scr/j/m/jmcgirr/scratch/gpa33.bed -fo /pine/scr/j/m/jmcgirr/scratch/gpa33.fa