This notebook is used to find SNPs within gene coding regions and get .fasta
files for your favorite pupfish genes.
Plyranges has lots of advanced options for overlapping genomic intervalsYou can find out more about plyranges here
The R markdown file used to generate this can be found here
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install()
BiocManager::install("plyranges")
BiocManager::install("tidyverse")
library(tidyverse)
library(plyranges)
gene coordinates - c_brontotheroides.all.renamed.putative_function.genes_only.reformated.known.final.saf
exon coordinates - c_brontotheroides.all.renamed.putative_function.exons_only_reformated_known_final.saf
Normally this information is combined in a .gtf
or .gff3
file
genes <- read.table("D:/Cyprinodon/bronto/c_brontotheroides.all.renamed.putative_function.genes_only.reformated.known.final.saf", header = TRUE, stringsAsFactors = FALSE)
exons <- read.table("https://github.com/joemcgirr/fishfASE/raw/master/markdown/data_files/c_brontotheroides.all.renamed.putative_function.exons_only_reformated_known_final.saf", header = FALSE, stringsAsFactors = FALSE)
head(genes)
head(exons)
plyranges works with objects called granges (genomic ranges) which require the following information:
seqnames (chromosome names)
start (start position of feature)
end (end position of feature)
strand (strand of feature)
All other columns with different names will be treated as metadata.
names(genes) <- c("gene","seqnames","start","end","strand")
names(exons) <- c("exon","seqnames","start","end","strand")
genes <- genes %>% as_granges()
exons <- exons %>% as_granges()
Read in your SNP file and make sure it has the appropriate granges headers (seqnames, start, end)
I’ll show an example with these 5 made up SNP coordinates:
scaffolds <- c("HiC_scaffold_1089","HiC_scaffold_1089","HiC_scaffold_1089","HiC_scaffold_16","HiC_scaffold_16")
starts <- c(2400,2502,4334,10045453,10045738)
ends <- starts + 1
snps <- data.frame(seqnames=scaffolds,
start=starts,
end=ends,
stringsAsFactors=FALSE)
snps
plyranges has lots of options for looking at overlaps between regions.
To find SNPs that overlap with any coding gene region, convert your SNP dataframe to a granges object and use join_overlap_intersect()
snps <- snps %>% as_granges()
hits <- join_overlap_intersect(genes,snps) %>% as.data.frame()
hits
features <- join_overlap_intersect(genes,exons) %>% as.data.frame()
head(features)
One of the exons in this gene is deleted in the scale-eater pupfish species.
see Fig. 3 from here
gene_name <- "Gpa33"
gene_region <- features %>% filter(gene == gene_name)
gene_region
One version is on scaffold 5 and the other is on scaffold 1089.
This is one of the joys of working with non-model organisms! /s
Often times genome annotations are a little messy.
However, in cases like this, we can be reasonably sure that
these genes correspond to gpa33a and gpa33b
orthologs in zebrafish. The genome duplications that occured in
teleosts makes this kind of thing common in this system.
Lets only include the Gpa33 found on scaffold 1089.
gpa33 <- gene_region %>% filter(seqnames == "HiC_scaffold_1089")
deleted_exon <- gpa33 %>% filter(exon == "CBRO_00019393-RA:5")
.bed
files.One with the location of the entire gpa33 coding region
and another with the location of the deleted exon.
gpa33 <- data.frame("seqnames" = c(as.character(gpa33$seqnames)[1]),
"start" = c(min(gpa33$start)),
"end" = c(max(gpa33$end)),
stringsAsFactors = FALSE)
# #write.table(gpa33, "gpa33.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")
# #write.table(deleted_exon[c("seqnames","start","end")], "gpa33_exon5.bed", col.names = FALSE, row.names = FALSE, quote = FALSE, sep = "\t")
.bed
file.This should be done on the cluster or on a computer thats running bedtools.
I run the following on the cluster:
module load bedtools
bedtools getfasta -fi /path/to/asm.racon.fasta -bed /path/to/gpa33.bed -fo gpa33.fa
bedtools getfasta -fi /proj/cmarlab/users/joe/Cyprinodon/bronto/asm.racon.fasta -bed /pine/scr/j/m/jmcgirr/scratch/gpa33.bed -fo /pine/scr/j/m/jmcgirr/scratch/gpa33.fa