Annotating my genomic variants with snpEFF, snpSift, and ClinVar

Sequencing my genome

This is the first in a series of notebooks that document my attempt to learn more about myself through sequencing. I’m a bioinformatics scientist with lots of experience tinkering with genomic data sets. When I heard that Nebula Genomics would sequence my whole genome at 30x coverage for $300 AND let me download all of the data (raw .fastq, ..bam, and .vcf), I jumped on the chance to take a look at my own source code. Nebula provided me with a vcf file containing 4,785,184 QC passing variants. I want to prioritize which of these millions of variants might deserve a closer look. As a first pass, I annotated these variants to see if any were known to disrupt protein function or play a role in disease.

There are plenty of third-party genetic interpretation websites that can do this kind of thing with your direct-to-consumer genetic test results. I thought it would be more fun to run my own analyses and share them with others that might want to limit the number of companies that have access to their genome. Anyone who has a vcf file from Nebula and a little experience with Linux and R should be able to recreate these analyses for themselves. I ran everything below on my little personal laptop (i5, 16G RAM) running Windows 11 with an Ubuntu Virtualbox install.

Publically availible tools and databases

SnpEFF and SnpSift are popular tools for vcf annotation – and for good reason. They generate loads of useful information while remaining user friendly. They are actively maintained as of spring 2022 and have extensive documentation.

SnpEff is a variant annotation and effect prediction tool. It annotates and predicts the effects of genetic variants.

SnpSift is a toolbox that allows you to filter and manipulate annotated files.

Maintained by Pablo Cingolani

Publications: SnpEFF, SnpSift

GATK is incredible software with many tools for genomic analyses. I only use it here to convert the vcf into something easier to read.

ClinVar is a freely accessible, public archive of reports of the relationships among human variations and phenotypes, with supporting evidence. I use this database to help prioritize which of the millions of variants in my genome deserve a closer look.

SNPedia is a wiki investigating human genetics that shares information about the effects of variations in DNA, citing peer-reviewed scientific publications. I use this resource to read quick facts about specific genetic variants and find associated publications. Unfortunately, it doesn’t seem to be very actively maintained since it was purchased by myHeritage along with Promethease.

Load R libraries

start_time <- Sys.time()

suppressPackageStartupMessages(library(tidyverse))
library(vroom)
library(rsnps)
library(gt)
#library(beepr)

# Wong, B. Points of view: Color blindness. Nat Methods (2011).
bla <- '#000000'
blu <- '#0072b2'
grb <- '#56b4e9'
lir <- '#cc79a7'
gre <- '#009e73'
red <- '#d55e00'
org <- '#e69f00'
yel <- '#f0e442'
gry <- '#BBBBBB'

Annotating my vcf with SnpEFF

Setup

Downloading and installing SnpEFF is quick and easy.

wget https://snpeff.blob.core.windows.net/versions/snpEff_latest_core.zip
unzip snpEff_latest_core.zip

You must specify which genome was used during alignment to call varaiants in the vcf. Nebula currently aligns to hg38.

java -jar /media/sf_dna/apps/snpEff/snpEff.jar download -v hg38

This creates a new vcf with annotations for variants in the Nebula vcf.

java -Xmx8g -jar /media/sf_dna/apps/snpEff/snpEff.jar hg38 /media/sf_dna/nebula/vcf/nebula.vcf > myVariants.ann.vcf

GATK is also quick and easy to install.

sudo apt install default-jre
wget https://github.com/broadinstitute/gatk/releases/download/4.2.5.0/gatk-4.2.5.0.zip
unzip gatk-4.2.5.0.zip
# set alias in bash profile 
# alias gatk='/media/sf_dna/apps/gatk-4.2.5.0/gatk'

I convert the vcf annotated by SnpEFF to a tab delimited text file with GATK. This makes the data a little easier to read and can be fed into R as a data frame. I also replace my Nebula ID with ‘myVariants’ to make downstream scripts more generic.

gatk VariantsToTable -V myVariants.ann.vcf -F CHROM -F POS -F TYPE -F ID -F ANN -F LOF -F NMD -GF AD -GF DP -GF GQ -GF GT -O myVariants.ann.txt

nebulaID="##########" 
sed -i "1s/$nebulaID/myVariants/g" myVariants.ann.txt

Possible variant annotations

The annotated vcf output by SnpEFF has lots of information about how a variant influences molecular phenotypes (not necessarily disease phenotypes, which are explored below with SnpSift and Clinvar). Molecular effects are described by a sequence ontology term and associated with an estimate the magnitude of the functional impact.

annotated_vcf_table <- "C:/Users/jmcgirr/dna/nebula/vcf/myVariants.ann.txt"

ann <- vroom(annotated_vcf_table, guess_max = 1000000)

# table based on SnpEFF documentation
ann_types <- data.frame(impact = c("HIGH","HIGH","HIGH","HIGH","HIGH","HIGH","HIGH","HIGH","HIGH","HIGH","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","MODERATE","LOW","LOW","LOW","LOW","LOW","LOW","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER","MODIFIER"),
                     ontology_term = c("chromosome_number_variation","exon_loss_variant","frameshift_variant","rare_amino_acid_variant","splice_acceptor_variant","splice_donor_variant","start_lost","stop_gained","stop_lost","transcript_ablation","3_prime_UTR_truncation&exon_loss","5_prime_UTR_truncation&exon_loss_variant","coding_sequence_variant-moderate","conservative_inframe_deletion","conservative_inframe_insertion","disruptive_inframe_deletion","disruptive_inframe_insertion","missense_variant","regulatory_region_ablation","splice_region_variant-moderate","TFBS_ablation","5_prime_UTR_premature_start_codon_gain_variant","initiator_codon_variant","splice_region_variant-low","start_retained","stop_retained_variant","synonymous_variant","3_prime_UTR_variant","5_prime_UTR_variant","coding_sequence_variant-modifier","conserved_intergenic_variant","conserved_intron_variant","downstream_gene_variant","exon_variant","feature_elongation","feature_truncation","gene_variant","intergenic_region","intragenic_variant","intron_variant","mature_miRNA_variant","miRNA","NMD_transcript_variant","non_coding_transcript_exon_variant","non_coding_transcript_variant","regulatory_region_amplification","regulatory_region_variant","TF_binding_site_variant","TFBS_amplification","transcript_amplification","transcript_variant","upstream_gene_variant"))

ann_types |> gt() |> tab_header(title = "Putative impact for sequence ontology terms output by SnpEFF") 
Putative impact for sequence ontology terms output by SnpEFF
impact ontology_term
HIGH chromosome_number_variation
HIGH exon_loss_variant
HIGH frameshift_variant
HIGH rare_amino_acid_variant
HIGH splice_acceptor_variant
HIGH splice_donor_variant
HIGH start_lost
HIGH stop_gained
HIGH stop_lost
HIGH transcript_ablation
MODERATE 3_prime_UTR_truncation&exon_loss
MODERATE 5_prime_UTR_truncation&exon_loss_variant
MODERATE coding_sequence_variant-moderate
MODERATE conservative_inframe_deletion
MODERATE conservative_inframe_insertion
MODERATE disruptive_inframe_deletion
MODERATE disruptive_inframe_insertion
MODERATE missense_variant
MODERATE regulatory_region_ablation
MODERATE splice_region_variant-moderate
MODERATE TFBS_ablation
LOW 5_prime_UTR_premature_start_codon_gain_variant
LOW initiator_codon_variant
LOW splice_region_variant-low
LOW start_retained
LOW stop_retained_variant
LOW synonymous_variant
MODIFIER 3_prime_UTR_variant
MODIFIER 5_prime_UTR_variant
MODIFIER coding_sequence_variant-modifier
MODIFIER conserved_intergenic_variant
MODIFIER conserved_intron_variant
MODIFIER downstream_gene_variant
MODIFIER exon_variant
MODIFIER feature_elongation
MODIFIER feature_truncation
MODIFIER gene_variant
MODIFIER intergenic_region
MODIFIER intragenic_variant
MODIFIER intron_variant
MODIFIER mature_miRNA_variant
MODIFIER miRNA
MODIFIER NMD_transcript_variant
MODIFIER non_coding_transcript_exon_variant
MODIFIER non_coding_transcript_variant
MODIFIER regulatory_region_amplification
MODIFIER regulatory_region_variant
MODIFIER TF_binding_site_variant
MODIFIER TFBS_amplification
MODIFIER transcript_amplification
MODIFIER transcript_variant
MODIFIER upstream_gene_variant

My variant annotations

# count the number of variants annotated for each ontology
n_variants <- c()
for(ot in ann_types$ontology_term){
n_variants <- c(n_variants,(filter(ann, grepl(ot, ANN)) |> nrow()))
#print(ot)
}

ann_types$n_variants <- n_variants
ann_types <- arrange(ann_types, n_variants)

ann_types$ontology_term <- factor(ann_types$ontology_term, levels = ann_types$ontology_term)
ann_types$impact <- factor(ann_types$impact, levels = c("HIGH","MODERATE","LOW","MODIFIER"))
ann_types$log_n_variants <- log10(ann_types$n_variants)

p1 <- ann_types|> 
   ggplot(aes(ontology_term, log_n_variants, fill = impact)) +
   geom_col() +
   coord_flip() +
   theme_minimal() +
   theme(axis.title.x=element_text(size=14),
    axis.title.y=element_text(size=12),
    axis.title=element_text(size=14),
    axis.text=element_text(size=12),
    plot.title=element_text(size=18))+
   ylab("\nlog(number of annotated variants)")+ xlab("")+ ylim(0,8)+
   ggtitle("Number of variants showing each snpEFF annotation (log scale)")+
   geom_text(aes(label=n_variants), position=position_stack(), hjust=-0.5)+
   scale_fill_manual(values = c(red,yel,blu,gre))
print(p1)

p1 <- ann_types|> filter(impact %in% c("HIGH"), ontology_term != "missense_variant") |>
   ggplot(aes(ontology_term, n_variants, fill = impact)) +
   geom_col() +
   coord_flip() +
   theme_minimal() +
   theme(axis.title.x=element_text(size=14),
    axis.title.y=element_text(size=14),
    axis.title=element_text(size=18),
    axis.text=element_text(size=12),
    plot.title=element_text(size=18))+
   ylab("\nnumber of annotated variants")+ xlab("")+ylim(0,400)+
   ggtitle("Number of variants showing high impact snpEFF annotations")+
   geom_text(aes(label=n_variants), position=position_stack(), hjust=-0.5)+
   scale_fill_manual(values = c(red,yel,blu,gre))
print(p1)

Investigate high impact variants on SNPedia

Tables show my genotype (on the plus strand) along with genotype quality and allele depth. Links point to SNPedia entries for variants annotated as high impact. It seems only a few entries exist for each annotation type.

# get table with rsids for an ontology term
annotations.for.ontology.term <- function(ontology_term){
 
  ot_table <- filter(ann, grepl(ontology_term, ANN)) 
  anns <- ot_table$ANN
  
  ot_anns <- c()
  ot_alleles <- c()
  for(ann_string in anns){
    ot_anns <- c(ot_anns,paste0(paste(strsplit(grep(ontology_term, strsplit(ann_string, ",")[[1]], value = TRUE)[1], "\\|")[[1]][c(1,2,3,4,7)], collapse = "|"),"|"))
  }
  
  ot_table$ANN <- ot_anns
  ot_table <- separate(ot_table,ANN, c("allele","annotation","impact","gene_name","feature_id"), sep = "\\|", extra = "drop")
  return(ot_table)
}

# search for SNPedia annotations for rsids output by annotations.for.ontology.term()
get.snpedia.urls <- function(ot_table){
  
  snpedia_urls <- c()
  myGTs  <- myADs <- myGQs <- c() 
  ot_table <- filter(ot_table, grepl("rs",ID))
  
  for(rsid in ot_table$ID){
  snpedia_table <- annotations(snp = rsid, output = 'snpedia')
    if(nrow(snpedia_table)>0){
      snpedia_urls <- c(snpedia_urls,gsub("\\s*\\([^\\)]+\\)","",snpedia_table$url[1]))
      myGTs <- c(myGTs, filter(ot_table, ID == rsid) |> pull(myVariants.GT))
      myADs <- c(myADs, filter(ot_table, ID == rsid) |> pull(myVariants.AD))
      myGQs <- c(myGQs, filter(ot_table, ID == rsid) |> pull(myVariants.GQ))

    }
  }
  
  return(data.frame(SNPedia_urls = snpedia_urls, Genotype_PlusOrientation = myGTs,
                    Allele_Depth = myADs, Genotype_Quality = myGQs))
  
}

make.hyperlink <-  function(myurl,mytext=myurl) {
  paste('<a href="',myurl,'">',mytext,'</a>')
}

for(i_ontology_term in (filter(ann_types, impact == "HIGH", n_variants > 0) |> pull(ontology_term))){
print(get.snpedia.urls(annotations.for.ontology.term(i_ontology_term)) |> 
        gt() |> 
        tab_header(title = paste0("SNPedia entries for variants annotated as ", i_ontology_term)) |> 
        fmt (columns = 'SNPedia_urls',fns = make.hyperlink))
}
SNPedia entries for variants annotated as start_lost
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs3764880 G/G 0,116 99
SNPedia entries for variants annotated as stop_lost
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs241448 A/G 36,35 99
SNPedia entries for variants annotated as stop_gained
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs17602729 G/A 85,100 99
http://www.snpedia.com/index.php/Rs6661174 T/T 0,126 99
http://www.snpedia.com/index.php/Rs1815739 T/T 0,113 99
http://www.snpedia.com/index.php/Rs601338 G/A 59,63 99
SNPedia entries for variants annotated as splice_acceptor_variant
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs776746 C/C 0,117 99
http://www.snpedia.com/index.php/Rs3892097 C/T 134,68 99
SNPedia entries for variants annotated as splice_donor_variant
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs2004640 G/G 0,116 99
SNPedia entries for variants annotated as frameshift_variant
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs8176719 TC/TC 0,114 99
http://www.snpedia.com/index.php/Rs2066847 G/GC 87,87 99

Annotating my vcf with SnpSift and Clinvar

SnpSift is automatically installed along with SnpEFF.

SnpSift takes annotations from one vcf and transfers them to matching variants in another vcf.

I annotate my Nebula vcf using a vcf curated by Clinvar. The Clinvar vcf reports the clinical significance of each variant based on supporting literature. I use this to prioritize possible variants of concern (annotated as pathogenic).

Download the Clinvar vcf.

wget https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz

Annotate my vcf with Clinvar vcf and convert to tab delimited table with GATK.

java -jar /media/sf_dna/apps/snpEff/SnpSift.jar annotate /media/sf_dna/clinvar/clinvar.vcf.gz /media/sf_dna/nebula/vcf/NG1T6RKMCV.vcf > myVariants.clinvar.vcf

gatk VariantsToTable -V /media/sf_dna/nebula/vcf/myVariants.dbNSFP.vcf -F CHROM -F POS -F TYPE -F ID -F ALLELEID -F CLNDN -F CLNSIG -F CLNSIGCONF -F CLNSIGINCL -F CLNVC -F GENEINFO -GF AD -GF GQ -GF GT -O myVariants.clinvar.txt

sed -i '1s/NG1T6RKMCV/myVariants/g' myVariants.clinvar.txt

Identify possible pathogenic variants

annotated_vcf_table <- "C:/Users/jmcgirr/dna/nebula/vcf/myVariants.clinvar.txt"

ann <- vroom(annotated_vcf_table, guess_max = 1000000) |>
      filter(!is.na(ALLELEID))
## Rows: 4785183 Columns: 15
## -- Column specification --------------------------------------------------------
## Delimiter: "\t"
## chr (11): CHROM, TYPE, ID, CLNDN, CLNSIG, CLNSIGCONF, CLNSIGINCL, CLNVC, GEN...
## dbl  (4): POS, ALLELEID, myVariants.DP, myVariants.GQ
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
patho <- filter(ann, grepl("Pathogenic",CLNSIG) | grepl("Likely_pathogenic",CLNSIG)) |> separate("ID", c("ID","unknown"), sep = ";")
#unique(patho$CLNSIG)
#nrow(patho)
as.data.frame(patho) |> gt()
CHROM POS TYPE ID unknown ALLELEID CLNDN CLNSIG CLNSIGCONF CLNSIGINCL CLNVC GENEINFO myVariants.AD myVariants.DP myVariants.GQ myVariants.GT
chr1 161223893 SNP rs5082 17936 32975 Familial_hypercholesterolemia_1 Pathogenic NA NA single_nucleotide_variant APOA2:336 0,90 90 99 A/A
chr3 124055231 SNP rs9289231 5458 20497 Coronary_heart_disease_5 Pathogenic NA NA single_nucleotide_variant KALRN:8997 107,87 195 99 T/G
chr8 66178947 SNP rs12721510 38800 47403 Autosomal_dominant_nocturnal_frontal_lobe_epilepsy Pathogenic NA NA single_nucleotide_variant CRH:1392 66,75 141 99 G/T
chr8 91071260 SNP rs1395982289 1216417 1206397 not_provided Likely_pathogenic NA NA single_nucleotide_variant OTUD6B:51633 87,100 190 99 C/T
chr11 5249833 SNP rs368698783 1048099 1036027 Hereditary_persistence_of_fetal_hemoglobin Pathogenic NA NA single_nucleotide_variant HBG1:3047|LOC106099064:106099064 8,6 14 99 C/T
chr16 27344882 SNP rs1805010 14666 29705 Acquired_immunodeficiency_syndrome,_slow_progression_to|Atopy,_resistance_to Pathogenic|_protective NA NA single_nucleotide_variant IL4R:3566 53,40 95 99 A/G
chr17 34252769 SNP rs1024611 14207 29246 Spina_bifida,_susceptibility_to|Mycobacterium_tuberculosis,_susceptibility_to|Coronary_artery_disease,_modifier_of|Coronary_artery_disease,_development_of,_in_hiv Pathogenic|_risk_factor NA NA single_nucleotide_variant CCL2:6347 92,92 185 99 A/G

Investigate pathogenic variants on SNPedia

print(get.snpedia.urls(patho) |> 
      gt() |> 
      tab_header(title = "SNPedia entries for variants annotated as pathoginic or likely pathogenic") |> 
      fmt (columns = 'SNPedia_urls',fns = make.hyperlink))
SNPedia entries for variants annotated as pathoginic or likely pathogenic
SNPedia_urls Genotype_PlusOrientation Allele_Depth Genotype_Quality
http://www.snpedia.com/index.php/Rs5082 A/A 0,90 99
http://www.snpedia.com/index.php/Rs9289231 T/G 107,87 99
http://www.snpedia.com/index.php/Rs12721510 G/T 66,75 99
http://www.snpedia.com/index.php/Rs1805010 A/G 53,40 99
http://www.snpedia.com/index.php/Rs1024611 A/G 92,92 99

R run time and session info

end_time <- Sys.time()
print(end_time - start_time)
## Time difference of 19.4748 mins
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] gt_0.4.0        rsnps_0.5.0.0   vroom_1.5.7     forcats_0.5.1  
##  [5] stringr_1.4.0   dplyr_1.0.8     purrr_0.3.4     readr_2.1.2    
##  [9] tidyr_1.2.0     tibble_3.1.6    ggplot2_3.3.5   tidyverse_1.3.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.8.3     lubridate_1.8.0  assertthat_0.2.1 digest_0.6.29   
##  [5] utf8_1.2.2       plyr_1.8.7       R6_2.5.1         cellranger_1.1.0
##  [9] backports_1.4.1  reprex_2.0.1     evaluate_0.15    highr_0.9       
## [13] httr_1.4.2       pillar_1.7.0     rlang_1.0.2      curl_4.3.2      
## [17] readxl_1.4.0     rstudioapi_0.13  jquerylib_0.1.4  checkmate_2.0.0 
## [21] rmarkdown_2.13   urltools_1.7.3   labeling_0.4.2   triebeard_0.3.0 
## [25] bit_4.0.4        munsell_0.5.0    broom_0.7.12     compiler_4.1.3  
## [29] modelr_0.1.8     xfun_0.30        pkgconfig_2.0.3  htmltools_0.5.2 
## [33] tidyselect_1.1.2 httpcode_0.3.0   bookdown_0.25    fansi_1.0.3     
## [37] crayon_1.5.1     tzdb_0.3.0       dbplyr_2.1.1     withr_2.5.0     
## [41] crul_1.2.0       grid_4.1.3       jsonlite_1.8.0   gtable_0.3.0    
## [45] lifecycle_1.0.1  DBI_1.1.2        magrittr_2.0.3   scales_1.1.1    
## [49] rmdformats_1.0.3 cli_3.2.0        stringi_1.7.6    farver_2.1.0    
## [53] fs_1.5.2         xml2_1.3.3       bslib_0.3.1      ellipsis_0.3.2  
## [57] generics_0.1.2   vctrs_0.4.0      tools_4.1.3      bit64_4.0.5     
## [61] glue_1.6.2       hms_1.1.1        parallel_4.1.3   fastmap_1.1.0   
## [65] yaml_2.3.5       colorspace_2.0-3 rvest_1.0.2      knitr_1.38      
## [69] haven_2.4.3      sass_0.4.1