Variant calling with Snippy

Background

Variant calling is the process of identifying differences between two genome samples. Usually differences are limited to single nucleotide polymorphisms (SNPs) and small insertions and deletions (indels). Larger structural variation such as inversions, duplications and large deletions are not typically covered by “variant calling”.

In this tutorial, we will use the tool “Snippy” (link to Snippy is here). Snippy uses a tool to align the reads to a reference genome, and another tool to decide (“call”) if the discrepancies are real variants.

Learning Objectives

Find variants between a reference genome and a set of reads
Visualise the SNP in context of the reads aligned to the genome
Determine the effect of those variants on genomic features
Understand if the SNP is potentially affecting the phenotype

Prepare reference

For variant calling, we need a reference genome that is of the same strain as the input sequence reads (preferably, as closely related as possible).

For this tutorial, our reference is the wildtype.gbk file and our reads are mutant_R1.fastq and mutant_R2.fastq.

If these files are not in your Galaxy history, import them from the Training dataset page.

Call variants with Snippy

Go to Tools → NGS Analysis → NGS: Variant Analysis → snippy
For Reference type select Genbank.
Then for Reference Genbank choose the wildtype.gbk file.
For Single or Paired-end reads choose Paired.
Then choose the first set of reads, mutant_R1.fastq and second set of reads, mutant_R2.fastq.
For Cleanup the non-snp output files select No.

Your tool interface should look like this:

Snippy interface

Click Execute.

Examine Snippy output

From Snippy, there are 10 output files in various formats.

Go to the file called snippy on data XX, data XX and data XX table and click on the eye icon.
We can see a list of variants. Look in column 3 to see which types the variants are, such as a SNP or a deletion.
Look at the third variant called. This is a T→A mutation, causing a stop codon. Look at column 14: the product of this gene is a methicillin resistance protein. Methicillin is an antibiotic. What might be the result of such a mutation?

View Snippy output in JBrowse

Go to Statistics and Visualisation → Graph/Display Data → JBrowse genome browser.
Under Reference genome to display choose Use a genome from history.
Under Select the reference genome choose wildtype.fna. This sequence will be the reference against which annotations are displayed.
For Produce a Standalone Instance select Yes.
For Genetic Code choose 11: The Bacterial, Archaeal and Plant Plastid Code.
Under JBrowse-in-Galaxy Action choose New JBrowse Instance.
We will now set up three different tracks - these are datasets displayed underneath the reference sequence (which is displayed as nucleotides in FASTA format). We will choose to display the sequence reads (the .bam file), the variants found by snippy (the .gff file) and the annotated reference genome (the wildtype.gff)

Track 1 - sequence reads

Click Insert Track Group
For Track Cateogry name it “sequence reads”
Click Insert Annotation Track
For Track Type choose BAM Pileups
For BAM Track Data select the snippy bam file
For Autogenerate SNP Track select Yes
Under Track Visibility choose On for new users.

Track 2 - variants

Click Insert Track Group again
For Track Category name it “variants”
Click Insert Annotation Track
For Track Type choose GFF/GFF3/BED/GBK Features
For Track Data select the snippy snps gff file
Under Track Visibility choose On for new users.

Track 3 - annotated reference

Click Insert Track Group again
For Track Category name it “annotated reference”
Click Insert Annotation Track
For Track Type choose GFF/GFF3/BED/GBK Features
For Track Data select wildtype.gff
Under JBrowse Track Type [Advanced] select Canvas Features.
Click on JBrowse Styling Options [Advanced]
Under JBrowse style.description add in the word product so that it says note,description,product
Under Track Visibility choose On for new users.
Click Execute
A new file will be created, called JBrowse on data XX and data XX - Complete. Click on the eye icon next to the file name. The JBrowse window will appear in the centre Galaxy panel.
On the left, tick boxes to display the tracks
Use the minus button to zoom out to see:
- sequence reads and their coverage (the grey graph)
Use the plus button to zoom in to see:
- probable real variants (a whole column of snps)
- probable errors (single one here and there)

JBrowse screenshot

In the coordinates box, type in 47299 and then Go to see the position of the SNP discussed above.
- the correct codon at this position is TGT, coding for the amino acid Cysteine, in the middle row of the amino acid translations.
- the mutation of T → A turns this triplet into TGA, a stop codon.

JBrowse screenshot