Public data → assembly, annotation, MLST

Overview

Download a readset from a public database
Check the quality of the data and filter
Assemble the reads into a draft genome
Find antibiotic resistance genes
Annotate the genome
Find the sequence type (the MLST)

Background

Sequencing reads (readsets) for more than 100,000 isolates are available on public molecular sequence databases (GenBank/ENA/DDJB):

Most of these have been produced using the Illumina sequencing platform.
Most of these have no corresponding draft assembly.

Not all readsets are of high quality:

There may be insufficient reads (usually ~x20 is the minimum read coverage needed).
The reads could be from a mixed colony.
The classification could be incorrect (both genus and species).

It is VERY important to check that what you find in the readset makes sense!

Import data

Go to your Galaxy instance.
Set up a new History for this Activity.
- In the History panel, click on the cog icon, select Create New.
- A new empty history should appear; click on Unnamed history and re-name it (e.g. ENA Activity).
Choose an accession number.
- If you are working on this tutorial in a workshop: assign yourself a readset from the table of isolates provided. Put your name in Column B. The accession number for the readset that relates to each isolate is located in Column A. ERR019289 will be used in this demonstration.
- Alternatively, use accession number ERR019289. This is Vibrio cholerae.
In Galaxy, go to the Tools panel on the left, select Get Data → EBI SRA.
- This causes the ENA website to open.
- Enter the accession number in the ENA search bar.
(The search may find reads under Experiment and Run. If so, click on the Accession number under “Run”.)
Find the column called Fastq files (galaxy). Click on File 1.

ENA fastq

This file will download to your Galaxy history, and will return you to the Galaxy page.

Repeat the above steps for Get Data → EBI SRA and download File 2.
The files should now be in your Galaxy history.
Click on the pencil icon next to File 1.
- Re-name it ERR019289_1.fastq.gz. Save
- Change the datatype to fastqsanger (note: not fastqCsanger). Save
Repeat for File 2 (name it ERR019289_2.fastq.gz).

Evaluate quality

We will run FastQC on the pair of fastq files.

In the Galaxy tools panel, go to NGS Analysis: NGS QC and manipulation: FastQC.
Choose the Multiple datasets icon and then select both fastq files.
Your Galaxy window should look like this:

galaxy window

Click Execute
The output (4 files) will appear at the top of your Galaxy history.
Click on the eye icon next to FastQC on data 1: Web page
Scroll through the results. Take note of the maximum read length (e.g. 54 bp).

Trim

In this step we will remove adapters and trim low-quality sequence from the reads.

In the Galaxy tools panel, go to NGS Analysis: NGS QC and manipulation: Trimmomatic
Leave settings as they are except for:
- Input FASTQ file R1 - check this is File 1
- Input FASTQ file R2 - check this is File 2
Under Perform initial ILLUMINACLIP step choose Yes
- Under Adapter sequences to use choose Nextera(paired-ended)
- This trims particular adapters from the sequences.
Under Trimmomatic Operation leave the settings as they are.
- We will use the average quality across a 4-base sliding window to identify and delete bad sequence (and the flanking bases to the start or end of the sequences - whichever is nearest to the patch of poor quality sequence)

Your tool interface should look like this:

trimmomatic tool interface

Click Execute

There are four output files.

Because trimmomatic might have trimmed some reads to zero, there are now some reads reads with no pair. These are in the unpaired output files. These can be deleted (with the cross button).
Re-name the other two output files, e.g. as ERRxxxxx_T1.fastq.gz & ERRxxxxx_T2.fastq.gz. These properly paired fastq files will be the input for the Spades assembly.

Assemble

We will assemble the trimmed reads.

In the left hand tools panel, go to NGS Analysis: NGS Assembly: spades.

Leave the parameters as their defaults except:

Careful correction? No
Kmers to use, separated by commas: 21,33,51
- chosen kmers must be shorter than the maximum read length (see the FastQC output: sequence length)
Coverage Cutoff: Off
- using a coverage cutoff might cause a problem if there are high-copy-number plasmids
Forward reads: ERR019289_T1.fastq.gz
Reverse reads: ERR019289_T2.fastq.gz

Your tool interface should look like this:

spades tool interface

Click Execute

There are five output files.

SPAdes contigs (fasta) & SPAdes scaffolds (fasta): The draft genome assembly. (These should be identical with the conditions used here.)
SPAdes contig stats & SPAdes scaffold stats: A list of all the contigs and sizes in each of these files.
SPAdes log: A summary of the assembly run.

Rename SPAdes contigs (fasta) to something like ERR019289.fasta.

Check the size of your draft genome sequence

If you only have a few contigs, you can estimate the size from the SPAdes contig stats file by adding together the contig sizes.
Alternatively, go to NGS Common Toolsets: Fasta Statistics and input the SPAdes contigs (fasta) file. Click Execute. The output will show the draft genome size next to num_bp.

Compare your assembly size to others of the same species

Go to the NCBI website: Genome
Next to Genome , enter the name of your species; e.g. Vibrio cholerae.
Click on Genome ASsembly and Annotation report
View the table. Click on the Size column to sort by size. (Check for additional pages at the bottom right.)
Is your assembly size similar?

Find antibiotic resistance genes

Now that we have our draft genome sequence, we can search for particular genes.

We will use the tool called ABRicate to find antibiotic resistance genes in the genome.
ABRicate uses a database of these genes called ResFinder.

In the tools panel, go to NGS Analysis: NGS Annotation: ABRicate.

For Select fasta file choose SPAdes contigs (fasta) or whatever you renamed it (e.g. ERR019289.fasta).
Click Execute.

There is one output file. Click on the eye icon to view.

This shows a table with one line for each antibiotic resistance gene found, in which contig, at which position, and the % coverage.

abricate results

Find the sequence type (MLST)

Bacterial samples (isolates) are often assigned a “sequence type”. This is a number that defines the particular combination of alleles in that isolate, e.g. ST248.

Because several genes (loci) are used, this is termed Multi-Locus Sequence Typing (MLST).
There are different MLST schemes for different groups of bacteria.

In the tools panel, go to NGS Analysis: NGS Annotation: MLST

Under input_file choose choose SPAdes contigs (fasta) or whatever you renamed it (e.g. ERR019289.fasta).
Note: a specific MLST scheme can be specified if you wish, but by default all schemes are searched
Click Execute.

There is one output file. Click on the eye icon to view.

There is a one line output.

MLST result

Some symbols are used to describe missing or inexact matches to alleles:

n: Exact intact allele
~n : Novel allele similar to n
n,m : Multiple alleles
- : Allele missing

Annotate

We have found a list of resistance genes in the draft sequence, but we can also annotate the whole genome to find all the genes present.

In the tools panel, go to Tools → NGS Analysis → NGS: Annotation → Prokka

Set the following parameters (leave everything else unchanged):

Contigs to annotate: SPAdes contigs (fasta) (or equivalent)
Locus tag prefix (–locustag): P
Force GenBank/ENA/DDJB compliance (–compliant): No
Sequencing Centre ID (–centre): V
Click Execute

There are several output files:

gff: the master annotation in GFF format, containing both sequences and annotations
gbk: a standard GenBank file derived from the master .gff. If the input to prokka was a multi-FASTA, then this will be a multi-GenBank, with one record for each sequence
fna: nucleotide FASTA file of the input contig sequences
faa: protein FASTA file of the translated CDS sequences
ffn: nucleotide FASTA file of all the annotated sequences, not just CDS
sqn: an ASN1 format “Sequin” file for submission to GenBank. It needs to be edited to set the correct taxonomy, authors, related publication, etc.
fsa: nucleotide FASTA file of the input contig sequences, used by “tbl2asn” to create the .sqn file. It is mostly the same as the .fna file, but with extra Sequin tags in the sequence description lines
tbl: Feature Table file, used by “tbl2asn” to create the .sqn file
err: unacceptable annotations - the NCBI discrepancy report
log: contains all the output that Prokka produced during its run
txt: statistics relating to the annotated features found

Tabulate

If you are working on this tutorial as part of a class workshop:

Go to the table of isolates and add information about genome size, GC content, and number of contigs.