Assembly with PacBio data and SMRT Portal

Keywords: de novo assembly, PacBio, PacificBiosciences, HGAP, SMRT Portal, Microbial Genomics Virtual Laboratory

This tutorial will show you how to assemble a bacterial genome de novo, using the PacBio SMRT Portal on the mGVL. We will use an analysis pipeline called HGAP, the Hierarchical Genome Assembly Process.

Start

Open your mGVL dashboard.
You should see SMRT Portal as one of the instance services on your GVL dashboard.
Open up the SMRT portal web link (to the right) and register/log on.

Input

We will use a dataset from a Streptococcus pyogenes bacteria.

If this has already been loaded onto SMRT portal (e.g. for use during a workshop), proceed to the next step (“Assembly”).

Otherwise:

Load the PacBio data (your own, or the training dataset) onto your GVL.
In the SMRT Portal, go to Design Job, the top left tab.
Go to Import and Manage.
Click Import SMRT cells.
Work out where you put the data on your GVL, and make sure the file path is showing.
- If not, click Add and enter the file path to the data.
- A SMRT cell is the collection of data from a particular cell in the machine. It includes .bax.h5 files.
Click on the file path and then Scan to check for new data.

Assembly

HGAP process overview

We will use the Hierarchical Genome Assembly Process (HGAP). This flowchart shows the steps in the process:

flowchart of HGAP process

Set up job

In the SMRT Portal, go to the top left tab, Design Job.
Go to Create New.
An Analysis window should appear. Tick all the boxes, then Next.
Under Job Name enter a name.
To the right, under Groups choose all.
Under Protocols choose RS_HGAP_Assembly.3.
There is an ellipsis underneath Protocols - click on the ellipsis.

smrt portal screenshot

This brings up the settings. Click on Assembly.

For Compute Minimum Seed Read Length: ensure box is ticked

For Number of Seed Read Chunks: enter 12
Change the Genome Size to an approximately correct size for the species. For S. pyogenes, enter 1800000.
For Target Coverage: enter 10
For Overlapper Error Rate: enter 0.04
Leave all other settings as they are.
Click Apply

Your protocol window should look like this:

smrt portal screenshot

Click Ok.
In the SMRT Cells Available window, select the file to be used. Click on the arrow to transfer these files to the SMRT Cells in Job window.
You can drag the column widths of the “Url” column so that you can see the URLs of the file paths better.

smrt portal screenshot

Click Save (bottom right hand side).
Next to Save, click Start.
The Monitor Jobs window should open.
- As each step proceeds, new items will appear under the Reports and Data tabs on the left.

smrt portal screenshot

Inputs and Outputs

The connections between the names of assembly stages and outputs is not always clear. This flowchart shows how each stage of the HGAP process corresponds to protocol window names and outputs:

inputs and outputs

Results

If the job is still running, click on the centre tab Monitor Jobs. Otherwise, click on the top right tab, View Data.

Double click on the job name to open its reports.
Click on different Reports in the left hand panel.

Things to look at:

General: Filtering (polymerase reads)

number of reads post-filter
read length (=average)

filtering results

General: Subread Filtering (subreads)

number of reads post-filter
read length (average)

subreads results

Assembly: Pre-Assembly (pre-assembled reads)

length cutoff (the computed minimum seed read length)
read length (average)

preassembly results

Assembly: Corrections

Consensus calling results:

Consensus concordance should be > 99%.

Graph: corrections across reference:

With the first run of polishing, we expect a lot of corrections but they should be randomly distributed.

corrections results

Note: only unitigs 0 and 1 shown.

Assembly: Top Corrections

This is a list of all the corrections made.

top corrections results

Note: only first 15 shown.

Resequencing: Coverage

Coverage across reference:

discard contigs <20X coverage
others should have fairly consistent coverage.
spikes could be collapsed repeats.
valleys could be mis-assembly - e.g. draft assembly was incorrect and so remapped reads didn’t support this part of the assembly.

resequencing coverage results