Assembly using Velvet

Keywords: de novo assembly, Velvet, Galaxy, Microbial Genomics Virtual Lab

Background

Velvet is one of a number of de novo assemblers that use short read sets as input (e.g. Illumina Reads), and the assembly method is based on de Bruijn graphs. For information about Velvet see this link.

In this activity, we will perform a de novo assembly of a short read set using the Velvet assembler.

Learning objectives

At the end of this tutorial you should be able to:

  1. assemble the reads using Velvet, and
  2. examine the output assembly.

Import and view data

If you have completed the previous tutorial on Quality Control, you should already have the required files in your current Galaxy history. If not, see how to get them here.

The data

The read set for today is from an imaginary Staphylococcus aureus bacterium with a miniature genome. The whole genome shotgun method used to sequence our mutant strain read set was produced on an Illumina DNA sequencing instrument.

  • The files we need for assembly are the mutant_R1.fastq and mutant_R2.fastq.
  • (We don’t need the reference genome sequences for this tutorial).

  • The reads are paired-end.

  • Each read is 150 bases long.

  • The number of bases sequenced is equivalent to 19x the genome sequence of the wildtype strain. (Read coverage 19x - rather low!).

  • Click on the View Data button (the Eye icon) next to one of the FASTQ sequence files.

Assemble reads with Velvet

  • We will perform a de novo assembly of the mutant FASTQ reads into long contiguous sequences (in FASTA format.)
  • Velvet requires the user to input a value of k for the assembly process. K-mers are fragments of sequence reads. Small k-mers will give greater connectivity, but large k-mers will give better specificity.
  • Go to Tools → NGS Analysis → NGS: Assembly → velvet
  • Set the following parameters (leave other settings as they are):

    • K-mer: Enter the value for k that you have been assigned in the spreadsheet.
    • Input file type: Fastq
    • Single or paired end reads: Paired
    • Select first set of reads: mutant_R1.fastq
    • Select second set of reads: mutant_R2.fastq
  • Your tool interface should look like this (you will most likely have a different value for k):

velvet interface

  • Click Execute

Examine the output

  • Galaxy is now running velvet on the reads for you.
  • Press the refresh button in the history pane to see if it has finished.
  • When it is finished, you will have four new files in your history.

    • a Contigs file
    • a Contigs stats file
    • the velvet log file
    • an assembly Last Graph file
  • Click on the View Data button Eye icon on each of the files.

  • The Contigs file will show each contig with the k-mer length and k-mer coverage listed as part of the header (however, these are just called length and coverage).

    • K-mer length: For the value of k chosen in the assembly, a measure of how many k-mers overlap (by 1 bp each overlap) to give this length.
    • K-mer coverage: For the value of k chosen in the assembly, a measure of how many k-mers overlap each base position (in the assembly).

Contigs output

  • The Contigs stats file will show a list of these k-mer lengths and k-mer coverages.

Contigs stats output

  • We will summarise the information in the log file.
  • Go to NGS Common Toolsets → FASTA manipulation → Fasta statistics
  • For the required input file, choose the velvet Contigs file.
  • Click Execute.
  • A new file will appear called Fasta summary stats
  • Click the eye icon to look at this file.

Fasta stats

  • Look at:
    • num_seq: the number of contigs in the FASTA file.
    • num_bp: the number of assembled bases. Roughly proportional to genome size.
    • len_max: the biggest contig.
    • len_N50: N50 is a contig size. If contigs were ordered from small to large, half of all the nucleotides will be in contigs this size or larger.

Now copy the relevant data back into the k-mer spreadsheet on your line.

Along with the demonstrator, have a look at the effect of the k-mer size on the output metrics of the assembly. Note that there are local maxima and minima in the charts.

Assembly with Velvet Optimiser

Now that we have seen the effect of k-mer size on the assembly, we will run the Velvet Optimiser to automatically choose the best k-mer size for us. It will use the “n50” to determine the best k-mer value to use. It then performs the further graph cleaning steps and automatically chooses other parameters for velvet. We should get a much better assembly result than we did with our attempts with Velvet alone.

  • Go to Tools → NGS Analysis → NGS: Assembly → Velvet Optimiser
    • Set the following parameters (leave other settings as they are):

      • Start k-mer size: 45
      • End k-mer size: 73
      • Input file type: Fastq
      • Single or paired end reads: Paired
      • Select first set of reads: mutant_R1.fastq
      • Select second set of reads: mutant_R2.fastq

      • Click Execute

Use the Fasta Statistics tool you used earlier to summarise the Velvet Optimiser Contigs output. Examine the resulting table. What are the main differences?