DGE using kallisto

This tutorial is about differential gene expression in bacteria, using tools on the command-line tools (kallisto) and the web (Degust).

Background

Differential Gene Expression (DGE) is the process of determining whether any genes were expressed at a different level between two conditions. For example, the conditions could be wildtype versus mutant, or the same sample from two growth conditions. Usually multiple biological replicates are done for each condition - these are needed to separate variation within the condition from that between the conditions.

There are several ways to test for DGE. All involve these steps:

map/align reads to transcripts
count number of reads per transcript
see if counts differ between conditions

Some tools will combine two of these steps. For example, options include:

Map reads to reference genome with BWA-MEM, count reads per transcript with HTSeq-count, examine DGE using voom/limma (within Galaxy or Degust).
Pseudo-align reads to a reference transcriptome and count, using kallisto, then examine DGE using voom/limma (within Galaxy or Degust).

Learning Objectives

At the end of this tutorial you should be able to:

(Pseudo-)align RNA-Seq data to a reference transcriptome and count: kallisto
Perform statistical analysis to obtain a list of differentially expressed genes: Degust
Visualize and interpret the results

RNA-Seq reads

A typical experiment will have 2 conditions each with 3 replicates, for a total of 6 samples.

Our RNA-seq reads are from 6 samples in FASTQ format.
- We have single-end reads; so one file per sample.
- Data could also be paired-end reads, and there would be two files per sample.
These have been reduced to 1% of their original size for this tutorial.
The experiment used the bacteria E. coli grown in two conditions.
- Files labelled “LB” are the wildtype
- Files labelled “MG” have been exposed to 0.5% αMG - alpha methyglucoside (a sugar solution).

The files are from Study PRJNA194149 from EBI ENA. We are using 3 FASTQ files from the control set (SRR794833-835) and 3 FASTQ files from the experimental condition set (SRR794848-850).

Get data

Login to your GVL. e.g.:

ssh your_username@gvl.genome.edu.au
<enter your password>

Get the files:

wget https://swift.rc.nectar.org.au:8888/v1/AUTH_377/public/Microbial_tutorials/Ecoli_kallisto_files.tar.gz

Uncompress and extract the files:

tar -zxvf Ecoli_kallisto_files.tar.gz

Move into the new directory:

cd Ecoli_kallisto_files

You should have the following files:

6 x RNA-seq reads in fastq.gz
1 x reference transcriptome in .fasta
1 x table of features in .tsv

The reference transcriptome and features table have been produced from a genbank file, using a custom python script.

We need to count the number of RNA-seq reads (that exist as fragments) that match different transcripts in the genome, including those for protein-coding sequences (such as genes) and RNA sequences (such as tRNA and mRNA). Therefore, we need a subset of the whole genome - the reference transcriptome.

Generate counts

Kallisto will count the reads per transcript.

Index the transcripts file

kallisto index -i transcripts.idx Ecoli_transcripts.fasta

transcripts.idx: the name of the output index file
transcripts.ffn: the name of the input fasta file file

Run kallisto for every read set

First, run kallisto for the LB1.fastq.gz reads:

kallisto quant -i transcripts.idx -o LB1 --single -l 500 -s 50 LB1.fastq.gz

-o LB1: LB1 will be name of the output folder produced from this analysis
–single : single-end reads
-l : estimated length of library fragments
-s : estimated standard deviation of library fragments
LB1.fastq.gz : input FASTQ file

Repeat for every FASTQ file (LB2, LB3, MG1, MG2, MG3).

Run as above, but change the name of the output folder, and the file name at the end.
We then need to combine all the counts into one table.

Extract required columns

Each output folder includes an abundance.tsv file. For each of these files, extract the column of counts (column 4) and remove the table heading.

cut -f4 -d$'\t' abundance.tsv | tail -n +2 > LB1_headless.tsv

This cuts column 4, then removes the header, and saves as LB1_headless.tsv.

Then add column heading “LB1” and saves as LB1.tsv:

echo -e "LB1" | cat - LB1_headless.tsv > LB1.tsv

Move LB1.tsv into the main analysis folder.

Repeat with all the other abundance.tsv files from the kallisto analyses, giving the tables the correct LBx or MGx header.

Paste together with features table

paste LB1.tsv LB2.tsv LB3.tsv MG1.tsv MG2.tsv MG3.tsv Ecoli_features.tsv > counts.tsv

Examine the file:

less counts.tsv

There should be a column for every set of RNA-Seq reads, and then several columns of information including feature, name and description.

Download the counts.tsv file to your local computer.

Test for DGE

Degust is a tool on the web that can analyse the counts files produced in the step above, to test for differential gene expression.

(Degust can also display the results from DGE analyses performed elsewhere.)

Upload counts file

Go to the Degust web page. Click Get Started.

Degust webpage image

Click on Choose File.
Select the counts.tsv and click Open.
Click Upload.

A Configuation page will appear.

For Name type DGE in E coli
For Info columns select name
For EC Number column select EC
For Analyze server side leave box checked.
For Min read count put 10
Click Add condition
- Under Replicates, select LB1, LB2, LB3.
Click Add condition again
- Under Replicates, select MG1, MG2, MG3.
Save changes
View - this brings up the Degust viewing window.

Overview of Degust sections

Top black panel with Configure settings at right.
Left: Conditions: LB (control) and MG (treatment).
Left: Method selection for DGE.
Top centre: Plots, with options at right.
When either of the expression plots are selected, a heatmap appears below.
A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).

Degust

Analyze gene expression

Under Method, make sure that Voom/Limma is selected.
Click Apply. This runs Voom/Limma on the uploaded counts.

MDS plot

First, look at the MDS plot.

MDSplot

This is a multidimensional scaling plot which represents the variation between samples.
Ideally:
- All the LB samples would be close to each other
- All the MG samples would be close to each other
- The LB and MG groups would be far apart
The x-axis is the dimension with the highest magnitude. The control/treatment samples should be split along this axis.
Our LB samples are on the left and the MG samples are on the right, which means they are well separated on their major MDS dimension, which looks correct.

Expression - MA plot

Each dot shows the change in expression in one gene.

The average expression (over both condition and treatment samples) is represented on the x-axis.
- Plot points should be symmetrical around the x-axis.
- We can see that many genes are expressed at a low level, and some are highly expressed.
The fold change is represented on the y axis.
- If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
- At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.

Click on the dot to see the gene name.

MAplot

Expression - Parallel Coordinates and heatmap

Each line shows the change in expression in one gene, between control and treatment.

Go to Options at the right.
- For FDR cut-off set at 0.001.
- This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
Look at the Parallel Coordinates plot. There are two axes:
- Left: LB: Gene expression in the control samples. All values are set at zero.
- Right: MG Gene expression in the treatment samples, relative to expression in the control.
The blocks of blue and red underneath the plot are called a heatmap.
- Each block is a gene. Click on a block to see its line in the plot above.
- Look at the row for the Treatment. Relative to the control, genes expressed more are red; genes expressed less are blue.

Parallel Coordinates plot

Note:

for an experiment with multiple treatments, the various treatment axes can be dragged to rearrange. There is no natural order (such as a time series).

Table of genes

name: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
LB and MG: log2(Fold Change) of gene expression. The default display is of fold change in the treatment (MG) relative to the control (LB). Therefore, values in the “LB” column are zero. This can be changed in the Options panel at the top right.
In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.

Kegg Pathway

A pathway is a drawn network to show the interaction between molecules, including some or all of genes, proteins, RNAs, chemical reactions.

Click on Kegg Pathway, and select “Glycolysis”.

kegg

Genes in this pathway will be highlighted as you hover over them elsewhere in Degust (e.g., in the table).

What next?

To learn more about the differentially-expressed genes:

Go to the NCBI website.
Under All Databases, click on Gene
Enter the gene name in the search bar; e.g. ptsG
Click on the first result that matches the species (e.g. in this case, E. coli).
- This provides information about the gene, and may also show further references (e.g. in this case, a link to the EcoGene resource).

Next steps: Investigate the biochemical pathways involving the genes of interest.

Links

Kallisto paper
Kallisto + sleuth paper. Sleuth can test for differential gene expression.
Link to Voom paper