DGE using kallisto
This tutorial is about differential gene expression in bacteria, using tools on the command-line tools (kallisto) and the web (Degust).
Differential Gene Expression (DGE) is the process of determining whether any genes were expressed at a different level between two conditions. For example, the conditions could be wildtype versus mutant, or the same sample from two growth conditions. Usually multiple biological replicates are done for each condition - these are needed to separate variation within the condition from that between the conditions.
There are several ways to test for DGE. All involve these steps:
- map/align reads to transcripts
- count number of reads per transcript
- see if counts differ between conditions
Some tools will combine two of these steps. For example, options include:
Map reads to reference genome with BWA-MEM, count reads per transcript with HTSeq-count, examine DGE using voom/limma (within Galaxy or Degust).
Pseudo-align reads to a reference transcriptome and count, using kallisto, then examine DGE using voom/limma (within Galaxy or Degust).
At the end of this tutorial you should be able to:
- (Pseudo-)align RNA-Seq data to a reference transcriptome and count: kallisto
- Perform statistical analysis to obtain a list of differentially expressed genes: Degust
- Visualize and interpret the results
A typical experiment will have 2 conditions each with 3 replicates, for a total of 6 samples.
- Our RNA-seq reads are from 6 samples in
- We have single-end reads; so one file per sample.
- Data could also be paired-end reads, and there would be two files per sample.
- These have been reduced to 1% of their original size for this tutorial.
- The experiment used the bacteria E. coli grown in two conditions.
- Files labelled “LB” are the wildtype
- Files labelled “MG” have been exposed to 0.5% αMG - alpha methyglucoside (a sugar solution).
The files are from Study PRJNA194149 from EBI ENA. We are using 3 FASTQ files from the control set (SRR794833-835) and 3 FASTQ files from the experimental condition set (SRR794848-850).
Login to your GVL. e.g.:
ssh firstname.lastname@example.org <enter your password>
Get the files:
Uncompress and extract the files:
tar -zxvf Ecoli_kallisto_files.tar.gz
Move into the new directory:
You should have the following files:
- 6 x RNA-seq reads in
- 1 x reference transcriptome in
- 1 x table of features in
The reference transcriptome and features table have been produced from a genbank file, using a custom python script.
We need to count the number of RNA-seq reads (that exist as fragments) that match different transcripts in the genome, including those for protein-coding sequences (such as genes) and RNA sequences (such as tRNA and mRNA). Therefore, we need a subset of the whole genome - the reference transcriptome.
Kallisto will count the reads per transcript.
Index the transcripts file
kallisto index -i transcripts.idx Ecoli_transcripts.fasta
transcripts.idx: the name of the output index file transcripts.ffn: the name of the input fasta file file
Run kallisto for every read set
First, run kallisto for the
kallisto quant -i transcripts.idx -o LB1 --single -l 500 -s 50 LB1.fastq.gz
- -o LB1: LB1 will be name of the output folder produced from this analysis
- –single : single-end reads
- -l : estimated length of library fragments
- -s : estimated standard deviation of library fragments
LB1.fastq.gz: input FASTQ file
Repeat for every FASTQ file (LB2, LB3, MG1, MG2, MG3).
- Run as above, but change the name of the output folder, and the file name at the end.
- We then need to combine all the counts into one table.
Extract required columns
Each output folder includes an
cut -f4 -d$'\t' abundance.tsv | tail -n +2 > LB1_headless.tsv
This cuts column 4, then removes the header, and saves as
Then add column heading “LB1” and saves as
echo -e "LB1" | cat - LB1_headless.tsv > LB1.tsv
Repeat with all the other
Paste together with features table
paste LB1.tsv LB2.tsv LB3.tsv MG1.tsv MG2.tsv MG3.tsv Ecoli_features.tsv > counts.tsv
Examine the file:
There should be a column for every set of RNA-Seq reads, and then several columns of information including feature, name and description.
Test for DGE
Degust is a tool on the web that can analyse the counts files produced in the step above, to test for differential gene expression.
(Degust can also display the results from DGE analyses performed elsewhere.)
Upload counts file
Go to the Degust web page. Click
- Click on
- Select the
counts.tsvand click Open.
A Configuation page will appear.
Nametype DGE in E coli
Info columnsselect name
EC Number columnselect EC
Analyze server sideleave box checked.
Min read countput 10
Replicates, select LB1, LB2, LB3.
Replicates, select MG1, MG2, MG3.
Save changes View- this brings up the Degust viewing window.
Overview of Degust sections
- Top black panel with
Configuresettings at right.
- Left: Conditions: LB (control) and MG (treatment).
- Left: Method selection for DGE.
- Top centre: Plots, with options at right.
- When either of the expression plots are selected, a heatmap appears below.
- A table of genes (or features); expression in treatment relative to control (Treatment column); and significance (FDR column).
Analyze gene expression
Method, make sure that Voom/Limmais selected.
Apply. This runs Voom/Limma on the uploaded counts.
First, look at the MDS plot.
- This is a multidimensional scaling plot which represents the variation between samples.
- All the LB samples would be close to each other
- All the MG samples would be close to each other
- The LB and MG groups would be far apart
- The x-axis is the dimension with the highest magnitude. The control/treatment samples should be split along this axis.
- Our LB samples are on the left and the MG samples are on the right, which means they are well separated on their major MDS dimension, which looks correct.
Expression - MA plot
Each dot shows the change in expression in one gene.
- The average expression (over both condition and treatment samples) is represented on the x-axis.
- Plot points should be symmetrical around the x-axis.
- We can see that many genes are expressed at a low level, and some are highly expressed.
- The fold change is represented on the y axis.
- If expression is significantly different between treatment and control, the dots are red. If not, they are blue. (In Degust, significant means FDR <0.05).
- At low levels of gene expression (low values of the x axis), fold changes are less likely to be significant.
Click on the dot to see the gene name.
Expression - Parallel Coordinates and heatmap
Each line shows the change in expression in one gene, between control and treatment.
- Go to
Optionsat the right.
FDR cut-offset at 0.001.
- This is a significance level (an adjusted p value). We will set it quite low in this example, to ensure we only examine key differences.
Look at the Parallel Coordinates plot. There are two axes:
- Left: LB: Gene expression in the control samples. All values are set at zero.
- Right: MG Gene expression in the treatment samples, relative to expression in the control.
The blocks of blue and red underneath the plot are called a heatmap.
- Each block is a gene. Click on a block to see its line in the plot above.
- Look at the row for the Treatment. Relative to the control, genes expressed more are red; genes expressed less are blue.
- for an experiment with multiple treatments, the various treatment axes can be dragged to rearrange. There is no natural order (such as a time series).
Table of genes
- name: names of genes. Note that gene names are sometimes specific to a species, or they may be only named as a locus ID (a chromosomal location specified in the genome annotation).
- FDR: False Discovery Rate. This is an adjusted p value to show the significance of the difference in gene expression between two conditions. Click on column headings to sort. By default, this table is sorted by FDR.
- LB and MG: log2(Fold Change) of gene expression. The default display is of fold change in the treatment (MG) relative to the control (LB). Therefore, values in the “LB” column are zero. This can be changed in the
Optionspanel at the top right.
- In some cases, a large fold change will be meaningful but in others, even a small fold change can be important biologically.
A pathway is a drawn network to show the interaction between molecules, including some or all of genes, proteins, RNAs, chemical reactions.
- Click on Kegg Pathway, and select “Glycolysis”.
- Genes in this pathway will be highlighted as you hover over them elsewhere in Degust (e.g., in the table).
To learn more about the differentially-expressed genes:
- Go to the NCBI website.
All Databases, click on Gene
- Enter the gene name in the search bar; e.g. ptsG
- Click on the first result that matches the species (e.g. in this case, E. coli).
- This provides information about the gene, and may also show further references (e.g. in this case, a link to the EcoGene resource).
Next steps: Investigate the biochemical pathways involving the genes of interest.