Using the limma-voom workflow for RNA sequencing data analysis
- 16th January 2026
- Posted by: Breige McBride
- Categories: Bioinformatics, Sequencing
At Fios Genomics, our RNA sequencing (RNA-seq) FGEZ workflow involves processing RNA-seq data to assess gene expression differences between groups of interest. Variability in library size, composition bias and inflated variance in low-expressed genes pose challenges for data analysis, thus transformations and statistical methods tailored to RNA-seq data must be used.
Understanding the limma-voom workflow in our RNA sequencing FGEZ reports
Data transformations for exploratory data analysis
Raw RNA-seq reads are aligned to a reference genome or transcriptome to produce gene level integer counts. To compare gene expression across samples, the counts must be normalised to account for different library sizes (total number of reads) per sample. Counts per million (CPM) reads normalisation divides raw counts by the sample library size, multiplies by one million and converts the values to a logarithmic scale. Depending on the method, the details of the calculation may vary. For example, the log2-transformed CPM reads generated by the function cpm in the R package edgeR (Chen et al., Nucleic Acids Research, 2025) differ from those generated by the voom function from limma (Ritchie et al., Nucleic Acids Research, 2015).
A further challenge in RNA-seq analysis is the mean–variance relationship inherent to the data: variance is not constant but depends on expression level (heteroscedasticity), even after CPM transformation. This violates the homoscedasticity assumptions of many statistical models and leads to inflated and unreliable variance estimates for lowly expressed genes. Consequently, genes with low or zero expression across samples should be filtered, for example using filterByExpr from edgeR, prior to applying a variance stabilising transformation. DESeq2 (Love et al., Genome Biology, 2014) offers two methods: a variance stabilising transformation (vst) and a regularised logarithmic transformation (rlog). The vst function transforms RNA-seq data in such a way that the variance becomes approximately constant. The rlog transformation results in values close to log2-transformed CPM reads for genes with high expression, and close to the average expression for genes with low expression. The resultant data are suitable for exploratory analyses such as principal component analysis and pairwise correlation analysis.
Data transformations for differential expression analysis
Although rlog and vst address the mean–variance relationship, they do not correct for composition bias. Composition bias arises when highly expressed genes distort apparent expression levels of other genes and observed differences reflect changes in RNA composition rather than true differential expression. Based on the assumption that most genes are not differentially expressed, the Trimmed Mean of M-values (TMM) method corrects for this bias. The function calcNormFactors from edgeR compares each sample to a reference sample, excludes genes with extreme fold changes or with very low or very high expression, and calculates a scaling factor based on the remaining genes. Actual library size is scaled by this factor resulting in the effective library size. The limma function voom then generates log-scale CPM reads by adding an offset of 0.5 to the raw counts, dividing by the effective library size plus one, multiplying by one million, followed by log2-transformation.
Because voom’s log₂-CPM reads exhibit the mean–variance relationship, the function also generates per-gene, per-sample precision weights based on the estimated mean–variance trend, giving more weight to low-variance observations. The log₂-CPM values and associated weights can then be used to generate a linear regression model per gene, where gene expression is expressed as a function of the variable of interest, which is solved using limma’s lmFit.
Differential expression analysis
Using the model coefficients generated by lmFit, contrasts.fit estimates the difference in mean expression between the groups of interest. The function ebayes then applies an empirical Bayes method to improve variance estimates by borrowing information from across all genes, shrinking the gene-wise variance estimates towards a common value. This improves statistical reliability, especially for small sample sizes. Finally, ebayes calculates a moderated t-statistic using the shrunken variance estimates and the associated significance value.
Conclusions
The limma-voom workflow overcomes the key challenges of working with RNA-seq data. Understanding each step of the workflow is key to being confident that the results we report are based on sound data analysis. At Fios Genomics, we are committed to handling your data correctly and accurately, always using best practices. If you would like to discuss the above in more detail or if you require RNA-seq data analysis, contact us and we’ll be happy to help!
Authors: Fios Genomics Bioinformaticians
See Also:
Linear Mixed Effects Models
Bioinformatics Report Example – Your Data Explained
Compare Multiple Drug Targets at Once
