Full code in E5 deep-dive-expression repo here
Overview
This post documents a complete R-based pipeline for analyzing the correlation between long non-coding RNAs (lncRNAs) and mRNAs across three marine invertebrate species: Apul, Peve, and Ptuh. The analysis includes:
- Reading in raw count matrices for lncRNAs and mRNAs
- Cleaning and harmonizing sample names
- Filtering low-expression genes
- Applying DESeq2 normalization and variance-stabilizing transformation
- Computing Pearson correlations in a block-wise fashion using RAM-efficient matrices
- Exporting full and significant correlation results
- Visualizing the top 20 significant lncRNA–mRNA correlations via heatmaps
1. Libraries and Setup
All three analyses begin by loading the necessary R packages:
library(data.table)
library(DESeq2)
library(WGCNA) # for correlation p-values
if (!requireNamespace("bigmemory", quietly = TRUE)) install.packages("bigmemory")
library(bigmemory)
library(dplyr)
2. Data Input and Cleaning
Apul
- Loads lncRNA and mRNA count matrices from the local filesystem
- Removes blank Gene IDs and enforces uniqueness
- Harmonizes sample names using regex
- Rounds Kallisto-generated mRNA counts
Peve and Ptuh
- Uses
curl
to download mRNA matrices from GitHub - Performs the same harmonization and cleaning steps
- Each sample name is standardized via regex from raw filenames
3. Matching and Filtering
For all three species: - Retain only samples common to both lncRNA and mRNA matrices - Apply filters to exclude genes expressed in fewer than 3 samples with fewer than 10 counts
min_samples <- 3
min_counts <- 10
keep_lnc <- rowSums(lnc_counts >= min_counts) >= min_samples
keep_mrna <- rowSums(mrna_counts >= min_counts) >= min_samples
4. DESeq2 Normalization and VST
DESeq2 is used to normalize combined count matrices and extract variance-stabilized data:
dds <- DESeqDataSetFromMatrix(countData = combined,
colData = data.frame(cond = factor(rep("all", ncol(combined))),
row.names = colnames(combined)),
design = ~1)
dds <- estimateSizeFactors(dds)
vst_mat <- assay(vst(dds, blind = TRUE))
5. Pearson Correlation (Block-wise)
To manage memory, correlations are calculated in blocks:
for (start in seq(1, nr_lnc, by = block_size)) {
...
cor_mat [start:end, ] <- rblk
cor_padj[start:end, ] <- p.adjust(pblk, method = "BH")
}
- Correlations are calculated using
cor()
- P-values adjusted with Benjamini–Hochberg (BH) procedure
6. Export Results
- Full correlation and adjusted p-value matrices are exported to
.tsv
- Significant pairs (|r| ≥ 0.7 & padj ≤ 0.05) are extracted and saved separately
fwrite(sig, "significant_pairs.tsv")
7. Visualization
Top 20 most significant lncRNA–mRNA pairs are visualized using a heatmap:
ggplot(heat_long, aes(x = mRNA, y = lncRNA, fill = correlation)) +
geom_tile(color = "white") +
scale_fill_gradient2(...) +
labs(title = "Top lncRNA–mRNA Correlations")
8. Session Info
Each species’ script ends with:
sessionInfo()
This ensures reproducibility and version tracking.
Summary
This workflow efficiently computes transcript correlations across three species while accounting for memory constraints. By combining DESeq2 and WGCNA in a block-wise fashion and harmonizing sample names from disparate pipelines, we ensure robust comparisons of lncRNA-mRNA co-expression patterns across marine invertebrate datasets.