Code repo link: 13.1-lncRNA-cross-species.Rmd

Compared lncRNAs across Acropora pulchra (Apul), Porites evermanni (Peve), and Porites tuahiniensis (Ptuh) — to identify:

1:1 homologous lncRNAs between species (pairwise BRHs), and\
1:1:1 homologous lncRNAs across all three species (triplets).

First used strict, protein-like BLAST filters, then repeated the analysis with relaxed, lncRNA-appropriate thresholds, all using the same BLAST results.

Summary

Strict criteria (≥70% identity, ≥50% coverage, ≤ 1e-5) identify a very small set of strongly conserved lncRNAs.

Relaxed criteria (≥50% identity, ≥30% coverage, ≤ 1e-5) expand the set of shared lncRNAs while still enforcing 1:1 and 1:1:1 relationships.

Species	Category	Strict n	Relaxed n	Species total
Apul	Species-only	31,465	31,426	31,491
Apul	Apul–Peve	9	28	31,491
Apul	Apul–Ptuh	16	36	31,491
Apul	Apul–Peve–Ptuh	1	1	31,491
Peve	Species-only	10,074	10,048	10,090
Peve	Apul–Peve	9	28	10,090
Peve	Peve–Ptuh	6	13	10,090
Peve	Apul–Peve–Ptuh	1	1	10,090
Ptuh	Species-only	16,130	16,103	16,153
Ptuh	Apul–Ptuh	16	36	16,153
Ptuh	Peve–Ptuh	6	13	16,153
Ptuh	Apul–Peve–Ptuh	1	1	16,153

Overall, this workflow makes me feel better about the actual process of getting 1:1 and 1:1:1 hits, but I think this might be too stringent to really be meaningful. Since we don’t expect much sequence conservation, it seems silly to apply such strict sequence similarity standards. As you can see the number of conserved sequences is extremely small to the point of what feels like not comparing sequences in a biologically relevant way.

Next steps

I will do some re-runs with settings more similar to E5 deep-dive comparisons
Orthogroup analysis (like used in my previous post) needs further consideration as it likely allows for the expected biological reality of low conservation while still meaningfully comparing lncRNAs
Run blasts for the handful of conserved lncRNAs identified. These might actually be a really cool test of whether or not a small number of lncRNAs are conserved across species. Maybe we can get a hit with a previously described lncRNA of known function!

Breakdown of code for this task

Data download and setup

I started by downloading species-specific lncRNA FASTAs and filtered count matrices from the deep-dive-expression repository, then set up directories and loaded core packages:

library(tidyverse)
library(Biostrings)
library(purrr)

blast_dir      <- "~/github/deep-dive-expression/M-multi-species/output/13-lncRNA-cross-species/blast"
out_table_dir  <- "~/github/deep-dive-expression/M-multi-species/output/13-lncRNA-cross-species/length_expr_tables"
plot_dir       <- "~/github/deep-dive-expression/M-multi-species/output/13-lncRNA-cross-species/plots"

dir.create(out_table_dir, showWarnings = FALSE, recursive = TRUE)
dir.create(plot_dir,      showWarnings = FALSE, recursive = TRUE)

The FASTAs and count files are downloaded with a bash chunk in the Rmd, into:

~/github/deep-dive-expression/M-multi-species/data/13.1-lncRNA-cross-species

Helper functions: length, expression, and BLAST parsing

I defined helpers to extract lncRNA length, mean expression, and to filter BLAST hits:

get_lnc_lengths <- function(fasta_path) {
  dna <- Biostrings::readDNAStringSet(fasta_path)
  ids_raw   <- names(dna)
  ids_clean <- sub("^[^_]+_", "", ids_raw)  # Apul_lncRNA_001 -> lncRNA_001

  tibble(
    lnc_id    = ids_clean,
    length_nt = as.integer(width(dna))
  )
}

get_lnc_expression <- function(counts_path) {
  counts <- readr::read_tsv(counts_path, show_col_types = FALSE)

  if ("Geneid" %in% names(counts)) {
    counts <- counts %>% dplyr::rename(lnc_id = Geneid)
  } else if (!"lnc_id" %in% names(counts)) {
    names(counts)[1] <- "lnc_id"
  }

  meta_cols <- c("lnc_id", "Chr", "Start", "End", "Strand", "Length")

  expr_mat <- counts %>%
    dplyr::select(-dplyr::any_of(meta_cols)) %>%
    dplyr::mutate(across(everything(), as.numeric))

  tibble(
    lnc_id          = counts$lnc_id,
    mean_expr       = rowMeans(expr_mat, na.rm = TRUE),
    log10_mean_expr = log10(rowMeans(expr_mat, na.rm = TRUE) + 1)
  )
}

For BLAST parsing, I used a function that keeps only the single best hit per query, after filtering on percent identity, query coverage, and e-value:

read_blast_best_hits <- function(path,
                                 pident_min = 70,
                                 qcov_min   = 0.5,
                                 evalue_max = 1e-5) {
  empty <- tibble(
    qseqid   = character(),
    sseqid   = character(),
    pident   = numeric(),
    aln_len  = integer(),
    qlen     = integer(),
    slen     = integer(),
    evalue   = numeric(),
    bitscore = numeric(),
    qcov     = numeric()
  )

  if (!file.exists(path)) {
    warning("BLAST file not found: ", path)
    return(empty)
  }

  x <- readr::read_tsv(
    path,
    col_names = c("qseqid", "sseqid", "pident",
                  "aln_len", "qlen", "slen", "evalue", "bitscore"),
    show_col_types = FALSE
  )

  if (nrow(x) == 0) {
    warning("BLAST file is empty: ", path)
    return(empty)
  }

  x %>%
    dplyr::mutate(qcov = aln_len / qlen) %>%
    dplyr::filter(
      pident >= pident_min,
      qcov   >= qcov_min,
      evalue <= evalue_max
    ) %>%
    dplyr::group_by(qseqid) %>%
    dplyr::arrange(
      dplyr::desc(bitscore),
      evalue,
      dplyr::desc(qcov),
      .by_group = TRUE
    ) %>%
    dplyr::slice(1) %>%
    dplyr::ungroup()
}

Building the core lncRNA table

Using the FASTAs and count matrices, I built a combined table with length and expression for each lncRNA:

species_meta <- tibble::tribble(
  ~species, ~fasta, ~counts,
  "Apul", "~/github/deep-dive-expression/M-multi-species/data/13.1-lncRNA-cross-species/Apul-lncRNA.fasta",
  "Peve", "~/github/deep-dive-expression/M-multi-species/data/13.1-lncRNA-cross-species/Peve-lncRNA.fasta",
  "Ptuh", "~/github/deep-dive-expression/M-multi-species/data/13.1-lncRNA-cross-species/Ptuh-lncRNA.fasta"
)

all_species_df <- species_meta %>%
  dplyr::mutate(
    lengths = purrr::map(fasta,  get_lnc_lengths),
    expr    = purrr::map(counts, get_lnc_expression)
  ) %>%
  dplyr::mutate(
    merged = purrr::map2(
      lengths, expr,
      ~ dplyr::left_join(.x, .y, by = "lnc_id")
    )
  )

lnc_df <- all_species_df %>%
  dplyr::select(species, merged) %>%
  tidyr::unnest(merged)

This lnc_df is the base object I keep joining onto throughout.

Strict homology: 1:1 and 1:1:1 using BRHs

To define 1:1 homology between species, I used best reciprocal hits (BRHs). For each species pair, I read in filtered BLAST hits and constructed BRH pairs:

apul_vs_peve <- read_blast_best_hits(file.path(blast_dir, "Apul_vs_Peve.blastn.tsv"))
peve_vs_apul <- read_blast_best_hits(file.path(blast_dir, "Peve_vs_Apul.blastn.tsv"))

apul_vs_ptuh <- read_blast_best_hits(file.path(blast_dir, "Apul_vs_Ptuh.blastn.tsv"))
ptuh_vs_apul <- read_blast_best_hits(file.path(blast_dir, "Ptuh_vs_Apul.blastn.tsv"))

peve_vs_ptuh <- read_blast_best_hits(file.path(blast_dir, "Peve_vs_Ptuh.blastn.tsv"))
ptuh_vs_peve <- read_blast_best_hits(file.path(blast_dir, "Ptuh_vs_Peve.blastn.tsv"))

get_brh_pairs <- function(q_to_s, s_to_q, q_species, s_species) {
  dplyr::inner_join(
    q_to_s %>% dplyr::select(qseqid, sseqid),
    s_to_q %>% dplyr::select(qseqid, sseqid),
    by = c("qseqid" = "sseqid", "sseqid" = "qseqid")
  ) %>%
    dplyr::transmute(
      !!q_species := qseqid,
      !!s_species := sseqid
    ) %>%
    dplyr::mutate(
      !!q_species := sub("^[^_]+_", "", .data[[q_species]]),
      !!s_species := sub("^[^_]+_", "", .data[[s_species]])
    )
}

brh_Apul_Peve <- get_brh_pairs(apul_vs_peve, peve_vs_apul, "Apul", "Peve")
brh_Apul_Ptuh <- get_brh_pairs(apul_vs_ptuh, ptuh_vs_apul, "Apul", "Ptuh")
brh_Peve_Ptuh <- get_brh_pairs(peve_vs_ptuh, ptuh_vs_peve, "Peve", "Ptuh")

Three-way 1:1:1 triplets (Apul–Peve–Ptuh) were then defined as consistent BRHs across all three pairs:

triplets_1to1to1 <- brh_Apul_Peve %>%
  dplyr::inner_join(brh_Apul_Ptuh, by = "Apul") %>%
  dplyr::inner_join(brh_Peve_Ptuh, by = c("Peve", "Ptuh"))

Per-species categories under strict criteria

For each species, every lncRNA was assigned to exactly one category:

species-only\
pairwise shared (Apul–Peve, Apul–Ptuh, or Peve–Ptuh)\
shared across all three species (Apul–Peve–Ptuh)

Example for Apul:

apul_all <- lnc_df %>%
  dplyr::filter(species == "Apul") %>%
  dplyr::pull(lnc_id) %>% unique()

apul_membership <- tibble::tibble(
  species = "Apul",
  lnc_id  = apul_all
) %>%
  dplyr::mutate(
    in_triplet = lnc_id %in% triplets_1to1to1$Apul,
    in_APEVE  = lnc_id %in% brh_Apul_Peve$Apul,
    in_APPT   = lnc_id %in% brh_Apul_Ptuh$Apul,
    category  = dplyr::case_when(
      in_triplet             ~ "Apul_Peve_Ptuh",
      in_APEVE & !in_triplet ~ "Apul_Peve",
      in_APPT  & !in_triplet ~ "Apul_Ptuh",
      TRUE                   ~ "Apul_only"
    )
  )

I did the same for Peve and Ptuh, then combined:

membership_all <- dplyr::bind_rows(
  apul_membership,
  peve_membership,
  ptuh_membership
) %>%
  dplyr::select(species, lnc_id, category)

A quick sanity check confirms that categories sum to the FASTA totals per species:

membership_all %>%
  dplyr::count(species, category, name = "n_lncRNAs") %>%
  dplyr::group_by(species) %>%
  dplyr::mutate(species_total = sum(n_lncRNAs)) %>%
  dplyr::ungroup()

These strict categories were then joined back into lnc_df as sharing_simple.

Relaxed lncRNA-appropriate thresholds (post-hoc)

The strict criteria (≥70% identity, ≥50% coverage) produced very few shared lncRNAs, which is plausible for lncRNAs but too restrictive for exploring broader conservation. To loosen this without rerunning BLAST, I defined relaxed thresholds:

PIDENT_MIN_RELAXED <- 50
QCOV_MIN_RELAXED   <- 0.3
EVALUE_MAX_RELAXED <- 1e-5

Then I wrapped the BLAST reader:

read_blast_best_hits_relaxed <- function(path) {
  read_blast_best_hits(
    path,
    pident_min = PIDENT_MIN_RELAXED,
    qcov_min   = QCOV_MIN_RELAXED,
    evalue_max = EVALUE_MAX_RELAXED
  )
}

Using these new thresholds, I rebuilt pairwise BRHs and 1:1:1 triplets:

apul_vs_peve_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Apul_vs_Peve.blastn.tsv"))
peve_vs_apul_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Peve_vs_Apul.blastn.tsv"))

apul_vs_ptuh_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Apul_vs_Ptuh.blastn.tsv"))
ptuh_vs_apul_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Ptuh_vs_Apul.blastn.tsv"))

peve_vs_ptuh_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Peve_vs_Ptuh.blastn.tsv"))
ptuh_vs_peve_r <- read_blast_best_hits_relaxed(file.path(blast_dir, "Ptuh_vs_Peve.blastn.tsv"))

brh_Apul_Peve_r <- get_brh_pairs(apul_vs_peve_r, peve_vs_apul_r, "Apul", "Peve")
brh_Apul_Ptuh_r <- get_brh_pairs(apul_vs_ptuh_r, ptuh_vs_apul_r, "Apul", "Ptuh")
brh_Peve_Ptuh_r <- get_brh_pairs(peve_vs_ptuh_r, ptuh_vs_peve_r, "Peve", "Ptuh")

triplets_1to1to1_relaxed <- brh_Apul_Peve_r %>%
  dplyr::inner_join(brh_Apul_Ptuh_r, by = "Apul") %>%
  dplyr::inner_join(brh_Peve_Ptuh_r, by = c("Peve", "Ptuh"))

Relaxed category assignment and inspection

I repeated the per-species category assignment using the relaxed BRHs and triplets, storing labels as category_relaxed:

apul_membership_relaxed <- tibble::tibble(
  species = "Apul",
  lnc_id  = apul_all
) %>%
  dplyr::mutate(
    in_triplet = lnc_id %in% triplets_1to1to1_relaxed$Apul,
    in_APEVE  = lnc_id %in% brh_Apul_Peve_r$Apul,
    in_APPT   = lnc_id %in% brh_Apul_Ptuh_r$Apul,
    category_relaxed = dplyr::case_when(
      in_triplet             ~ "Apul_Peve_Ptuh",
      in_APEVE & !in_triplet ~ "Apul_Peve",
      in_APPT  & !in_triplet ~ "Apul_Ptuh",
      TRUE                   ~ "Apul_only"
    )
  )

Peve and Ptuh were handled analogously; then I combined:

membership_relaxed <- dplyr::bind_rows(
  apul_membership_relaxed,
  peve_membership_relaxed,
  ptuh_membership_relaxed
) %>%
  dplyr::select(species, lnc_id, category_relaxed)

Sanity check (again, per-species totals):

membership_relaxed %>%
  dplyr::count(species, category_relaxed, name = "n_lncRNAs") %>%
  dplyr::group_by(species) %>%
  dplyr::mutate(species_total = sum(n_lncRNAs)) %>%
  dplyr::ungroup()

Finally, I joined relaxed labels back into lnc_df:

lnc_df_relaxed <- lnc_df %>%
  dplyr::left_join(
    membership_relaxed %>% dplyr::select(species, lnc_id, category_relaxed),
    by = c("species", "lnc_id")
  )

Visualizing relaxed patterns

To explore how relaxed homology relates to length and expression, I derived a simplified sharing label:

lnc_df_plot_relaxed <- lnc_df_relaxed %>%
  dplyr::mutate(
    species_facet = dplyr::case_when(
      species == "Apul" ~ "italic('A. pulchra')",
      species == "Peve" ~ "italic('P. evermanni')",
      species == "Ptuh" ~ "italic('P. tuahiniensis')",
      TRUE ~ species
    ),
    sharing_simple_relaxed = dplyr::case_when(
      grepl("only$", category_relaxed)     ~ "Species-only",
      category_relaxed == "Apul_Peve_Ptuh" ~ "All three",
      TRUE                                 ~ "Pairwise"
    )
  )

Then I plotted lncRNA length distributions by relaxed sharing pattern, and length vs expression, colored by sharing pattern, in the Rmd using ggplot2.