TL;DR - Fixing the system environment settings for gffcompare and fixing formatting issues with out GTFs appears to have fixed issues with pipeline. The number of putative lncRNAs is now much lower (more similar to deep-dive descriptive numbers) and no duplicates are detected in bed, gtf, fasta, and count files.

Ptuh lncRNA Pipeline: Step-by-Step Explanations

Below is a detailed breakdown of each major section of the Ptuh lncRNA pipeline, describing what each step does and why it’s important.


1. Variables and Environment Setup

# Global R options
knitr::opts_chunk$set(echo = TRUE)

# Define key paths and tool directories
DATA_DIR     <- "../data/01.6-Ptuh-lncRNA-pipeline"
OUTPUT_DIR   <- "../output/01.6-Ptuh-lncRNA-pipeline"
THREADS      <- "24"
FASTQ_SOURCE <- "…/P_meandrina/trimmed/"
FASTQ_SUFFIX <- "fastq.gz"
GENOME_SOURCE<- "…/Pocillopora_meandrina_HIv1.assembly.fasta"
GTF_SOURCE   <- "…/Pocillopora_meandrina_HIv1.genes-validated.gtf"
GFF_SOURCE   <- "…/Pocillopora_meandrina_HIv1.genes-validated.gff3"
GFFPATTERN   <- 'class_code "u"|class_code "x"|class_code "o"|class_code "i"'

HISAT2_DIR     <- "/home/shared/hisat2-2.2.1/"
SAMTOOLS_DIR   <- "/home/shared/samtools-1.12/"
STRINGTIE_DIR  <- "/home/shared/stringtie-2.2.1.Linux_x86_64"
GFFCOMPARE_DIR <- "/home/shared/gffcompare-0.12.6.Linux_x86_64"
BEDTOOLS_DIR   <- "/home/shared/bedtools2/bin"
CPC2_DIR       <- "/home/shared/CPC2_standalone-1.0.1/bin"
CONDA_PATH     <- "/opt/anaconda/anaconda3/bin"

GENOME_FASTA <- file.path(DATA_DIR, "genome.fasta")
GENOME_GTF   <- file.path(DATA_DIR, "genome.gtf")
GENOME_GFF   <- file.path(DATA_DIR, "genome.gff")
FASTQ_DIR    <- file.path(DATA_DIR, "fastq")
GENOME_INDEX <- file.path(OUTPUT_DIR, "genome.index")

Sys.setenv(
  THREADS=THREADS, DATA_DIR=DATA_DIR, OUTPUT_DIR=OUTPUT_DIR,
  FASTQ_SOURCE=FASTQ_SOURCE, FASTQ_SUFFIX=FASTQ_SUFFIX,
  GENOME_SOURCE=GENOME_SOURCE, GTF_SOURCE=GTF_SOURCE,
  GFF_SOURCE=GFF_SOURCE, GFFPATTERN=GFFPATTERN,
  HISAT2_DIR=HISAT2_DIR, SAMTOOLS_DIR=SAMTOOLS_DIR,
  STRINGTIE_DIR=STRINGTIE_DIR, GFFCOMPARE_DIR=GFFCOMPARE_DIR,
  BEDTOOLS_DIR=BEDTOOLS_DIR, CPC2_DIR=CPC2_DIR,
  CONDA_PATH=CONDA_PATH, GENOME_FASTA=GENOME_FASTA,
  GENOME_GTF=GENOME_GTF, GENOME_GFF=GENOME_GFF,
  FASTQ_DIR=FASTQ_DIR, GENOME_INDEX=GENOME_INDEX
)

Why?
Defining all variables and paths at the top centralizes configuration, making the pipeline reproducible and easy to update.


2. Directory Creation

mkdir -p "${DATA_DIR}" "${OUTPUT_DIR}"
  • Purpose: Create the required data and output folders in one go.

3. Download Genome and Reads

wget -nv -r --no-directories --no-parent   -P "${FASTQ_DIR}" -A "*${FASTQ_SUFFIX}" "${FASTQ_SOURCE}"

curl -o "${GENOME_FASTA}" "${GENOME_SOURCE}"
curl -o "${GENOME_GTF}"   "${GTF_SOURCE}"
curl -o "${GENOME_GFF}"   "${GFF_SOURCE}"
  • wget: Retrieves all trimmed FASTQ files.
  • curl: Downloads the reference genome, GTF, and GFF3.

4. Integrity Checks

head -1 "${GENOME_FASTA}"
head -2 "${GENOME_GFF}"
head -1 "${GENOME_GTF}"
  • Goal: Quickly verify that the downloads are actual sequence/annotation files rather than HTML errors.

5. HISAT2 Indexing & Alignment

"${HISAT2_DIR}/hisat2_extract_exons.py" "${GENOME_GTF}" > exon.txt
"${HISAT2_DIR}/hisat2_extract_splice_sites.py" "${GENOME_GTF}" > splice_sites.txt

"${HISAT2_DIR}/hisat2-build" -p "${THREADS}"   "${GENOME_FASTA}" "${GENOME_INDEX}"   --exon exon.txt --ss splice_sites.txt
for r2 in "${FASTQ_DIR}"/*_R2_*."${FASTQ_SUFFIX}"; do
  sample="${r2##*/}"
  sample="${sample%%_R2_*}"
  r1="${r2/_R2_/_R1_}"
  "${HISAT2_DIR}/hisat2" -x "${GENOME_INDEX}" -p "${THREADS}"     -1 "$r1" -2 "$r2" -S "${OUTPUT_DIR}/${sample}.sam"
done
  • Extract hints (exons, splice sites) for better alignment.
  • Index the genome with those hints.
  • Align each paired sample to generate SAM files.

6. SAM → Sorted & Indexed BAM

for sam in "${OUTPUT_DIR}"/*.sam; do
  bam="${sam%.sam}.bam"
  sorted="${sam%.sam}.sorted.bam"
  samtools view -bS -@ "${THREADS}" "$sam" > "$bam"
  samtools sort -@ "${THREADS}" "$bam" -o "$sorted"
  samtools index -@ "${THREADS}" "$sorted"
done
  • Convert SAM → BAM,
  • Sort BAM for downstream compatibility,
  • Index for rapid access.

7. Transcript Assembly with StringTie

find "${OUTPUT_DIR}" -name "*sorted.bam"   | xargs -n1 -I{} "${STRINGTIE_DIR}/stringtie"       -p "${THREADS}" -G "${GENOME_GFF}"       -o "{}.gtf" "{}"

"${STRINGTIE_DIR}/stringtie" --merge -G "${GENOME_GFF}"   -o "${OUTPUT_DIR}/stringtie_merged.gtf"   "${OUTPUT_DIR}"/*.gtf
  • Per-sample assembly of transcript models.
  • Merge across samples into a unified annotation.

At this point there are 552268 transcripts.


8. GFFCompare & Candidate Selection

"${GFFCOMPARE_DIR}/gffcompare" -r "${GENOME_GFF}"   -o "${OUTPUT_DIR}/gffcompare_merged"   "${OUTPUT_DIR}/stringtie_merged.gtf"

awk '$3=="transcript" && !/^#/ && /'"$GFFPATTERN"'/'   gffcompare_merged.annotated.gtf   | awk '($5 - $4) > 199'   > lncRNA_candidates.gtf
  • Class codes filter novel/antisense classes.
  • Length filter ensures >200 bp.

After this filter step there are 18998 lncRNA candidates.


9. Fasta Extraction & CPC2

Bedtools

"${BEDTOOLS_DIR}"/bedtools getfasta \
-fi "${GENOME_FASTA}" \
-bed "${OUTPUT_DIR}/lncRNA_candidates.gtf" \
-fo "${OUTPUT_DIR}/lncRNA_candidates.fasta" \
-name -split
fgrep -c ">" ${OUTPUT_DIR}/lncRNA_candidates.fasta

head ${OUTPUT_DIR}/lncRNA_candidates.fasta

CPC2

eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
python /home/shared/CPC2_standalone-1.0.1/bin/CPC2.py \
-i "${OUTPUT_DIR}/lncRNA_candidates.fasta" \
-o "${OUTPUT_DIR}/CPC2"

Filter

awk '$8 == "noncoding" {print $1}' "${OUTPUT_DIR}/CPC2.txt" > "${OUTPUT_DIR}/noncoding_transcripts_ids.txt"

Subsetting new fasta

"${SAMTOOLS_DIR}samtools" faidx "${OUTPUT_DIR}/lncRNA_candidates.fasta" \
-r "${OUTPUT_DIR}/noncoding_transcripts_ids.txt" \
> "${OUTPUT_DIR}/lncRNA.fasta"

Generate new bed and new gtf

# Define input and output file paths using the OUTPUT_DIR variable
input="${OUTPUT_DIR}/noncoding_transcripts_ids.txt"
output="${OUTPUT_DIR}/lncRNA.bed"

# Process each line of the input file
while IFS= read -r line; do
    # Remove "transcript::" from the line
    line="${line//transcript::/}"
    
    # Split the line by ':' to get the chromosome and position string
    IFS=':' read -r chromosome pos <<< "$line"
    
    # Split the position string by '-' to separate start and end positions
    IFS='-' read -r start end <<< "$pos"
    
    # Convert the start position to 0-based by subtracting 1
    start=$((start - 1))
    
    # Write the chromosome, updated start, and end positions to the output file (tab-separated)
    printf "%s\t%s\t%s\n" "$chromosome" "$start" "$end"
done < "$input" > "$output"
awk 'BEGIN{OFS="\t"; count=1} {printf "%s\t.\tlncRNA\t%d\t%d\t.\t+\t.\tgene_id \"lncRNA_%03d\";\n", $1, $2, $3, count++;}' "${OUTPUT_DIR}/lncRNA.bed" \
> "${OUTPUT_DIR}/lncRNA.gtf"
  • Extract nucleotide sequences of candidates.
  • Predict coding potential; retain only noncoding.
  • Generate new gtf, bed, and fasta of lncRNA candidates

After this filter step there are 16153 lncRNA candidates in all three output file types.


10. GTF format editing & featureCounts for expression matrices

New reliable position change step. +1 to bed positions to fit GTF expectation for downstream analysis.

awk 'BEGIN{OFS="\t"} 
     { $4 = $4 + 1; print }' \
  ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/lncRNA.gtf \
> ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_fixed.gtf

The 16104th line has a start position error, with a true start of zero that cannot be fed into featureCounts. This chunk only fixes this one line to change start from 0 to 1.

awk 'BEGIN{OFS="\t"}
     # if the GTF start (col 4) is 0, set it to 1
     { if($4==0) $4=1
       print
     }' \
  ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_fixed.gtf \
> ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_zeroFix.gtf

Issue with column 9. Original gtf generation made lncRNA ID and gene_id two separate columns, need to be merge into one for proper formatting.


awk 'BEGIN{OFS="\t"}
     {
       # Gather everything from col 9 onward into one string
       attr = $9
       for(i=10; i<=NF; i++){
         attr = attr " " $i
       }
       # Now print exactly nine columns, with our recombined attr
       print $1, $2, $3, $4, $5, $6, $7, $8, attr
     }' \
  ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_zeroFix.gtf \
> ~/github/deep-dive-expression/M-multi-species/output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_for_fc.gtf

GTF is now ready for input to featureCounts for expression matrices

/home/shared/subread-2.0.5-Linux-x86_64/bin/featureCounts \
-T 24 \
-a ../output/01.6-Ptuh-lncRNA-pipeline/Ptuh_lncRNA_for_fc.gtf \
-o ../output/01.6-Ptuh-lncRNA-pipeline/Ptuh_counts.txt \
-t lncRNA \
-g gene_id \
-p \
../output/01.6-Ptuh-lncRNA-pipeline/*sorted.bam

Edits will still need to be incorporated to the original final gtf generation now that formatting issues have been diagnosed for successful input into featureCounts.