Skip to content

lienhard/LRGASP_IsoTools

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 

LRGASP_IsoTools

This repository contains scripts, notebooks, and supplementary information related to the IsoTools LRGASP contribution. Please note that this repository uses an early pre-release version of IsoTools (version 0.2.5) that was used for the submission of the LRGASP challenge. Since then, the syntax and functionality of IsoTools have been greatly improved, including the filtering query syntax for better efficiency and ease of use.

This repository is intended to serve as documentation for the LRGASP contribution and is not a guideline for general IsoTools usage. For up-to-date information and guidelines, please refer to the official documentation at https://isotools.readthedocs.io.

TL;DR

The contribution is restricted on PacBio long read only challenge 1 (iso_detect_ref) and challenge 2 (iso_quant). After basic preprocessing, the individual "experiments" are processed with challenge.py, which uses IsoTools to create the required directory structure and output files for challenge 1 and challenge 2. The filtering and normalization strategies are outlined below This notebook contains additional analysis of the challenge dataset performed with IsoTools.

Preprocessing

After downloading relevant fastq files from ENCODE data portal, CapTrap adapter sequences and polyA tails are trimmed using cutadapt:

# PacBio CapTrap files
cutadapt -a "GCATCGCTGTCTCTTATACACATCT" -g "AGATGTGTATAAGAGACAG"  --error-rate .1 --cores 10 -o $fastq_out $fastq_in
# all PacBio files
cutadapt -a "A{100}" --error-rate .1 --cores 10 -o $fastq_out $fastq_in
#simulated files (surprisingly, the reads are 50% reverse complement which has to be detected by the presence of leading polyT)
cutadapt -a "A{100}" --error-rate .1 --cores 10 --discard-untrimmed -o  $fastq_trimmed_fwd $fastq_in
cutadapt -g "T{100}" --error-rate .1 --cores 10 --discard-untrimmed -o  $fastq_trimmed_rev $fastq_in
python reverse_complement.py $fastq_trimmed_rev |gzip > $fastq_trimmed_rev_rc
cat $fastq_trimmed_fwd $fastq_trimmed_rev_rc > $fastq_trimmed

Trimmed PacBio fastq files are aligned to the provided reference using minimap2-2.22

minimap2 -t 30 -ax splice:hq -uf --MD -a ${ref}_hq.mmi $fastq_file |samtools sort -O BAM -o ${out}.bam -

For benchmarking, Illlumina short read RNA-seq fastqs were processed with kallisto and STAR, both with the provided annotation and a long read derived annotation, such as submitted for challenge 1 but based on all samples of an organism.

kallisto quant -t 20 -i ${ref_tr}_kallisto -o ${out}_kallisto  $fq1 $fq2
STAR --quantMode TranscriptomeSAM GeneCounts --runThreadN 16 --genomeDir ${ref_tr}_star --readFilesIn $fq1 $fq2 --readFilesCommand zcat --outFileNamePrefix ${out}_star --outReadsUnmapped Fastx --outBAMcompression -1 --outSAMtype BAM SortedByCoordinate --outSAMattributes All

In addition, short read junction coverage was computed as suggested for the evaluation scripts.

IsoTools

All steps to create the directory structure and output files required for challenge 1 and challenge 2 are performed within the python script "challenge.py":

python challenge.py <sample> <protocol> <submission name>

Using functionality from IsoTools, this scripts imports the reference annotation and the long reads, generates and filters transcript models, and performs a length normalization to estimate transcript abundance from long reads.

After importing the reference annotation, the first step of the analysis is the import of long reads and the assignment to transcripts and genes. During import, the following corrections are performed:

  • If the same shift on 5' and 3' wrt reference junction is observed, this is considered an alignment ambiguity and corrected to match the reference.
  • Chimeric alignments on same chromosome and in correct orientation are considered to originate from long introns, not handled correctly by the aligner. Accordingly, the parts are chained into a single concordant alignment.

Reads are considered to be the same transcript if and only if they share all spice junction. The exact boundaries do not need to match. For mono-exonic transcripts, a overlap of at least 50% is required. The exact transcript boundaries (transcription start site/TSS and polyadenylation site/PAS) are determined after importing all long reads as the median position of all reads. In addition, transcripts are assigned to reference genes if they share at least one splice site or 50% exonic overlap for mono exonic transcripts. If they do not exactly match a reference transcript, the quality of the difference is described by 19 different categories.

Subchallenge 1

The more reads a gene has, the more likely is it that some of the reads will be affected by technical artifacts leading to incorrect transcript models. A priori, we have higher confidence in transcripts matching reference annotation compared to novel transcripts and genes. To distinguish genuine transcripts from technical artifacts, we apply the following filter criteria:

  • Transcripts where all splice sites match reference splice sites need support from at least 2 reads, and at least 2% of the total gene read support.
  • Transcripts with novel splice sites
    • need support from at least 5 reads, and at least 5% of the total gene read support.
    • are discarded if they contain more than 50% A content downstream.
    • are discarded if they contain a direct repeat of length 6 or longer at a novel splice site.

Subchallenge 2

In PacBio Isoseq data observed a marked difference in transcript length distribution compared to the reference annotation, but also compared to the distribution derived from short read transcript levels estimated by kallisto. For the cDNA protocol, this typically results in a depletion of both short (<1500 bases) and very long (>6000 bases) transcripts. To compensate for this length dependent effect, we suggest the following normalization strategy: We fit a lognormal model to both the reference and the observed transcript length distributions. The quotient of the two distribution serves as the normalization factor. As the fit is not good in the left tail (e.g. for very short transcripts ~ 100 bases), the factor is set constant for the first 1% percentile of the observed transcript length model.

Without normalization

without normalization

Length dependency

length dependency

After normalization

with normalization

About

Scripts, notebooks and supplementary information for the IsoTools LRGASP contribution

Resources

Stars

Watchers

Forks

Releases

No releases published