Analysis of 3'End sequencing and extension of gene annotation

Library

Library called MACE (Massive Analysis of cDNA Ends) is used to prepare RNA samples from whole turtle or lizard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs).

Preprocessing

Reads quality control

Reads base composition and quality per length is estimated using fastqSeqStats.

zcat data.fastq.gz | fastqSeqStats --ifastq - --otxt data_qual.txt --polyA

Alignment

Reads alignment is done using STAR. The following parameters were utalized with the program.

STAR --runMode alignReads --runThreadN 6 --genomeDir $genomePath --readFilesIn $queryFile --readFilesCommand zcat --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --alignSoftClipAtReferenceEnds No --outFilterScoreMinOverLread 0.25 --outFilterMatchNminOverLread 0.25;

Reads	Turtle	Lizzard
raw	11099556	16061777
unq. aligned	8468757 (76.30%)	12029909 (74.90%)

Find 3'UTR isoforms

Prepare internal priming mask

To evaluate potential internal priming events, a poly(A/T) mask of a 10 consecutive mono bases is aligned to the genome of interest with Bowtie.

bowtie /path_to_genome/bowtie_index/genome PolyTailMask.fa -f -v 2 --all --sam --threads 6 |samtools view -buS - | samtools sort - PolyTailMask

The resulted Bam file is converted to Bed format and compressed with BGzip.

bedtools bamtobed -i PolyTailMask.bam -split | bedtools merge -i - -s -c 4 -o count | sort -k1,1 -k2,2n | awk -F"\t" 'OFS="\t"{print $1,$2,$3,sprintf("poly%06d",NR),$5,$4}' | bgzip > species_PolyRegions.bed.gz

The final mask file is indexed with Tabix for random access.

tabix species_PolyRegions.bed.gz

Detection

Clusters of poly(A) sites are detected using PASSFinder. It relies on HTSLib. Bam alignment file is processed in a way, where each read is classified as either poly(A) containing or not, based on a poly(A) recognition algorithm, designed by Jim Kent at UCSC. Coverage on each base is piled up. The resulted Bed coverage files are sorted by position, compressed using BGzip and indexed using Tabix.

PASSFinder --input $inputBam -r species_PolyRegions.bed.gz --masksize 3 --polysize 3 --mapq 255 | sort -k1,1 -k2,2n | bgzip -@ 8 > $bedCoverageFile;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;

Clustering

3'base coverage holds the infomation of potential 3'UTR sites. In order to identify the positions of polyadenylation we implemented a base clustering techniques that can merge positions in user defined window. During the grouping process information about best expressed base and best expressed seed (3'base of reads containing poly(A) tail) is maintained:

chrom
chromStart
chromEnd
groupId
span
strand
clusterBasesMasked
clusterBasesCounts
clusterReadsSumCoverage
clusterReadsMaxCoverage
clusterReadsBestBase
clusterSeedsCounts
clusterSeedsSumCoverage
clusterSeedsMaxCoverage
clusterSeedsBestBase

Use PASSCluster.pl script to run the procedure. We used a clustering windows of 25 nucleotides as suggested in Müller et. al..

perl PASSCluster.pl -gbed $bedCoverageFile -window 25 | sort -k1,1 -k2,2n | bgzip > $passCluseredFile.bed.gz;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;

Annotation

Prepare reference annotation map

Reference annotation maps were downloaded from NCBI database for Chrysemys picta (Turtle) and Pogona vitticeps (Lizard). The GFF files were converted to Bed12 files. The Bed12 reference annotation file has the name field contains transcript and gene information separated by ';'. To split the Bed12 file into Bed6 file with feature label (5'UTR, CDS, intron, 3'UTR) appended to the name run BED12Split.pl.

Find closest gene to cluster

To associate PASS clusters with genes use PASSAnnotate.pl routine. The script requires a Bed6 features map created in the previous step and PASS clusters file from the clustering procedure. It uses [bedtools closest] (http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) as internal engine for assosiating neighbouring genes and sites. The routine will calculate UTR predicted length and relative contribution of peak per gene. The following fields are added to the information of the clusters:

geneSymbol
geneFeature
upstreamStartPosition
upstreamSpan
filter (true if pass site contributes >= 1% to total gene expression)

The window used to assign upstream gene was 20kB.

perl PASSAnnotate.pl $passCluseredFile.bed.gz refSeq_Features.bed 20000 |sort -k1,1 -k2,2n |gzip > speciesAnnotation_Date.bed.gz

Methods

Library called MACE (Massive Analysis of cDNA Ends) (cite: https://www.ncbi.nlm.nih.gov/pubmed/25468442) is used to prepare RNA samples from whole turtle or lizzard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs). Reads were aligned to genome of interest using STAR aligner (cite: https://www.ncbi.nlm.nih.gov/pubmed/23104886). Parameters outFilterScoreMinOverLread and outFilterMatchNminOverLread are used to allow poly(A) tail soft clip at reads end. Only uniquely aligned reads are considered for downstream analysis. Internal priming events are avoided by filtering alignments hitting poly(A) rich genomic regions. Such regions are identified by aligning 10 As to genome of interest (cite: https://www.ncbi.nlm.nih.gov/pubmed/25052703). Poly(A) supported sites (PASS) are identified using a HTSLib driven tool called PASSFinder (https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). The tool will cluster reads based on Poly(A) tail using an algorithm defined by Jim Kent at UCSC (https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Alignments are piled up and 3'-most non-A base is maximised to precisely pinpoint poly(A) site position. PAS sites are then associated with upstream genes up to 20kB and current species annotation is accordingly extended.

Overview

Contribution of extended 3'UTRs

The plot represents a cummulative distribution of the relative distance from annotated 3'End. All upstream identified 3'UTRs will have negative distance and will represent an alternative polyadenylation site (~70% of all sites). All downstream identified 3'UTRs will have positive distance and will represent extended PASS (~30% of all sites). Both annotations overcome the same degree of extension. Range of [-5,+5] kB contributes to more than 95% of all sites.

The plot represents each sequencing run per species and the relative contribution of reads used in analysis to gene features. Extended 3'UTRs contribute with around ~10%. Intronic contribution is relatively low, suggesting that DropSeq method preserves cells integrity and has a low efficiency on pre-mRNA sequencing. Both genomes suffer some lack of annotation as we can recognise it from the contribution of intergenic reads.

Example Genome track for CALM3. One can recognize the 3'bias of the method, although most of the coverage falls in the last coding sequence exons. Red track shows the extension achived with the MACE sequencing. Some substantial gene coverage is recovered by that technique.

ReptilePallium/Extension_3UTR/README.md