Skip to content
This repository has been archived by the owner. It is now read-only.
Permalink
5059372a14
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time

Analysis of 3'End sequencing and extension of gene annotation

Library

Library called MACE (Massive Analysis of cDNA Ends) is used to prepare RNA samples from whole turtle or lizard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs).

Preprocessing

Reads quality control

Reads base composition and quality per length is estimated using fastqSeqStats.

zcat data.fastq.gz | fastqSeqStats --ifastq - --otxt data_qual.txt --polyA

Alignment

Reads alignment is done using STAR. The following parameters were utalized with the program.

STAR --runMode alignReads --runThreadN 6 --genomeDir $genomePath --readFilesIn $queryFile --readFilesCommand zcat --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --alignSoftClipAtReferenceEnds No --outFilterScoreMinOverLread 0.25 --outFilterMatchNminOverLread 0.25;
Reads Turtle Lizzard
raw 11099556 16061777
unq. aligned 8468757 (76.30%) 12029909 (74.90%)

Find 3'UTR isoforms

Prepare internal priming mask

To evaluate potential internal priming events, a poly(A/T) mask of a 10 consecutive mono bases is aligned to the genome of interest with Bowtie.

bowtie /path_to_genome/bowtie_index/genome PolyTailMask.fa -f -v 2 --all --sam --threads 6 |samtools view -buS - | samtools sort - PolyTailMask

The resulted Bam file is converted to Bed format and compressed with BGzip.

bedtools bamtobed -i PolyTailMask.bam -split | bedtools merge -i - -s -c 4 -o count | sort -k1,1 -k2,2n | awk -F"\t" 'OFS="\t"{print $1,$2,$3,sprintf("poly%06d",NR),$5,$4}' | bgzip > species_PolyRegions.bed.gz

The final mask file is indexed with Tabix for random access.

tabix species_PolyRegions.bed.gz

Detection

Clusters of poly(A) sites are detected using PASSFinder. It relies on HTSLib. Bam alignment file is processed in a way, where each read is classified as either poly(A) containing or not, based on a poly(A) recognition algorithm, designed by Jim Kent at UCSC. Coverage on each base is piled up. The resulted Bed coverage files are sorted by position, compressed using BGzip and indexed using Tabix.

PASSFinder --input $inputBam -r species_PolyRegions.bed.gz --masksize 3 --polysize 3 --mapq 255 | sort -k1,1 -k2,2n | bgzip -@ 8 > $bedCoverageFile;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;

Clustering

3'base coverage holds the infomation of potential 3'UTR sites. In order to identify the positions of polyadenylation we implemented a base clustering techniques that can merge positions in user defined window. During the grouping process information about best expressed base and best expressed seed (3'base of reads containing poly(A) tail) is maintained:

  1. chrom
  2. chromStart
  3. chromEnd
  4. groupId
  5. span
  6. strand
  7. clusterBasesMasked
  8. clusterBasesCounts
  9. clusterReadsSumCoverage
  10. clusterReadsMaxCoverage
  11. clusterReadsBestBase
  12. clusterSeedsCounts
  13. clusterSeedsSumCoverage
  14. clusterSeedsMaxCoverage
  15. clusterSeedsBestBase

Use PASSCluster.pl script to run the procedure. We used a clustering windows of 25 nucleotides as suggested in Müller et. al..

perl PASSCluster.pl -gbed $bedCoverageFile -window 25 | sort -k1,1 -k2,2n | bgzip > $passCluseredFile.bed.gz;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;

Annotation

Prepare reference annotation map

Reference annotation maps were downloaded from NCBI database for Chrysemys picta (Turtle) and Pogona vitticeps (Lizard). The GFF files were converted to Bed12 files. The Bed12 reference annotation file has the name field contains transcript and gene information separated by ';'. To split the Bed12 file into Bed6 file with feature label (5'UTR, CDS, intron, 3'UTR) appended to the name run BED12Split.pl.

Find closest gene to cluster

To associate PASS clusters with genes use PASSAnnotate.pl routine. The script requires a Bed6 features map created in the previous step and PASS clusters file from the clustering procedure. It uses [bedtools closest] (http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) as internal engine for assosiating neighbouring genes and sites. The routine will calculate UTR predicted length and relative contribution of peak per gene. The following fields are added to the information of the clusters:

  1. geneSymbol
  2. geneFeature
  3. upstreamStartPosition
  4. upstreamSpan
  5. filter (true if pass site contributes >= 1% to total gene expression)

The window used to assign upstream gene was 20kB.

perl PASSAnnotate.pl $passCluseredFile.bed.gz refSeq_Features.bed 20000 |sort -k1,1 -k2,2n |gzip > speciesAnnotation_Date.bed.gz

Methods

Library called MACE (Massive Analysis of cDNA Ends) (cite: https://www.ncbi.nlm.nih.gov/pubmed/25468442) is used to prepare RNA samples from whole turtle or lizzard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs). Reads were aligned to genome of interest using STAR aligner (cite: https://www.ncbi.nlm.nih.gov/pubmed/23104886). Parameters outFilterScoreMinOverLread and outFilterMatchNminOverLread are used to allow poly(A) tail soft clip at reads end. Only uniquely aligned reads are considered for downstream analysis. Internal priming events are avoided by filtering alignments hitting poly(A) rich genomic regions. Such regions are identified by aligning 10 As to genome of interest (cite: https://www.ncbi.nlm.nih.gov/pubmed/25052703). Poly(A) supported sites (PASS) are identified using a HTSLib driven tool called PASSFinder (https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). The tool will cluster reads based on Poly(A) tail using an algorithm defined by Jim Kent at UCSC (https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Alignments are piled up and 3'-most non-A base is maximised to precisely pinpoint poly(A) site position. PAS sites are then associated with upstream genes up to 20kB and current species annotation is accordingly extended.

Overview

Contribution of extended 3'UTRs

figure_LengthContribution

The plot represents a cummulative distribution of the relative distance from annotated 3'End. All upstream identified 3'UTRs will have negative distance and will represent an alternative polyadenylation site (~70% of all sites). All downstream identified 3'UTRs will have positive distance and will represent extended PASS (~30% of all sites). Both annotations overcome the same degree of extension. Range of [-5,+5] kB contributes to more than 95% of all sites.

figure_FeaturesContribution

The plot represents each sequencing run per species and the relative contribution of reads used in analysis to gene features. Extended 3'UTRs contribute with around ~10%. Intronic contribution is relatively low, suggesting that DropSeq method preserves cells integrity and has a low efficiency on pre-mRNA sequencing. Both genomes suffer some lack of annotation as we can recognise it from the contribution of intergenic reads.

figure_ExampleTrack

Example Genome track for CALM3. One can recognize the 3'bias of the method, although most of the coverage falls in the last coding sequence exons. Red track shows the extension achived with the MACE sequencing. Some substantial gene coverage is recovered by that technique.