README.md

# Analysis of 3'End sequencing and extension of gene annotation

## Library
Library called [MACE](http://genxpro.net/sequencing/transcriptome/mace-massive-analysis-of-cdna-ends/) (Massive Analysis of cDNA Ends) is used to prepare RNA samples from whole turtle or lizard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs).

## Preprocessing

### Reads quality control
Reads base composition and quality per length is estimated using [fastqSeqStats](https://github.molgen.mpg.de/MPIBR-Bioinformatics/fastqSeqStats).

```
zcat data.fastq.gz | fastqSeqStats --ifastq - --otxt data_qual.txt --polyA
```

### Alignment
Reads alignment is done using [STAR](https://github.com/alexdobin/STAR). The following parameters were utalized with the program.

```
STAR --runMode alignReads --runThreadN 6 --genomeDir $genomePath --readFilesIn $queryFile --readFilesCommand zcat --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --alignSoftClipAtReferenceEnds No --outFilterScoreMinOverLread 0.25 --outFilterMatchNminOverLread 0.25;
```

|Reads        |Turtle          |Lizzard          |
|:------------|:--------------:|:---------------:|
|raw          |11099556        |16061777         |
|unq. aligned |8468757 (76.30%)|12029909 (74.90%)|


## Find 3'UTR isoforms

### Prepare internal priming mask
To evaluate potential internal priming events, a poly(A/T) mask of a 10 consecutive mono bases is aligned to the genome of interest with [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml).

```
bowtie /path_to_genome/bowtie_index/genome PolyTailMask.fa -f -v 2 --all --sam --threads 6 |samtools view -buS - | samtools sort - PolyTailMask
```

The resulted Bam file is converted to Bed format and compressed with [BGzip](https://github.com/samtools/htslib).

```
bedtools bamtobed -i PolyTailMask.bam -split | bedtools merge -i - -s -c 4 -o count | sort -k1,1 -k2,2n | awk -F"\t" 'OFS="\t"{print $1,$2,$3,sprintf("poly%06d",NR),$5,$4}' | bgzip > species_PolyRegions.bed.gz
```

The final mask file is indexed with [Tabix](https://github.com/samtools/htslib) for random access.

```tabix species_PolyRegions.bed.gz```

### Detection
Clusters of poly(A) sites are detected using [PASSFinder](https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). It relies on [HTSLib](https://github.com/samtools/htslib). Bam alignment file is processed in a way, where each read is classified as either poly(A) containing or not, based on a poly(A) recognition algorithm, designed by [Jim Kent at UCSC](https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Coverage on each base is piled up. The resulted Bed coverage files are sorted by position, compressed using [BGzip](https://github.com/samtools/htslib) and indexed using [Tabix](https://github.com/samtools/htslib).

```
PASSFinder --input $inputBam -r species_PolyRegions.bed.gz --masksize 3 --polysize 3 --mapq 255 | sort -k1,1 -k2,2n | bgzip -@ 8 > $bedCoverageFile;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;
```

### Clustering
3'base coverage holds the infomation of potential 3'UTR sites. In order to identify the positions of polyadenylation we implemented a base clustering techniques that can merge positions in user defined window. During the grouping process information about best expressed base and best expressed seed (3'base of reads containing poly(A) tail) is maintained:

1. chrom
2. chromStart
3. chromEnd
4. groupId
5. span
6. strand
7. clusterBasesMasked
8. clusterBasesCounts
9. clusterReadsSumCoverage
10. clusterReadsMaxCoverage
11. clusterReadsBestBase
12. clusterSeedsCounts
13. clusterSeedsSumCoverage
14. clusterSeedsMaxCoverage
15. clusterSeedsBestBase

Use [PASSCluster.pl](scripts/PASSCluster.pl) script to run the procedure. We used a clustering windows of 25 nucleotides as suggested in [Müller et. al.](https://www.ncbi.nlm.nih.gov/pubmed/?term=25052703).

```
perl PASSCluster.pl -gbed $bedCoverageFile -window 25 | sort -k1,1 -k2,2n | bgzip > $passCluseredFile.bed.gz;
tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;
```

### Annotation

##### Prepare reference annotation map
Reference annotation maps were downloaded from [NCBI]() database for Chrysemys picta ([Turtle](https://goo.gl/H952Dr)) and Pogona vitticeps ([Lizard](https://goo.gl/2atdhP)). The GFF files were converted to Bed12 files.
The Bed12 reference annotation file has the name field contains transcript and gene information separated by ';'. To split the Bed12 file into Bed6 file with feature label (5'UTR, CDS, intron, 3'UTR) appended to the name run [BED12Split.pl](scripts/BED12Split.pl).

##### Find closest gene to cluster
To associate PASS clusters with genes use [PASSAnnotate.pl](scripts/PASSAnnotate.pl) routine. The script requires a Bed6 features map created in the previous step and PASS clusters file from the clustering procedure. It uses [bedtools closest] (http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) as internal engine for assosiating neighbouring genes and sites. The routine will calculate UTR predicted length and relative contribution of peak per gene.  The following fields are added to the information of the clusters:

16. geneSymbol
17. geneFeature
18. upstreamStartPosition
19. upstreamSpan
20. filter  (true if pass site contributes >= 1% to total gene expression)

The window used to assign upstream gene was 20kB.

```
perl PASSAnnotate.pl $passCluseredFile.bed.gz refSeq_Features.bed 20000 |sort -k1,1 -k2,2n |gzip > speciesAnnotation_Date.bed.gz
```

## Methods
Library called [MACE](http://genxpro.net/sequencing/transcriptome/mace-massive-analysis-of-cdna-ends/) (Massive Analysis of cDNA Ends) (cite: https://www.ncbi.nlm.nih.gov/pubmed/25468442) is used to prepare RNA samples from whole turtle or lizzard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs). Reads were aligned to genome of interest using STAR aligner (cite: https://www.ncbi.nlm.nih.gov/pubmed/23104886). Parameters *outFilterScoreMinOverLread* and *outFilterMatchNminOverLread* are used to allow poly(A) tail soft clip at reads end. Only uniquely aligned reads are considered for downstream analysis. Internal priming events are avoided by filtering alignments hitting poly(A) rich genomic regions. Such regions are identified by aligning 10 As to genome of interest (cite: https://www.ncbi.nlm.nih.gov/pubmed/25052703). Poly(A) supported sites (PASS) are identified using a HTSLib driven tool called PASSFinder (https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). The tool will cluster reads based on Poly(A) tail using an algorithm defined by Jim Kent at UCSC (https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Alignments are piled up and 3'-most non-A base is maximised to precisely pinpoint poly(A) site position. PAS sites are then associated with upstream genes up to 20kB and current species annotation is accordingly extended.

## Overview
### Contribution of extended 3'UTRs
![figure_LengthContribution](./figures/figureX_relativeExtension.png)

The plot represents a cummulative distribution of the relative distance from annotated 3'End. All upstream identified 3'UTRs will have negative distance and will represent an alternative polyadenylation site (~70% of all sites). All downstream identified 3'UTRs will have positive distance and will represent extended PASS (~30% of all sites). Both annotations overcome the same degree of extension. Range of [-5,+5] kB contributes to more than 95% of all sites.

![figure_FeaturesContribution](./figures/figureX_DistributionOfGeneFeatures.png)

The plot represents each sequencing run per species and the relative contribution of reads used in analysis to gene features. Extended 3'UTRs contribute with around ~10%. Intronic contribution is relatively low, suggesting that DropSeq method preserves cells integrity and has a low efficiency on pre-mRNA sequencing. Both genomes suffer some lack of annotation as we can recognise it from the contribution of intergenic reads.

![figure_ExampleTrack](./figures/track_CALM3_05Oct2017.pdf)

Example Genome track for CALM3. One can recognize the 3'bias of the method, although most of the coverage falls in the last coding sequence exons. Red track shows the extension achived with the MACE sequencing. Some substantial gene coverage is recovered by that technique.
	# Analysis of 3'End sequencing and extension of gene annotation

	## Library
	Library called [MACE](http://genxpro.net/sequencing/transcriptome/mace-massive-analysis-of-cdna-ends/) (Massive Analysis of cDNA Ends) is used to prepare RNA samples from whole turtle or lizard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs).

	## Preprocessing

	### Reads quality control
	Reads base composition and quality per length is estimated using [fastqSeqStats](https://github.molgen.mpg.de/MPIBR-Bioinformatics/fastqSeqStats).

	```
	zcat data.fastq.gz \| fastqSeqStats --ifastq - --otxt data_qual.txt --polyA
	```

	### Alignment
	Reads alignment is done using [STAR](https://github.com/alexdobin/STAR). The following parameters were utalized with the program.

	```
	STAR --runMode alignReads --runThreadN 6 --genomeDir $genomePath --readFilesIn $queryFile --readFilesCommand zcat --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical --alignSoftClipAtReferenceEnds No --outFilterScoreMinOverLread 0.25 --outFilterMatchNminOverLread 0.25;
	```

	\|Reads \|Turtle \|Lizzard \|
	\|:------------\|:--------------:\|:---------------:\|
	\|raw \|11099556 \|16061777 \|
	\|unq. aligned \|8468757 (76.30%)\|12029909 (74.90%)\|



	## Find 3'UTR isoforms

	### Prepare internal priming mask
	To evaluate potential internal priming events, a poly(A/T) mask of a 10 consecutive mono bases is aligned to the genome of interest with [Bowtie](http://bowtie-bio.sourceforge.net/index.shtml).

	```
	bowtie /path_to_genome/bowtie_index/genome PolyTailMask.fa -f -v 2 --all --sam --threads 6 \|samtools view -buS - \| samtools sort - PolyTailMask
	```

	The resulted Bam file is converted to Bed format and compressed with [BGzip](https://github.com/samtools/htslib).

	```
	bedtools bamtobed -i PolyTailMask.bam -split \| bedtools merge -i - -s -c 4 -o count \| sort -k1,1 -k2,2n \| awk -F"\t" 'OFS="\t"{print $1,$2,$3,sprintf("poly%06d",NR),$5,$4}' \| bgzip > species_PolyRegions.bed.gz
	```

	The final mask file is indexed with [Tabix](https://github.com/samtools/htslib) for random access.

	```tabix species_PolyRegions.bed.gz```

	### Detection
	Clusters of poly(A) sites are detected using [PASSFinder](https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). It relies on [HTSLib](https://github.com/samtools/htslib). Bam alignment file is processed in a way, where each read is classified as either poly(A) containing or not, based on a poly(A) recognition algorithm, designed by [Jim Kent at UCSC](https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Coverage on each base is piled up. The resulted Bed coverage files are sorted by position, compressed using [BGzip](https://github.com/samtools/htslib) and indexed using [Tabix](https://github.com/samtools/htslib).

	```
	PASSFinder --input $inputBam -r species_PolyRegions.bed.gz --masksize 3 --polysize 3 --mapq 255 \| sort -k1,1 -k2,2n \| bgzip -@ 8 > $bedCoverageFile;
	tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;
	```

	### Clustering
	3'base coverage holds the infomation of potential 3'UTR sites. In order to identify the positions of polyadenylation we implemented a base clustering techniques that can merge positions in user defined window. During the grouping process information about best expressed base and best expressed seed (3'base of reads containing poly(A) tail) is maintained:

	1. chrom
	2. chromStart
	3. chromEnd
	4. groupId
	5. span
	6. strand
	7. clusterBasesMasked
	8. clusterBasesCounts
	9. clusterReadsSumCoverage
	10. clusterReadsMaxCoverage
	11. clusterReadsBestBase
	12. clusterSeedsCounts
	13. clusterSeedsSumCoverage
	14. clusterSeedsMaxCoverage
	15. clusterSeedsBestBase

	Use [PASSCluster.pl](scripts/PASSCluster.pl) script to run the procedure. We used a clustering windows of 25 nucleotides as suggested in [Müller et. al.](https://www.ncbi.nlm.nih.gov/pubmed/?term=25052703).

	```
	perl PASSCluster.pl -gbed $bedCoverageFile -window 25 \| sort -k1,1 -k2,2n \| bgzip > $passCluseredFile.bed.gz;
	tabix --sequence 1 --begin 2 --end 2 --zero-based $file_out;
	```

	### Annotation

	##### Prepare reference annotation map
	Reference annotation maps were downloaded from [NCBI]() database for Chrysemys picta ([Turtle](https://goo.gl/H952Dr)) and Pogona vitticeps ([Lizard](https://goo.gl/2atdhP)). The GFF files were converted to Bed12 files.
	The Bed12 reference annotation file has the name field contains transcript and gene information separated by ';'. To split the Bed12 file into Bed6 file with feature label (5'UTR, CDS, intron, 3'UTR) appended to the name run [BED12Split.pl](scripts/BED12Split.pl).

	##### Find closest gene to cluster
	To associate PASS clusters with genes use [PASSAnnotate.pl](scripts/PASSAnnotate.pl) routine. The script requires a Bed6 features map created in the previous step and PASS clusters file from the clustering procedure. It uses [bedtools closest] (http://bedtools.readthedocs.io/en/latest/content/tools/closest.html) as internal engine for assosiating neighbouring genes and sites. The routine will calculate UTR predicted length and relative contribution of peak per gene. The following fields are added to the information of the clusters:

	16. geneSymbol
	17. geneFeature
	18. upstreamStartPosition
	19. upstreamSpan
	20. filter (true if pass site contributes >= 1% to total gene expression)

	The window used to assign upstream gene was 20kB.

	```
	perl PASSAnnotate.pl $passCluseredFile.bed.gz refSeq_Features.bed 20000 \|sort -k1,1 -k2,2n \|gzip > speciesAnnotation_Date.bed.gz
	```

	## Methods
	Library called [MACE](http://genxpro.net/sequencing/transcriptome/mace-massive-analysis-of-cdna-ends/) (Massive Analysis of cDNA Ends) (cite: https://www.ncbi.nlm.nih.gov/pubmed/25468442) is used to prepare RNA samples from whole turtle or lizzard brain. Reads were 68 nucleotides long and filtered for PCR duplicates by unique molecule identifiers (UMIs). Reads were aligned to genome of interest using STAR aligner (cite: https://www.ncbi.nlm.nih.gov/pubmed/23104886). Parameters outFilterScoreMinOverLread and outFilterMatchNminOverLread are used to allow poly(A) tail soft clip at reads end. Only uniquely aligned reads are considered for downstream analysis. Internal priming events are avoided by filtering alignments hitting poly(A) rich genomic regions. Such regions are identified by aligning 10 As to genome of interest (cite: https://www.ncbi.nlm.nih.gov/pubmed/25052703). Poly(A) supported sites (PASS) are identified using a HTSLib driven tool called PASSFinder (https://github.molgen.mpg.de/MPIBR-Bioinformatics/PASSFinder). The tool will cluster reads based on Poly(A) tail using an algorithm defined by Jim Kent at UCSC (https://github.com/jstjohn/KentLib/blob/master/lib/dnautil.c). Alignments are piled up and 3'-most non-A base is maximised to precisely pinpoint poly(A) site position. PAS sites are then associated with upstream genes up to 20kB and current species annotation is accordingly extended.

	## Overview
	### Contribution of extended 3'UTRs
	![figure_LengthContribution](./figures/figureX_relativeExtension.png)

	The plot represents a cummulative distribution of the relative distance from annotated 3'End. All upstream identified 3'UTRs will have negative distance and will represent an alternative polyadenylation site (~70% of all sites). All downstream identified 3'UTRs will have positive distance and will represent extended PASS (~30% of all sites). Both annotations overcome the same degree of extension. Range of [-5,+5] kB contributes to more than 95% of all sites.

	![figure_FeaturesContribution](./figures/figureX_DistributionOfGeneFeatures.png)

	The plot represents each sequencing run per species and the relative contribution of reads used in analysis to gene features. Extended 3'UTRs contribute with around ~10%. Intronic contribution is relatively low, suggesting that DropSeq method preserves cells integrity and has a low efficiency on pre-mRNA sequencing. Both genomes suffer some lack of annotation as we can recognise it from the contribution of intergenic reads.

	![figure_ExampleTrack](./figures/track_CALM3_05Oct2017.pdf)

	Example Genome track for CALM3. One can recognize the 3'bias of the method, although most of the coverage falls in the last coding sequence exons. Red track shows the extension achived with the MACE sequencing. Some substantial gene coverage is recovered by that technique.