Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
LTP-transcriptomics/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
225 lines (168 sloc)
8.27 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Protocol for bioinformatics analysis of LTP project | |
## Experiment samples | |
| Sample Name | Experiment Name | | |
|-------------|-----------------------------| | |
| Sample 1 | LTP 5 min | | |
| Sample 2 | LTP 5 min (control) | | |
| Sample 3 | LTP 5 (no stimulation) | | |
| Sample 4 | LTP 5 min | | |
| Sample 5 | LTP 5 min (control) | | |
| Sample 6 | LTP 30 min | | |
| Sample 7 | LTP 30 min (control) | | |
| Sample 8 | LTP 30 min (no stimulation) | | |
| Sample 9 | LTP 30 min | | |
| Sample 10 | LTP 30 min (control) | | |
| Sample 11 | LTP 1 hr | | |
| Sample 12 | LTP 1 hr (control) | | |
| Sample 13 | LTP 1 hr (No stimulation) | | |
| Sample 14 | LTP 1 hr | | |
| Sample 15 | LTP 1 hr (control) | | |
| Sample 16 | LTP 1 hr | | |
## Organisation | |
**Raw and preprocessed files are located under:** | |
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910` | |
Directory structure: | |
- raw_data: contains raw sequenced reads and other data as obtained from Novogen. | |
- reads_quality: contains quality checks of each raw file | |
- alignments: contains information about the alignments | |
- bams: contains the aligned reads as BAM Files | |
- logs: contains the log files of each alignment run | |
- gene_counts: contains the read counts for each gene in the reference. | |
**Analysis files (means, after gene counting) are located under:** | |
`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis` | |
The folder `analysis` contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc. | |
The parent folder `LTP_transcriptomics` contains also a symlink `data` to the raw and preprocessed files mentioned above. | |
## Preprocessing | |
Preprocessing steps are run under the folder: | |
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910` | |
### 1. MD5 checksum | |
```bash | |
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt . | |
``` | |
### 2. Reads quality check | |
Checked the quality for each raw file using [fastqSeqStats](https://software.scic.brain.mpg.de/projects/MPIBR-Bioinformatics/fastqSeqStats). | |
```bash | |
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; | |
``` | |
```bash | |
for file in *.fq.gz; do file_name=$(basename "$file"); file_name="${file_name%.*.*}"; gunzip -c $file | fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done | |
``` | |
#### Results | |
The quality reports can be found under: | |
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality` | |
### 3. Reads alignment | |
Aligned with STAR v2.4.2a using following bash script: | |
```bash | |
#!/bin/bash | |
# $1 :: source directory | |
# $2 :: STAR genome dir | |
# $3 :: BAM dir out | |
# $4 :: LOG dir out | |
for file in $1*_1.fq.gz | |
do | |
file_name=$(basename "$file"); | |
extension="${file_name#*.}" | |
file_name="${file_name%\_*}"; | |
read_1=$1$file_name"_1."$extension; | |
read_2=$1$file_name"_2."$extension; | |
echo "-----------------"; | |
echo $file_name; | |
echo "-----------------"; | |
echo "Reads 1: "$read_1; | |
echo "Reads 2: "$read_2; | |
echo "-----------------"; | |
echo "Alignment"; | |
STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical; | |
echo "index alignment ..."; | |
mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam"; | |
samtools index $3$file_name".bam"; | |
echo "cleanup ..."; | |
mv "Log.final.out" $4$file_name"_seqlog.txt"; | |
rm -f "SJ.out.tab"; | |
rm -f "Log.out"; | |
rm -f "Log.progress.out"; | |
done | |
``` | |
Used rat's genome rn5_Mar2012 as reference for the mapping. | |
The command was: | |
`/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/` | |
#### Results | |
The aligned reads (BAMs) under: | |
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/` | |
These BAM files are sorted by coordinates. | |
Log files under: | |
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/` | |
### 4. Feature counting | |
Count the amount of reads per feature (gene) in each sample, using [featureCounts](http://bioinf.wehi.edu.au/featureCounts/). | |
Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort: | |
```bash | |
for file in *.bam; do file_name=$(basename "$file"); file_name="${file_name%.*}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done | |
mv *_sorted.bam bamsSortedByReadName/ | |
``` | |
Annotation file for rat under: | |
`/storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf` | |
featureCounts parameters: | |
-Q 255 | |
-T 6 | |
Change first to directory containing the sorted BAMs: | |
```bash | |
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName | |
``` | |
Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...): | |
```bash | |
featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam | sort -n -k1.2 | paste -s -d ' ' -) | |
``` | |
#### Extract mapped features only | |
Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them: | |
```bash | |
awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt | |
``` | |
Then we add a header to that file with labels and sample names: | |
```bash | |
paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary | cut -f2- | tr -s '\t' '\n' | awk 'OFS="\t"{split($1, a, "_"); print a[1]}' | paste -s -d '\t' -) | cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt | |
``` | |
## Analysis | |
Analysis steps are run under the folder: | |
`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis` | |
### 1. Normalization | |
Removed entries with RPKM values lower than 2 and applied deSeq normalisation: | |
``` | |
analysis_rpkm_deSeq_normalization.m | |
``` | |
### 2. Fold change | |
Visualize fold change for each experiment time-range pair. | |
``` | |
analysis_foldchange.m | |
analysis_clustergram_foldchange.m | |
``` | |
### 3. Binomial fit test | |
Visualize binomial fit test for each experiment time-range pair. | |
``` | |
analysis_binomial_test.m | |
``` | |
### 4. Development of gene expression over time | |
Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where: | |
- "d" = downregulated | |
- "n" = no regulation (normal) | |
- "u" = upregulated | |
``` | |
analysis_expression_development.m | |
``` | |
### 5. Gene ontology | |
Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script [GOAnalysis](https://software.scic.corp.brain.mpg.de/projects/MPIBR-Bioinformatics/GOAnalysis) by Georgi Tushev. | |
First change to the corresponding analsyis' data folder: | |
```bash | |
cd data/gene_ontology | |
``` | |
The run the annotation: | |
```bash | |
for file in target_genes_lists/*.txt; do file_name=$(basename "$file"); file_name="${file_name%.*}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done | |
``` | |
Get all unique GO terms available in the annotated files: | |
```bash | |
cat go_* | cut -f1 | sort -u > all_go_terms.txt | |
``` | |
Create table with the P-value of each condition for each GO Term available: | |
``` | |
analysis_gene_ontology.m | |
``` |