Skip to content
Permalink
26e8fc7843
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
225 lines (168 sloc) 8.27 KB
# Protocol for bioinformatics analysis of LTP project
## Experiment samples
| Sample Name | Experiment Name |
|-------------|-----------------------------|
| Sample 1 | LTP 5 min |
| Sample 2 | LTP 5 min (control) |
| Sample 3 | LTP 5 (no stimulation) |
| Sample 4 | LTP 5 min |
| Sample 5 | LTP 5 min (control) |
| Sample 6 | LTP 30 min |
| Sample 7 | LTP 30 min (control) |
| Sample 8 | LTP 30 min (no stimulation) |
| Sample 9 | LTP 30 min |
| Sample 10 | LTP 30 min (control) |
| Sample 11 | LTP 1 hr |
| Sample 12 | LTP 1 hr (control) |
| Sample 13 | LTP 1 hr (No stimulation) |
| Sample 14 | LTP 1 hr |
| Sample 15 | LTP 1 hr (control) |
| Sample 16 | LTP 1 hr |
## Organisation
**Raw and preprocessed files are located under:**
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`
Directory structure:
- raw_data: contains raw sequenced reads and other data as obtained from Novogen.
- reads_quality: contains quality checks of each raw file
- alignments: contains information about the alignments
- bams: contains the aligned reads as BAM Files
- logs: contains the log files of each alignment run
- gene_counts: contains the read counts for each gene in the reference.
**Analysis files (means, after gene counting) are located under:**
`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`
The folder `analysis` contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc.
The parent folder `LTP_transcriptomics` contains also a symlink `data` to the raw and preprocessed files mentioned above.
## Preprocessing
Preprocessing steps are run under the folder:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`
### 1. MD5 checksum
```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt .
```
### 2. Reads quality check
Checked the quality for each raw file using [fastqSeqStats](https://software.scic.brain.mpg.de/projects/MPIBR-Bioinformatics/fastqSeqStats).
```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data;
```
```bash
for file in *.fq.gz; do file_name=$(basename "$file"); file_name="${file_name%.*.*}"; gunzip -c $file | fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done
```
#### Results
The quality reports can be found under:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality`
### 3. Reads alignment
Aligned with STAR v2.4.2a using following bash script:
```bash
#!/bin/bash
# $1 :: source directory
# $2 :: STAR genome dir
# $3 :: BAM dir out
# $4 :: LOG dir out
for file in $1*_1.fq.gz
do
file_name=$(basename "$file");
extension="${file_name#*.}"
file_name="${file_name%\_*}";
read_1=$1$file_name"_1."$extension;
read_2=$1$file_name"_2."$extension;
echo "-----------------";
echo $file_name;
echo "-----------------";
echo "Reads 1: "$read_1;
echo "Reads 2: "$read_2;
echo "-----------------";
echo "Alignment";
STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical;
echo "index alignment ...";
mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam";
samtools index $3$file_name".bam";
echo "cleanup ...";
mv "Log.final.out" $4$file_name"_seqlog.txt";
rm -f "SJ.out.tab";
rm -f "Log.out";
rm -f "Log.progress.out";
done
```
Used rat's genome rn5_Mar2012 as reference for the mapping.
The command was:
`/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`
#### Results
The aligned reads (BAMs) under:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/`
These BAM files are sorted by coordinates.
Log files under:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`
### 4. Feature counting
Count the amount of reads per feature (gene) in each sample, using [featureCounts](http://bioinf.wehi.edu.au/featureCounts/).
Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort:
```bash
for file in *.bam; do file_name=$(basename "$file"); file_name="${file_name%.*}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done
mv *_sorted.bam bamsSortedByReadName/
```
Annotation file for rat under:
`/storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf`
featureCounts parameters:
-Q 255
-T 6
Change first to directory containing the sorted BAMs:
```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName
```
Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...):
```bash
featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam | sort -n -k1.2 | paste -s -d ' ' -)
```
#### Extract mapped features only
Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them:
```bash
awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt
```
Then we add a header to that file with labels and sample names:
```bash
paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary | cut -f2- | tr -s '\t' '\n' | awk 'OFS="\t"{split($1, a, "_"); print a[1]}' | paste -s -d '\t' -) | cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt
```
## Analysis
Analysis steps are run under the folder:
`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`
### 1. Normalization
Removed entries with RPKM values lower than 2 and applied deSeq normalisation:
```
analysis_rpkm_deSeq_normalization.m
```
### 2. Fold change
Visualize fold change for each experiment time-range pair.
```
analysis_foldchange.m
analysis_clustergram_foldchange.m
```
### 3. Binomial fit test
Visualize binomial fit test for each experiment time-range pair.
```
analysis_binomial_test.m
```
### 4. Development of gene expression over time
Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where:
- "d" = downregulated
- "n" = no regulation (normal)
- "u" = upregulated
```
analysis_expression_development.m
```
### 5. Gene ontology
Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script [GOAnalysis](https://software.scic.corp.brain.mpg.de/projects/MPIBR-Bioinformatics/GOAnalysis) by Georgi Tushev.
First change to the corresponding analsyis' data folder:
```bash
cd data/gene_ontology
```
The run the annotation:
```bash
for file in target_genes_lists/*.txt; do file_name=$(basename "$file"); file_name="${file_name%.*}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done
```
Get all unique GO terms available in the annotated files:
```bash
cat go_* | cut -f1 | sort -u > all_go_terms.txt
```
Create table with the P-value of each condition for each GO Term available:
```
analysis_gene_ontology.m
```