README.md

# Protocol for bioinformatics analysis of LTP project

## Experiment samples

| Sample Name | Experiment Name             |
|-------------|-----------------------------|
| Sample 1    | LTP 5 min                   |
| Sample 2    | LTP 5 min (control)         |
| Sample 3    | LTP 5 (no stimulation)      |
| Sample 4    | LTP 5 min                   |
| Sample 5    | LTP 5 min (control)         |
| Sample 6    | LTP 30 min                  |
| Sample 7    | LTP 30 min (control)        |
| Sample 8    | LTP 30 min (no stimulation) |
| Sample 9    | LTP 30 min                  |
| Sample 10   | LTP 30 min (control)        |
| Sample 11   | LTP 1 hr                    |
| Sample 12   | LTP 1 hr (control)          |
| Sample 13   | LTP 1 hr (No stimulation)   |
| Sample 14   | LTP 1 hr                    |
| Sample 15   | LTP 1 hr (control)          |
| Sample 16   | LTP 1 hr                    |

## Organisation

**Raw and preprocessed files are located under:**

`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`

Directory structure:
- raw_data: contains raw sequenced reads and other data as obtained from Novogen.
- reads_quality: contains quality checks of each raw file
- alignments: contains information about the alignments
  - bams: contains the aligned reads as BAM Files
  - logs: contains the log files of each alignment run
- gene_counts: contains the read counts for each gene in the reference.

**Analysis files (means, after gene counting) are located under:**

`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`

The folder `analysis` contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc.

The parent folder `LTP_transcriptomics` contains also a symlink `data` to the raw and preprocessed files mentioned above.


## Preprocessing

Preprocessing steps are run under the folder:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`

### 1. MD5 checksum

```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt .
```

### 2. Reads quality check
Checked the quality for each raw file using [fastqSeqStats](https://software.scic.brain.mpg.de/projects/MPIBR-Bioinformatics/fastqSeqStats).

```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data;
```
```bash
for file in *.fq.gz; do file_name=$(basename "$file"); file_name="${file_name%.*.*}"; gunzip -c $file | fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done
```
#### Results
The quality reports can be found under:
`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality`

### 3. Reads alignment
Aligned with STAR v2.4.2a using following bash script:

```bash
#!/bin/bash

# $1 :: source directory
# $2 :: STAR genome dir
# $3 :: BAM dir out
# $4 :: LOG dir out

for file in $1*_1.fq.gz
do
  file_name=$(basename "$file");
	extension="${file_name#*.}"
  file_name="${file_name%\_*}";

	read_1=$1$file_name"_1."$extension;
	read_2=$1$file_name"_2."$extension;

	echo "-----------------";
  echo $file_name;
	echo "-----------------";
	echo "Reads 1: "$read_1;
	echo "Reads 2: "$read_2;
	echo "-----------------";
  echo "Alignment";
      STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical;
  echo "index alignment ...";
      mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam";
      samtools index $3$file_name".bam";
  echo "cleanup ...";
      mv "Log.final.out" $4$file_name"_seqlog.txt";
      rm -f "SJ.out.tab";
      rm -f "Log.out";
      rm -f "Log.progress.out";
done
```

Used rat's genome rn5_Mar2012 as reference for the mapping.

The command was:

`/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`

#### Results

The aligned reads (BAMs) under:

`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/`

These BAM files are sorted by coordinates.

Log files under:

`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`

### 4. Feature counting
Count the amount of reads per feature (gene) in each sample, using [featureCounts](http://bioinf.wehi.edu.au/featureCounts/).

Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort:

```bash
for file in *.bam; do file_name=$(basename "$file"); file_name="${file_name%.*}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done

mv *_sorted.bam bamsSortedByReadName/
```

Annotation file for rat under:
`/storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf`

featureCounts parameters:
-Q 255
-T 6

Change first to directory containing the sorted BAMs:
```bash
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName
```

Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...):
```bash
featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam | sort -n -k1.2 | paste -s -d ' ' -)
```
#### Extract mapped features only
Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them:

```bash
awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt
```

Then we add a header to that file with labels and sample names:

```bash
paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary | cut -f2- | tr -s '\t' '\n' | awk 'OFS="\t"{split($1, a, "_"); print a[1]}' | paste -s -d '\t' -) | cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt
```

## Analysis

Analysis steps are run under the folder:
`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`

### 1. Normalization

Removed entries with RPKM values lower than 2 and applied deSeq normalisation:
```
analysis_rpkm_deSeq_normalization.m
```

### 2. Fold change

Visualize fold change for each experiment time-range pair.
```
analysis_foldchange.m
analysis_clustergram_foldchange.m
```

### 3. Binomial fit test
Visualize binomial fit test for each experiment time-range pair.
```
analysis_binomial_test.m
```

### 4. Development of gene expression over time
Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where:
- "d" = downregulated
- "n" = no regulation (normal)
- "u" = upregulated
```
analysis_expression_development.m
```

### 5. Gene ontology

Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script [GOAnalysis](https://software.scic.corp.brain.mpg.de/projects/MPIBR-Bioinformatics/GOAnalysis) by Georgi Tushev.

First change to the corresponding analsyis' data folder:
```bash
cd data/gene_ontology
```

The run the annotation:
```bash
for file in target_genes_lists/*.txt; do file_name=$(basename "$file"); file_name="${file_name%.*}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done
```

Get all unique GO terms available in the annotated files:
```bash
cat go_* | cut -f1 | sort -u > all_go_terms.txt
```

Create table with the P-value of each condition for each GO Term available:
```
analysis_gene_ontology.m
```
	# Protocol for bioinformatics analysis of LTP project

	## Experiment samples

	\| Sample Name \| Experiment Name \|
	\|-------------\|-----------------------------\|
	\| Sample 1 \| LTP 5 min \|
	\| Sample 2 \| LTP 5 min (control) \|
	\| Sample 3 \| LTP 5 (no stimulation) \|
	\| Sample 4 \| LTP 5 min \|
	\| Sample 5 \| LTP 5 min (control) \|
	\| Sample 6 \| LTP 30 min \|
	\| Sample 7 \| LTP 30 min (control) \|
	\| Sample 8 \| LTP 30 min (no stimulation) \|
	\| Sample 9 \| LTP 30 min \|
	\| Sample 10 \| LTP 30 min (control) \|
	\| Sample 11 \| LTP 1 hr \|
	\| Sample 12 \| LTP 1 hr (control) \|
	\| Sample 13 \| LTP 1 hr (No stimulation) \|
	\| Sample 14 \| LTP 1 hr \|
	\| Sample 15 \| LTP 1 hr (control) \|
	\| Sample 16 \| LTP 1 hr \|

	## Organisation

	Raw and preprocessed files are located under:

	`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`

	Directory structure:
	- raw_data: contains raw sequenced reads and other data as obtained from Novogen.
	- reads_quality: contains quality checks of each raw file
	- alignments: contains information about the alignments
	- bams: contains the aligned reads as BAM Files
	- logs: contains the log files of each alignment run
	- gene_counts: contains the read counts for each gene in the reference.

	Analysis files (means, after gene counting) are located under:

	`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`

	The folder `analysis` contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc.

	The parent folder `LTP_transcriptomics` contains also a symlink `data` to the raw and preprocessed files mentioned above.


	## Preprocessing

	Preprocessing steps are run under the folder:
	`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910`

	### 1. MD5 checksum

	```bash
	cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt .
	```

	### 2. Reads quality check
	Checked the quality for each raw file using [fastqSeqStats](https://software.scic.brain.mpg.de/projects/MPIBR-Bioinformatics/fastqSeqStats).

	```bash
	cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data;
	```
	```bash
	for file in .fq.gz; do file_name=$(basename "$file"); file_name="${file_name%..*}"; gunzip -c $file \| fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done
	```
	#### Results
	The quality reports can be found under:
	`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality`

	### 3. Reads alignment
	Aligned with STAR v2.4.2a using following bash script:

	```bash
	#!/bin/bash

	# $1 :: source directory
	# $2 :: STAR genome dir
	# $3 :: BAM dir out
	# $4 :: LOG dir out

	for file in $1*_1.fq.gz
	do
	file_name=$(basename "$file");
	extension="${file_name#*.}"
	file_name="${file_name%\_*}";

	read_1=$1$file_name"_1."$extension;
	read_2=$1$file_name"_2."$extension;

	echo "-----------------";
	echo $file_name;
	echo "-----------------";
	echo "Reads 1: "$read_1;
	echo "Reads 2: "$read_2;
	echo "-----------------";
	echo "Alignment";
	STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical;
	echo "index alignment ...";
	mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam";
	samtools index $3$file_name".bam";
	echo "cleanup ...";
	mv "Log.final.out" $4$file_name"_seqlog.txt";
	rm -f "SJ.out.tab";
	rm -f "Log.out";
	rm -f "Log.progress.out";
	done
	```

	Used rat's genome rn5_Mar2012 as reference for the mapping.

	The command was:

	`/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`

	#### Results

	The aligned reads (BAMs) under:

	`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/`

	These BAM files are sorted by coordinates.

	Log files under:

	`/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/`

	### 4. Feature counting
	Count the amount of reads per feature (gene) in each sample, using [featureCounts](http://bioinf.wehi.edu.au/featureCounts/).

	Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort:

	```bash
	for file in .bam; do file_name=$(basename "$file"); file_name="${file_name%.}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done

	mv *_sorted.bam bamsSortedByReadName/
	```

	Annotation file for rat under:
	`/storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf`

	featureCounts parameters:
	-Q 255
	-T 6

	Change first to directory containing the sorted BAMs:
	```bash
	cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName
	```

	Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...):
	```bash
	featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam \| sort -n -k1.2 \| paste -s -d ' ' -)
	```
	#### Extract mapped features only
	Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them:

	```bash
	awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt
	```

	Then we add a header to that file with labels and sample names:

	```bash
	paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary \| cut -f2- \| tr -s '\t' '\n' \| awk 'OFS="\t"{split($1, a, "_"); print a[1]}' \| paste -s -d '\t' -) \| cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt
	```

	## Analysis

	Analysis steps are run under the folder:
	`/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis`

	### 1. Normalization

	Removed entries with RPKM values lower than 2 and applied deSeq normalisation:
	```
	analysis_rpkm_deSeq_normalization.m
	```

	### 2. Fold change

	Visualize fold change for each experiment time-range pair.
	```
	analysis_foldchange.m
	analysis_clustergram_foldchange.m
	```

	### 3. Binomial fit test
	Visualize binomial fit test for each experiment time-range pair.
	```
	analysis_binomial_test.m
	```

	### 4. Development of gene expression over time
	Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where:
	- "d" = downregulated
	- "n" = no regulation (normal)
	- "u" = upregulated
	```
	analysis_expression_development.m
	```

	### 5. Gene ontology

	Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script [GOAnalysis](https://software.scic.corp.brain.mpg.de/projects/MPIBR-Bioinformatics/GOAnalysis) by Georgi Tushev.

	First change to the corresponding analsyis' data folder:
	```bash
	cd data/gene_ontology
	```

	The run the annotation:
	```bash
	for file in target_genes_lists/.txt; do file_name=$(basename "$file"); file_name="${file_name%.}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done
	```

	Get all unique GO terms available in the annotated files:
	```bash
	cat go_* \| cut -f1 \| sort -u > all_go_terms.txt
	```

	Create table with the P-value of each condition for each GO Term available:
	```
	analysis_gene_ontology.m
	```