Protocol for bioinformatics analysis of LTP project

Experiment samples

Sample Name	Experiment Name
Sample 1	LTP 5 min
Sample 2	LTP 5 min (control)
Sample 3	LTP 5 (no stimulation)
Sample 4	LTP 5 min
Sample 5	LTP 5 min (control)
Sample 6	LTP 30 min
Sample 7	LTP 30 min (control)
Sample 8	LTP 30 min (no stimulation)
Sample 9	LTP 30 min
Sample 10	LTP 30 min (control)
Sample 11	LTP 1 hr
Sample 12	LTP 1 hr (control)
Sample 13	LTP 1 hr (No stimulation)
Sample 14	LTP 1 hr
Sample 15	LTP 1 hr (control)
Sample 16	LTP 1 hr

Organisation

Raw and preprocessed files are located under:

/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910

Directory structure:

raw_data: contains raw sequenced reads and other data as obtained from Novogen.
reads_quality: contains quality checks of each raw file
alignments: contains information about the alignments
- bams: contains the aligned reads as BAM Files
- logs: contains the log files of each alignment run
gene_counts: contains the read counts for each gene in the reference.

Analysis files (means, after gene counting) are located under:

/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis

The folder analysis contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc.

The parent folder LTP_transcriptomics contains also a symlink data to the raw and preprocessed files mentioned above.

Preprocessing

Preprocessing steps are run under the folder: /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910

1. MD5 checksum

cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt .

2. Reads quality check

Checked the quality for each raw file using fastqSeqStats.

cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data;

for file in *.fq.gz; do file_name=$(basename "$file"); file_name="${file_name%.*.*}"; gunzip -c $file | fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done

Results

The quality reports can be found under: /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality

3. Reads alignment

Aligned with STAR v2.4.2a using following bash script:

#!/bin/bash

# $1 :: source directory
# $2 :: STAR genome dir
# $3 :: BAM dir out
# $4 :: LOG dir out

for file in $1*_1.fq.gz
do
  file_name=$(basename "$file");
	extension="${file_name#*.}"
  file_name="${file_name%\_*}";

	read_1=$1$file_name"_1."$extension;
	read_2=$1$file_name"_2."$extension;

	echo "-----------------";
  echo $file_name;
	echo "-----------------";
	echo "Reads 1: "$read_1;
	echo "Reads 2: "$read_2;
	echo "-----------------";
  echo "Alignment";
      STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical;
  echo "index alignment ...";
      mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam";
      samtools index $3$file_name".bam";
  echo "cleanup ...";
      mv "Log.final.out" $4$file_name"_seqlog.txt";
      rm -f "SJ.out.tab";
      rm -f "Log.out";
      rm -f "Log.progress.out";
done

Used rat's genome rn5_Mar2012 as reference for the mapping.

The command was:

/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/

Results

The aligned reads (BAMs) under:

/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/

These BAM files are sorted by coordinates.

Log files under:

/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/

4. Feature counting

Count the amount of reads per feature (gene) in each sample, using featureCounts.

Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort:

for file in *.bam; do file_name=$(basename "$file"); file_name="${file_name%.*}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done

mv *_sorted.bam bamsSortedByReadName/

Annotation file for rat under: /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf

featureCounts parameters: -Q 255 -T 6

Change first to directory containing the sorted BAMs:

cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName

Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...):

featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam | sort -n -k1.2 | paste -s -d ' ' -)

Extract mapped features only

Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them:

awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt

Then we add a header to that file with labels and sample names:

paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary | cut -f2- | tr -s '\t' '\n' | awk 'OFS="\t"{split($1, a, "_"); print a[1]}' | paste -s -d '\t' -) | cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt

Analysis

Analysis steps are run under the folder: /storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis

1. Normalization

Removed entries with RPKM values lower than 2 and applied deSeq normalisation:

analysis_rpkm_deSeq_normalization.m

2. Fold change

Visualize fold change for each experiment time-range pair.

analysis_foldchange.m
analysis_clustergram_foldchange.m

3. Binomial fit test

Visualize binomial fit test for each experiment time-range pair.

analysis_binomial_test.m

4. Development of gene expression over time

Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where:

"d" = downregulated
"n" = no regulation (normal)
"u" = upregulated

analysis_expression_development.m

5. Gene ontology

Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script GOAnalysis by Georgi Tushev.

First change to the corresponding analsyis' data folder:

cd data/gene_ontology

The run the annotation:

for file in target_genes_lists/*.txt; do file_name=$(basename "$file"); file_name="${file_name%.*}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done

Get all unique GO terms available in the annotated files:

cat go_* | cut -f1 | sort -u > all_go_terms.txt

Create table with the P-value of each condition for each GO Term available:

analysis_gene_ontology.m

MPIBR-Bioinformatics/LTP-transcriptomics

About

Resources

Stars

Watchers

Forks

Releases

Languages