Sample Name | Experiment Name |
---|---|
Sample 1 | LTP 5 min |
Sample 2 | LTP 5 min (control) |
Sample 3 | LTP 5 (no stimulation) |
Sample 4 | LTP 5 min |
Sample 5 | LTP 5 min (control) |
Sample 6 | LTP 30 min |
Sample 7 | LTP 30 min (control) |
Sample 8 | LTP 30 min (no stimulation) |
Sample 9 | LTP 30 min |
Sample 10 | LTP 30 min (control) |
Sample 11 | LTP 1 hr |
Sample 12 | LTP 1 hr (control) |
Sample 13 | LTP 1 hr (No stimulation) |
Sample 14 | LTP 1 hr |
Sample 15 | LTP 1 hr (control) |
Sample 16 | LTP 1 hr |
Raw and preprocessed files are located under:
/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910
Directory structure:
- raw_data: contains raw sequenced reads and other data as obtained from Novogen.
- reads_quality: contains quality checks of each raw file
- alignments: contains information about the alignments
- bams: contains the aligned reads as BAM Files
- logs: contains the log files of each alignment run
- gene_counts: contains the read counts for each gene in the reference.
Analysis files (means, after gene counting) are located under:
/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis
The folder analysis
contains this repository (all the scripts used for analysis) as well as generated data, external programs, etc.
The parent folder LTP_transcriptomics
contains also a symlink data
to the raw and preprocessed files mentioned above.
Preprocessing steps are run under the folder:
/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data; md5sum -c MD5.txt .
Checked the quality for each raw file using fastqSeqStats.
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data;
for file in *.fq.gz; do file_name=$(basename "$file"); file_name="${file_name%.*.*}"; gunzip -c $file | fastqSeqStats --ifastq - --otxt $file_name"_stats.txt"; done
The quality reports can be found under:
/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/reads_quality
Aligned with STAR v2.4.2a using following bash script:
#!/bin/bash
# $1 :: source directory
# $2 :: STAR genome dir
# $3 :: BAM dir out
# $4 :: LOG dir out
for file in $1*_1.fq.gz
do
file_name=$(basename "$file");
extension="${file_name#*.}"
file_name="${file_name%\_*}";
read_1=$1$file_name"_1."$extension;
read_2=$1$file_name"_2."$extension;
echo "-----------------";
echo $file_name;
echo "-----------------";
echo "Reads 1: "$read_1;
echo "Reads 2: "$read_2;
echo "-----------------";
echo "Alignment";
STAR --runMode alignReads --runThreadN 16 --genomeDir $2 --readFilesIn $read_1 $read_2 --readFilesCommand gunzip -c --outSAMattributes All --outStd Log --outSAMtype BAM SortedByCoordinate --outSAMstrandField intronMotif --outFilterIntronMotifs RemoveNoncanonical;
echo "index alignment ...";
mv "Aligned.sortedByCoord.out.bam" $3$file_name".bam";
samtools index $3$file_name".bam";
echo "cleanup ...";
mv "Log.final.out" $4$file_name"_seqlog.txt";
rm -f "SJ.out.tab";
rm -f "Log.out";
rm -f "Log.progress.out";
done
Used rat's genome rn5_Mar2012 as reference for the mapping.
The command was:
/storage/scic/Data/External/Bioinformatics/bin/pipelines/alignPairedReadsSTAR.sh /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/raw_data/ /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/stardb/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/ /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/
The aligned reads (BAMs) under:
/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bams/
These BAM files are sorted by coordinates.
Log files under:
/storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/logs/
Count the amount of reads per feature (gene) in each sample, using featureCounts.
Before doing the counting, we need the input BAM files to be sorted by read name, since featureCounts wants the mate pairs to be listed next to each other. For that we sort them with samtools sort:
for file in *.bam; do file_name=$(basename "$file"); file_name="${file_name%.*}"; samtools sort -n -o $file_name"_sortedByName.bam" $file; done
mv *_sorted.bam bamsSortedByReadName/
Annotation file for rat under:
/storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf
featureCounts parameters: -Q 255 -T 6
Change first to directory containing the sorted BAMs:
cd /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/alignments/bamsSortedByName
Run featureCounts on every BAM file. The files will be read in alphanumerical order (1,2,3...10,11,12...):
featureCounts -Q 255 -T 6 -p -a /storage/scic/Data/External/Bioinformatics/Rat/Genomes/rn5_Mar2012/annotation/annRefSeq_rn5_20Apr2016.gtf -o /storage/schu/ProjectGroups/RNA/Data/RNASequencing/Novogene/NHHW162910/gene_counts/gene_counts_180117.txt $(ls *.bam | sort -n -k1.2 | paste -s -d ' ' -)
Once the counting of the features/genes is done, we extract only those which have at least one read mapped to them:
awk -F"\t" 'OFS="\t"{if(($7+$8+$9+$10+$11+$12+$13+$14+$15+$16+$17+$18+$19+$20+$21+$22)>0){print $1,$6,$7,$8,$9,$10,$11,$12,$13,$14,$15,$16,$17,$18,$19,$20,$21,$22}}' gene_counts_180117.txt > gene_counts_180117_only_mapped.txt
Then we add a header to that file with labels and sample names:
paste -d '\t' <(echo "GeneId") <(echo "Length") <(head -1 gene_counts_180117.txt.summary | cut -f2- | tr -s '\t' '\n' | awk 'OFS="\t"{split($1, a, "_"); print a[1]}' | paste -s -d '\t' -) | cat - gene_counts_180117_only_mapped.txt > gene_counts_180117_only_mapped_with_header.txt
Analysis steps are run under the folder:
/storage/schu/ProjectGroups/RNA/Projects/LTP_transcriptomics/analysis
Removed entries with RPKM values lower than 2 and applied deSeq normalisation:
analysis_rpkm_deSeq_normalization.m
Visualize fold change for each experiment time-range pair.
analysis_foldchange.m
analysis_clustergram_foldchange.m
Visualize binomial fit test for each experiment time-range pair.
analysis_binomial_test.m
Analyze the changes in gene expression over time (for each experiment) for each gene. This classifies each gene into one of 27 possible combinations/conditions: ddu, uud, ndd, ... etc, where:
- "d" = downregulated
- "n" = no regulation (normal)
- "u" = upregulated
analysis_expression_development.m
Run gene ontology annotation for each condition/combination file generated in the previous step. We use the perl script GOAnalysis by Georgi Tushev.
First change to the corresponding analsyis' data folder:
cd data/gene_ontology
The run the annotation:
for file in target_genes_lists/*.txt; do file_name=$(basename "$file"); file_name="${file_name%.*}"; perl ../../bin/GOAnalysis/GOAnalysis.pl -obo annotations/go-basic.obo -ann annotations/gene_association.rgd.gz -background background_list.txt -target $file > "go_"$file_name".txt"; done
Get all unique GO terms available in the annotated files:
cat go_* | cut -f1 | sort -u > all_go_terms.txt
Create table with the P-value of each condition for each GO Term available:
analysis_gene_ontology.m