-
Notifications
You must be signed in to change notification settings - Fork 0
How to use the scripts provided by the workflow
The workflow is separated into three sections. Section A for analyzing the Nanopore data, Section B for analyzing the gRNA NGS data, and Section C for the final visualization.
The data files needed to run this analysis are:
- A FASTQ-file containing all NanoPore reads
- The reference gene in FASTA format
- Exon gene annotation file in BED3 format
- All guide sequences in FASTA format
- A TSV-file with guide pair NGS counts
- Example file:
GuideA GuideB Rb_Tiling_ctrl_A Rb_tiling_lib Rb_Tiling_ctrl_C Rb_Tiling_end_B Rb_Tiling_end_D exon1_1234_0.60 exon2_1234_0.60 100 50 112 345 367 exon1_1234_0.60 exon3_1234_0.60 20 25 10 490 510 exon1_1234_0.60 exon4_1234_0.60 101 120 39 1123 1000
- Example file:
Note: The guide names need to be in following format: Name[X]_[locus]_[doench score]
The A section of the workflow analyses the Nanopore sequences and identifies the excisions.
- The first step is to merge the FASTA files if the sequences are split up into multiple files. This can be easily done using basic shell commands:
cat *.fasta.txt > combined.fasta
- The combined FASTA file serves as the input for minimap2. Minimap2 aligns the sequences to the reference sequence. Minimap2 is run in the spliced alignment mode.
minimap2 -ax splice --secondary=no gene.fa query.fa > align.sam
- The aligned sequences are then filtered
awk '($2 == "2064" || $2 == "2048") && (NR >1) {next} {print}' align.sam> align_unique.sam
- The filtered SAM file is converted to a BED file using paftools. Paftools is a script provided by minimap2
paftools.js splice2bed align_unique.sam > align.bed
-
The resulting BED file contains the aligned exons of each sequence as the blocks in columns 10-12.
-
By overlapping these exons with the exons from the reference gene, the missing exons can be identified. This is done why a script provided by this workflow. The exon annotation needs to be respectively to the reference.
python ./ANGIE/scripts/exon_exon_excision_count.py -n alignment_unique.bed -e exon_annotation.bed -o nanopore_count.tsv
The B section of the workflow focuses on the analysis of the gRNA pair NGS counts. The section is split into B.1 and B.2
Subsection B.1 validates the position of the gRNAs on the reference sequence and the PAM orientation of the gRNAs in each pair.
This is done by aligning the guide sequences to the reference using bowtie2.
The resulting SAM file is converted to a TSV file with the following columns:
gRNA name | PAM orientation | reference name | position | sequence
This TSV-file serves as an input for the guide setup in subsection B.2
Subsection B.2 normalizes the gRNA counts. The counts are normalized using the DESeq normalization.
python ./ANGIE/scripts/deseq_wrapper.py -i gRNA_count.tsv -p /Project-Schnipp-Schnapp/scripts/deseq_norm.R -o ./output_dir/
The normalized counts serve as the input for the guide setup script. The setup script sets the guide pairs up to be quantified.
python ./ANGIE/scripts/guide_setup.py -i normalized.counts.tsv -a guide_align(from B.1).tsv -o guide_setup.tsv
The resulting guide_setup.tsv serves as the input for the quantification script. It counts the gRNA pairs and bins them into the intron-intron or exon-exon combinations.
python ./ANGIE/scripts/get_guide_pair_count.py -i guide_setup.tsv -o guide_pair_count.tsv
To combine the data from section A and section B the merge_count_dfs.py is called.
python ./ANGIE/scripts/merge_count_dfs.py -i nanopore_count.tsv -n guide_pair_count.tsv -o merged_counts.tsv
Section C is about plotting the data. The plot_log2fold.py script plots the log2 fold change of each pair as a scatter plot.
python ./ANGIE/scripts/plot_log2fold.py -i guide_pair_count.tsv -o lfc_scatter.png
Additionally, the scatter_exon_exon.py can be called to compare the bins with each other. For example, the exon-exon combinations.
python ./Project-Schnipp-Schnapp/scripts/scatter_exon_exon.py -i merged_counts.tsv -e [number of binds] -q [quadrant split] -o scatter.png