How to use the scripts provided by the workflow

The Workflow

Projekt-Schnipp-Schnapp_Pipeline

The workflow is separated into three sections. Section A for analyzing the Nanopore data, Section B for analyzing the gRNA NGS data, and Section C for the final visualization.

Input

The data files needed to run this analysis are:

A FASTQ-file containing all NanoPore reads
The reference gene in FASTA format
Exon gene annotation file in BED3 format
All guide sequences in FASTA format

A TSV-file with guide pair NGS counts

Example file:

GuideA	GuideB	Rb_Tiling_ctrl_A	Rb_tiling_lib	Rb_Tiling_ctrl_C	Rb_Tiling_end_B	Rb_Tiling_end_D
exon1_1234_0.60	exon2_1234_0.60	100	50	112	345	367
exon1_1234_0.60	exon3_1234_0.60	20	25	10	490	510
exon1_1234_0.60	exon4_1234_0.60	101	120	39	1123	1000

Note: The guide names need to be in following format: Name[X]_[locus]_[doench score]

Section A

The A section of the workflow analyses the Nanopore sequences and identifies the excisions.

The first step is to merge the FASTA files if the sequences are split up into multiple files. This can be easily done using basic shell commands:

cat *.fasta.txt > combined.fasta

The combined FASTA file serves as the input for minimap2. Minimap2 aligns the sequences to the reference sequence. Minimap2 is run in the spliced alignment mode.

minimap2 -ax splice --secondary=no gene.fa query.fa > align.sam

The aligned sequences are then filtered

awk '($2 == "2064" || $2 == "2048") && (NR >1) {next} {print}' align.sam> align_unique.sam

The filtered SAM file is converted to a BED file using paftools. Paftools is a script provided by minimap2

paftools.js splice2bed align_unique.sam > align.bed

The resulting BED file contains the aligned exons of each sequence as the blocks in columns 10-12.
By overlapping these exons with the exons from the reference gene, the missing exons can be identified. This is done why a script provided by this workflow. The exon annotation needs to be respectively to the reference.

python ./ANGIE/scripts/exon_exon_excision_count.py -n alignment_unique.bed -e exon_annotation.bed -o nanopore_count.tsv

Section B

The B section of the workflow focuses on the analysis of the gRNA pair NGS counts. The section is split into B.1 and B.2

Subsection B.1

Subsection B.1 validates the position of the gRNAs on the reference sequence and the PAM orientation of the gRNAs in each pair. This is done by aligning the guide sequences to the reference using bowtie2. The resulting SAM file is converted to a TSV file with the following columns:
gRNA name | PAM orientation | reference name | position | sequence

This TSV-file serves as an input for the guide setup in subsection B.2

Subsection B.2

Subsection B.2 normalizes the gRNA counts. The counts are normalized using the DESeq normalization.

python ./ANGIE/scripts/deseq_wrapper.py -i gRNA_count.tsv -p /Project-Schnipp-Schnapp/scripts/deseq_norm.R -o ./output_dir/

The normalized counts serve as the input for the guide setup script. The setup script sets the guide pairs up to be quantified.

python ./ANGIE/scripts/guide_setup.py -i normalized.counts.tsv -a guide_align(from B.1).tsv -o guide_setup.tsv

The resulting guide_setup.tsv serves as the input for the quantification script. It counts the gRNA pairs and bins them into the intron-intron or exon-exon combinations.

python ./ANGIE/scripts/get_guide_pair_count.py -i guide_setup.tsv -o guide_pair_count.tsv

Combine section A and B

To combine the data from section A and section B the merge_count_dfs.py is called.

python ./ANGIE/scripts/merge_count_dfs.py -i nanopore_count.tsv -n guide_pair_count.tsv -o merged_counts.tsv

Section C

Section C is about plotting the data. The plot_log2fold.py script plots the log2 fold change of each pair as a scatter plot.

python ./ANGIE/scripts/plot_log2fold.py -i guide_pair_count.tsv -o lfc_scatter.png

Additionally, the scatter_exon_exon.py can be called to compare the bins with each other. For example, the exon-exon combinations.

python ./Project-Schnipp-Schnapp/scripts/scatter_exon_exon.py -i merged_counts.tsv -e [number of binds] -q [quadrant split] -o scatter.png