SV conflict analysis

This repo contains various utilities and a pipeline to inspect(pipeline) conflicts of structural calls. The tool iterates over all given genomic regions and collects data on zygosity, repetitiveness, and different visualizations. The main utilities used in the pipeline are:

Multiple Sequence Alignment
PCA on pairwise read distances
Comparison of references and sample haplotypes

The result of a run is a directory with subdiretories for each region. In each region's directory, collected data on that region can be found. Aswell as PDF file, that inhabits all visualizations together for a quick overview of a region.

Usage of analysis.py

All paths passed to the program have to exist, except the result_dir argument, which will be created if necessary.

positional arguments:
  conflicts             Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41')
  vcf_dir               Directory with vcf files for all technologies.
  result_dir            Directory for subfolders for each region.
  hg38_reads            Path to hg38 PacBio read file.
  hg38_ref              Path to reference of hg38.
  t2t_ref               Path to reference of t2t.

optional arguments:
  -h, --help            show this help message and exit
  --hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX
                        Index file of hg38 reads (in case they are not in the same folder)
  --hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA
                        Illumina reads of hg38.
  --additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES
                        Path to vcf file with extra haplotypes (like HGSVC)
  --skip SKIP           Use this to skip the first N conflicts in the conflict file/string.
  --touch_only          Use this to just 'touch' the result directories so that they are in the correct order.
  --msa                 Perform MSA for each region. Increases the running time drastically.

Prerequisities

Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed:

Overview of the functionalities of the single utilities

conflict_positions.py

takes a conflict graph file and one vcf file as arguments
prints all graphs' starting and ending positions

ambiguity_scores.py

takes a conflict graph file and a chosen call method as arguments
for the chosen method the occurences in each graph are counted
the counts are put into a barplot that is saved to png file

amgiguity_distances.py

takes a conflict graph file and a chosen call method as arguments
for the chosen method the positions of each SV are saved
for each combination of SVs in a graph the break point distances are calculated
all those distances are then put into a barplot that is saved as a png file

multi_occurences.py

takes a conflict file as input
for each method we count in how many graphs they occur
furthermore, for each method we count the graphs where it occurs at least twice
this information is saved in a barplot (horizontal)

sv_to_gt.sh

takes a conflict file and a method
for each graph one line gets printed where all genotypes of the SVs of method in that graph are listed

genotype_stats.py

takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs
each combination is counted
after that each combination is printed with its count
if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1)

reference_comparison.py

takes two files of reads mapped to different references, a read name and the region in the first reference to look at
we get the sequence of both references where the read is mapped to and also the read sequence itself
all three sequences are saved into txt files and dotplot commands are suggested at the end of the script

msa_on_conflicts.py

takes a region file and a BAM file (and some more options)
for all reads that cover the region (+padding) make a MSA with clustalw
reinterpret the MSA as new alignments of the reads and put them into a new BAM file

An examplary workflow:

CONFLICTS="some_conflict_file.txt"
VCF_DIR="path/to/vcf/files/"
REGIONS="regions.txt"
POSITIONS="positions.txt"
GTS="genotypes.txt"
METHOD=svim

python multi_occurences.py $CONFLICTS > mult_occ.txt
grep $METHOD mult_occ.txt > $REGIONS

# check out genotype information in wanted regions
bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS"
python genotype_stats.py "$GTS" quality

python conflict_positions.py $REGIONS > $POSITIONS

python msa_on_conflicts.py $POSITIONS ...

To Document

sv_type_stat.py
trf_counter.sh
vcfs_comp.py
needleman_wunsch_omp.cpp
liftover.py
cmp_breakend.py
supp_align_query.sh

kriese/sv-conflict-analysis

About

Resources

Stars

Watchers

Forks

Releases

Languages