Skip to content

kriese/sv-conflict-analysis

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.

SV conflict analysis

This repo contains various utilities and a pipeline to inspect(pipeline) conflicts of structural calls. The tool iterates over all given genomic regions and collects data on zygosity, repetitiveness, and different visualizations. The main utilities used in the pipeline are:

  1. Multiple Sequence Alignment
  2. PCA on pairwise read distances
  3. Comparison of references and sample haplotypes

The result of a run is a directory with subdiretories for each region. In each region's directory, collected data on that region can be found. Aswell as PDF file, that inhabits all visualizations together for a quick overview of a region.

Usage of analysis.py

All paths passed to the program have to exist, except the result_dir argument, which will be created if necessary.

positional arguments:
  conflicts             Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41')
  vcf_dir               Directory with vcf files for all technologies.
  result_dir            Directory for subfolders for each region.
  hg38_reads            Path to hg38 PacBio read file.
  hg38_ref              Path to reference of hg38.
  t2t_ref               Path to reference of t2t.

optional arguments:
  -h, --help            show this help message and exit
  --hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX
                        Index file of hg38 reads (in case they are not in the same folder)
  --hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA
                        Illumina reads of hg38.
  --additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES
                        Path to vcf file with extra haplotypes (like HGSVC)
  --skip SKIP           Use this to skip the first N conflicts in the conflict file/string.
  --touch_only          Use this to just 'touch' the result directories so that they are in the correct order.
  --msa                 Perform MSA for each region. Increases the running time drastically.

Prerequisities

Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed:

Overview of the functionalities of the single utilities

conflict_positions.py

  • takes a conflict graph file and one vcf file as arguments
  • prints all graphs' starting and ending positions

ambiguity_scores.py

  • takes a conflict graph file and a chosen call method as arguments
  • for the chosen method the occurences in each graph are counted
  • the counts are put into a barplot that is saved to png file

amgiguity_distances.py

  • takes a conflict graph file and a chosen call method as arguments
  • for the chosen method the positions of each SV are saved
  • for each combination of SVs in a graph the break point distances are calculated
  • all those distances are then put into a barplot that is saved as a png file

multi_occurences.py

  • takes a conflict file as input
  • for each method we count in how many graphs they occur
  • furthermore, for each method we count the graphs where it occurs at least twice
  • this information is saved in a barplot (horizontal)

sv_to_gt.sh

  • takes a conflict file and a method
  • for each graph one line gets printed where all genotypes of the SVs of method in that graph are listed

genotype_stats.py

  • takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs
  • each combination is counted
  • after that each combination is printed with its count
  • if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1)

reference_comparison.py

  • takes two files of reads mapped to different references, a read name and the region in the first reference to look at
  • we get the sequence of both references where the read is mapped to and also the read sequence itself
  • all three sequences are saved into txt files and dotplot commands are suggested at the end of the script

msa_on_conflicts.py

  • takes a region file and a BAM file (and some more options)
  • for all reads that cover the region (+padding) make a MSA with clustalw
  • reinterpret the MSA as new alignments of the reads and put them into a new BAM file

An examplary workflow:

CONFLICTS="some_conflict_file.txt"
VCF_DIR="path/to/vcf/files/"
REGIONS="regions.txt"
POSITIONS="positions.txt"
GTS="genotypes.txt"
METHOD=svim

python multi_occurences.py $CONFLICTS > mult_occ.txt
grep $METHOD mult_occ.txt > $REGIONS

# check out genotype information in wanted regions
bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS"
python genotype_stats.py "$GTS" quality

python conflict_positions.py $REGIONS > $POSITIONS

python msa_on_conflicts.py $POSITIONS ...

To Document

  • sv_type_stat.py
  • trf_counter.sh
  • vcfs_comp.py
  • needleman_wunsch_omp.cpp
  • liftover.py
  • cmp_breakend.py
  • supp_align_query.sh

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published