This repo contains various utilities and a pipeline to inspect(pipeline) conflicts of structural calls. The tool iterates over all given genomic regions and collects data on zygosity, repetitiveness, and different visualizations. The main utilities used in the pipeline are:
- Multiple Sequence Alignment
- PCA on pairwise read distances
- Comparison of references and sample haplotypes
The result of a run is a directory with subdiretories for each region. In each region's directory, collected data on that region can be found. Aswell as PDF file, that inhabits all visualizations together for a quick overview of a region.
Usage of analysis.py
All paths passed to the program have to exist, except the result_dir
argument, which
will be created if necessary.
positional arguments:
conflicts Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41')
vcf_dir Directory with vcf files for all technologies.
result_dir Directory for subfolders for each region.
hg38_reads Path to hg38 PacBio read file.
hg38_ref Path to reference of hg38.
t2t_ref Path to reference of t2t.
optional arguments:
-h, --help show this help message and exit
--hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX
Index file of hg38 reads (in case they are not in the same folder)
--hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA
Illumina reads of hg38.
--additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES
Path to vcf file with extra haplotypes (like HGSVC)
--skip SKIP Use this to skip the first N conflicts in the conflict file/string.
--touch_only Use this to just 'touch' the result directories so that they are in the correct order.
--msa Perform MSA for each region. Increases the running time drastically.
Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed:
- takes a conflict graph file and one vcf file as arguments
- prints all graphs' starting and ending positions
- takes a conflict graph file and a chosen call method as arguments
- for the chosen method the occurences in each graph are counted
- the counts are put into a barplot that is saved to png file
- takes a conflict graph file and a chosen call method as arguments
- for the chosen method the positions of each SV are saved
- for each combination of SVs in a graph the break point distances are calculated
- all those distances are then put into a barplot that is saved as a png file
- takes a conflict file as input
- for each method we count in how many graphs they occur
- furthermore, for each method we count the graphs where it occurs at least twice
- this information is saved in a barplot (horizontal)
- takes a conflict file and a method
- for each graph one line gets printed where all genotypes of the SVs of method in that graph are listed
- takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs
- each combination is counted
- after that each combination is printed with its count
- if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1)
- takes two files of reads mapped to different references, a read name and the region in the first reference to look at
- we get the sequence of both references where the read is mapped to and also the read sequence itself
- all three sequences are saved into txt files and dotplot commands are suggested at the end of the script
- takes a region file and a BAM file (and some more options)
- for all reads that cover the region (+padding) make a MSA with clustalw
- reinterpret the MSA as new alignments of the reads and put them into a new BAM file
CONFLICTS="some_conflict_file.txt"
VCF_DIR="path/to/vcf/files/"
REGIONS="regions.txt"
POSITIONS="positions.txt"
GTS="genotypes.txt"
METHOD=svim
python multi_occurences.py $CONFLICTS > mult_occ.txt
grep $METHOD mult_occ.txt > $REGIONS
# check out genotype information in wanted regions
bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS"
python genotype_stats.py "$GTS" quality
python conflict_positions.py $REGIONS > $POSITIONS
python msa_on_conflicts.py $POSITIONS ...
- sv_type_stat.py
- trf_counter.sh
- vcfs_comp.py
- needleman_wunsch_omp.cpp
- liftover.py
- cmp_breakend.py
- supp_align_query.sh