Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
sv-conflict-analysis/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
121 lines (97 sloc)
5.37 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# SV conflict analysis | |
This repo contains various utilities and a [pipeline](analysis.py) to inspect(pipeline) | |
conflicts of structural calls. The tool iterates over all given genomic regions and | |
collects data on zygosity, repetitiveness, and different visualizations. | |
The main utilities used in the pipeline are: | |
1. Multiple Sequence Alignment | |
2. PCA on pairwise read distances | |
3. Comparison of references and sample haplotypes | |
The result of a run is a directory with subdiretories for each region. In each region's | |
directory, collected data on that region can be found. Aswell as PDF file, that | |
inhabits all visualizations together for a quick overview of a region. | |
### Usage of [analysis.py](analysis.py) | |
All paths passed to the program have to exist, except the `result_dir` argument, which | |
will be created if necessary. | |
``` | |
positional arguments: | |
conflicts Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41') | |
vcf_dir Directory with vcf files for all technologies. | |
result_dir Directory for subfolders for each region. | |
hg38_reads Path to hg38 PacBio read file. | |
hg38_ref Path to reference of hg38. | |
t2t_ref Path to reference of t2t. | |
optional arguments: | |
-h, --help show this help message and exit | |
--hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX | |
Index file of hg38 reads (in case they are not in the same folder) | |
--hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA | |
Illumina reads of hg38. | |
--additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES | |
Path to vcf file with extra haplotypes (like HGSVC) | |
--skip SKIP Use this to skip the first N conflicts in the conflict file/string. | |
--touch_only Use this to just 'touch' the result directories so that they are in the correct order. | |
--msa Perform MSA for each region. Increases the running time drastically. | |
``` | |
### Prerequisities | |
Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed: | |
* [dotter](https://sonnhammer.sbc.su.se/download/software/dotter/dotter.ubuntu) | |
* [clustalw2](http://www.clustal.org/download/current/) | |
* [igv](https://software.broadinstitute.org/software/igv/download) | |
### Overview of the functionalities of the single utilities | |
[conflict\_positions.py](conflict_positions.py) | |
- takes a conflict graph file and one vcf file as arguments | |
- prints all graphs' starting and ending positions | |
[ambiguity\_scores.py](ambiguity_scores.py) | |
- takes a conflict graph file and a chosen call method as arguments | |
- for the chosen method the occurences in each graph are counted | |
- the counts are put into a barplot that is saved to png file | |
[amgiguity\_distances.py](amgiguity_distances.py) | |
- takes a conflict graph file and a chosen call method as arguments | |
- for the chosen method the positions of each SV are saved | |
- for each combination of SVs in a graph the break point distances are calculated | |
- all those distances are then put into a barplot that is saved as a png file | |
[multi\_occurences.py](multi_occurences.py) | |
- takes a conflict file as input | |
- for each method we count in how many graphs they occur | |
- furthermore, for each method we count the graphs where it occurs at least twice | |
- this information is saved in a barplot (horizontal) | |
[sv\_to\_gt.sh](sv_to_gt.sh) | |
- takes a conflict file and a method | |
- for each graph one line gets printed where all genotypes of the SVs of *method* in that graph are listed | |
[genotype\_stats.py](genotype_stats.py) | |
- takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs | |
- each combination is counted | |
- after that each combination is printed with its count | |
- if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1) | |
[reference\_comparison.py](reference_comparison.py) | |
- takes two files of reads mapped to different references, a read name and the region in the first reference to look at | |
- we get the sequence of both references where the read is mapped to and also the read sequence itself | |
- all three sequences are saved into txt files and dotplot commands are suggested at the end of the script | |
[msa\_on\_conflicts.py](msa_on_conflicts.py) | |
- takes a region file and a BAM file (and some more options) | |
- for all reads that cover the region (+padding) make a MSA with clustalw | |
- reinterpret the MSA as new alignments of the reads and put them into a new BAM file | |
#### An examplary workflow: | |
```sh | |
CONFLICTS="some_conflict_file.txt" | |
VCF_DIR="path/to/vcf/files/" | |
REGIONS="regions.txt" | |
POSITIONS="positions.txt" | |
GTS="genotypes.txt" | |
METHOD=svim | |
python multi_occurences.py $CONFLICTS > mult_occ.txt | |
grep $METHOD mult_occ.txt > $REGIONS | |
# check out genotype information in wanted regions | |
bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS" | |
python genotype_stats.py "$GTS" quality | |
python conflict_positions.py $REGIONS > $POSITIONS | |
python msa_on_conflicts.py $POSITIONS ... | |
``` | |
#### To Document | |
* sv_type_stat.py | |
* trf_counter.sh | |
* vcfs_comp.py | |
* needleman_wunsch_omp.cpp | |
* liftover.py | |
* cmp_breakend.py | |
* supp_align_query.sh |