README.md

# SV conflict analysis

This repo contains various utilities and a [pipeline](analysis.py) to inspect(pipeline)
conflicts of structural calls. The tool iterates over all given genomic regions and
collects data on zygosity, repetitiveness, and different visualizations.
The main utilities used in the pipeline are:
1. Multiple Sequence Alignment
2. PCA on pairwise read distances
3. Comparison of references and sample haplotypes

The result of a run is a directory with subdiretories for each region. In each region's
directory, collected data on that region can be found. Aswell as PDF file, that
inhabits all visualizations together for a quick overview of a region.

### Usage of [analysis.py](analysis.py)
All paths passed to the program have to exist, except the `result_dir` argument, which
will be created if necessary.
```
positional arguments:
  conflicts             Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41')
  vcf_dir               Directory with vcf files for all technologies.
  result_dir            Directory for subfolders for each region.
  hg38_reads            Path to hg38 PacBio read file.
  hg38_ref              Path to reference of hg38.
  t2t_ref               Path to reference of t2t.

optional arguments:
  -h, --help            show this help message and exit
  --hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX
                        Index file of hg38 reads (in case they are not in the same folder)
  --hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA
                        Illumina reads of hg38.
  --additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES
                        Path to vcf file with extra haplotypes (like HGSVC)
  --skip SKIP           Use this to skip the first N conflicts in the conflict file/string.
  --touch_only          Use this to just 'touch' the result directories so that they are in the correct order.
  --msa                 Perform MSA for each region. Increases the running time drastically.
```

### Prerequisities
Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed:
* [dotter](https://sonnhammer.sbc.su.se/download/software/dotter/dotter.ubuntu)
* [clustalw2](http://www.clustal.org/download/current/)
* [igv](https://software.broadinstitute.org/software/igv/download)

### Overview of the functionalities of the single utilities
[conflict\_positions.py](conflict_positions.py)
- takes a conflict graph file and one vcf file as arguments
- prints all graphs' starting and ending positions


[ambiguity\_scores.py](ambiguity_scores.py)
- takes a conflict graph file and a chosen call method as arguments
- for the chosen method the occurences in each graph are counted
- the counts are put into a barplot that is saved to png file


[amgiguity\_distances.py](amgiguity_distances.py)
- takes a conflict graph file and a chosen call method as arguments
- for the chosen method the positions of each SV are saved
- for each combination of SVs in a graph the break point distances are calculated
- all those distances are then put into a barplot that is saved as a png file


[multi\_occurences.py](multi_occurences.py)
- takes a conflict file as input
- for each method we count in how many graphs they occur
- furthermore, for each method we count the graphs where it occurs at least twice
- this information is saved in a barplot (horizontal)

[sv\_to\_gt.sh](sv_to_gt.sh)
- takes a conflict file and a method
- for each graph one line gets printed where all genotypes of the SVs of *method* in that graph are listed

[genotype\_stats.py](genotype_stats.py)
- takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs
- each combination is counted
- after that each combination is printed with its count
- if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1)


[reference\_comparison.py](reference_comparison.py)
- takes two files of reads mapped to different references, a read name and the region in the first reference to look at
- we get the sequence of both references where the read is mapped to and also the read sequence itself
- all three sequences are saved into txt files and dotplot commands are suggested at the end of the script

[msa\_on\_conflicts.py](msa_on_conflicts.py)
- takes a region file and a BAM file (and some more options)
- for all reads that cover the region (+padding) make a MSA with clustalw
- reinterpret the MSA as new alignments of the reads and put them into a new BAM file


#### An examplary workflow:
```sh
CONFLICTS="some_conflict_file.txt"
VCF_DIR="path/to/vcf/files/"
REGIONS="regions.txt"
POSITIONS="positions.txt"
GTS="genotypes.txt"
METHOD=svim

python multi_occurences.py $CONFLICTS > mult_occ.txt
grep $METHOD mult_occ.txt > $REGIONS

# check out genotype information in wanted regions
bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS"
python genotype_stats.py "$GTS" quality

python conflict_positions.py $REGIONS > $POSITIONS

python msa_on_conflicts.py $POSITIONS ...
```

#### To Document
* sv_type_stat.py
* trf_counter.sh
* vcfs_comp.py
* needleman_wunsch_omp.cpp
* liftover.py
* cmp_breakend.py
* supp_align_query.sh
	# SV conflict analysis

	This repo contains various utilities and a [pipeline](analysis.py) to inspect(pipeline)
	conflicts of structural calls. The tool iterates over all given genomic regions and
	collects data on zygosity, repetitiveness, and different visualizations.
	The main utilities used in the pipeline are:
	1. Multiple Sequence Alignment
	2. PCA on pairwise read distances
	3. Comparison of references and sample haplotypes

	The result of a run is a directory with subdiretories for each region. In each region's
	directory, collected data on that region can be found. Aswell as PDF file, that
	inhabits all visualizations together for a quick overview of a region.

	### Usage of [analysis.py](analysis.py)
	All paths passed to the program have to exist, except the `result_dir` argument, which
	will be created if necessary.
	```
	positional arguments:
	conflicts Conflict file (json), region file (csv:CHROM,START,END) or region string ('chr1:12-13;...;chr19:31-41')
	vcf_dir Directory with vcf files for all technologies.
	result_dir Directory for subfolders for each region.
	hg38_reads Path to hg38 PacBio read file.
	hg38_ref Path to reference of hg38.
	t2t_ref Path to reference of t2t.

	optional arguments:
	-h, --help show this help message and exit
	--hg38_reads_idx HG38_READS_IDX, -idx HG38_READS_IDX
	Index file of hg38 reads (in case they are not in the same folder)
	--hg38_reads_illumina HG38_READS_ILLUMINA, -ill HG38_READS_ILLUMINA
	Illumina reads of hg38.
	--additional_haplotypes ADDITIONAL_HAPLOTYPES, -ah ADDITIONAL_HAPLOTYPES
	Path to vcf file with extra haplotypes (like HGSVC)
	--skip SKIP Use this to skip the first N conflicts in the conflict file/string.
	--touch_only Use this to just 'touch' the result directories so that they are in the correct order.
	--msa Perform MSA for each region. Increases the running time drastically.
	```

	### Prerequisities
	Besides various python modules (biopython, emboss, numpy, scikit-learn, matplotlib, PyPDF4, pyliftover, pyfaidx, pysam, samplot), some command line tools need to be installed:
	* [dotter](https://sonnhammer.sbc.su.se/download/software/dotter/dotter.ubuntu)
	* [clustalw2](http://www.clustal.org/download/current/)
	* [igv](https://software.broadinstitute.org/software/igv/download)

	### Overview of the functionalities of the single utilities
	[conflict\_positions.py](conflict_positions.py)
	- takes a conflict graph file and one vcf file as arguments
	- prints all graphs' starting and ending positions


	[ambiguity\_scores.py](ambiguity_scores.py)
	- takes a conflict graph file and a chosen call method as arguments
	- for the chosen method the occurences in each graph are counted
	- the counts are put into a barplot that is saved to png file


	[amgiguity\_distances.py](amgiguity_distances.py)
	- takes a conflict graph file and a chosen call method as arguments
	- for the chosen method the positions of each SV are saved
	- for each combination of SVs in a graph the break point distances are calculated
	- all those distances are then put into a barplot that is saved as a png file


	[multi\_occurences.py](multi_occurences.py)
	- takes a conflict file as input
	- for each method we count in how many graphs they occur
	- furthermore, for each method we count the graphs where it occurs at least twice
	- this information is saved in a barplot (horizontal)

	[sv\_to\_gt.sh](sv_to_gt.sh)
	- takes a conflict file and a method
	- for each graph one line gets printed where all genotypes of the SVs of method in that graph are listed

	[genotype\_stats.py](genotype_stats.py)
	- takes a genotype file as input where each line represents the genotypes of a conflict graph's SVs
	- each combination is counted
	- after that each combination is printed with its count
	- if the genotype file contains quality scores for each SV, the distribution of scores is put into a boxplot with each boxplot representing a genotype (this only show the correlation of high quality SVs being called with GT 1/1)


	[reference\_comparison.py](reference_comparison.py)
	- takes two files of reads mapped to different references, a read name and the region in the first reference to look at
	- we get the sequence of both references where the read is mapped to and also the read sequence itself
	- all three sequences are saved into txt files and dotplot commands are suggested at the end of the script

	[msa\_on\_conflicts.py](msa_on_conflicts.py)
	- takes a region file and a BAM file (and some more options)
	- for all reads that cover the region (+padding) make a MSA with clustalw
	- reinterpret the MSA as new alignments of the reads and put them into a new BAM file


	#### An examplary workflow:
	```sh
	CONFLICTS="some_conflict_file.txt"
	VCF_DIR="path/to/vcf/files/"
	REGIONS="regions.txt"
	POSITIONS="positions.txt"
	GTS="genotypes.txt"
	METHOD=svim

	python multi_occurences.py $CONFLICTS > mult_occ.txt
	grep $METHOD mult_occ.txt > $REGIONS

	# check out genotype information in wanted regions
	bash sv_to_gt.sh $REGIONS $METHOD $VCF_DIR/${METHOD}.vcf.gz > "$GTS"
	python genotype_stats.py "$GTS" quality

	python conflict_positions.py $REGIONS > $POSITIONS

	python msa_on_conflicts.py $POSITIONS ...
	```

	#### To Document
	* sv_type_stat.py
	* trf_counter.sh
	* vcfs_comp.py
	* needleman_wunsch_omp.cpp
	* liftover.py
	* cmp_breakend.py
	* supp_align_query.sh