README.md

# SV conflict workflow

## Filetree
The project is built up similarly to other snakemake workflows.

Root
- config
- resources
- results
- workflow

### Config
The [`config.yaml`](config/config.yaml) contains global configurations for running the
pipeline. Currently, the following settings can be made:

```yaml
read_cut_padding: # int, padding left and right of the conflict region

# path to a conflict file
region_file: # str

# path to a vcf file with known SVs, e.g. from the HGSVC
alternative_haplotype_vcf: # str

# path to liftover directory (explained below)
liftover_root: # str, defaults to 'resources/references/liftover'

clustering:
  eps: 0.6 # float, the higher the bigger the clusters can be

# Settings for the pairwise alignment
pw_alignment:
  match: # positive integer
  gap: # negative integer
  mismatch: # negative integer
```

**Region files** should be text files, where each line is just a region
specifying string (like chr12:456-876).


### Resources
Here you have to put all the resources that snakemake is going to use.

#### Resources
In the `references` directory, you should put the fasta files + indexes for
each reference that you want to use. Also, `references/liftover` should
include a file called `liftover_config.json` which defines the liftover
path of one reference to another.
E.g. assuming, hg38 is the main reference used in the workflow and
t2tv1.1 is another reference, we need:
```
references/
        hg38.fa
        hg38.fa.fai
        t2t.fa
        t2t.fa.fai
        liftover/
                hg38_to_t2t_v1.0.chain
                t2t_v1.0_to_t2t_v1.1.chain
                liftover_config.json
```

where `liftover_config.json` looks like:
```json
{
        "t2t":
        [
                "hg38_to_t2t_v1.0.chain",
                "t2t_v1.0_to_t2t_v1.1.chain",
        ]
}
```

#### Samples
The `samples` subdirectory should contain a folder for each sample that
is going to be used in the workflow.
Each sample's folder again includes a folder for each reference, where
the mapped reads (.bam files) for different technologies reside.
If not all samples should be checked at the same regions, a region file
can be put into each sample's folder. (TODO!!!)
E.g.:
```
samples/
        Sample1/
                hg38/
                        PacBio.bam
                        PacBio.bam.bai
                        Illumina.bam
                        Illumina.bam.bai
                t2t/
                        PacBio.bam
                        PacBio.bam.bai
                        Illumina.bam
                        Illumina.bam.bai
                regions.txt
```

### Results
The results directory will include all results after the workflow has
run. The structure will look like the following:
```
results/
        Sample1/
                chr1_123_234/
                        results.pdf
                        ...
```

There will be more files (distance matrices, screenshots, fasta files,
etc.) in each region's directory, but the most
important file is the `results.pdf`. Here, all gathered information will
be collected.
	# SV conflict workflow

	## Filetree
	The project is built up similarly to other snakemake workflows.

	Root
	- config
	- resources
	- results
	- workflow

	### Config
	The [`config.yaml`](config/config.yaml) contains global configurations for running the
	pipeline. Currently, the following settings can be made:

	```yaml
	read_cut_padding: # int, padding left and right of the conflict region

	# path to a conflict file
	region_file: # str

	# path to a vcf file with known SVs, e.g. from the HGSVC
	alternative_haplotype_vcf: # str

	# path to liftover directory (explained below)
	liftover_root: # str, defaults to 'resources/references/liftover'

	clustering:
	eps: 0.6 # float, the higher the bigger the clusters can be

	# Settings for the pairwise alignment
	pw_alignment:
	match: # positive integer
	gap: # negative integer
	mismatch: # negative integer
	```

	Region files should be text files, where each line is just a region
	specifying string (like chr12:456-876).


	### Resources
	Here you have to put all the resources that snakemake is going to use.

	#### Resources
	In the `references` directory, you should put the fasta files + indexes for
	each reference that you want to use. Also, `references/liftover` should
	include a file called `liftover_config.json` which defines the liftover
	path of one reference to another.
	E.g. assuming, hg38 is the main reference used in the workflow and
	t2tv1.1 is another reference, we need:
	```
	references/
	hg38.fa
	hg38.fa.fai
	t2t.fa
	t2t.fa.fai
	liftover/
	hg38_to_t2t_v1.0.chain
	t2t_v1.0_to_t2t_v1.1.chain
	liftover_config.json
	```

	where `liftover_config.json` looks like:
	```json
	{
	"t2t":
	[
	"hg38_to_t2t_v1.0.chain",
	"t2t_v1.0_to_t2t_v1.1.chain",
	]
	}
	```

	#### Samples
	The `samples` subdirectory should contain a folder for each sample that
	is going to be used in the workflow.
	Each sample's folder again includes a folder for each reference, where
	the mapped reads (.bam files) for different technologies reside.
	If not all samples should be checked at the same regions, a region file
	can be put into each sample's folder. (TODO!!!)
	E.g.:
	```
	samples/
	Sample1/
	hg38/
	PacBio.bam
	PacBio.bam.bai
	Illumina.bam
	Illumina.bam.bai
	t2t/
	PacBio.bam
	PacBio.bam.bai
	Illumina.bam
	Illumina.bam.bai
	regions.txt
	```

	### Results
	The results directory will include all results after the workflow has
	run. The structure will look like the following:
	```
	results/
	Sample1/
	chr1_123_234/
	results.pdf
	...
	```

	There will be more files (distance matrices, screenshots, fasta files,
	etc.) in each region's directory, but the most
	important file is the `results.pdf`. Here, all gathered information will
	be collected.