The project is built up similarly to other snakemake workflows.
Root
- config
- resources
- results
- workflow
The config.yaml
contains global configurations for running the
pipeline. Currently, the following settings can be made:
read_cut_padding: # int, padding left and right of the conflict region
# path to a conflict file
region_file: # str
# path to a vcf file with known SVs, e.g. from the HGSVC
alternative_haplotype_vcf: # str
# path to liftover directory (explained below)
liftover_root: # str, defaults to 'resources/references/liftover'
clustering:
eps: 0.6 # float, the higher the bigger the clusters can be
# Settings for the pairwise alignment
pw_alignment:
match: # positive integer
gap: # negative integer
mismatch: # negative integer
Region files should be text files, where each line is just a region specifying string (like chr12:456-876).
Here you have to put all the resources that snakemake is going to use.
In the references
directory, you should put the fasta files + indexes for
each reference that you want to use. Also, references/liftover
should
include a file called liftover_config.json
which defines the liftover
path of one reference to another.
E.g. assuming, hg38 is the main reference used in the workflow and
t2tv1.1 is another reference, we need:
references/
hg38.fa
hg38.fa.fai
t2t.fa
t2t.fa.fai
liftover/
hg38_to_t2t_v1.0.chain
t2t_v1.0_to_t2t_v1.1.chain
liftover_config.json
where liftover_config.json
looks like:
{
"t2t":
[
"hg38_to_t2t_v1.0.chain",
"t2t_v1.0_to_t2t_v1.1.chain",
]
}
The samples
subdirectory should contain a folder for each sample that
is going to be used in the workflow.
Each sample's folder again includes a folder for each reference, where
the mapped reads (.bam files) for different technologies reside.
If not all samples should be checked at the same regions, a region file
can be put into each sample's folder. (TODO!!!)
E.g.:
samples/
Sample1/
hg38/
PacBio.bam
PacBio.bam.bai
Illumina.bam
Illumina.bam.bai
t2t/
PacBio.bam
PacBio.bam.bai
Illumina.bam
Illumina.bam.bai
regions.txt
The results directory will include all results after the workflow has run. The structure will look like the following:
results/
Sample1/
chr1_123_234/
results.pdf
...
There will be more files (distance matrices, screenshots, fasta files,
etc.) in each region's directory, but the most
important file is the results.pdf
. Here, all gathered information will
be collected.