SV conflict workflow

Filetree

The project is built up similarly to other snakemake workflows.

Root

config
resources
results
workflow

Config

The config.yaml contains global configurations for running the pipeline. Currently, the following settings can be made:

read_cut_padding: # int, padding left and right of the conflict region

# path to a conflict file
region_file: # str

# path to a vcf file with known SVs, e.g. from the HGSVC
alternative_haplotype_vcf: # str

# path to liftover directory (explained below)
liftover_root: # str, defaults to 'resources/references/liftover'

clustering:
  eps: 0.6 # float, the higher the bigger the clusters can be

# Settings for the pairwise alignment
pw_alignment:
  match: # positive integer
  gap: # negative integer
  mismatch: # negative integer

Region files should be text files, where each line is just a region specifying string (like chr12:456-876).

Resources

Here you have to put all the resources that snakemake is going to use.

Resources

In the references directory, you should put the fasta files + indexes for each reference that you want to use. Also, references/liftover should include a file called liftover_config.json which defines the liftover path of one reference to another. E.g. assuming, hg38 is the main reference used in the workflow and t2tv1.1 is another reference, we need:

references/
        hg38.fa
        hg38.fa.fai
        t2t.fa
        t2t.fa.fai
        liftover/
                hg38_to_t2t_v1.0.chain
                t2t_v1.0_to_t2t_v1.1.chain
                liftover_config.json

where liftover_config.json looks like:

{
        "t2t":
        [
                "hg38_to_t2t_v1.0.chain",
                "t2t_v1.0_to_t2t_v1.1.chain",
        ]
}

Samples

The samples subdirectory should contain a folder for each sample that is going to be used in the workflow. Each sample's folder again includes a folder for each reference, where the mapped reads (.bam files) for different technologies reside. If not all samples should be checked at the same regions, a region file can be put into each sample's folder. (TODO!!!) E.g.:

samples/
        Sample1/
                hg38/
                        PacBio.bam
                        PacBio.bam.bai
                        Illumina.bam
                        Illumina.bam.bai
                t2t/
                        PacBio.bam
                        PacBio.bam.bai
                        Illumina.bam
                        Illumina.bam.bai
                regions.txt

Results

The results directory will include all results after the workflow has run. The structure will look like the following:

results/
        Sample1/
                chr1_123_234/
                        results.pdf
                        ...

There will be more files (distance matrices, screenshots, fasta files, etc.) in each region's directory, but the most important file is the results.pdf. Here, all gathered information will be collected.

License

kriese/conflict-analysis-snakemake

About

Resources

License

Stars

Watchers

Forks

Releases

Languages