Code for reconstruction of shattered chromosomes
Prerequisites
- Snakemake installation
- Python3 installation
- Python-packages: seaborn, argparse, numpy, strawC (for reading hic files)
- R-installation
- Download files from UCSC genome browser with chrom.sizes and cytoBand and store them in the misc folder (as hg19.chrom.sizes and hg19.cytoband.txt), or the following script can do it:
./download_misc_folder.sh
- All code assumes reference genome hg19
The scripts have been tested on a Linux system using snakemake (version 6.8.0), and the python-packages seaborn (version 0.11.2), argparse (version 1.1), numpy (version 1.20.3), strawC (version 0.0.9). Beside the required packages, the scripts themselves need no installation.
Example
- Derivative chromosomes are reconstructed from a curated set of novel adjacencies (assuming reference genome hg19)
- Simple examples can be found in the examples folder with a Snakefile for Snakemake and can be executed with:
./call_snakemake.sh
- The runtime for creating the derivative chromosomes tsv files for the examples should be only a few seconds on a standard desktop computer.
The novel adjacency file
Novel adjacencies are stored in a simple tab-separated file format:
chrom1 pos1 strand1 chrom2 pos2 strand2 ID
1 100200290 + 2 50200300 - ID_1
2 50000301 - 1 100400295 - ID_2
The coordinates are 1-based. The novel adjacency file is the input for the reconstruction and was derived for the manuscript from manually curated SV calls. The strand indicates here the orientation of the connected fragments. For example strand1 = "+" and strand2 = "+" indicates a connection 3' to 5'.
Step 1 - simplify
- This step matches novel adjacencies with compatible strandedness, by setting their anchor points which are less than 50 bp separated from each other, directly next to each other
- In this step the smaller coordinate is shifted toward the larger coordinate
- By this, smaller deviations between the coordinates between otherwise fitting novel adjacencies are removed
- If more than two novel adjacencies are comptible to each other, the ambiguity has the be resolved manually in the novel adjacency file
- The output is a file with novel adjacencies with adjusted coordinates. In case a coordinate was shifted, a note is appended to the ID.
Step 2 - linearize
- In this step, fragments are created from the simplified (adapted) novel adjacency file
- The fragments are connected according to the novel adjacenicies
- The resulting graph is traversed such that the derivative chromosomes are extracted (or scaffolds in case of incomplete graphs)
Output format
- The output format is related to a BED file (chrom start end name score strand scaffold)
- start is 0-based, end is the non-inclusive end position of the fragment
- name (column 4) and score (column 5) are not used
- column 7 is the scaffold, in case the first and the last fragment of a scoffold are telomeric ends, the scaffold refers to a derivative chromosome
1 0 100200300 . . + 0
2 50000300 50200300 . . - 0
1 100200300 100400300 . . - 0
1 100400300 249250621 . . + 0
2 0 50000300 . . + 1
2 50200300 243199373 . . + 1
Step 3A - plotting
- A plot of the reconstruction graph can be generated
Step 3B - 2D annotation files
- Creating 2D annotation files for Juicebox in bedpe-like format
- The 2D annotation files can be used to inspect the plausibility of reconstructions or for identifying IDs of scaffold, which should be merged further
- The script is producing annotation files for:
- Fagments
- Scaffolds
- Derivative chromosomes, which are merged from multiple scaffolds
python3 ../../code/create_annotation_file_from_fragments.py --fragmentsIn sample.linearized.tsv \
--bedpeFrag sample.fragments.bedpe \
--bedpeScaff sample.scaffolds.bedpe \
--bedpeDerChrom sample.derChrom.bedpe \
--fragmentsOut sample.derChrom.tsv \
--mergeLists 1,2 6,8,11
- In the code example above, the scaffolds with ID 1,2 are aggregated to the same derivative chromosome (same for scaffolds 6,8,11)
- The format of the file written to --fragmentsOut is as before in Step 2, but with an additional 8th column, containing the ID of the derChrom
- In case the optional argument --mergeLists is not used, the derChrom-ID equals the scaffold-ID (i.e. no scaffolds were merged)