GitHub

Code for reconstruction of shattered chromosomes

Prerequisites

Snakemake installation
Python3 installation
Python-packages: seaborn, argparse, numpy, strawC (for reading hic files)
R-installation
Download files from UCSC genome browser with chrom.sizes and cytoBand and store them in the misc folder (as hg19.chrom.sizes and hg19.cytoband.txt), or the following script can do it:

./download_misc_folder.sh

All code assumes reference genome hg19

The scripts have been tested on a Linux system using snakemake (version 6.8.0), and the python-packages seaborn (version 0.11.2), argparse (version 1.1), numpy (version 1.20.3), strawC (version 0.0.9). Beside the required packages, the scripts themselves need no installation.

Example

Derivative chromosomes are reconstructed from a curated set of novel adjacencies (assuming reference genome hg19)
Simple examples can be found in the examples folder with a Snakefile for Snakemake and can be executed with:

./call_snakemake.sh

The runtime for creating the derivative chromosomes tsv files for the examples should be only a few seconds on a standard desktop computer.

The novel adjacency file

Novel adjacencies are stored in a simple tab-separated file format:

chrom1	pos1	strand1	chrom2	pos2	strand2	ID
1	100200290	+	2	50200300	-	ID_1
2	50000301	-	1	100400295	-	ID_2

The coordinates are 1-based. The novel adjacency file is the input for the reconstruction and was derived for the manuscript from manually curated SV calls. The strand indicates here the orientation of the connected fragments. For example strand1 = "+" and strand2 = "+" indicates a connection 3' to 5'.

Step 1 - simplify

This step matches novel adjacencies with compatible strandedness, by setting their anchor points which are less than 50 bp separated from each other, directly next to each other
In this step the smaller coordinate is shifted toward the larger coordinate
By this, smaller deviations between the coordinates between otherwise fitting novel adjacencies are removed
If more than two novel adjacencies are comptible to each other, the ambiguity has the be resolved manually in the novel adjacency file
The output is a file with novel adjacencies with adjusted coordinates. In case a coordinate was shifted, a note is appended to the ID.

Step 2 - linearize

In this step, fragments are created from the simplified (adapted) novel adjacency file
The fragments are connected according to the novel adjacenicies
The resulting graph is traversed such that the derivative chromosomes are extracted (or scaffolds in case of incomplete graphs)

Output format

The output format is related to a BED file (chrom start end name score strand scaffold)
start is 0-based, end is the non-inclusive end position of the fragment
name (column 4) and score (column 5) are not used
column 7 is the scaffold, in case the first and the last fragment of a scoffold are telomeric ends, the scaffold refers to a derivative chromosome

1	0	100200300	.	.	+	0
2	50000300	50200300	.	.	-	0
1	100200300	100400300	.	.	-	0
1	100400300	249250621	.	.	+	0
2	0	50000300	.	.	+	1
2	50200300	243199373	.	.	+	1

Step 3A - plotting

A plot of the reconstruction graph can be generated

Step 3B - 2D annotation files

Creating 2D annotation files for Juicebox in bedpe-like format
The 2D annotation files can be used to inspect the plausibility of reconstructions or for identifying IDs of scaffold, which should be merged further
The script is producing annotation files for:

Fagments
Scaffolds
Derivative chromosomes, which are merged from multiple scaffolds

python3 ../../code/create_annotation_file_from_fragments.py --fragmentsIn sample.linearized.tsv \
                                                            --bedpeFrag sample.fragments.bedpe \
                                                            --bedpeScaff sample.scaffolds.bedpe \
                                                            --bedpeDerChrom sample.derChrom.bedpe \
                                                            --fragmentsOut   sample.derChrom.tsv \
                                                            --mergeLists 1,2 6,8,11

In the code example above, the scaffolds with ID 1,2 are aggregated to the same derivative chromosome (same for scaffolds 6,8,11)
The format of the file written to --fragmentsOut is as before in Step 2, but with an additional 8th column, containing the ID of the derChrom
In case the optional argument --mergeLists is not used, the derChrom-ID equals the scaffold-ID (i.e. no scaffolds were merged)

README.md

Code for reconstruction of shattered chromosomes

Prerequisites

Example

The novel adjacency file

Step 1 - simplify

Step 2 - linearize

Output format

Step 3A - plotting

Step 3B - 2D annotation files

About

Releases

Languages

License

schoepfl/chromosome_reconstruction

Launching GitHub Desktop

Launching GitHub Desktop

Launching Xcode

Launching Visual Studio Code

Latest commit

Git stats

Files

README.md

Code for reconstruction of shattered chromosomes

Prerequisites

Example

The novel adjacency file

Step 1 - simplify

Step 2 - linearize

Output format

Step 3A - plotting

Step 3B - 2D annotation files

About

Resources

License

Stars

Watchers

Forks

Releases

Languages