Skip to content

schoepfl/chromosome_reconstruction

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

Code for reconstruction of shattered chromosomes

Prerequisites

  • Snakemake installation
  • Python3 installation
  • Python-packages: seaborn, argparse, numpy, strawC (for reading hic files)
  • R-installation
  • Download files from UCSC genome browser with chrom.sizes and cytoBand and store them in the misc folder (as hg19.chrom.sizes and hg19.cytoband.txt), or the following script can do it:
./download_misc_folder.sh
  • All code assumes reference genome hg19

The scripts have been tested on a Linux system using snakemake (version 6.8.0), and the python-packages seaborn (version 0.11.2), argparse (version 1.1), numpy (version 1.20.3), strawC (version 0.0.9). Beside the required packages, the scripts themselves need no installation.

Example

  • Derivative chromosomes are reconstructed from a curated set of novel adjacencies (assuming reference genome hg19)
  • Simple examples can be found in the examples folder with a Snakefile for Snakemake and can be executed with:
./call_snakemake.sh
  • The runtime for creating the derivative chromosomes tsv files for the examples should be only a few seconds on a standard desktop computer.

The novel adjacency file

Novel adjacencies are stored in a simple tab-separated file format:

chrom1	pos1	strand1	chrom2	pos2	strand2	ID
1	100200290	+	2	50200300	-	ID_1
2	50000301	-	1	100400295	-	ID_2

The coordinates are 1-based. The novel adjacency file is the input for the reconstruction and was derived for the manuscript from manually curated SV calls. The strand indicates here the orientation of the connected fragments. For example strand1 = "+" and strand2 = "+" indicates a connection 3' to 5'.

Step 1 - simplify

  • This step matches novel adjacencies with compatible strandedness, by setting their anchor points which are less than 50 bp separated from each other, directly next to each other
  • In this step the smaller coordinate is shifted toward the larger coordinate
  • By this, smaller deviations between the coordinates between otherwise fitting novel adjacencies are removed
  • If more than two novel adjacencies are comptible to each other, the ambiguity has the be resolved manually in the novel adjacency file
  • The output is a file with novel adjacencies with adjusted coordinates. In case a coordinate was shifted, a note is appended to the ID.

Step 2 - linearize

  • In this step, fragments are created from the simplified (adapted) novel adjacency file
  • The fragments are connected according to the novel adjacenicies
  • The resulting graph is traversed such that the derivative chromosomes are extracted (or scaffolds in case of incomplete graphs)

Output format

  • The output format is related to a BED file (chrom start end name score strand scaffold)
  • start is 0-based, end is the non-inclusive end position of the fragment
  • name (column 4) and score (column 5) are not used
  • column 7 is the scaffold, in case the first and the last fragment of a scoffold are telomeric ends, the scaffold refers to a derivative chromosome
1	0	100200300	.	.	+	0
2	50000300	50200300	.	.	-	0
1	100200300	100400300	.	.	-	0
1	100400300	249250621	.	.	+	0
2	0	50000300	.	.	+	1
2	50200300	243199373	.	.	+	1

Step 3A - plotting

  • A plot of the reconstruction graph can be generated

Step 3B - 2D annotation files

  • Creating 2D annotation files for Juicebox in bedpe-like format
  • The 2D annotation files can be used to inspect the plausibility of reconstructions or for identifying IDs of scaffold, which should be merged further
  • The script is producing annotation files for:
  1. Fagments
  2. Scaffolds
  3. Derivative chromosomes, which are merged from multiple scaffolds
python3 ../../code/create_annotation_file_from_fragments.py --fragmentsIn sample.linearized.tsv \
                                                            --bedpeFrag sample.fragments.bedpe \
                                                            --bedpeScaff sample.scaffolds.bedpe \
                                                            --bedpeDerChrom sample.derChrom.bedpe \
                                                            --fragmentsOut   sample.derChrom.tsv \
                                                            --mergeLists 1,2 6,8,11
  • In the code example above, the scaffolds with ID 1,2 are aggregated to the same derivative chromosome (same for scaffolds 6,8,11)
  • The format of the file written to --fragmentsOut is as before in Step 2, but with an additional 8th column, containing the ID of the derChrom
  • In case the optional argument --mergeLists is not used, the derChrom-ID equals the scaffold-ID (i.e. no scaffolds were merged)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published