Skip to content
/ refdata Public

Collection of scripts and converters assembled as pipeline to process annotation data

License

Notifications You must be signed in to change notification settings

pebert/refdata

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

refdata

Collection of scripts and converters assembled as pipeline to process annotation data

Portal links

Glossary

ENCODE promoters (v3)

  • based on H3K4me3 / DNaseI
  • human: 107 cell types
  • mouse: 14 cell types
  • proximal: (DNase peak less than 2kb away from annotated TSS) AND (peak in top 10000 peaks, ranking based on all nearby expressed transcripts)
  • distal: higher ranked according to the above schema, but not TSS-proximal; could be unannotated TSS or transcribed enhancers

ENCODE enhancers (v3)

  • based on H3K27ac / DNaseI
  • human: 47 cell types
  • mouse: 14 cell types
  • distal: (region more than 2kb away from any TSS) AND (region in top 20000 based on (i) anchoring with DNase peak and (ii) ranking by H3K27ac signal); notably, a region in this set can contain several DNase peaks since the boundaries are defined based on H3K27ac peaks
  • proximal: higher ranked according to the above schema, but too close to a TSS; could be promoter or enhancer with promoter-like activty

Download log

bigWig/wiggle coordinates

  1. BigWig files created from bedGraph format use "0-start, half-open" coordinates
  2. bigWigs that represent variableStep and fixedStep data are generated from wiggle files that use "1-start, fully-closed" coordinates
  3. source: https://genome.ucsc.edu/goldenpath/help/wiggle.html (2017-06-28)

phyloP

  • "In the phyloP plots, sites predicted to be conserved are assigned positive scores, while sites predicted to be fast-evolving are assigned negative scores."
  • source: table description on http://genome.ucsc.edu

OrthoDB flat files

odb9_OGs.tab.gz - Ortho DB orthologous groups

  1. OG unique id (not stable between releases)
  2. level tax_id on which the cluster was built
  3. OG name (the group's most common gene name)

odb9_OG2genes.tab.gz - OGs to genes correspondence

  1. OG unique id
  2. Ortho DB unique gene id

odb9_genes.tab.gz - genes with some info

odb9_genes.tab

  1. Ortho DB unique gene id (not stable between releases)
  2. organism tax id
  3. protein sequence id, as downloaded together with the sequence
  4. Uniprot id, evaluated by mapping
  5. ENSEMBL gene name, evaluated by mapping
  6. NCBI gid, evaluated by mapping
  7. description, evaluated by mapping

Gene age annotation

Downloaded from Macotte lab as indicated above. Uniprot IDs extracted and mapped to Ensembl identifiers using Uniprot's service: http://www.uniprot.org/uploadlists

Ensembl Protein ID mapping

Manually downloaded ID conversion tables from http://may2012.archive.ensembl.org/biomart Corresponds to Ensembl 67

UCSC Chainfiles

Chains and Nets

Following the descriptions in the GenomeWiki (last modified status: 16 April 2015, 19:10), the process to generate reciprocal best chains/nets generates the following output:

The reciprocal best process uses that output: the query-referenced (but target-centric / target single-cov) net is turned back into component chains, and then those are netted to get single coverage in the query too; the two outputs of that netting are reciprocal-best in query and target coords. Reciprocal-best nets are symmetrical again.

Additionally, the following naming convention is used by UCSC for pre-computed chain/net files:

In chain and net lingo, the target is the reference genome sequence and the query is some other genome sequence. For example, if you are viewing Human-Mouse alignments in the Human genome browser, human is the target and mouse is the query.

The file names reflect the assembly conversion data contained within in the format To.over.chain.gz. For example, a file named hg15ToHg16.over.chain.gz file contains the liftOver data needed to convert hg15 (Human Build 33) coordinates to hg16 (Human Build 34).

a net is single-coverage for target but not for query, unless it has been filtered to be single-coverage on both target and query. By convention we add "rbest" to the net filename in that case.

The generic process to generate reciprocal best/symmetric single coverage files is also documented in the Genome Wiki (last modified status: 12 January 2016, 19:00) The documentation is incomplete or imprecise or just not error-free - see following paragraph.

The error report on the non-symmetric output can be found in this Google groups thread and includes the following statement by the UCSC support:

Chaining and netting are not simple operations. They may not be symmetrical operations, there may be some slight difference in each direction. I tried taking these results and running them around in another cycle to get comparable bed files in the same coordinate system to see what might be missing, but this led to even more missing bases. Evidently the cycle itself does something to cause bases to go missing.

Notably, the tool chainNet has to be executed with non-default parameters. The defaults are:

  1. -minSpace=N - minimum gap size to fill, default 25

  2. -minScore=N - minimum chain score to consider, default 2000.0

In the script provided by the UCSC support, these values are set to:

  1. -minSpace=1

  2. -minScore=0

Setting these parameters to non-default values dramatically increases the number of missing bases, for unclear reasons.

Chain scoring

Apparently, there is no precise/formal description of how chain scores are derived (truth may be in the code). In this UCSC google groups thread, Angie (UCSC) states that

In a nutshell, the chain scoring scheme is somewhat complicated, but ballpark estimates of scores can be made from expected size of aligned blocks, percent identity, gap size and gap frequency. [...] Gaps in chains are penalized with a piecewise linear function that penalizes gap openings the most, with less harsh penalties as gaps are larger (we expect large gaps in cross-species alignments due to insertions, rearrangements etc.) [...] Another approach to making sense of chain scores is to look at chain score histograms, e.g. for all chains or for chains of a particular length and gap size extracted using chainFilter.

Besides (manual) score filtering using the above mentioned histogram method, in this UCSC google groups thread, Rachel (UCSC) mentions that

Regarding filtering, if there are a lot of low scoring chains e.g. those with score < 5000 then often we filter these out using the minScore option for axtChain. Chains can also be filtered after they are made using the chainFilter program. chainPreNet does remove chains that do not have a chance of being netted. Then chainNet makes the alignments nets from chains using the highest scoring chains in the top level. Gaps are filled in with other chains at level 2 and then gaps in the level 2 chains can be filled in with chains in level 3 etc. In the net, chains are trimmed to fit into these sections that are not covered by a higher-scoring chain. We also use netFilter with the minGap option set to 12 before loading the net into the database. This restricts the nets to those with a gap size >= 12 bp.

However, there seems not to be an overall good strategy as stated by Angie (UCSC) in this UCSC google groups thread

Picking a score threshold for chains is a tricky business... scores vary hugely with length as well as conservation. This scoring scheme allows us to recognize long chains in syntenic regions, but it also retains almost anything from blastz. That's why we also have the "net" tracks -- to keep the best chains and ignore most of the "fluff".

UCSC / CRG alignability files

Alignability

These tracks provide a measure of how often the sequence found at the particular location will align within the whole genome. Unlike measures of uniqueness, alignability will tolerate up to 2 mismatches. These tracks are in the form of signals ranging from 0 to 1 and have several configuration options.

Alignability

The CRG Alignability tracks display how uniquely k-mer sequences align to a region of the genome. To generate the data, the GEM-mappability program has been employed. The method is equivalent to mapping sliding windows of k-mers (where k has been set to 36, 40, 50, 75 or 100 nts to produce these tracks) back to the genome using the GEM mapper aligner (up to 2 mismatches were allowed in this case). For each window, a mappability score was computed (S = 1/(number of matches found in the genome): S=1 means one match in the genome, S=0.5 is two matches in the genome, and so on). The CRG Alignability tracks were generated independently of the ENCODE project, in the framework of the GEM (GEnome Multitool) project.

About

Collection of scripts and converters assembled as pipeline to process annotation data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Languages