Skip to content
Permalink
47e4fab48f
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
executable file 116 lines (113 sloc) 5.21 KB
output
html_document

Data Format

CLARION: generiC fiLe formAt foR quantItative cOmparsions of high throughput screeNs

CLARION is a data format especially developed to be used with WIlsON, which relies on a tab-delimited table with a metadata header to describe the following columns. It is based on the Summarized Experiment format and supports all types of data which can be reduced to features (e.g. genes, transcripts, proteins, probes) with assigned numerical values (e.g. count, score, log2foldchange, zscore, pvalue). Most result tables derived from RNA-Seq, ChIP/ATAC-Seq, Proteomics, Microarrays, and many other analyses can thus be easily reformatted to become compatible, without having to modify the code of WIlsON for each specific experiment.



The format consists of three blocks of data with distinct structures:

  • Header: Parameters concerning the global experiment
  • Metadata: Parameters describing the content of each data column
  • Data: Matrix of data columns bearing textual and numerical information per feature


## Header:

  • Line identifier '!'
  • Syntax: name = value
  • Mandatory columns: *

### Parameters:

  • format: Name of the file format (must be Clarion)
  • version: Version of the file format (1.0)
  • experiment_id: Unique id to be used for the experiment
  • delimiter(*): In-field delimiter for multi-value fields (e.g. multiple kegg pathways). Mandatory for multi-value fields.


## Metadata:

  • Line identifier '#'
  • Mandatory columns: *

### Columns:

  • key*:
    • Reference to data matrix (column headline)
    • Must be unique
  • factor1 - factorN:
    • Denotes experimental factors (e.g. wildtype, mutant, time point) per sample and condition
    • One or more columns (factor1, factor2, ..., factorN)
    • Used for grouping
  • level*:
    • Classifies content of column
    • Must be one of:
      • sample: Data relating to a single sample
      • condition: Data relating to a single condition (combination of all samples; e.g. average count)
      • contrast: Data relating to a single contrast (pairwise comparison of conditions)
      • feature: Annotation relating to a feature (e.g. gene, transcript, probe, protein, ...)
  • type(*):
    • Mandatory for multi-value fields
    • Further classify content level
    • Must be one of:
    • For level = feature = values to be filtered for
      • unique_id: Unique identifier (e.g. ENSMUSG00000023944)
      • name: Main feature name / symbol / label (e.g. Hsp90ab1)
      • category: Single value per field; categorical data (e.g. protein_coding)
      • array: Multiple delimited values per field; categorical data (e.g. Cholinergic synapse|Choline metabolism in cancer)
    • For levels = sample, condition, contrast = values to be plotted
      • score: count, intensity, ...
      • ratio: foldchange, log2foldchange, ...
      • probability: pvalue, padj, ...
      • array: Multiple numeric values per field; e.g. coverage/windows, ...
    • Attention: if the type is not given, the first feature column is expected to hold a unique identifier!
  • label:
    • Optional label alternative to column name
    • Can be used for plotting
    • Should be unique
    • For level = contrast delimited by '|' (condition1|condition2)
  • sub_label:
    • Optional more detailed label to offer a logical subselection of a column using the interface


## Data:

  • Traditional tab-delimited data matrix
  • Minimum: one column with a unique id; one column with a numerical value
  • If types are missing first column will be treated as unique_id