output
html_document

Data Format

CLARION: generiC fiLe formAt foR quantItative cOmparsions of high throughput screeNs

CLARION is a data format especially developed to be used with WIlsON, which relies on a tab-delimited table with a metadata header to describe the following columns. It is based on the Summarized Experiment format and supports all types of data which can be reduced to features (e.g. genes, transcripts, proteins, probes) with assigned numerical values (e.g. count, score, log2foldchange, zscore, pvalue). Most result tables derived from RNA-Seq, ChIP/ATAC-Seq, Proteomics, Microarrays, and many other analyses can thus be easily reformatted to become compatible, without having to modify the code of WIlsON for each specific experiment.

The format consists of three blocks of data with distinct structures:

Header: Parameters concerning the global experiment
Metadata: Parameters describing the content of each data column
Data: Matrix of data columns bearing textual and numerical information per feature

## Header:

Line identifier '!'
Syntax: name = value
Mandatory columns: *

### Parameters:

format: Name of the file format (must be Clarion)
version: Version of the file format (1.0)
experiment_id: Unique id to be used for the experiment
delimiter(*): In-field delimiter for multi-value fields (e.g. multiple kegg pathways). Mandatory for multi-value fields.

## Metadata:

Line identifier '#'
Mandatory columns: *

### Columns:

key*:

Reference to data matrix (column headline)
Must be unique

factor1 - factorN:

Denotes experimental factors (e.g. wildtype, mutant, time point) per sample and condition

One or more columns (factor1, factor2, ..., factorN)

Used for grouping

level*:

Classifies content of column
Must be one of:

sample: Data relating to a single sample
condition: Data relating to a single condition (combination of all samples; e.g. average count)
contrast: Data relating to a single contrast (pairwise comparison of conditions)
feature: Annotation relating to a feature (e.g. gene, transcript, probe, protein, ...)

type(*):

Mandatory for multi-value fields
Further classify content level
Must be one of:
For level = feature = values to be filtered for

unique_id: Unique identifier (e.g. ENSMUSG00000023944)
name: Main feature name / symbol / label (e.g. Hsp90ab1)
category: Single value per field; categorical data (e.g. protein_coding)
array: Multiple delimited values per field; categorical data (e.g. Cholinergic synapse|Choline metabolism in cancer)

For levels = sample, condition, contrast = values to be plotted

score: count, intensity, ...
ratio: foldchange, log2foldchange, ...
probability: pvalue, padj, ...
array: Multiple numeric values per field; e.g. coverage/windows, ...

Attention: if the type is not given, the first feature column is expected to hold a unique identifier!

label:

Optional label alternative to column name
Can be used for plotting
Should be unique
For level = contrast delimited by '|' (condition1|condition2)

sub_label:

Optional more detailed label to offer a logical subselection of a column using the interface

## Data:

Traditional tab-delimited data matrix
Minimum: one column with a unique id; one column with a numerical value
If types are missing first column will be treated as unique_id

wilson-apps/wilson-basic/introduction/format.md

Data Format