This repository contains the synthetic and biological sequences that were used to evaluate the ssHMM motif finder under https://github.molgen.mpg.de/heller/ssHMM.
Directory structure:
-
clip-seq: This directory contains all 25 CLIP-Seq datasets from the paper. Each dataset consists of a positive set (binding sites of the protein in question) and two negative sets: The first negative set (negative_shuffle) was produced by moving the positive binding sites to random locations in the same or a different gene with
bedtools shuffle
. The second negative set (negative_clip) consists of the positive binding sites of all 24 other CLIP-Seq datasets. The subdirectoriesfasta/
andshapes/
additionally contain training and test sets for the positve dataset and the negative_clip dataset. The training and test sets of the positive dataset were generated randomly by splitting the positive dataset with a 90% to 10% ratio. The training and test set of the negative_clip dataset were produced by randomly selecting sequences from the negative_clip dataset that possessed a similar length to the sequences of the positive. The selected sequences were subsequently split with a 90% to 10% ratio.- bed: .bed files with genomic coordinates of the positive set and the negative_shuffle set
- fasta: .fasta files with genomic sequences of the positive and both negative sets
- shapes: secondary structures as predicted by RNAshapes of the positive and both negative sets
- structures: secondary structures as predicted by RNAstructures of the positve set and the negative_shuffle set
-
synthetic: This directory contains all 24 synthetic datasets that were generated.
- hairpin: Synthetic datasets with motifs implanted into a hairpin context.
- A_IC1_SF1_random: Synthetic sequences with information content 1.0, hairpin fraction 100%, and random background sequences
- fasta: .fasta files with synthetic sequences containing an implanted motif
- pwms: images and numerical representations of implanted motifs
- shapes: secondary structures as predicted by RNAshapes
- structures: secondary structures as predicted by RNAstructures
- B_IC1_SF05_random
- C_IC1_SF01_random
- D_IC05_SF1_random
- E_IC05_SF05_random
- F_IC05_SF01_random
- G_IC1_SF1_utr
- H_IC1_SF05_utr
- I_IC1_SF01_utr
- K_IC05_SF1_utr
- L_IC05_SF05_utr
- M_IC05_SF01_utr
- A_IC1_SF1_random: Synthetic sequences with information content 1.0, hairpin fraction 100%, and random background sequences
- stem: Synthetic datasets with motifs implanted into a stem context.
- A_IC1_SF1_random: Synthetic sequences with information content 1.0, stem fraction 100%, and random background sequences
- fasta: .fasta files with synthetic sequences containing an implanted motif
- pwms: images and numerical representations of implanted motifs
- shapes: secondary structures as predicted by RNAshapes
- structures: secondary structures as predicted by RNAstructures
- B_IC1_SF05_random
- C_IC1_SF01_random
- D_IC05_SF1_random
- E_IC05_SF05_random
- F_IC05_SF01_random
- G_IC1_SF1_utr
- H_IC1_SF05_utr
- I_IC1_SF01_utr
- K_IC05_SF1_utr
- L_IC05_SF05_utr
- M_IC05_SF01_utr
- A_IC1_SF1_random: Synthetic sequences with information content 1.0, stem fraction 100%, and random background sequences
- hairpin: Synthetic datasets with motifs implanted into a hairpin context.