HELP_UPDATE

UPDATE DATABASES

Help file for updating DBs (MTI DBs, miRBase, UniProtKB) used for LimiTT.

Order of updating:
* MTI DBs
* miRBase
* run the "do_after_update.py" script
* UniProtKB

-----------
MTI DBs
-----------

All current DB files were downloaded, converted to tab delimited text files and saved within the "files" folder.
During the LimiTT process the script "db_call.py" accesses the files and filters the matching information out of them.
By updating one of the DBs, it is necessary to adjust the file name within the "db_call.py" script and in case of changed
columns - to either change the column numbers within the script or be sure to arrange the file to the style of the old file.

!!!!!!!!!!!!!!!!!
 AFTER updating one or more of the DBs (or miRBAse) please run the script "do_after_update.py". 
!!!!!!!!!!!!!!!!

-- TarBase

    miRNA-mimat, ensgid, gene-name, reporter_gene, nothern_blot, western_blot, qPCR, proteomics, microarray, sequencing, degradome_seq, other

    Except for 1 (ensgid), all columns are used.

-- miRTarBase

   miRTarBase ID, miRNA, Species (miRNA), Target Gene, Target Gene (Entrez Gene ID), Species (Target Gene), Experiments, Support Type, References (PMID)

   Starting from 0, the columns 1 (miRNA), 6 (Experiment) and 8 (PubMed) are used.

-- miRecords

   Pubmed_id, Target gene_species_scientific, Target gene_name, Target gene_Refseq_acc, Target site_number, miRNA_species, miRNA_mature_ID, miRNA_regulation, Reporter_target gene/region, Reporter link element, Test_method_inter, Target gene mRNA_level, Original description, Mutation_target region, Post mutation_method, Original description_mutation_region, Target site_position, miRNA_regulation_site, Reporter_target site, Reporter link element, Test_method_inter_site, Original description_inter_site, Mutation_target site, Post mutation_method_site, Original description_mutation_site, Additional note

   The columns 0 (Pubmed_id), 2 (Target gene_name), 5 (miRNA_species) and 6 (miRNA_mature_ID) are used.

-- starBase

   The content of starBase could be donwnloaded just as single files for each Organism and each stringency (number of CLIP-Seq experiments supporting the MTI), resulting in 9 files with the following columns.
  
   name, geneName, targetScanSites, picTarSites, RNA22Sites, PITASites, miRandaSites, CancerNum

   Important are column 0 (name) and 1 (geneName).
   Steps for updating starBase:

   * Download each file and append the stringency information with "_stringX" (e.g. starBase_Human_xycx_string1.xls) to the file name.
   * Save all files in one folder where no other .xls files are in.
   * Start the script "starBase_toFile.py"
      This script will concatenate all files in one text file and save each entry once with the highest stringency.
      nameMiRNA, NameGene, Stringency
      LimiTT will automatically add the experimental method, which is solely CLIP-Seq for starBase.

---------
miRBase
--------

Content of miRBase is needed to convert the miRBase accession numbers the DB TarBase uses, to mature miRNA identifiers.
For that, the mature.fa file from http://www.mirbase.org/ftp.shtml was downloaded and converted to a python dictionary (hash) with MIMAT accessions as key and corresponding miRNA identifier as value.

* Download the mature.fa file
* Start the "miRBase_to_dict.py" script
* Run the the script "do_after_update.py"
* Replace the old mimat_miRNA.dict filein the "files" folder with the new one

--------
UniProt
--------

The whole content of the UniProtKB was downloaded (SwissProt and TrEmbl) and coverted to a shortened list per entry. Columns:

	Review status, Accession(s), Organism, Gene names, Protein names, EC number, Existence, GO-IDs, RefSeq, UniGene, Ensembl, GeneID, KEGG

Subsequently this shortened file was used to create a database-like structure with python dictionaries to enable the
mapping of 
	- all possible gene symbols to UniProtKB entries and thus on UniProt accessions (UniProtAcc)
	- UniProtAcc to UniProtKB entries .

This "database-like structure" consists of three dictionary (or dictionary-like) structures consisting of the key -> value
	- Gene Symbol -> ID(s) of UNiProtKB entry
	- UniProtAcc ->  ID of UNiProtKB entry
	- ID of UNiProtKB entry -> shortened UNiProtKB entry

At this, just entries were used, where gene symbols/ synonyms or cross-references can be linked to target symbols of
the MTI DBs. The files (entry.shelve, id.dict, uni.dict) are saved under the "files" folder

Steps for updating:

* If MTI DBs or miRBase were updated before, be sure that "do_after_update.py" was finished.
* Download the zipped SwissProt and TrEmbl .dat files from UniProt via FTP (.dat.gz files)
* start the "unip_to_tab.py" script to create the shortened table
* start the "uniProt_to_dict.py" script to create the database-like structured dictionaries
* replace the old dictionary files with the new ones in the "files" folder