Tool to detect potential transposable elements in a fasta file
Python
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
TE_DB
example
.gitignore
LICENSE
README.md
detector.py
requirements.txt
setup.py

README.md

De-TE-ctor

Tool to detect potential transposable elements in a fasta file. Some classes of transposable elements (TE) encode for proteins. Often these look, without prior knowledge, exactly like a protein coding gene. Some genomes include these TEs in their annotation while others filter them out. For comparative genomics this is problematic as it creates differences (e.g. a simple gene count would not be comparable between genomes with and without TEs included) based on a technical bias and not a valid biological trait.

To solve this issue these TEs should be detected and removed in all genomes prior to doing a comparative analysis (e.g. counting genes, constructing gene families, ...). De-TE-ctor is a small pipeline to quickly check for putative TEs included in the set of protein coding genes of a genome and report them so they can be removed.

Usage

Installation

Clone the package from the git repository

git clone https://github.molgen.mpg.de/proost/De-TE-ctor detector
cd detector

Create a virtual environment

virtualenv --python=python3 env
source env/bin/activate
pip install -r requirements.txt

Install the detector module

pip install --editable .

Commands

First, you need to create a blast database from a set of known transposable element proteins. First collect sequences of such elements in fasta format and store them in one directory. Note that all files should have the extension .fasta. Alternatively, the files in ./TE_DB/ can be used. For more details how these were prepared read the file here.

Create the blast DB using the command below

detector build ./TE_DB/ known_te_db

This command will pick up all fasta-files in the TE_DB folder, concatenate them and build a blast database named known_te_db.

Next, a fasta-file (with protein sequences) should be basted against the newly created database with known transposable element proteins.

detector blast species_peptides.fasta known_te_db species_blast_output

This command will blast a protein fasta-file species_peptides.fasta against known_te_db (created in the previous step). The output will be stored in species_blast_output

Finally, from the output, sequences similar to known transposable element proteins can be extracted using the analyze command below.

detector analyze species_blast_output species_putative_te.lst

The file species_putative_te.lst will be created if putative transposable element proteins are found.

Finally, detector can be used to remove the putative TEs from a fasta file.

detector filter species_peptides.fasta species_putative_te.lst species_peptides.clean.fasta

TODO

  • Add support for cluster