Tool to detect potential transposable elements in a fasta file. Some classes of transposable elements (TE) encode for proteins. Often these look, without prior knowledge, exactly like a protein coding gene. Some genomes include these TEs in their annotation while others filter them out. For comparative genomics this is problematic as it creates differences (e.g. a simple gene count would not be comparable between genomes with and without TEs included) based on a technical bias and not a valid biological trait.
To solve this issue these TEs should be detected and removed in all genomes prior to doing a comparative analysis (e.g. counting genes, constructing gene families, ...). De-TE-ctor is a small pipeline to quickly check for putative TEs included in the set of protein coding genes of a genome and report them so they can be removed.
Clone the package from the git repository
git clone https://github.molgen.mpg.de/proost/De-TE-ctor detector
cd detector
Create a virtual environment
virtualenv --python=python3 env
source env/bin/activate
pip install -r requirements.txt
Install the detector module
pip install --editable .
First, you need to create a blast database from a set of known transposable element proteins. First collect sequences of such elements in fasta format and store them in one directory. Note that all files should have the extension .fasta. Alternatively, the files in ./TE_DB/ can be used. For more details how these were prepared read the file here.
Create the blast DB using the command below
detector build ./TE_DB/ known_te_db
This command will pick up all fasta-files in the TE_DB folder, concatenate them and build a blast database named known_te_db.
Next, a fasta-file (with protein sequences) should be basted against the newly created database with known transposable element proteins.
detector blast species_peptides.fasta known_te_db species_blast_output
This command will blast a protein fasta-file species_peptides.fasta against known_te_db (created in the previous step). The output will be stored in species_blast_output
Finally, from the output, sequences similar to known transposable element proteins can be extracted using the analyze command below.
detector analyze species_blast_output species_putative_te.lst
The file species_putative_te.lst will be created if putative transposable element proteins are found.
Finally, detector can be used to remove the putative TEs from a fasta file.
detector filter species_peptides.fasta species_putative_te.lst species_peptides.clean.fasta
- Add support for cluster