Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
LSTrAP/docs/helper.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
121 lines (70 sloc)
4.25 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Additional scripts | |
Scripts used to perform analyses reported in the LSTrAP manuscript (Proost et al., *under preparation*) are found in | |
*./helper* | |
## Obtain and prepare data | |
### get_sra_ip.py | |
Script to download runs from [Sequence Read Archive](http://www.ncbi.nlm.nih.gov/sra), requires the Aspera connect | |
client to be installed and a open ssh key is required (can be obtained from the Apera connect package) | |
python3 get_sra_ip.py runs.list.txt ./output/directory /absolute/path/to/opensshkey | |
### sra_to_fastq.py | |
Script to convert sra files into fastq. Sratools is required. | |
python3 sra_to_fastq.py /sra/files/directory /fastq/output/directory | |
## Running LSTrAP on transcriptome data | |
To use LSTrAP on a *de novo* assembled transcriptome, a little pre-processing is required. Instead of the genome, a fasta | |
file containing **coding** sequences can be used (remove UTRs). Using the helper script fasta_to_gff.py, a gff file suited | |
for LSTrAP can be generated. | |
### parse_gff.py | |
Script to remove splice variants from a GFF3 file, the longest one is retained. | |
# print to STDOUT | |
python3 parse_gff.py input.gff | |
# write to file | |
python parse_gff.py input.gff -o output.gff | |
python parse_gff.py input.gff --output output.gff | |
## Quality control | |
### htseq_count_stats.py, hisat2_stats.py and tophat_stats.py | |
These scripts will extract the statistics used to assess the quality of samples. | |
python3 htseq_count_stats.py ./path/to/htseq/files > output.txt | |
python3 tophat_stats.py ./path/to/tophat/output > output.txt | |
python3 hisat2_stats.py ./path/to/hisat2/output > output.txt | |
## Plots and Graphs | |
Scripts to generate images similar to those presented in the publication. Example data, | |
derived from the *Sorghum bicolor* case study, is included in the repository. | |
### plot_network.py | |
Script that plots the co-expression neighborhood for a specific gene. A PCC cutoff of 0.7 is included by default, | |
but users can override this setting using the --cutoff parameter. Matplotlib and networkx are required for this | |
script. | |
# To draw plot to screen using a PCC cutoff of >= 0.8 | |
python3 plot_network.py <PCC_TABLE> <GENE_ID> --cutoff 0.8 | |
# Save as png | |
python3 plot_network.py <PCC_TABLE> <GENE_ID> --cutoff 0.8 --png output.png | |
# Set png dpi (for publication) | |
python3 plot_network.py <PCC_TABLE> <GENE_ID> --cutoff 0.8 --png output.png --dpi 900 | |
![matrix example](images/plot_network.png "Example of plotted network") | |
### matrix_heatmap.py | |
Script to draw a sample distance heatmap (with hierarchical clustering) based | |
on a normalized expression matrix. | |
# To draw plot to screen | |
python3 matrix_heatmap.py ./data/sbi.expression.matrix.tpm.txt | |
# Hide labels (useful for large sets) | |
python3 matrix_heatmap.py ./data/sbi.expression.matrix.tpm.txt --hide_labels | |
# Save as png | |
python3 matrix_heatmap.py ./data/sbi.expression.matrix.tpm.txt --png output.png | |
# Set png dpi (for publication) | |
python3 matrix_heatmap.py ./data/sbi.expression.matrix.tpm.txt --png output.png --dpi 900 | |
![matrix example](images/matrix.png "Sample distance heatmap (with hierarchical clustering)") | |
### pca_plot.py | |
Script to perform a PCA analysis on any expression matrix. | |
python3 pca_plot.py ./data/sbi.expression.matrix.tpm.txt | |
### pca_powerlaw.py | |
*This script and the required data are included to recreate results from the manuscript (Proost et al., under review)* | |
Script to perform a PCA analysis on the *Sorghum bicolor* data (case study) and draw the node degree distribution. The | |
required data is included here as well. Note that this script requires sklearn and seaborn. | |
python3 pca_powerlaw.py ./data/sbi.expression.matrix.tpm.txt ./data/sbi_annotation.txt ./data/sbi.power_law.R07.txt | |
## Utilities | |
### merge_matrix.py | |
In case samples for one (!) species were processed in two or more batches, this script can be used to merge the | |
expression matrices. | |
*Note that to obtain co-expression networks using the merged matrix LSTrAP needs to be run, using the merged expression | |
matrix, skipping all steps before the construction of co-expression.* | |
*Only merge raw matrices with raw, tpm with tpm and rpkm with rpkm!* | |
python3 merge_matrix.py matrix_one.txt matrix_two.txt matrix_merged.txt |