This repository contains a pipeline for performing polygenic risk score (PRS) computation using PRS-CS on an HPC cluster. The pipeline has been structured to run across multiple clusters with specific configurations for each cluster (e.g., Pirol and Raven). Below is an overview of the directory structure and instructions for setting up and running the pipeline. It is based on: https://github.com/getian107/PRScs (please cite PRScs when using this pipeline) Please contact Alicia Schowe (alicia_schowe@psych.mpg.de) if you run into any issues with this pipeline.
- 08 April 2025: restructured scripts for pirol and raven HPC
- 05 Mar 2024: fixed -sum flag in plink command
-
Prepare your files (i.e., format GWAS summary statistics, prepare your individual genotype file)
-
Clone this repository (one time)
git clone https://github.molgen.mpg.de/mpip/run_PRScs.git
- Create a conda environment with all the requirements (one time) This step requires that you already set up miniforge3 as described in the IT wiki.
bash step1_setup.sh
- Compute your PGS
bash step2_run_pipeline.sh CLUSTER PATH_TO_GENOTYPE PATH_TO_GWAS GWAS_N NAME_OUTDIR NAME_PGS REF_GENOME #executes the pipeline
Script | Description |
---|---|
step1_setup.sh |
Sets up environment and clones PRS-CS repo (this only needs to be run once) |
01_env/01_prscs_env.yml |
Conda environment setup |
step2_run_pipeline.sh |
Executes PRS-CS on pirol or raven by chromosome (see 02_run scripts), summarises slurm output files & add weights to one PGS using plink (03_post_processing) |
02_run/01_run_PRSCS_on_pirol.sh |
Runs pipeline on pirol, node c.pirol |
02_run/01_run_PRSCS_on_raven.sh |
Runs pipeline on raven, strongly reccommended if you intend to compute multiple PRS |
02_run/02_PRScs.sh |
Executes PRS-CS python scripts on pirol |
02_run/02_PRScs_raven.sh |
Executes PRS-CS python scripts on raven |
03_post_processing/01_slurm_reports |
Summarizes error messages & number of SNPs per chromosome (should be decreasing number from chr1 to chr22) |
03_post_processing/02_plink_sum_to_PRS.sh |
If all went well, this will use plink to compute your final PRS per individual |
How it works: step2_run_pipeline.sh will call upon scripts in 02_run which will execute PRScs.py as an array job, running parallel continuous shrinkage on each chromosome. The output of this first step will be 22 chromosome files with their respective weights after applying continuous shrinkage based on reference dataset and GWAS. After running PRS-CS, 03_processing/01_slurm_reports will summarise any error messages, number of SNPs per chromosome that went into the PGS and the shrinkage parameter. If all 22 jobs run successfully, 03_post_processing/02_plink_sum_to_PRS will generate the weighted effect sizes of across all chromosomes into a single score per individual. If you find that weights for any chromosome have not been generated, first check the slurm summary file. After fixing the issue (if it needs any fixing), you can just rerun step2_run_pipeline.sh (it will only re-run chromosome files that have not been generated yet or are empty). Once all chromosome files have been generated, post_processing plink will be run.
[I recommend checking https://github.com/getian107/PRScs on a regular basis for PRScs updates]
Prepare all GWAS summary statistics => including the header line, format into: SNP|A1|A2|BETA|SE or SNP|A1|A2|BETA|P (instead of BETA you can also supply OR)
- Note1: When preparing GWAS summary statistics, check for matching genome build (reference panel is based on hg37), ancestry and column of the effect allele!
- Note2: When using BETA/OR + P as the input, p-values smaller than 1e-323 are truncated, which may reduce the prediction accuracy for traits that have highly significant loci.
- Note3: the beta/OR column is only used for direction of the effect (+/-) so also z statistics can be used.
- in command line, go to the folder where you set everything up, then run:
bash step2_run_pipelin.sh CLUSTER PATH_TO_GENOTYPE PATH_TO_GWAS GWAS_N NAME_OUTDIR NAME_PGS REF_GENOME #executes the pipeline
- CLUSTER: the cluster you are currently logged in to; pirol or raven
- PATH_TO_GENOTYPE: path to your own genotyped data
- PATH_TO_GWAS: path to the formatted GWAS summary statistics
- GWAS_N = total GWAS sample size
- NAME_OUTDIR = name of the output directory where your output files should be stored (if the output directory does not exist yet, the script will create a new directory via mkdir)
- NAME_PGS = the prefix of all your output files
- REF_GENOME = reference genome, usually european is fine and you can leave the same path as used in the example.
- CONDA = path to your conda setup (required to activate the conda environment in the sbatch scripts)
bash step2_run_pipeline.sh \
pirol \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/GoDMC2/godmc_phase2/input_data_BeCOME/genotypes \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/CovSocial/PRSes/PRScs_pipeline/GWAS_sumstats/formatted/formatted_NEUno23and_18022020.sumstats \
120000 \
output \
NEU2 \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/tools/genoref/ldblk_1kg_eur \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/tools/miniforge3/etc/profile.d/conda.sh;
bash step2_run_pipeline.sh \
raven \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/GoDMC2/godmc_phase2/input_data_BeCOME/genotypes \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/CovSocial/PRSes/PRScs_pipeline/GWAS_sumstats/formatted/formatted_NEUno23and_18022020.sumstats \
120000 \
output \
NEU2 \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/tools/genoref/ldblk_1kg_eur \
/nexus/posix0/MPI-psych/g/dept-binder/mpsczamara/AliciaSchowe/tools/miniforge3/etc/profile.d/conda.sh;
NEU.txt
NEU_PRS.hh
NEU_PRS.log
NEU_PRS.profile
NEU_pst_eff_a1_b0.5_phiauto_chr15.txt
NEU_pst_eff_a1_b0.5_phiauto_chr11.txt
NEU_slurm_error_summary.txt
slurm/
- {NAME_PGS}.txt => weights of all chromosomes combined (part of 03_post_processing scripts)
- {NAME_PGS}_PRS.log => plink log file of the final score (here you can also check the total number of SNPs that went into the score)
- {NAME_PGS}_PRS.profile => Final weighted PGS for each individual.
- {NAME_PGS}_pst_eff_a1_b0.5_phiauto_chr1.txt => 22 files including the PRScs computed weights for each genetic variant per chromosome
- {NAME_PGS}_slurm_error_summary.txt => for each slurm file, this summary will show the chromosome, number of considered SNPs, and the shrinkage parameter.
- slurm/ folder with all individual slurm logs A brief sanity check of the data is to see whether the total number of SNPs that went into the score decreases from chr1 to chr22.
Scanning SLURM output files for summary and errors...
[PHENO_slurm-12345_19.out]
Chromosome: 19
SNPs: 15,430
Shrinkage parameter: 4.28e-06
[PHENO_slurm-12345_20.out]
Chromosome: 20
SNPs: Not found
Shrinkage parameter: Not found
--- Errors in PHENO_slurm-12345_20.out ---
Segmentation fault (core dumped)
Done. Summary saved to /path/to/output/PHENO_slurm_error_summary.txt
PGS were computed using PRS-CS (Ge et al., 2019) and PLINK 1.9 (Purcell et al., 2007).
PRS-CS was applied to infer posterior mean effects by chromosome for autosomal single nucleotide polymorphisms (SNPs)
that overlap between the given discovery genome-wide association study (GWAS)
and the European 1000 Genomes linkage disequilibrium (LD) panel.
To ensure convergence of the underlying Gibbs sampler algorithm,
we set Markov chain Monte Carlo (MCMC) iteration to 10,000 and the first 5,000 MCMC iterations as burn-in (Schultz et al., 2022).
For all other PRS-CS parameter the default settings were used. Using the PLINK 1.9 score function,
raw PGSs were then calculated for each participant as a sum of the risk allele count weighted by
the posterior means effects returned by PRS-CS.
Auton A, Abecasis GR, Altshuler DM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68-74. doi:10.1038/nature15393
Ge T, Chen CY, Ni Y, Feng YCA, Smoller JW. Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nat Commun. 2019;10(1):1776. doi:10.1038/s41467-019-09718-5
Schultz LM, Merikangas AK, Ruparel K, et al. Stability of polygenic scores across discovery genome-wide association studies. Hum Genet Genomics Adv. 2022;3(2):100091. doi:10.1016/j.xhgg.2022.100091
Purcell S, Neale B, Todd-Brown K, et al. PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses. Am J Hum Genet. 2007;81(3):559-575.