Skip to content
Permalink
9fbfbd3e4c
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
140 lines (109 sloc) 15 KB
# EPIC Preprocessing Pipeline
<br />
This pipeline provides a workflow to preprocess and conduct quality control analyses on raw DNA methylation (DNAm) data from EPIC arrays. EPIC is a DNAm profiling microarray, manufactured by Illumina (Illumina, San Diego, CA, USA) and is commonly used for epigenome-wide DNAm assessment.
The pipeline is semi-automated, meaning that an input is required at specific scripts, but the pipeline is otherwise automated to generate and save files and produces a visualized output (as an html) including feedback and time-stamp for documentation.
<br />
***Note: The following text should give you a quick introduction that would help you process and understand the generated data. Further, we describe the content and goal of each script in the preprocessing pipeline. The pipeline is strongly based on the workflow described by Maksimovic et al. [1], the minfi R package [2,3] and further code developed by L. Dieckmann, MSc, and other sources (see references below) and makes no claim to completeness or take any responsibility for usage.***
<br />
## Background
<br />
**Definitions:**
The minfi package uses the data classes described below:
- **RGChannelSet:** raw data from IDAT files organized at the probe level. The RGChannelSet data has two channels: Red and Green.
***Note:*** A probe is not necessarily the CpG locus. (This depends on whether the data is calculated with type I or type II).
- **MethylSet:** data organized by the CpG locus level, but not mapped to a genome. This data has two channels: ***Meth*** (methylated) and ***Unmeth*** (unmethylated).
- **RatioSet:** data organized by the CpG locus level, but not mapped to a genome. The data has at least one of two channels: Beta and/or M. It may also include a CN channel (copy number).
- **GenomicMethylSet:** like a MethylSet, but mapped to a genome.
- **GenomicRatioSet:** like a RatioSet, but mapped to the genome.
<br />
## Theory behind the workflow
<br />
The workflow contains a variety of steps, including quality control, CpG filtering, normalization, batch effect correction, and some data exploration and statistical testing. Here are some short explanations for people new to the field.
<br />
<br />
**Quality Control**
<br />
The workflow will find and remove low quality or unreliable samples from the data. The pipeline does this by calculating detection p-values (detP), which indicate the quality. ***minfi***, the R package used in this pipeline, creates the detP values by comparing the total signal of unmethylated and methuylated (M+U) for each CpG to a background signal level, approximated from negative control probes. A small p-value indicates reliable signal while large p-values indicate poor quality signals. We also provide the minfi qcreport, which generates other quality control plots. Poor samples are then excluded from further processing and analysis by using a detP cutoff value set by the user.
<br />
**Calculating M and beta values**
<br />
DNAm levels are converted into beta values (β = M/(M+U)) and M values (Mvalue = log2(M/U)). Beta values and M values are related through a logit transformation. M values are, therefore, were suggested to be better distributed for statistical analysis [5].
<br />
**Sex Matches**
<br />
We look for sex mismatches by comparing the reported sex of each sample to the sex predicted by the DNAm data. The pipeline will then exclude any samples where reported sex does not match the sex predicted by the DNAm data. In such cases we suggest to contact the study coordinator to confirm the reported sex (especially to exclude data management issues).
<br />
**Normalization**
<br />
Data normalization should minimize unwanted technical variation. While different types of normalization are available [6], we used here stratified quantile normalization followed by beta-mixtrure quantile normalization (BMIQ).
<br />
**Filtering**
<br />
Poor performing probes are filtered out of the data. We also remove probes that are contained on sex (X and Y) chromosomes. We further remove probes that are known to have common SNPs at the CpG site and those that are "cross-reactive", meaning that they map to multiple places in the genome [7,8].
<br />
**Batch Effects**
<br />
Batch correction is performed to avoid strong varaition in the data due to technical effects. In short, a principal component analysis (PCA) is performed to capture variation in the data and represent it in a dimentionality-reduced manner. The expained variation is then tested for specific known technical variables. Whenever a strong batch effect is detected, a correction can be performed with ComBat of the sva R package [9].
<br />
**Pipeline Input**
<br />
The pipeline takes IDAT files (rawest output files from the machine) as input data. These files contain red and green channels. The pipeline also relies on a sample sheet phenotype data file that must be in a csv format. The ***minfi R package*** can read this with the ***read.metharray.sheet()*** function. The pipeline initially reads in raw data to an RGChannelSet (contains information about control probes), which has the raw intensity in a green channel matrix and a red channel matrix. Along with the IlluminaMethylationManifest object (which contains the array design and describes how the probes and color channels are paired to measure methylation), the RGChannelSet will be processed into a Methylset, which will contain normalized data and two matrices with the methylated and unmethylated values for each CpG. If you are using MethylationEPIC v2.0 microarraysm the pipeline would use R packages available on following Github repositories [10,11].
<br />
**User must specify**
<br />
Flexibility in such pipelines is sometimes needed and variables such as locations, names of columns etc. might differ in some cases. Therefore, the user is asked to specify several fields, defined and described in detail in the scripts that call for them. The user must also modify several other fields in the scripts. These are all described under the header "instructions" at the beginning of the individual scripts and fields that must be modified are labele with the default "PLEASE FILL IN."
**Scripts that require specifications are:** 1, 5, 9, 10.
<br />
Briefly, the user must specify the path on the cluster to the cloned github repository, path to the phenotype data location, path to the idat files location, the name of several columns in their phenotype data (including slide, array, person id, and sex). The user must also input a project name, array type, population ethnicity, number of samples and detection p-value cutoff.
<br />
## Description of individual scripts
<br />
**Script 1**
Script 1 takes user definitions (of various paths, column names, and choices for parameters used in QC and preprocessing) and formats them into a dataframe called "userchoices." This dataframe is saved in the "data" folder. It also creates directories needed for the pipeline within the repository folder. This includes a project folder with a user-specified name. The project folder contains a "reports" folder and a "processed_data" folder. The "processed_data" folder contains a sub-folder called "final_data." Next, it imports, formats, and saves phenotype data in a RData and csv file called "phenotype_data". Phenotype data must be initially imported in a ***csv format***, must contain one sample per row, and must have columns each with one attribute of the sample. The sex data must start with F (female) / W (weiblich) or M (male or männlich). If your data contains sex data that is described in a different way (or includes other categories of sex), this should be adjusted.
<br />
**Script 2**
Script 2 converts raw intensity signals to usable data. This includes a conversion of raw intensity red and green signals from IDAT files to a "RGChannelSet"; conversion of methylated and unmethylated data to "MethylSet"; creation of beta and M values in "RatioSet"; Storage of raw beta values in "Betas"; mapping of RatioSet with genomic position to get "gRatioSet", saving of phenotype data in "PhenoData"; and creation of a "summary file" to track number of samples and probes in each step.
Further information about the data classes mentioned above is available in the minfi User's Guide: https://bioconductor.org/packages/devel/bioc/vignettes/minfi/inst/doc/minfi.html
<br />
**Script 3**
Script 3 conducts the first level of quality control. Script 3 analyzes the detection p-values of samples. This script saves and reports person IDs of samples with detection p-values greater than user threshold; generates minfi QC report (density plots of betas) and distribution artifacts reports (shows beta values across all probes) for the remaining samples. The distribution artifacts report should have two prominent peaks, since most of probes are have either very low methylation (around 0%) or very high methylation (around 100%). Exceptions from this rule (e.g. further or more strongly deviating peaks) might be artifacts and should be excluded from further analysis. The user is asked to record these sample names for use in script 5.
<br />
**Script 4**
Script 4 predicts sex with DNA methylation data; compares epigenetically predicted sex with phenotype sex information; and reports samples with mismatches. These samples will be excluded in script 5.
<br />
**Script 5**
Script 5 removes samples with sex mismatches and distribution artifacts and saves clean data files ("RGSet_clean" and "Betas_clean") to the processed data folder. The script also evaluates the different reasons for sample exclusion, and informs the user if any of the samples have failed multiple tests. The user is asked to input all array ids of samples with positive distribution artifacts after visually evaluating script 3 output to the "exclusion_distribution_artifacts" vector.
<br />
**Script 6**
Script 6 normalizes the data to minimize unwanted technique-related variation (saves "RGSet_clean_quantile", "Betas_clean_quantile", "Mset_clean_quantile", and "Betas_clean_quantile_bmiq"); visualizes beta densities before and after normalization; and creates reports containing beta density plots per sample for both raw and normalized data in the reports folder. We use stratified quantile normalization, followed by BMIQ (beta-mixture quantile normalization). The user can choose to modify this in the script and use a different normalization method instead.
<br />
**Script 7**
Script 7 removes (filters) poor performing probes with unreliable signal from normalized beta values and the normalized genomic ratio set. It removes also probes on sex chromosomes, affected by common SNPs, vulnerable to cross-hybridizing, or are polymorphic.
<br />
**Script 8**
Script 8 identifies and removes outliers. In both filtered and unfiltered data, this script removes probes with 0 variance if present prior to principal component analysis (PCA); runs a PCA to identify extreme outliers; removes outliers if present from betas data set; runs a PCA to determine if there are any technical batch effects; provides reports of statistical tests and visualizations if any technical batch effects are present; generates and saves a full report of anova of linear models to check technical batches to the reports folder (not displayed).
<br />
**Script 9 and 10**
Script 9 and 10 correct for up to three additional technical batch effects specified by the user in filtered (script 9) and unfiltered (script 10) data. The user also must specify the order of the correction. The script also checks how effective the correction was by displaying statistical measures and conducting PCA post-correction. A final RGSet is saved to the final_data folder.
<br />
## Author Contacts:
<br />
This workflow was prepared by **Natan Yusupov** and **Alexandra Halberstam**.
Special thanks to **Dr. Darina Czamara** for the scientific supervision and **Dr. Benno Pütz** for the insights regarding the code.
We can be reached at **natan_yusupov@psych.mpg.de** and **alexandrahalberstam@gmail.com**
Please do not hesitate to contact us with any questions. We would be more than happy to receive suggestions on how to improve this pipeline.
## References
<br />
1. Maksimovic, J., Phipson, B., & Oshlack, A. (2017). A cross-package Bioconductor workflow for analysing methylation array data. F1000Research. (https://doi.org/10.12688/f1000research.8839.3)
2. https://bioconductor.org/packages/devel/bioc/vignettes/minfi/inst/doc/minfi.html
3. Aryee MJ, Jaffe AE, Corrada-Bravo H, Ladd-Acosta C, Feinberg AP, Hansen KD, Irizarry RA (2014). “Minfi: A flexible and comprehensive Bioconductor package for the analysis of Infinium DNA Methylation microarrays.” Bioinformatics, 30(10), 1363–1369. doi:10.1093/bioinformatics/btu049.
4. https://www.illumina.com/products/by-type/microarray-kits/infinium-methylation-epic.html
5. Du, P., Zhang, X., Huang, CC. et al. Comparison of Beta-value and M-value methods for quantifying methylation levels by microarray analysis. BMC Bioinformatics 11, 587 (2010). https://doi.org/10.1186/1471-2105-11-587
6. Hicks, Stephanie C, and Rafael A Irizarry. 2015. “Quantro: A Data-Driven Approach to Guide the Choice of an Appropriate Normalization Method.” Genome Biology 16 (1): 1.
7. Chen, Yi-an, Mathieu Lemire, Sanaa Choufani, Darci T Butcher, Daria Grafodatskaya, Brent W Zanke, Steven Gallinger, Thomas J Hudson, and Rosanna Weksberg. 2013. “Discovery of cross-reactive probes and polymorphic CpGs in the Illumina Infinium HumanMethylation450 microarray.” Epigenetics : Official Journal of the DNA Methylation Society 8 (2): 203–9. https://doi.org/10.4161/epi.23470.
8. McCartney DL, Walker RM, Morris SW, McIntosh AM, Porteous DJ, Evans KL. Identification of polymorphic and off-target probe binding sites on the Illumina Infinium MethylationEPIC BeadChip. Genom Data. 2016 May 26;9:22-4. doi: 10.1016/j.gdata.2016.05.012. PMID: 27330998; PMCID: PMC4909830.
9. Leek JT, Johnson WE, Parker HS, Jaffe AE, Storey JD. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics. 2012;28:882–3.
10. https://github.com/jokergoo/IlluminaHumanMethylationEPICv2manifest available by Zuguang Gu (National Center for Tumor Diseases, Heidelberg, Germany)
11. https://github.com/jokergoo/IlluminaHumanMethylationEPICv2anno.20a1.hg38 available by Zuguang Gu (National Center for Tumor Diseases, Heidelberg, Germany)
<br />
**Legal Disclaimer:** The authors assumes no responsibility for the topicality, correctness, completeness or quality of code or information provided. Liability claims against the author which relate to material or immaterial nature caused by the use or misuse of any code or information provided through the use of incorrect or incomplete information are excluded unless the author is not intentional or grossly negligent fault. All suggestions are non-binding. The author reserves the right to change parts of the content or the entire content without prior notice, add to, delete or cease publication temporarily or permanently.