From 672ef7fe279c2b66da878a66bc77dbd96619b37f Mon Sep 17 00:00:00 2001
From: Natan Yusupov <natan_yusupov@psych.mpg.de>
Date: Wed, 11 Oct 2023 16:20:55 +0200
Subject: [PATCH] update README.md

---
 README.md | 41 ++++++++++++++++++++++++++---------------
 1 file changed, 26 insertions(+), 15 deletions(-)
diff --git a/README.md b/README.md
index 025fc34..342b38e 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,14 @@
 # EPIC Preprocessing Pipeline
 
 <br />
-This pipeline provides a workflow to preprocess and conduct quality control analyses on raw DNA methylation (DNAm) data from EPIC arrays. EPIC is a DNAm profiling microarray, manufactured by Illumina (Illumina, San Diego, CA, USA) and used for epigenome-wide DNAm assessment.  
+This pipeline provides a workflow to preprocess and conduct quality control analyses on raw DNA methylation (DNAm) data from EPIC arrays. EPIC is a DNAm profiling microarray, manufactured by Illumina (Illumina, San Diego, CA, USA) and is commonly used for epigenome-wide DNAm assessment.  
 
 The pipeline is semi-automated, meaning that an input is required at specific scripts, but the pipeline is otherwise automated to generate and save files and produces a visualized output (as an html) including feedback and time-stamp for documentation.  
 <br />  
 ***Note: The following text should give you a quick introduction that would help you process and understand the generated data. Further, we describe the content and goal of each script in the preprocessing pipeline. The pipeline is strongly based on the workflow described by Maksimovic et al. [1], the minfi R package [2,3] and further code developed by L. Dieckmann, MSc, and other sources (see references below) and makes no claim to completeness or take any responsibility for usage.***  
 <br />  
 
-## Background:
+## Background
 <br />
 
 **Definitions:**
@@ -58,7 +58,7 @@ Poor performing probes are filtered out of the data. We also remove probes that
 Batch correction is performed to avoid strong varaition in the data due to technical effects. In short, a principal component analysis (PCA) is performed to capture variation in the data and represent it in a dimentionality-reduced manner. The expained variation is then tested for specific known technical variables. Whenever a strong batch effect is detected, a correction can be performed with ComBat of the sva R package [9].  
 <br />  
 
-**Pipeline Input:**
+**Pipeline Input**
 <br />  
 
 The pipeline takes IDAT files (rawest output files from the machine) as input data. These files contain red and green channels. The pipeline also relies on a sample sheet phenotype data file that must be in a csv format. The ***minfi R package*** can read this with the ***read.metharray.sheet()*** function. The pipeline initially reads in raw data to an RGChannelSet (contains information about control probes), which has the raw intensity in a green channel matrix and a red channel matrix. Along with the IlluminaMethylationManifest object (which contains the array design and describes how the probes and color channels are paired to measure methylation), the RGChannelSet will be processed into a Methylset, which will contain normalized data and two matrices with the methylated and unmethylated values for each CpG. If you are using MethylationEPIC v2.0 microarraysm the pipeline would use R packages available on following Github repositories [10,11].  
@@ -73,40 +73,51 @@ Flexibility in such pipelines is sometimes needed and variables such as location
 Briefly, the user must specify the path on the cluster to the cloned github repository, path to the phenotype data location, path to the idat files location, the name of several columns in their phenotype data (including slide, array, person id, and sex). The user must also input a project name, array type, population ethnicity, number of samples and detection p-value cutoff.  
 <br />  
 
-**Description of individual scripts:**
+## Description of individual scripts
 <br />  
-**Script 1**:  
+
+**Script 1**  
 Script 1 takes user definitions (of various paths, column names, and choices for parameters used in QC and preprocessing) and formats them into a  dataframe called "userchoices." This dataframe is saved in the "data" folder. It also creates directories needed for the pipeline within the repository folder. This includes a project folder with a user-specified name. The project folder contains a "reports" folder and a "processed_data" folder. The "processed_data" folder contains a sub-folder called "final_data." Next, it imports, formats, and saves phenotype data in a RData and csv file called "phenotype_data". Phenotype data must be initially imported in a ***csv format***, must contain one sample per row, and must have columns each with one attribute of the sample. The sex data must start with F (female) / W (weiblich) or M (male or männlich). If your data contains sex data that is described in a different way (or includes other categories of sex), this should be adjusted.  
 <br />  
-**Script 2:**  
+
+**Script 2**  
 Script 2 converts raw intensity signals to usable data. This includes a conversion of raw intensity red and green signals from IDAT files to a "RGChannelSet"; conversion of methylated and unmethylated data to "MethylSet"; creation of beta and M values in "RatioSet"; Storage of raw beta values in "Betas"; mapping of RatioSet with genomic position to get "gRatioSet", saving of phenotype data in "PhenoData"; and creation of a "summary file" to track number of samples and probes in each step.  
 Further information about the data classes mentioned above is available in the minfi User's Guide: https://bioconductor.org/packages/devel/bioc/vignettes/minfi/inst/doc/minfi.html  
 <br />  
-**Script 3:**  
+
+**Script 3**  
 Script 3 conducts the first level of quality control. Script 3 analyzes the detection p-values of samples. This script saves and reports person IDs of samples with detection p-values greater than user threshold; generates minfi QC report (density plots of betas) and distribution artifacts reports (shows beta values across all probes) for the remaining samples. The distribution artifacts report should have two prominent peaks, since most of probes are have either very low methylation (around 0%) or very high methylation (around 100%). Exceptions from this rule (e.g. further or more strongly deviating peaks) might be artifacts and should be excluded from further analysis. The user is asked to record these sample names for use in script 5.  
 <br />  
-**Script 4:**  
+
+**Script 4**  
 Script 4 predicts sex with DNA methylation data; compares epigenetically predicted sex with phenotype sex information; and reports samples with mismatches. These samples will be excluded in script 5.  
 <br />  
-**Script 5:**  
+
+**Script 5**  
 Script 5 removes samples with sex mismatches and distribution artifacts and saves clean data files ("RGSet_clean" and "Betas_clean") to the processed data folder. The script also evaluates the different reasons for sample exclusion, and informs the user if any of the samples have failed multiple tests. The user is asked to input all array ids of samples with positive distribution artifacts after visually evaluating script 3 output to the "exclusion_distribution_artifacts" vector.  
 <br />  
-**Script 6:**  
+
+**Script 6**  
 Script 6 normalizes the data to minimize unwanted technique-related variation (saves "RGSet_clean_quantile", "Betas_clean_quantile", "Mset_clean_quantile", and "Betas_clean_quantile_bmiq"); visualizes beta densities before and after normalization; and creates reports containing beta density plots per sample for both raw and normalized data in the reports folder. We use stratified quantile normalization, followed by BMIQ (beta-mixture quantile normalization). The user can choose to modify this in the script and use a different normalization method instead.  
 <br />  
-**Script 7:**  
+
+**Script 7**  
 Script 7 removes (filters) poor performing probes with unreliable signal from normalized beta values and the normalized genomic ratio set. It removes also probes on sex chromosomes, affected by common SNPs, vulnerable to cross-hybridizing, or are polymorphic.  
 <br />  
-**Script 8:**  
+
+**Script 8**  
 Script 8 identifies and removes outliers. In both filtered and unfiltered data, this script removes probes with 0 variance if present prior to principal component analysis (PCA); runs a PCA to identify extreme outliers; removes outliers if present from betas data set; runs a PCA to determine if there are any technical batch effects; provides reports of statistical tests and visualizations if any technical batch effects are present; generates and saves a full report of anova of linear models to check technical batches to the reports folder (not displayed).  
 <br />  
-**Script 9 and 10:**  
+
+**Script 9 and 10**  
 Script 9 and 10 correct for up to three additional technical batch effects specified by the user in filtered (script 9) and unfiltered (script 10) data. The user also must specify the order of the correction. The script also checks how effective the correction was by displaying statistical measures and conducting PCA post-correction. A final RGSet is saved to the final_data folder.  
 <br />  
 
 ## Author Contacts:  
+<br />  
+
 This workflow was prepared by **Natan Yusupov** and **Alexandra Halberstam**.  
-Special thanks to **Dr. Darina Czamara** for the scientific supervision and **Benno Pütz** for the insights regarding the code.  
+Special thanks to **Dr. Darina Czamara** for the scientific supervision and **Dr. Benno Pütz** for the insights regarding the code.  
 We can be reached at **natan_yusupov@psych.mpg.de** and **alexandrahalberstam@gmail.com**  
 Please do not hesitate to contact us with any questions. We would be more than happy to receive suggestions on how to improve this pipeline.  
 
@@ -126,7 +137,7 @@ Please do not hesitate to contact us with any questions. We would be more than h
 11. https://github.com/jokergoo/IlluminaHumanMethylationEPICv2anno.20a1.hg38 available by Zuguang Gu (National Center for Tumor Diseases, Heidelberg, Germany)
 <br />  
 
-Disclaimer: The authors assumes no responsibility for the topicality, correctness, completeness or quality of code or information provided. Liability claims against the author which relate to material or immaterial nature caused by the use or misuse of any code or information provided through the use of incorrect or incomplete information are excluded unless the author is not intentional or grossly negligent fault. All suggestions are non-binding. The author reserves the right to change parts of the content or the entire content without prior notice, add to, delete or cease publication temporarily or permanently.
+**Legal Disclaimer:** The authors assumes no responsibility for the topicality, correctness, completeness or quality of code or information provided. Liability claims against the author which relate to material or immaterial nature caused by the use or misuse of any code or information provided through the use of incorrect or incomplete information are excluded unless the author is not intentional or grossly negligent fault. All suggestions are non-binding. The author reserves the right to change parts of the content or the entire content without prior notice, add to, delete or cease publication temporarily or permanently.
 
 ## License/Copyright
 <a rel=license href=http://creativecommons.org/licenses/by/4.0/><img alt=Creative Commons Lizenzvertrag style=border-width:0 src=https://i.creativecommons.org/l/by/4.0/88x31.png /></a><br />Dieses Werk ist lizenziert unter einer <a rel=license href=http://creativecommons.org/licenses/by/4.0/>Creative Commons Namensnennung 4.0 International Lizenz</a>.
\ No newline at end of file