Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
EPIC_Preprocessing_Pipeline/scripts/1_definitions_and_setup.Rmd
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
289 lines (257 sloc)
13.9 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Script 1: Define data properties/locations, create directory structure, prepare phenotype data" | |
date: '`r format(Sys.time(), "%d %B, %Y")`' | |
output: html_document | |
--- | |
```{r runtime start, include=FALSE} | |
start_time <- Sys.time() | |
``` | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = FALSE) | |
needed_packages <- c("dplyr", "knitr", "stringr", "rmarkdown") | |
installed_packages <- needed_packages %in% rownames(installed.packages()) | |
if (any(installed_packages == FALSE)) { | |
install.packages(needed_packages[!installed_packages], repos = "http://cran.us.r-project.org") | |
} | |
lapply(needed_packages, library, character.only = TRUE) | |
``` | |
# Description | |
This script is a preparation script for the EPIC preprocessing pipeline | |
## Instructions | |
This script asks the user to define or choose: | |
<br> | |
* personal path to the cloned github repository | |
* path to the phenotype data location | |
* path to the idat files location | |
* project name | |
* array type | |
* population ethnicity | |
* detection *p* value limit for samples (script 4 will exclude samples with *p* values higher than this value) | |
* extreme outliers definition (number of standard deviation of first and second PCA for script 10) | |
* the name of the following columns as they are written in your phenotype data: | |
+ slide (example: "201715340007") | |
+ array (example: "R01C01") | |
+ plate | |
+ well (example: "B02") | |
+ person id | |
+ sex | |
* additional batch variables (up to 3) | |
* optionally: | |
+ path for R packages | |
+ path to the genotype data location | |
+ prefix (project name) of the genotype files | |
<br> | |
**Notes/Definitions:** | |
<br> | |
**Personal path:** The path of the github repository after it is cloned to the user's folder on the cluster. | |
The path name should not include the name of the github repository | |
**Path to phenotype data:** The path to the data table containing phenotype information | |
The path name should include the name of data table itself. This file must be **.csv** | |
**Path to idat files:** The path to the raw files obtained from array | |
**Path to R package directories (optional):** The path to a directory for installing and loading R packages. | |
This is needed if installation of R packages to the default R library are not permitted due to the lack of admin rights. | |
This will enable R to install and load any needed packages without problems. | |
Please make sure you have the needed writing permissions in this specified directory | |
**Array type:** "v1" or "v2". This stands for the different versions of the EPIC array | |
**Population ethnicity:** Choose "AFR", "AMR", "ASN", "EUR" or "Global". Global | |
may be used if the population is very diverse. This information will be useful for removing | |
ethnic-specific hybridizing probes | |
**Detection p-value of samples:** The threshold for the *p* value of detection. This *p* value indicates the | |
signal quality in the initial Quality Control (QC) for samples (we typically use a | |
threshold of 0.05 and set this by default) | |
**Define extreme outliers:** We define outliers as outside a certain number of standard deviations from the mean | |
in the first and second principal component. You choose how many standard deviations | |
should define an extreme outlier (we typically use 3 and set this by default) | |
**Name of slide column:** The name of the column containing information | |
about the slide on which the sample was run | |
**Name of array column:** The name of the column containing information | |
about the array on which the sample was run | |
**Name of person id column:** The name of the column containing a identifier for each person | |
**Name of sex column:** The name of the column containing sex information. Sex data must | |
be listed as F, M, or W. If your sex data is described in a different way | |
(or includes other categories) please contact pipeline authors | |
**Additional batch variables:** The following technical batch variables will | |
be considered in a batch check in script 8: plate, slide, array, row and column. Please | |
provide additional possible batch variables if needed as a character vector (up to 3, | |
add more if necessary). If not, please leave this field as is. | |
+**Path to genotype files:** The path to the QC-ed genotype data in plink file format (.bed, .bim, .fam) | |
+**Genotype project name:** Prefix of the genotype files, denoting the project (e.g. my_study_2020) | |
<br> | |
## Directory structure | |
The script creates directories needed for the pipeline within the cloned epic preprocessing folder. This includes: | |
<br> | |
1. a `project folder with a name specified by the user` | |
2. `processed_data` and `reports` folder within the project folder | |
3. `final_data` folder within the processed data folder | |
<br> | |
## Phenotype data preparation | |
This script imports and formats phenotype data. Phenotype data must be in a **csv format**. The file | |
must contain one sample per row, with columns containing one attribute of the sample. | |
<br> | |
The following information is required: | |
<br> | |
1. **personid:** unique person id | |
2. **slide:** location of sample on slide | |
3. **array:** location of sample on array | |
4. **arrayid:** unique technical sample id (will be automatically created from slide and array) | |
5. **basename:** Path location of sample-specific IDAT files (will be automatically created from path to IDAT files, slide, and arrayid) | |
```{r, user definitions, include=FALSE} | |
# Please write your input information in the following | |
# We specified the character vectors need to be filled by user (within quotation marks) with **PLEASE FILL IN** | |
personal_path <- "PLEASE FILL IN" | |
pheno_path <- "PLEASE FILL IN" | |
idat_path <- "PLEASE FILL IN" | |
package_path <- "PLEASE FILL IN" | |
project_name <- "PLEASE FILL IN" | |
array_type <- "PLEASE FILL IN" | |
population_ethnicity <- "PLEASE FILL IN" | |
detP_sample_threshold <- "0.05" | |
outlier_threshold <- "3" | |
name_slide_column <- "PLEASE FILL IN" | |
name_array_column <- "PLEASE FILL IN" | |
name_plate_column <- "PLEASE FILL IN" | |
name_well_column <- "PLEASE FILL IN" | |
name_personid_column <- "PLEASE FILL IN" | |
name_sex_column <- "PLEASE FILL IN" | |
additional_batch_variable_1 <- "PLEASE FILL IN" | |
additional_batch_variable_2 <- "PLEASE FILL IN" | |
additional_batch_variable_3 <- "PLEASE FILL IN" | |
``` | |
```{r, save user choices, include=FALSE} | |
user_choices <- data.frame( | |
"personal_path" = personal_path, | |
"pheno_path" = pheno_path, | |
"idat_path" = idat_path, | |
"package_path" = package_path, | |
"project_name" = project_name, | |
"array_type" = array_type, | |
"population_ethnicity" = population_ethnicity, | |
"detP_sample_threshold" = as.numeric(detP_sample_threshold), | |
"outlier_threshold" = as.numeric(outlier_threshold), | |
"name_slide_column" = name_slide_column, | |
"name_array_column" = name_array_column, | |
"name_plate_column" = name_plate_column, | |
"name_well_column" = name_well_column, | |
"name_personid_column" = name_personid_column, | |
"name_sex_column" = name_sex_column, | |
"additional_batch_variable_1" = additional_batch_variable_1, | |
"additional_batch_variable_2" = additional_batch_variable_2, | |
"additional_batch_variable_3" = additional_batch_variable_3, | |
"genotype_path" = genotype_path, | |
"genotype_projectname" = genotype_projectname | |
) | |
saveRDS(user_choices, "../data/user_choices.rds") | |
ifelse(personal_path == "PLEASE FILL IN", | |
personal_path_message <- "✘ Please specify the path to the cloned repository", | |
personal_path_message <- "✔ Path to cloned repository was provided") | |
ifelse(pheno_path == "PLEASE FILL IN", | |
pheno_path_message <- "✘ Please specify the path to phenotype data", | |
pheno_path_message <- "✔ Path to phenotype data was provided") | |
ifelse(idat_path == "PLEASE FILL IN", | |
idat_path_message <- "✘ Please specify the path to idat files", | |
idat_path_message <- "✔ Path to idat files was provided") | |
ifelse(package_path == "PLEASE FILL IN", | |
package_path_message <- "✔ Package installation path was not specified and therefore default path will be used", | |
package_path_message <- paste0("✔ Package installation path was specified: ", package_path, ". all R packages will be installed here.")) | |
ifelse(project_name == "PLEASE FILL IN", | |
project_name_message <- "✘ Please specify project name", | |
project_name_message <- "✔ Project name was provided") | |
ifelse(array_type %in% c("v1", "v2"), | |
array_type_message <- "✔ Array type is OK", | |
array_type_message <- paste0("✘ This array type is not available: ", | |
array_type, ". Please use v1 or v2")) | |
ifelse(population_ethnicity %in% c("AFR", "AMR", "ASN", "EUR", "Global"), | |
population_ethnicity_message <- "✔ Population ethnicity is OK", | |
population_ethnicity_message <- paste0("✘ This population ethnicity is not available: ", | |
population_ethnicity, ". Please use AFR, AMR, ASN, EUR or Global")) | |
ifelse(detP_sample_threshold == "0.05" | detP_sample_threshold == "0.01", | |
detP_sample_threshold_message <- "✔ Detection p-value is OK", | |
detP_sample_threshold_message <- paste0("✘ This detection p-value is not available: ", detP_sample_threshold, ". Please use 0.05 or 0.01")) | |
ifelse(outlier_threshold == "3", | |
outlier_threshold_message <- "✔ Outlier threshold is set to standard: 3 standard deviations", | |
outlier_threshold_message <- paste0("✔ New outlier threshold was set to be: ", outlier_threshold, " standard deviations")) | |
ifelse(name_slide_column == "PLEASE FILL IN" | name_array_column == "PLEASE FILL IN" | |
| name_plate_column == "PLEASE FILL IN" | name_well_column == "PLEASE FILL IN" | |
| name_personid_column == "PLEASE FILL IN" | name_sex_column == "PLEASE FILL IN", | |
phenotype_names_message <- "✘ Please specify all needed column names of phenotype data", | |
phenotype_names_message <- "✔ Names of columns in phenotype data were provided in full") | |
additional_batch_variables <- c(additional_batch_variable_1, additional_batch_variable_2, additional_batch_variable_3) | |
additional_batch_variables <- additional_batch_variables[additional_batch_variables != "PLEASE FILL IN"] | |
ifelse(identical(additional_batch_variables, character(0)), | |
additional_batch_variables_message <- "✔ Additional batch variables were not specified and therefore standard variables will be included", | |
additional_batch_variables_message <- paste0("✔ The following additional batch variables were specified: ", additional_batch_variables, ". It will be included in following scripts")) | |
``` | |
```{r, create directories, include=FALSE} | |
dir.create(paste0(personal_path, "/", project_name, "/processed_data/final_data/"), | |
recursive = TRUE) | |
dir.create(paste0(personal_path, "/", project_name, "/reports/")) | |
``` | |
```{r, imoporting & formating & saving phenotype data,include=FALSE} | |
phenotype_data <- read.csv(pheno_path, header = TRUE, row.names = 1) | |
phenotype_data <- phenotype_data %>% | |
as.data.frame() %>% | |
rename(personid = name_personid_column, slide = name_slide_column, | |
array = name_array_column, plate = name_plate_column, | |
well = name_well_column, sex = name_sex_column) %>% | |
mutate(arrayid = paste0(slide, "_", array), | |
Basename = paste0(idat_path, "/", slide, "/", arrayid), | |
array = as.factor(array), | |
plate = as.factor(plate), | |
slide = as.factor(slide), | |
row = as.factor(str_extract(well, "[:alpha:]")), | |
column = as.factor(str_extract(well, "[:digit:]{2}")), | |
well = as.factor(well), | |
sex = as.factor(case_when(str_detect(sex, "^f|F|w|W") ~ "F", | |
str_detect(sex, "^m|M|h|H") ~ "M", | |
TRUE ~ "Error"))) # to exclude collisions in sex mismatches check later | |
saveRDS(phenotype_data, paste0(personal_path, "/", | |
project_name, "/processed_data/phenotype_data.rds")) | |
write.csv(phenotype_data, paste0(personal_path, "/", | |
project_name, "/processed_data/phenotype_data.csv"), | |
row.names=F, sep = ",") | |
``` | |
## Display user definitions | |
The following details were saved as user choices in data folder: | |
```{r, users definitions display} | |
user_choices %>% paged_table() | |
``` | |
## Check user definitions | |
```{r, users definitions check, results = 'asis'} | |
cat(personal_path_message, pheno_path_message, idat_path_message, package_path_message, | |
project_name_message, array_type_message, population_ethnicity_message, | |
detP_sample_threshold_message, outlier_threshold_message, phenotype_names_message, | |
additional_batch_variables_message, | |
sep = "<br>\n") | |
``` | |
## User report | |
```{r, info output for user 1, results = 'asis'} | |
cat("The folders listed above were created within the cloned epic_preprocessing folder", | |
"The phenotype data file was created as in the processed_data folder", | |
"Session info text file was created in the project folder", sep = "<br>\n") | |
``` | |
```{r, document session info into a text file, include=FALSE} | |
connection <- file(paste0(personal_path, "/", project_name, "/session_info_all.txt"), "w") | |
writeLines("######################################################################", connection) | |
writeLines("############################ Script 1: #############################", connection) | |
writeLines("######################################################################", connection) | |
sessioninfo <- sessionInfo() | |
writeLines("\nR Version:", connection) | |
writeLines(paste0(" ", sessioninfo$R.version$version.string), connection) | |
writeLines("\nAttached packages:", connection) | |
# nicely format packages (with version) | |
package_version <- unlist(lapply(sessioninfo$otherPkgs, function(x){paste0(" ",x$Package, " (", x$Version, ")")})) | |
names(package_version) <- NULL | |
for(i in 1:length(package_version)) { | |
writeLines(package_version[i], connection) | |
} | |
writeLines("\n", connection) | |
close(connection) | |
``` | |
```{r runtime end, include=FALSE} | |
end_time <- Sys.time() | |
run_time <- as.numeric(round((end_time - start_time), 1), units = "mins") | |
``` | |
## Completion time | |
```{r, completion time, results = 'asis'} | |
cat("The time for the script to run was:", run_time, "minutes", sep = " ") | |
``` |