scripts/1_definitions_and_setup.Rmd

---
title: "Script 1: Define data properties/locations, create directory structure, prepare phenotype data"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---

```{r runtime start, include=FALSE}
start_time <- Sys.time()
```

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
needed_packages <- c("dplyr", "knitr", "stringr", "rmarkdown")
installed_packages <- needed_packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
  install.packages(needed_packages[!installed_packages], repos = "http://cran.us.r-project.org")
}
lapply(needed_packages, library, character.only = TRUE)
```

# Description

This script is a preparation script for the EPIC preprocessing pipeline  

## Instructions

This script asks the user to define or choose:
<br>  

* personal path to the cloned github repository  
* path to the phenotype data location  
* path to the idat files location  
* project name  
* array type  
* population ethnicity  
* detection *p* value limit for samples (script 4 will exclude samples with *p* values higher than this value)  
* extreme outliers definition (number of standard deviation of first and second PCA for script 10)  
* the name of the following columns as they are written in your phenotype data: 
  + slide (example: "201715340007")  
  + array (example: "R01C01")  
  + plate  
  + well (example: "B02")  
  + person id  
  + sex  
* additional batch variables (up to 3)  
* optionally:
  + path for R packages
  + path to the genotype data location
  + prefix (project name) of the genotype files

<br>  
**Notes/Definitions:**  
<br>  
**Personal path:** The path of the github repository after it is cloned to the user's folder on the cluster. 
The path name should not include the name of the github repository  
**Path to phenotype data:** The path to the data table containing phenotype information  
The path name should include the name of data table itself. This file must be **.csv**  
**Path to idat files:** The path to the raw files obtained from array  
**Path to R package directories (optional):** The path to a directory for installing and loading R packages. 
This is needed if installation of R packages to the default R library are not permitted due to the lack of admin rights. 
This will enable R to install and load any needed packages without problems. 
Please make sure you have the needed writing permissions in this specified directory  
**Array type:** "v1" or "v2". This stands for the different versions of the EPIC array  
**Population ethnicity:** Choose "AFR", "AMR", "ASN", "EUR" or "Global". Global 
may be used if the population is very diverse. This information will be useful for removing 
ethnic-specific hybridizing probes  
**Detection p-value of samples:** The threshold for the *p* value of detection. This *p* value indicates the
signal quality in the initial Quality Control (QC) for samples (we typically use a 
threshold of 0.05 and set this by default)  
**Define extreme outliers:** We define outliers as outside a certain number of standard deviations from the mean 
in the first and second principal component. You choose how many standard deviations 
should define an extreme outlier (we typically use 3 and set this by default)  
**Name of slide column:** The name of the column containing information 
about the slide on which the sample was run  
**Name of array column:** The name of the column containing information 
about the array on which the sample was run  
**Name of person id column:** The name of the column containing a identifier for each person   
**Name of sex column:** The name of the column containing sex information. Sex data must
be listed as F, M, or W. If your sex data is described in a different way 
(or includes other categories) please contact pipeline authors  
**Additional batch variables:** The following technical batch variables will 
be considered in a batch check in script 8: plate, slide, array, row and column. Please 
provide additional possible batch variables if needed as a character vector (up to 3, 
add more if necessary). If not, please leave this field as is.  
+**Path to genotype files:** The path to the QC-ed genotype data in plink file format (.bed, .bim, .fam)  
+**Genotype project name:** Prefix of the genotype files, denoting the project (e.g. my_study_2020)  

<br>  

## Directory structure

The script creates directories needed for the pipeline within the cloned epic preprocessing folder. This includes:
<br>  
1. a `project folder with a name specified by the user`  
2. `processed_data` and `reports` folder within the project folder  
3. `final_data` folder within the processed data folder  
<br>  

## Phenotype data preparation

This script imports and formats phenotype data. Phenotype data must be in a **csv format**. The file
must contain one sample per row, with columns containing one attribute of the sample.  
<br>  
The following information is required:  
<br>  
1. **personid:** unique person id  
2. **slide:** location of sample on slide  
3. **array:** location of sample on array  
4. **arrayid:** unique technical sample id (will be automatically created from slide and array)  
5. **basename:** Path location of sample-specific IDAT files (will be automatically created from path to IDAT files, slide, and arrayid)  

```{r, user definitions, include=FALSE}
# Please write your input information in the following
# We specified the character vectors need to be filled by user (within quotation marks) with **PLEASE FILL IN**
personal_path <- "PLEASE FILL IN"
pheno_path <- "PLEASE FILL IN"
idat_path <- "PLEASE FILL IN"
package_path <- "PLEASE FILL IN"
project_name <- "PLEASE FILL IN"
array_type <- "PLEASE FILL IN"
population_ethnicity <- "PLEASE FILL IN"
detP_sample_threshold <- "0.05"
outlier_threshold <- "3"
name_slide_column <- "PLEASE FILL IN"
name_array_column <- "PLEASE FILL IN"
name_plate_column <- "PLEASE FILL IN"
name_well_column <- "PLEASE FILL IN"
name_personid_column <- "PLEASE FILL IN"
name_sex_column <- "PLEASE FILL IN"
additional_batch_variable_1 <- "PLEASE FILL IN"
additional_batch_variable_2 <- "PLEASE FILL IN"
additional_batch_variable_3 <- "PLEASE FILL IN"

```

```{r, save user choices, include=FALSE}
user_choices <- data.frame(
  "personal_path" = personal_path,
  "pheno_path" = pheno_path,
  "idat_path" = idat_path,
  "package_path" = package_path,
  "project_name" = project_name,
  "array_type" = array_type,
  "population_ethnicity" = population_ethnicity,
  "detP_sample_threshold" = as.numeric(detP_sample_threshold),
  "outlier_threshold" = as.numeric(outlier_threshold),
  "name_slide_column" = name_slide_column,
  "name_array_column" = name_array_column,
  "name_plate_column" = name_plate_column,
  "name_well_column" = name_well_column,
  "name_personid_column" = name_personid_column,
  "name_sex_column" = name_sex_column, 
  "additional_batch_variable_1" = additional_batch_variable_1,
  "additional_batch_variable_2" = additional_batch_variable_2,
  "additional_batch_variable_3" = additional_batch_variable_3,
  "genotype_path" = genotype_path,
  "genotype_projectname" = genotype_projectname
)
saveRDS(user_choices, "../data/user_choices.rds")

ifelse(personal_path == "PLEASE FILL IN", 
       personal_path_message <- "✘ Please specify the path to the cloned repository", 
       personal_path_message <- "✔ Path to cloned repository was provided")
ifelse(pheno_path == "PLEASE FILL IN", 
       pheno_path_message <- "✘ Please specify the path to phenotype data", 
       pheno_path_message <- "✔ Path to phenotype data was provided")
ifelse(idat_path == "PLEASE FILL IN", 
       idat_path_message <- "✘ Please specify the path to idat files", 
       idat_path_message <- "✔ Path to idat files was provided")
ifelse(package_path == "PLEASE FILL IN", 
       package_path_message <- "✔ Package installation path was not specified and therefore default path will be used", 
       package_path_message <- paste0("✔ Package installation path was specified: ", package_path, ". all R packages will be installed here."))
ifelse(project_name == "PLEASE FILL IN", 
       project_name_message <- "✘ Please specify project name", 
       project_name_message <- "✔ Project name was provided")
ifelse(array_type %in% c("v1", "v2"),
       array_type_message <- "✔ Array type is OK", 
       array_type_message <- paste0("✘ This array type is not available: ", 
                              array_type, ". Please use v1 or v2"))
ifelse(population_ethnicity %in% c("AFR", "AMR", "ASN", "EUR", "Global"),
       population_ethnicity_message <- "✔ Population ethnicity is OK", 
       population_ethnicity_message <- paste0("✘ This population ethnicity is not available: ", 
                    population_ethnicity, ". Please use AFR, AMR, ASN, EUR or Global"))
ifelse(detP_sample_threshold == "0.05" | detP_sample_threshold == "0.01", 
       detP_sample_threshold_message <- "✔ Detection p-value is OK", 
       detP_sample_threshold_message <- paste0("✘ This detection p-value is not available: ", detP_sample_threshold, ". Please use 0.05 or 0.01"))
ifelse(outlier_threshold == "3",
       outlier_threshold_message <- "✔ Outlier threshold is set to standard: 3 standard deviations", 
       outlier_threshold_message <- paste0("✔ New outlier threshold was set to be: ", outlier_threshold, " standard deviations"))
ifelse(name_slide_column == "PLEASE FILL IN" | name_array_column == "PLEASE FILL IN"
       | name_plate_column == "PLEASE FILL IN" | name_well_column == "PLEASE FILL IN"
       | name_personid_column == "PLEASE FILL IN" | name_sex_column == "PLEASE FILL IN",
       phenotype_names_message <- "✘ Please specify all needed column names of phenotype data", 
       phenotype_names_message <- "✔ Names of columns in phenotype data were provided in full")

additional_batch_variables <- c(additional_batch_variable_1, additional_batch_variable_2, additional_batch_variable_3)
additional_batch_variables <- additional_batch_variables[additional_batch_variables != "PLEASE FILL IN"]
ifelse(identical(additional_batch_variables, character(0)), 
       additional_batch_variables_message <- "✔ Additional batch variables were not specified and therefore standard variables will be included",
       additional_batch_variables_message <- paste0("✔ The following additional batch variables were specified: ", additional_batch_variables, ". It will be included in following scripts"))
```

```{r, create directories, include=FALSE}
dir.create(paste0(personal_path, "/", project_name, "/processed_data/final_data/"), 
           recursive = TRUE)
dir.create(paste0(personal_path, "/", project_name, "/reports/"))
```

```{r, imoporting & formating & saving phenotype data,include=FALSE}
phenotype_data <- read.csv(pheno_path, header = TRUE, row.names = 1)
phenotype_data <- phenotype_data %>%
  as.data.frame() %>%
  rename(personid = name_personid_column, slide = name_slide_column,
         array = name_array_column, plate = name_plate_column, 
         well = name_well_column, sex = name_sex_column) %>%
  mutate(arrayid = paste0(slide, "_", array),
         Basename = paste0(idat_path, "/", slide, "/", arrayid),
         array = as.factor(array),
         plate = as.factor(plate),
         slide = as.factor(slide),
         row = as.factor(str_extract(well, "[:alpha:]")),
         column = as.factor(str_extract(well, "[:digit:]{2}")),
         well = as.factor(well),
         sex = as.factor(case_when(str_detect(sex, "^f|F|w|W") ~ "F",
                                   str_detect(sex, "^m|M|h|H") ~ "M",
                                   TRUE ~ "Error"))) # to exclude collisions in sex mismatches check later

saveRDS(phenotype_data, paste0(personal_path, "/", 
                               project_name, "/processed_data/phenotype_data.rds"))
write.csv(phenotype_data, paste0(personal_path, "/", 
                                 project_name, "/processed_data/phenotype_data.csv"), 
          row.names=F, sep = ",")
```

## Display user definitions

The following details were saved as user choices in data folder:

```{r, users definitions display}
user_choices %>% paged_table()
```

## Check user definitions

```{r, users definitions check, results = 'asis'}
cat(personal_path_message, pheno_path_message, idat_path_message, package_path_message,
    project_name_message, array_type_message, population_ethnicity_message,
    detP_sample_threshold_message, outlier_threshold_message, phenotype_names_message, 
    additional_batch_variables_message,
    sep = "<br>\n")
```

## User report

```{r, info output for user 1, results = 'asis'}
cat("The folders listed above were created within the cloned epic_preprocessing folder",
    "The phenotype data file was created as in the processed_data folder", 
    "Session info text file was created in the project folder", sep = "<br>\n")
```

```{r, document session info into a text file, include=FALSE}
connection <- file(paste0(personal_path, "/", project_name, "/session_info_all.txt"), "w")
writeLines("######################################################################", connection)
writeLines("############################  Script 1:  #############################", connection)
writeLines("######################################################################", connection)
sessioninfo <- sessionInfo()
writeLines("\nR Version:", connection)
writeLines(paste0("   ", sessioninfo$R.version$version.string), connection)
writeLines("\nAttached packages:", connection)
# nicely format packages (with version)
package_version <- unlist(lapply(sessioninfo$otherPkgs, function(x){paste0("   ",x$Package, " (", x$Version, ")")}))
names(package_version) <- NULL
for(i in 1:length(package_version)) {
  writeLines(package_version[i], connection)
}
writeLines("\n", connection)
close(connection)
```

```{r runtime end, include=FALSE}
end_time <- Sys.time()
run_time <- as.numeric(round((end_time - start_time), 1), units = "mins")
```

## Completion time

```{r, completion time, results = 'asis'}
cat("The time for the script to run was:", run_time, "minutes", sep = " ")
```