Skip to content
Permalink
f1bd3ab1ae
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
289 lines (257 sloc) 13.9 KB
---
title: "Script 1: Define data properties/locations, create directory structure, prepare phenotype data"
date: '`r format(Sys.time(), "%d %B, %Y")`'
output: html_document
---
```{r runtime start, include=FALSE}
start_time <- Sys.time()
```
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE)
needed_packages <- c("dplyr", "knitr", "stringr", "rmarkdown")
installed_packages <- needed_packages %in% rownames(installed.packages())
if (any(installed_packages == FALSE)) {
install.packages(needed_packages[!installed_packages], repos = "http://cran.us.r-project.org")
}
lapply(needed_packages, library, character.only = TRUE)
```
# Description
This script is a preparation script for the EPIC preprocessing pipeline
## Instructions
This script asks the user to define or choose:
<br>
* personal path to the cloned github repository
* path to the phenotype data location
* path to the idat files location
* project name
* array type
* population ethnicity
* detection *p* value limit for samples (script 4 will exclude samples with *p* values higher than this value)
* extreme outliers definition (number of standard deviation of first and second PCA for script 10)
* the name of the following columns as they are written in your phenotype data:
+ slide (example: "201715340007")
+ array (example: "R01C01")
+ plate
+ well (example: "B02")
+ person id
+ sex
* additional batch variables (up to 3)
* optionally:
+ path for R packages
+ path to the genotype data location
+ prefix (project name) of the genotype files
<br>
**Notes/Definitions:**
<br>
**Personal path:** The path of the github repository after it is cloned to the user's folder on the cluster.
The path name should not include the name of the github repository
**Path to phenotype data:** The path to the data table containing phenotype information
The path name should include the name of data table itself. This file must be **.csv**
**Path to idat files:** The path to the raw files obtained from array
**Path to R package directories (optional):** The path to a directory for installing and loading R packages.
This is needed if installation of R packages to the default R library are not permitted due to the lack of admin rights.
This will enable R to install and load any needed packages without problems.
Please make sure you have the needed writing permissions in this specified directory
**Array type:** "v1" or "v2". This stands for the different versions of the EPIC array
**Population ethnicity:** Choose "AFR", "AMR", "ASN", "EUR" or "Global". Global
may be used if the population is very diverse. This information will be useful for removing
ethnic-specific hybridizing probes
**Detection p-value of samples:** The threshold for the *p* value of detection. This *p* value indicates the
signal quality in the initial Quality Control (QC) for samples (we typically use a
threshold of 0.05 and set this by default)
**Define extreme outliers:** We define outliers as outside a certain number of standard deviations from the mean
in the first and second principal component. You choose how many standard deviations
should define an extreme outlier (we typically use 3 and set this by default)
**Name of slide column:** The name of the column containing information
about the slide on which the sample was run
**Name of array column:** The name of the column containing information
about the array on which the sample was run
**Name of person id column:** The name of the column containing a identifier for each person
**Name of sex column:** The name of the column containing sex information. Sex data must
be listed as F, M, or W. If your sex data is described in a different way
(or includes other categories) please contact pipeline authors
**Additional batch variables:** The following technical batch variables will
be considered in a batch check in script 8: plate, slide, array, row and column. Please
provide additional possible batch variables if needed as a character vector (up to 3,
add more if necessary). If not, please leave this field as is.
+**Path to genotype files:** The path to the QC-ed genotype data in plink file format (.bed, .bim, .fam)
+**Genotype project name:** Prefix of the genotype files, denoting the project (e.g. my_study_2020)
<br>
## Directory structure
The script creates directories needed for the pipeline within the cloned epic preprocessing folder. This includes:
<br>
1. a `project folder with a name specified by the user`
2. `processed_data` and `reports` folder within the project folder
3. `final_data` folder within the processed data folder
<br>
## Phenotype data preparation
This script imports and formats phenotype data. Phenotype data must be in a **csv format**. The file
must contain one sample per row, with columns containing one attribute of the sample.
<br>
The following information is required:
<br>
1. **personid:** unique person id
2. **slide:** location of sample on slide
3. **array:** location of sample on array
4. **arrayid:** unique technical sample id (will be automatically created from slide and array)
5. **basename:** Path location of sample-specific IDAT files (will be automatically created from path to IDAT files, slide, and arrayid)
```{r, user definitions, include=FALSE}
# Please write your input information in the following
# We specified the character vectors need to be filled by user (within quotation marks) with **PLEASE FILL IN**
personal_path <- "PLEASE FILL IN"
pheno_path <- "PLEASE FILL IN"
idat_path <- "PLEASE FILL IN"
package_path <- "PLEASE FILL IN"
project_name <- "PLEASE FILL IN"
array_type <- "PLEASE FILL IN"
population_ethnicity <- "PLEASE FILL IN"
detP_sample_threshold <- "0.05"
outlier_threshold <- "3"
name_slide_column <- "PLEASE FILL IN"
name_array_column <- "PLEASE FILL IN"
name_plate_column <- "PLEASE FILL IN"
name_well_column <- "PLEASE FILL IN"
name_personid_column <- "PLEASE FILL IN"
name_sex_column <- "PLEASE FILL IN"
additional_batch_variable_1 <- "PLEASE FILL IN"
additional_batch_variable_2 <- "PLEASE FILL IN"
additional_batch_variable_3 <- "PLEASE FILL IN"
```
```{r, save user choices, include=FALSE}
user_choices <- data.frame(
"personal_path" = personal_path,
"pheno_path" = pheno_path,
"idat_path" = idat_path,
"package_path" = package_path,
"project_name" = project_name,
"array_type" = array_type,
"population_ethnicity" = population_ethnicity,
"detP_sample_threshold" = as.numeric(detP_sample_threshold),
"outlier_threshold" = as.numeric(outlier_threshold),
"name_slide_column" = name_slide_column,
"name_array_column" = name_array_column,
"name_plate_column" = name_plate_column,
"name_well_column" = name_well_column,
"name_personid_column" = name_personid_column,
"name_sex_column" = name_sex_column,
"additional_batch_variable_1" = additional_batch_variable_1,
"additional_batch_variable_2" = additional_batch_variable_2,
"additional_batch_variable_3" = additional_batch_variable_3,
"genotype_path" = genotype_path,
"genotype_projectname" = genotype_projectname
)
saveRDS(user_choices, "../data/user_choices.rds")
ifelse(personal_path == "PLEASE FILL IN",
personal_path_message <- "✘ Please specify the path to the cloned repository",
personal_path_message <- "✔ Path to cloned repository was provided")
ifelse(pheno_path == "PLEASE FILL IN",
pheno_path_message <- "✘ Please specify the path to phenotype data",
pheno_path_message <- "✔ Path to phenotype data was provided")
ifelse(idat_path == "PLEASE FILL IN",
idat_path_message <- "✘ Please specify the path to idat files",
idat_path_message <- "✔ Path to idat files was provided")
ifelse(package_path == "PLEASE FILL IN",
package_path_message <- "✔ Package installation path was not specified and therefore default path will be used",
package_path_message <- paste0("✔ Package installation path was specified: ", package_path, ". all R packages will be installed here."))
ifelse(project_name == "PLEASE FILL IN",
project_name_message <- "✘ Please specify project name",
project_name_message <- "✔ Project name was provided")
ifelse(array_type %in% c("v1", "v2"),
array_type_message <- "✔ Array type is OK",
array_type_message <- paste0("✘ This array type is not available: ",
array_type, ". Please use v1 or v2"))
ifelse(population_ethnicity %in% c("AFR", "AMR", "ASN", "EUR", "Global"),
population_ethnicity_message <- "✔ Population ethnicity is OK",
population_ethnicity_message <- paste0("✘ This population ethnicity is not available: ",
population_ethnicity, ". Please use AFR, AMR, ASN, EUR or Global"))
ifelse(detP_sample_threshold == "0.05" | detP_sample_threshold == "0.01",
detP_sample_threshold_message <- "✔ Detection p-value is OK",
detP_sample_threshold_message <- paste0("✘ This detection p-value is not available: ", detP_sample_threshold, ". Please use 0.05 or 0.01"))
ifelse(outlier_threshold == "3",
outlier_threshold_message <- "✔ Outlier threshold is set to standard: 3 standard deviations",
outlier_threshold_message <- paste0("✔ New outlier threshold was set to be: ", outlier_threshold, " standard deviations"))
ifelse(name_slide_column == "PLEASE FILL IN" | name_array_column == "PLEASE FILL IN"
| name_plate_column == "PLEASE FILL IN" | name_well_column == "PLEASE FILL IN"
| name_personid_column == "PLEASE FILL IN" | name_sex_column == "PLEASE FILL IN",
phenotype_names_message <- "✘ Please specify all needed column names of phenotype data",
phenotype_names_message <- "✔ Names of columns in phenotype data were provided in full")
additional_batch_variables <- c(additional_batch_variable_1, additional_batch_variable_2, additional_batch_variable_3)
additional_batch_variables <- additional_batch_variables[additional_batch_variables != "PLEASE FILL IN"]
ifelse(identical(additional_batch_variables, character(0)),
additional_batch_variables_message <- "✔ Additional batch variables were not specified and therefore standard variables will be included",
additional_batch_variables_message <- paste0("✔ The following additional batch variables were specified: ", additional_batch_variables, ". It will be included in following scripts"))
```
```{r, create directories, include=FALSE}
dir.create(paste0(personal_path, "/", project_name, "/processed_data/final_data/"),
recursive = TRUE)
dir.create(paste0(personal_path, "/", project_name, "/reports/"))
```
```{r, imoporting & formating & saving phenotype data,include=FALSE}
phenotype_data <- read.csv(pheno_path, header = TRUE, row.names = 1)
phenotype_data <- phenotype_data %>%
as.data.frame() %>%
rename(personid = name_personid_column, slide = name_slide_column,
array = name_array_column, plate = name_plate_column,
well = name_well_column, sex = name_sex_column) %>%
mutate(arrayid = paste0(slide, "_", array),
Basename = paste0(idat_path, "/", slide, "/", arrayid),
array = as.factor(array),
plate = as.factor(plate),
slide = as.factor(slide),
row = as.factor(str_extract(well, "[:alpha:]")),
column = as.factor(str_extract(well, "[:digit:]{2}")),
well = as.factor(well),
sex = as.factor(case_when(str_detect(sex, "^f|F|w|W") ~ "F",
str_detect(sex, "^m|M|h|H") ~ "M",
TRUE ~ "Error"))) # to exclude collisions in sex mismatches check later
saveRDS(phenotype_data, paste0(personal_path, "/",
project_name, "/processed_data/phenotype_data.rds"))
write.csv(phenotype_data, paste0(personal_path, "/",
project_name, "/processed_data/phenotype_data.csv"),
row.names=F, sep = ",")
```
## Display user definitions
The following details were saved as user choices in data folder:
```{r, users definitions display}
user_choices %>% paged_table()
```
## Check user definitions
```{r, users definitions check, results = 'asis'}
cat(personal_path_message, pheno_path_message, idat_path_message, package_path_message,
project_name_message, array_type_message, population_ethnicity_message,
detP_sample_threshold_message, outlier_threshold_message, phenotype_names_message,
additional_batch_variables_message,
sep = "<br>\n")
```
## User report
```{r, info output for user 1, results = 'asis'}
cat("The folders listed above were created within the cloned epic_preprocessing folder",
"The phenotype data file was created as in the processed_data folder",
"Session info text file was created in the project folder", sep = "<br>\n")
```
```{r, document session info into a text file, include=FALSE}
connection <- file(paste0(personal_path, "/", project_name, "/session_info_all.txt"), "w")
writeLines("######################################################################", connection)
writeLines("############################ Script 1: #############################", connection)
writeLines("######################################################################", connection)
sessioninfo <- sessionInfo()
writeLines("\nR Version:", connection)
writeLines(paste0(" ", sessioninfo$R.version$version.string), connection)
writeLines("\nAttached packages:", connection)
# nicely format packages (with version)
package_version <- unlist(lapply(sessioninfo$otherPkgs, function(x){paste0(" ",x$Package, " (", x$Version, ")")}))
names(package_version) <- NULL
for(i in 1:length(package_version)) {
writeLines(package_version[i], connection)
}
writeLines("\n", connection)
close(connection)
```
```{r runtime end, include=FALSE}
end_time <- Sys.time()
run_time <- as.numeric(round((end_time - start_time), 1), units = "mins")
```
## Completion time
```{r, completion time, results = 'asis'}
cat("The time for the script to run was:", run_time, "minutes", sep = " ")
```