hit_selection_distance.Rmd

---
title: "Replicate Distance"
output:
  html_document: default
  pdf_document: default
  word_document: default
---
The goal is to do some Hit Selection. Selecting the compound that have effects, that are showing a phenotype.
Find the median replicate distance. The smaller the distance, the more correlated the replicates are.

Use different type of distance metric:

- Euclidean distance: usual distance between two vectors (L2 norm)
- Maximum distance: maximum distance between two components
- Manhattan distance: absolute distance between two vector (L1 norm)


## Data

The input data is a 7680 by 803 matrix.
There are 7680 different observations and 799 features (extracted with CellProfiler).
Each compound (1600 different) has 4 replicates. The negative control has 1280 replicates.
20 plates with 384 wells in each plate.

```{r setup, include=FALSE}
# all usefull libraries
library(magrittr)
library(dplyr)
library(ggplot2)
library(foreach)
library(doMC)
library(stringr)
library(tidyverse)
```

```{r import data with feature selection, message=FALSE, eval=FALSE}
profiles <- 
  readr::read_csv(file.path("..", "..", "input", "BBBC022_2013", "BBBC022_2013_sel_feat.csv")) # 7680x(nfeat+metadata)

variables <-
  names(profiles) %>% str_subset("^Cells_|^Cytoplasm_|^Nuclei_") # nfeat

metadata <-
  names(profiles) %>% str_subset("^Metadata_") # metadata

# Remove the negative control (DMSO) from the data
profiles %<>% 
  filter(!Metadata_broad_sample %in% "DMSO") # 6400xnfeat
```

```{r import data no feature selection, message=FALSE}
profiles <- 
  list.files("../../backend/BBBC022_2013/", 
          pattern = "*_normalized.csv",
          recursive = T,
          full.names = T) %>%
  map_df(read_csv)

dim(profiles)

variables <-
  names(profiles) %>% str_subset("^Cells_|^Cytoplasm_|^Nuclei_")

metadata <-
  names(profiles) %>% str_subset("^Metadata_")
  
profiles %<>%
  cytominer::select(
    sample = 
      profiles %>% 
      filter(Metadata_pert_type == "control"),
    variables = variables,
    operation = "variance_threshold"
  )

variables <-
  names(profiles) %>% str_subset("^Cells_|^Cytoplasm_|^Nuclei_")

# Remove the negative control (DMSO) from the data
profiles %<>% 
  filter(!Metadata_broad_sample %in% "DMSO") # 6400xnfeat

```

## Parameter
```{r parameter}
# method of distance matrix computation: "euclidean" (default), "maximum", "manhattan"
dist.method <- "euclidean"
# number of data to make the non replicate correlation
N <- 5000

# seed for the reproducibility
set.seed(42)

# number of CPU cores for parallelization
registerDoMC(7)
```

## Separation of data

Separation is made according to the compound that was added.
Image_Metadata_BROAD_ID = gives the ID of the compound that was added.

```{r separation of data}
# find the different compounds
IDs <- distinct(profiles, Metadata_broad_sample)
dim(IDs)
```

## Distance of the data

Calculate the distance of the replicate for each compound.

```{r distance}
# loop over all IDs
comp.dist.median <- foreach(i = 1:length(IDs$Image_Metadata_BROAD_ID), .combine=cbind) %dopar% {
  #filtering to choose only for one compound
  #comp <-
  #  filter(profiles, Metadata_broad_sample %in% IDs$Metadata_broad_sample[i])
  #comp <-
  #  comp[, variables] %>%
  #  as.matrix()

  #filtering to choose only for one compound
  comp <-
    filter(pf$data, Image_Metadata_BROAD_ID %in% IDs$Image_Metadata_BROAD_ID[i])
  comp <-
    comp[, pf$feat_cols] %>%
    as.matrix()
  
  # distance of the features
  comp.dist <- dist(comp, method = dist.method) %>% as.matrix()
  
  # median of the distances
  comp.dist.median <- median(comp.dist[lower.tri(comp.dist)],na.rm=TRUE)
}

hist(comp.dist.median,
     main="Histogram for Median Replicate Distance",
     xlab="Median Replicate Distance")

```


## Thresholding of poor replicate correlation

H0: median non replicate correlation.

The Null distribution is estimated by finding the median correlation of non replicates. 
Select randomly 4 replicates each coming for a different compound and calculate the median correlation. 
Repeat this N times to get a distribution.
Finally estimate a threshold (5th percentile) to filter out compounds with poor replicate correlation.

```{r non replicate distance parallel}
start.time <- Sys.time()

# set seed for reproducibility
set.seed(42)

# random sequence for reproducibility
a <- sample(1:10000, N, replace=F)

# loop over N times to get a distribution
random.replicate.dist.median <- foreach(i = 1:N, .combine=cbind) %dopar% {
  # set seed according to random sequence
  set.seed(a[i])
  
  # group by IDs
  # sample fixed number per group -> choose 4 replicates randomly from different group
  #random.replicate <- 
  #  profiles %>% 
  #  group_by(Metadata_broad_sample) %>% 
  #  sample_n(1, replace = FALSE) %>% 
  #  ungroup(random.replicate)
  #random.replicate <- sample_n(random.replicate, 4, replace = FALSE)
  
#comp <- random.replicate[,variables] %>% 
#    as.matrix()
  random.replicate <- 
    pf$data %>% 
    group_by(Image_Metadata_BROAD_ID) %>% 
    sample_n(1, replace = FALSE) %>% 
    ungroup(random.replicate)
  random.replicate <- sample_n(random.replicate, 4, replace = FALSE)
  
  
  comp <- random.replicate[,pf$feat_cols] %>% 
    as.matrix()
  
  
  # distance of the features
  random.replicate.dist <- dist(comp, method = dist.method)
  
  # median of the non replicate distance
  random.replicate.dist.median <- median(random.replicate.dist[lower.tri(random.replicate.dist)],na.rm=TRUE)
  
}

# histogram plot
hist(random.replicate.dist.median,
     main="Histogram for Non Replicate Median Correlation",
     xlab="Non Replicate Median Correlation")

# threshold to determine if can reject H0
thres <- quantile(random.replicate.dist.median, .05)
print(thres)

end.time <- Sys.time() 
time.taken <- end.time - start.time
time.taken
```

## Hit Selection

Select strong replicate correlation comparing with the 95th percentile of the Null Distribution.

```{r Hit Selection}
# find indices of replicate median correlation > threshold
inds <- which(comp.dist.median < thres)

# find values of the median that are hit selected
hit.select <- comp.dist.median[inds]

# find component that are hit selected
hit.select.IDs <- IDs$Metadata_broad_sample[inds]

# ratio of strong median replicate correlation
high.median.dist <- length(hit.select)/length(comp.dist.median)
print(high.median.dist)

```


## Results

|  Method   | Pearson | Spearman | Kendall | Euclidean | Maximum | Manhattan |
| --------- | ------- | -------- | ------- | --------- | ------- | --------- |
| N = 1000  | 0.5819  | -------- | ------- | --------- | ------- | --------- |
| N = 5000  | 0.5806  | 0.5519   | 0.5419  | 0.4606    | 0.3612  | 0.4875    |
| N = 10000 | 0.5850  | -------- | ------- | --------- | ------- | --------- |


- Difference between Pearson and Spearman correlation seem not to be very significant (no statistical test was performed).
- Distance method compared to correlation metric gives a lower ratio of hit selection (more or less 10% lower).

```{r plot}
type1 <- rep("Non-replicate Distance", length(random.replicate.dist.median))
type2 <- rep("Replicate Distance", length(comp.dist.median))
type <- c(type1, type2)
distance <- c(random.replicate.dist.median, comp.dist.median)
dat <- data.frame(distance = distance, type = type) 
ggplot(dat, aes(x=distance, fill=type, y=..density../sum(..density..))) +
  geom_histogram(binwidth=5, alpha=.4, position="identity")	+ 
  xlim(0, 400) + 
  geom_vline(xintercept=thres, colour = "red", alpha = 0.4) +
  ylab("density")

```