-
Notifications
You must be signed in to change notification settings - Fork 38
Description
I am facing some issues parallelizing processes with furrr::future_apply.
This is the setting I am having issues with:
rm(list=ls(all=TRUE))
require(future)
require(furrr)
require(dplyr)
require(readr)
require(parallel)
set.seed(123)
# fake data
my_list <- replicate(1000000, rnorm(1000), simplify = FALSE)
# function to parallelize
f_to_parallelize <- function(x){
y <- sum(x)
return(y)
}
# plans to test
plan(sequential)
#plan(multisession, workers=2)
#plan(multisession, workers=6)
#plan(multisession, workers=15)
l <- future_walk(my_list, f_to_parallelize)
When I profile memory and time for these 4 plans this is what I get:
I have launched 4 different jobs from R studio server, while I was profiling all memory used for processes with my user in a separate job to get data for the graph.
This is the outpu of my sessionInfo()) of the parallelization jobs:
R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] readr_2.1.2 dplyr_1.1.0 furrr_0.2.3 future_1.24.0
loaded via a namespace (and not attached):
[1] rstudioapi_0.13 parallelly_1.30.0 magrittr_2.0.2 hms_1.1.1
[5] tidyselect_1.2.0 R6_2.5.1 rlang_1.1.1 fansi_1.0.2
[9] globals_0.14.0 tools_4.2.2 utf8_1.2.2 cli_3.6.0
[13] ellipsis_0.3.2 digest_0.6.29 tibble_3.1.6 lifecycle_1.0.3
[17] crayon_1.5.0 tzdb_0.2.0 purrr_1.0.1 vctrs_0.5.2
[21] codetools_0.2-18 glue_1.6.2 compiler_4.2.2 pillar_1.7.0
[25] generics_0.1.2 listenv_0.8.0 pkgconfig_2.0.3
Is this behavior normal? I did not expected the steep increase in memory for all the plans, other than the increase in time when I increase the number of workers.
I also tested the sys.sleep(1) function in parallel, and I got the result I expected, time decreases as I increase workers.
What I am trying to parallelize is far more complex than this, i.e. a series of nested wrapped functions that do some training for some time series models and inference writing a csv and not returning anything.
I fill like I am losing something very simple but yet I cannot wrap my head around it, what concerns me the most is the memory increase, as it would be a very memory intensive function.
