Skip to content

Conversation

@DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Nov 6, 2025

Closes #1675, 3 years later!

If you find yourself questioning whether we need all 3 options for missing = NULL / FALSE / TRUE, the answer is yes:

vctrs Data frame Vector
vec_pall(missing = NULL) when_all(na_rm = FALSE)
vec_pall(missing = FALSE) filter() / filter_out()
vec_pall(missing = TRUE) when_all(na_rm = TRUE)
vec_pany(missing = NULL) when_any(na_rm = FALSE)
vec_pany(missing = FALSE) when_any(na_rm = TRUE)
vec_pany(missing = TRUE)

We could have gotten away with na_rm in vec_pall() and vec_pany() except for the fact that filter() and filter_out() need the vec_pall(.missing = FALSE) case, which would be lost if we simplified to the binary na_rm.

I think the important thing is that the user facing when_all() and when_any() get the simple na_rm, and vctrs gets the more flexible/holistic but more mental overhead .missing argument.


They are very fast due to a rather clever NA propagation algorithm that uses C level arithmetic rather than if/else branching, making us immune to bad branch prediction. Much faster than base R! Enough so that it might be interesting to see if they'd take a patch.

Here's base R having branchiness:
https://github.com/wch/r-source/blob/3e507c3364b779e42bc06a6bb28867ec4a3a082e/src/main/logic.c#L361-L367

Here's some benchmarks with equal distribution of TRUE, FALSE, and NAs (so, bad for branch prediction, which affects R but not us)

(Ignore the list_* names, this is before I switched back to vec_* names)

library(vctrs)

set.seed(123)

x <- sample(c(TRUE, FALSE, NA), size = 1e8, replace = TRUE)
y <- sample(c(TRUE, FALSE, NA), size = 1e8, replace = TRUE)
z <- sample(c(TRUE, FALSE, NA), size = 1e8, replace = TRUE)

bench::mark(
  x | y,
  list_pany(list(x, y)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x | y                    517ms  519.4ms      1.92     381MB     1.28
#> 2 list_pany(list(x, y))   42.1ms   43.1ms     23.2      381MB     5.79

bench::mark(
  x & y,
  list_pall(list(x, y)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x & y                  463.3ms  467.2ms      2.12     381MB    0.531
#> 2 list_pall(list(x, y))   40.9ms   41.4ms     23.2      381MB    5.79

bench::mark(
  x | y | z,
  list_pany(list(x, y, z)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x | y | z                   1.01s    1.02s     0.974     763MB    0.974
#> 2 list_pany(list(x, y, z))  61.94ms  67.52ms    14.4       381MB    3.60

bench::mark(
  x & y & z,
  list_pall(list(x, y, z)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x & y & z                   781ms  784.3ms      1.27     763MB    0.849
#> 2 list_pall(list(x, y, z))     58ms   58.6ms     17.0      381MB    4.26

bench::mark(
  list_pall(list(x, y, z), missing = FALSE),
  list_pall(list(x, y, z), missing = TRUE),
  list_pall(list(x, y, z), missing = NULL),
  list_pany(list(x, y, z), missing = FALSE),
  list_pany(list(x, y, z), missing = TRUE),
  list_pany(list(x, y, z), missing = NULL),
  check = FALSE,
  iterations = 50
)
#> # A tibble: 6 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 list_pall(list(x, y, z), missing =… 56.2ms 57.3ms      17.4     381MB     3.31
#> 2 list_pall(list(x, y, z), missing =… 56.3ms 57.2ms      17.3     381MB     1.92
#> 3 list_pall(list(x, y, z), missing =… 57.3ms 58.3ms      17.0     381MB     2.32
#> 4 list_pany(list(x, y, z), missing =… 56.3ms 58.3ms      16.7     381MB     2.28
#> 5 list_pany(list(x, y, z), missing =…   56ms 58.2ms      16.8     381MB     1.87
#> 6 list_pany(list(x, y, z), missing =… 61.1ms 62.7ms      15.8     381MB     2.15

Base R gets faster if you remove the "jumpiness" in x. i.e. if you remove any NAs and heavily skew towards TRUE (1000:1) then it's still slower than us but not by as much.

library(vctrs)

set.seed(123)

x <- sample(c(rep(TRUE, 1000), FALSE), size = 1e8, replace = TRUE)
y <- sample(c(rep(TRUE, 1000), FALSE), size = 1e8, replace = TRUE)
z <- sample(c(rep(TRUE, 1000), FALSE), size = 1e8, replace = TRUE)

bench::mark(
  x | y,
  list_pany(list(x, y)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x | y                  122.9ms  124.5ms      7.82     381MB     5.21
#> 2 list_pany(list(x, y))   41.9ms   42.6ms     23.1      381MB     5.78

bench::mark(
  x & y,
  list_pall(list(x, y)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                 min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>            <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x & y                    156ms  161.6ms      6.14     381MB     1.53
#> 2 list_pall(list(x, y))   40.6ms   41.2ms     24.1      381MB     6.03

bench::mark(
  x | y | z,
  list_pany(list(x, y, z)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x | y | z                 236.1ms  237.8ms      4.13     763MB     4.13
#> 2 list_pany(list(x, y, z))   60.9ms   62.4ms     15.9      381MB     3.98

bench::mark(
  x & y & z,
  list_pall(list(x, y, z)),
  iterations = 10
)
#> # A tibble: 2 × 6
#>   expression                    min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>               <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 x & y & z                 312.4ms    315ms      3.15     763MB     2.10
#> 2 list_pall(list(x, y, z))   58.1ms     59ms     16.9      381MB     4.23

bench::mark(
  list_pall(list(x, y, z), missing = FALSE),
  list_pall(list(x, y, z), missing = TRUE),
  list_pall(list(x, y, z), missing = NULL),
  list_pany(list(x, y, z), missing = FALSE),
  list_pany(list(x, y, z), missing = TRUE),
  list_pany(list(x, y, z), missing = NULL),
  check = FALSE,
  iterations = 50
)
#> # A tibble: 6 × 6
#>   expression                             min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                          <bch:> <bch:>     <dbl> <bch:byt>    <dbl>
#> 1 list_pall(list(x, y, z), missing =… 56.6ms 58.3ms      17.1     381MB     3.26
#> 2 list_pall(list(x, y, z), missing =… 56.6ms 57.9ms      17.0     381MB     1.89
#> 3 list_pall(list(x, y, z), missing =… 58.2ms 59.8ms      16.2     381MB     2.21
#> 4 list_pany(list(x, y, z), missing =… 56.7ms 58.3ms      17.0     381MB     2.32
#> 5 list_pany(list(x, y, z), missing =… 56.6ms 58.3ms      17.1     381MB     1.90
#> 6 list_pany(list(x, y, z), missing =… 61.7ms 63.1ms      15.7     381MB     2.14

@DavisVaughan DavisVaughan changed the title Implement list_pall() and list_pany() Implement vec_pall() and vec_pany() Nov 12, 2025
@DavisVaughan DavisVaughan requested a review from lionel- November 12, 2025 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants