`has_no_duplicates` is much slower than assert_that equivalent in a dataset with many variables

Issue #31 new
Nicholas Marantz created an issue

assertive::has_no_duplicates runs much slower than the assertthat equivalent in a dataset with many variables, but not in a dataset with few variables. As indicated below, the median runtime on df (with 500,000 observations of two variables) is 1.02s for assertthat and 1.16s for assertive (i.e., assertthat is only marginally faster than assertive). But the median runtime on df2 (with 500,000 observations of 50 variables) is 2.52s for assertthat and 38.4s for assertive (i.e., assertthat is over 15 times faster than assertive).

library(tidyverse)
library(assertthat)
library(assertive)

var <- function(n = 5000) {
  a <- do.call(paste0, replicate(10, sample(LETTERS, n, TRUE), FALSE))
  paste0(a, sprintf("%04d", sample(9999, n, TRUE)), sample(LETTERS, n, TRUE))
}

df <- tibble(id1 = var(500000), id2 = var(500000))
assertthat_small <- bench::mark(assertthat::assert_that(anyDuplicated(df[, c("id1", "id2")]) == 0))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
assertive_small <- bench::mark(df %>% assertive::has_no_duplicates(c("id1", "id2")))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
assertthat_small$median
#> [1] 1.02s
assertive_small$median
#> [1] 1.16s


df2 <- tibble(id1 = var(500000), id2 = var(500000), id3 = var(500000), 
              id4 = var(500000), id5 = var(500000), id6 = var(500000), 
              id7 = var(500000), id8 = var(500000), id9 = var(500000), 
              id10 = var(500000), id11 = var(500000), id12 = var(500000), 
              id13 = var(500000), id14 = var(500000), id15 = var(500000), 
              id16 = var(500000), id17 = var(500000), id18 = var(500000), 
              id19 = var(500000), id20 = var(500000), id21 = var(500000), 
              id22 = var(500000), id23 = var(500000), id24 = var(500000), 
              id25 = var(500000), id26 = var(500000), id27 = var(500000), 
              id28 = var(500000), id29 = var(500000), id30 = var(500000), 
              id31 = var(500000), id32 = var(500000), id33 = var(500000), 
              id34 = var(500000), id35 = var(500000), id36 = var(500000), 
              id37 = var(500000), id38 = var(500000), id39 = var(500000), 
              id40 = var(500000), id41 = var(500000), id42 = var(500000), 
              id43 = var(500000), id44 = var(500000), id45 = var(500000), 
              id46 = var(500000), id47 = var(500000), id48 = var(500000), 
              id49 = var(500000), id50 = var(500000))
assertthat_big <- bench::mark(assertthat::assert_that(anyDuplicated(df2[, c("id1", "id2")]) == 0))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
assertive_big <- bench::mark(df2 %>% assertive::has_no_duplicates(c("id1", "id2")))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
assertthat_big$median
#> [1] 2.52s
assertive_big$median
#> [1] 38.4s
Created on 2021-06-04 by the reprex package (v2.0.0)

Comments (0)

  1. Log in to comment