Skip to content

Conversation

theroggy
Copy link
Member

@theroggy theroggy commented Sep 13, 2025

In read_dataframe without arrow, the number of rows of the result was counted first, and then the full data was read.

Especially when using a filter, counting the rows can take significant time.

This PR avoids doing the rowcount before reading to improve performance, with these results with the new zealand building outlines geopackage (3.3 million rows) as test file:

  • If the filter limits the rows a lot counting the rows can even take the same time as the subsequent reading of all data... so in this case the time taken ~halves.
    • e.g. reading the test file file with where="ST_NPOINTS(st_buffer(geom, 10)) > 2000" (returning 9 rows) took 82 s, now 45 s.
  • When reading the entire file without filter, both implementations take 55-60 seconds on my windows laptop (plugged in), with the new implementation giving the same average timings.

@theroggy theroggy marked this pull request as ready for review September 13, 2025 15:49
@theroggy theroggy marked this pull request as draft September 13, 2025 15:50
@theroggy theroggy marked this pull request as ready for review September 13, 2025 20:30
@theroggy theroggy modified the milestones: 0.11.0, 0.12.0 Sep 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant