Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low cardinality auto indexing #142

Open
EricSoroos opened this issue Jun 10, 2024 · 5 comments
Open

Low cardinality auto indexing #142

EricSoroos opened this issue Jun 10, 2024 · 5 comments

Comments

@EricSoroos
Copy link
Contributor

Describe the bug

Auto index generation on cardinality is backwards.

https://github.com/dathere/datapusher-plus/blob/master/datapusher/jobs.py#L1728

    # if a column's cardinality <= AUTO_INDEX_THRESHOLD, create an index for that column

Low cardinality indexes on postgresql aren't useful, because the planner will almost never choose them.

The planner will choose a table scan instead of an index scan if it is likely that many/most pages will be read. When you have a low cardinality index, eg a bool, the planner is going to assume that 1/2 of rows will be hit, and therefore essentially all pages. Even if you've got a distribution that's really skewed, the planner will generally not choose that index. (if you do have a known skewed distribution, you can do a partial/multicolumn index where the condition is the rare case, but that's not really applicable for a general purpose tool. )

This should be reversed, and set the default threshold to ~10 at least.

@jqnatividad
Copy link
Contributor

I went for that heuristic as I was optimizing for accelerated aggregations - (e.g. Borough column only has 5 values - Manhattan,Bronx,Brooklyn,Queens,Staten Island...), and for filtering with the DataTables view Searchbuilder, where you have an interactive filter builder and response time is important.

Maybe, I can tweak the heuristic so that if the value is negative and less than -1, then the threshold is interpreted as greater than the absolute value?

Alternatively, I can add another set of AutoIndex settings that can be specified in the dotenv template to eval() an expression that DP+ maintainers can set to specify their own Autoindex heuristics?

@EricSoroos
Copy link
Contributor Author

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

@EricSoroos
Copy link
Contributor Author

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

@jqnatividad
Copy link
Contributor

It's possible what you want is 5 and up -- that's getting to the point that the planner might choose to use the index, and that's something that an explain analyze would tell you. Right now I'm seeing it create indexes when there's one value text value, which is truly useless. (and in my case, it's a 13mb index + a 38mb text index)

Yes. For that example, I would set the setting to 5 in the dotenv template. I'll change the default from 3 to 10 as you suggested as that setting is too low. I'll also add a minimum with a default of 3.

@jqnatividad
Copy link
Contributor

As a side note -- I'm not sure it's worth specifically creating unique indexes when doing the loads from datapusher, unless you're specifically trying to constrain additional entries that are added.

Yes. That was what I was going for - to prevent erroneous upserts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants