Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support WHERE x IN (list) clauses with multiple items in list #352

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

gfieremans
Copy link

@gfieremans gfieremans commented Apr 7, 2023

Summary

Up to now, the constraintargs for a WHERE IN clause with more than one element in a list comes in as None in the VTCursor.Filter() method. With the use of the VTTable.BestIndexObject() and using apsw 3.41.0+, it is possible to get this list passed in as a set to the VTCursor.Filter() method. This PR serializes that set to a string.

Testing instructions

  • You will probably need to change the setup.cfg to set apsw>=3.41.0
  • set supports_requested_columns = True and supports_in_statements = True in your adapter's class derived from the Adapter base class..
  • make sure to properly handle the deserialization of a stringified set. A WHERE x IN (item) clause gets serialized to the string ==item, while the same clause with one more item WHERE x IN (item1, item2) gets serialized to the string IN("item1", "item2").

src/shillelagh/adapters/base.py Outdated Show resolved Hide resolved
src/shillelagh/adapters/base.py Outdated Show resolved Hide resolved
src/shillelagh/adapters/base.py Outdated Show resolved Hide resolved
src/shillelagh/backends/apsw/vt.py Outdated Show resolved Hide resolved
src/shillelagh/backends/apsw/vt.py Outdated Show resolved Hide resolved
if isinstance(constraint, set):
# A WHERE x IN (list) clause sends in constraintargs as a set when N>1:
# {item1,...,itemN}. We serialize this set to a string '["item1",...,"itemN"]'.
constraint = json.dumps(list(constraint))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but why do you have to serialize this at all here?

Copy link
Author

@gfieremans gfieremans May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these sets belong to an "Equals" clause and the current logic translates that for a string to '==string' and to leverage that principle the set needs to be converted to somethign like '==["string1", "string2"]'

Copy link
Collaborator

@cancan101 cancan101 May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the code should look something like this:

if isinstance(constraint, set):
        constraint = [type_map[column_type.type]().parse(c) for c in constraint]
        value = [column_type.format(c) for c in constraint]
        # operator = IN (not sure if eq is passed in here or if IN is set)
else:
        constraint = type_map[column_type.type]().parse(constraint)
        value = column_type.format(constraint)

all_bounds[column_name].add((operator, value))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes, you will need to set the operator explicitly here to Operator.IN and create that operator as otherwise it will be EQ

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK - let me try that.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all_bounds[column_name].add((operator, value)) which value is a list so (operator, value) is unhashable

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So change this line:

value = [column_type.format(c) for c in constraint]

To

value = tuple(column_type.format(c) for c in constraint)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that should work, then we got adding new type of operator left, waiting for the update and merge ^^

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with tuple() and added the Operator.IN

@cancan101
Copy link
Collaborator

this also needs some tests

src/shillelagh/backends/apsw/vt.py Outdated Show resolved Hide resolved
if isinstance(constraint, set):
# A WHERE x IN (list) clause sends in constraintargs as a set when N>1:
# {item1,...,itemN}. We serialize this set to a string '["item1",...,"itemN"]'.
constraint = json.dumps(list(constraint))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok yes, you will need to set the operator explicitly here to Operator.IN and create that operator as otherwise it will be EQ

@cwegener
Copy link
Collaborator

I think the failing integration tests look unrelated to your work? And they were probably fixed here -> 48bb76a

You should just run pylint, black and mypy tests locally though and just address all those style and formatting checks.

@gfieremans
Copy link
Author

gfieremans commented May 26, 2023

Yes, thanks, I just figured that out too that the pre-commit is changing files. I think there still is one issue related to my PR though that I'm trying to track down. I seem to be changing the signature of requested_columns of VTTable.

@cwegener
Copy link
Collaborator

For pre-commit, just make sure it's installed on your local machine https://pre-commit.com/#install
And then run pre-commit install once off in your local shillelagh repo, and then all those local hooks will run before git commit runs (unless you use a different Git client ..... i.e. the hooks probably won't work with something like Eclipse Egit ...)

@cwegener
Copy link
Collaborator

. I think there still is one issue related to my PR though that I'm trying to track down. I seem to be changing the signature of requested_columns of VTTable

You should just be able to merge the latest main branch, or rebase onto main to incorporate the changes related to the API change. 14ed1f6

@gfieremans
Copy link
Author

This particular error in my new test, I cannot figure out:

FAILED tests/backends/apsw/vt_test.py::test_virtual_best_index_object_with_in_statement - AttributeError: 'VTTable' object has no attribute 'requested_columns'

@cwegener
Copy link
Collaborator

This particular error in my new test, I cannot figure out:

FAILED tests/backends/apsw/vt_test.py::test_virtual_best_index_object_with_in_statement - AttributeError: 'VTTable' object has no attribute 'requested_columns'

The test runs successfully for my on Linux Python 3.9 .... 🤷 Something funny with the Github actions set up maybe ... I'm not a GH actions expert. 😉

src/shillelagh/backends/apsw/vt.py Outdated Show resolved Hide resolved

all_bounds[column_name].add((operator, value))
if isinstance(constraint, set) and operator == Operator.EQ:
# See also https://rogerbinns.github.io/apsw/vtable.html#apsw.VTCursor.Filter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit but I would write this as:

if isinstance(constraint, set) and operator is Operator.EQ:
  # See also https://rogerbinns.github.io/apsw/vtable.html#apsw.VTCursor.Filter
  constraint = [type_map[column_type.type]().parse(c) for c in constraint]
  value = tuple(column_type.format(c) for c in constraint)
  operator = Operator.IN
else:
  constraint = type_map[column_type.type]().parse(constraint)
  value = column_type.format(constraint)

 all_bounds[column_name].add((operator, value))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That causes warnings because in the 'if' branch value is a tuple and in the 'else' branch it is a string.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can solve that by forward declaring the type

value: Tuple | string # update type as needed
if isinstance(constraint, set) and operator is Operator.EQ:
  # See also https://rogerbinns.github.io/apsw/vtable.html#apsw.VTCursor.Filter
  constraint = [type_map[column_type.type]().parse(c) for c in constraint]
  value = tuple(column_type.format(c) for c in constraint)
  operator = Operator.IN
else:
  constraint = type_map[column_type.type]().parse(constraint)
  value = column_type.format(constraint)

 all_bounds[column_name].add((operator, value))

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is Any OK? I'd rather not get into a type cast war trying to list all possibilities

value: Any

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any is fine

src/shillelagh/filters.py Outdated Show resolved Hide resolved
return Impossible()

# we only accept tuples
value = values.pop()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why pop here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copied the concept from all the other filters. I'll remove the check for 'tuple' below and then normalize it with all the others.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@betodealmeida can comment on the pop and the rest of this code.


# we only accept tuples
value = values.pop()
if not isinstance(value, tuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably be looser here with the types accepted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These come from https://rogerbinns.github.io/apsw/vtable.html#apsw.VTCursor.Filte. Only set is mentioned there I believe.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but that is tuple. Where does this get called from any ways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tuple? That's what we chose in

value = tuple(column_type.format(c) for c in constraint)

But I'm OK removing this constraint.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not totally clear where this code gets called. I remember it being cases where the sql engine doesn't handle filtering.

src/shillelagh/filters.py Outdated Show resolved Hide resolved
Copy link
Owner

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, this slid under my radar. Taking a look this weekend.

# if true, a set from an IN statement will be assigned the new IN operator and
# passed in as a tuple to bounds in ``get_rows`` and ``get_data``. This requires
# supports_requested_columns to be set to true as well.
supports_in_statements = False
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we need explicit support at the adapter level. I think we might be able to just add the IN adapter, and if a given column doesn't support it we could replace it with a bunch of Equal operators.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants