Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "needle in the haystack" queries #8

Open
valyala opened this issue Feb 13, 2025 · 1 comment
Open

Add "needle in the haystack" queries #8

valyala opened this issue Feb 13, 2025 · 1 comment

Comments

@valyala
Copy link
Contributor

valyala commented Feb 13, 2025

The JSONBench dataset contains fields with big number of unique values (aka high-cardinality fields):

  • did (aka user_id)
  • commit.cid (aka commit_id)
  • commit.record.subject.cid

Sometimes it is needed to find all the rows for the particular rarely seen value of some field. For example, to find all the rows generated by some user. Then the following query can be used for JSONBench data:

SELECT count(*) FROM bluesky WHERE data.did = 'did:plc:stwikwzlk2mepaagokthylry'

Another practical query is to select a row for the given commit_id:

SELECT * FROM bluesky WHERE data.commit.cid = 'bafyreielfqkpggsdqwtbtg5tyh7iqytp64paevfjbeufnw6kc7sgmjemhm'
@alexey-milovidov
Copy link
Member

This will be a good addition, although not in the direction of the benchmark (data analytics). Let's evaluate the systems on this query and see how they stand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants