Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request]: Properly incorporate bulk multimap side input reads into caching. #34149

Open
17 tasks
robertwb opened this issue Mar 3, 2025 · 0 comments
Open
17 tasks

Comments

@robertwb
Copy link
Contributor

robertwb commented Mar 3, 2025

What would you like to happen?

Currently it simply stores the first 100 values, optimizing for avoiding point lookups for very small maps. The proper way to do this would probably be to issue the state request itself from MultimapSideInput.get() in the bulk side input read block, iterate over the returned key-value-iterables, and add the keys and their corresponding (weighted) value iterables to the cache one at a time.

We would probably also want to store some state indicating whether the bulk-reading was already attempted, as well as (if the set of returned values was the entire map) the set of keys (or at least a bloom filter) such that we can return quickly with the empty iterable for those keys that we have discovered are not actually in the map (distinguishing from the case of a key having been evicted from the cache).

Alternatively, we could store the map (possibly of the first page alone) in cache as a single entry.

Issue Priority

Priority: 3 (nice-to-have improvement)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant