Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reservoir sampling improvements #647

Open
jmalkin opened this issue Feb 3, 2025 · 0 comments
Open

Reservoir sampling improvements #647

jmalkin opened this issue Feb 3, 2025 · 0 comments

Comments

@jmalkin
Copy link
Contributor

jmalkin commented Feb 3, 2025

We can make a number of improvements to reservoir sampling:

  • We can go from 2 random draws per accepted sample to 1 by picking a random long from 0 to n and accepting if it's less than k -- which also provides the location to use. This would get around the current max size bit limit (which is unlikely to be an issue, but still) by removing the use of a double for a value in [0, 1). But that'd also require a new serialization version since n is currently stored with fewer than 64 bits.
  • After discussing with a stats person about a year ago, we believe we could even switch to a random draw from the distribution for the next item, dropping the random draws to O(accepted items) rather than O(n). As long as merges happen independent of that random draw (so no adversary able to inspect it) then we should be able to merge and re-draw for the merged sample.
  • Merge can be improved to give uniform second-order probabilities. This is the exciting one for me. But there's a caveat.

Proposed merge, with python pseudocode:

# Merge two random samples. Samples are lists s1 and s2. The numbers n1 and n2
# are the sizes of the sets from which s1 and s2 were sampled.  Value s is the specified
# size for the merged sample. Requires s <= len(s1) and s <= len(s2). 
def merge(s1, s2, n1, n2, s): 
  # Find number to draw from s1. (The others will be from s2.)
  t = 0 # Number of observations to draw from s1.

  for i in range(s):
    j = random.randint(1, n1 + n2)
    
    if j <= n1:
      t += 1
      n1 -= 1
    else: n2 -= 1

  # Draw the samples.
  a = random.sample(s1, t) # Sample without replacement.
  b = random.sample(s2, s - t) # Sample without replacement.

  return a + b # Make one list with the elements of a and b.

The random sample without replacement is important here, which is the source of the caveat: You can almost use the first t and s-t samples from s1 and s2, respectively, except that there's a position bias from the initial reservoir fill. For items 0 to k-1, they can only ever appear at that specific index. We don't start placing randomly until after the reservoir is filled.

The options I can think of for this:

  1. Randomly permute items when the reservoir fills. If it hasn't filled at merge, you're just inserting as new items of weight 1 anyway. Doesn't help with already-serialized samples.
  2. Randomly permute the input sample at merge (a modified Fisher-Yates shuffle since you can stop after the first t or s-t items). But this is an undesirable side-effect for merging, even if it doesn't impact the probabilities.
  3. Copy the array and permute that. Uses more space, and potentially a lot more for a large reservoir, but avoids side-effects. Probably permute an array of integer indices to avoid creating an array of arbitrary objects.

Maybe we can mix 1 and 3? For new samples, randomly permute upon fill and track that with a flag (which would need to be serialized going forward, of course). If a sample needs to be merged and doesn't have the flag, then copy and permute the array.

This should probably have been 3 separate issues. But wanted to document the ideas before I forget yet again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant