-
Notifications
You must be signed in to change notification settings - Fork 1
Membership Testing False Positive Rate: Count Min Sketch
Each Count-Min sketch is of size 4096 bits. We vary the depth and width of each sketch, while maintaining its size at 4096 bits, to see how the false positive rate changes. (Assuming we store each count in the count table as an integer, the size of a Count-Min sketch with depth d and width w is 32 * dw bits.)
Let's also plot the expected false positive rate as a black line on top of the simulated one.
And here's a zoomed in version:
Here's a derivation of the expected false positive rate, i.e., the probability that a new element is claimed as already being in the set. Let d be the depth of the Count-Min sketch (i.e., the number of hashes used), let w be the width of the sketch (so that the counts table has size d*w), and let there be n distinct elements seen so far (note that, in practice, we usually do not know the exact value of n when running a CMS).
On inserting a new element x, the probability we erroneously claim it as an existing element is:
P(false positive)
= P(each of the d hashes maps x to a non-zero slot)
= P(hash h_0 maps x to a non-zero slot)^d [since these are identical distributions]
= [1 - P(no previous element has been mapped by h_0 to the same slot as x)]^d
= [1 - (1 - 1/w)^n]^d


