-
Notifications
You must be signed in to change notification settings - Fork 1
Membership Testing False Positive Rate: HyperLogLog
Edwin Chen edited this page Dec 17, 2017
·
7 revisions
Here's a plot of the false positive rate of HyperLogLog on simulated data (in blue), with the expected false positive rate overlaid (in black):
Let b be the number of binary address bits used to map elements to buckets, so that B = 2^b is the number of buckets, and let n be the number of distinct elements inserted into the HyperLogLog so far. Given a new element x, we compute a hash h(x), use the first b bits of h(x) to map x into a bucket, and let rho(x) be the position of the first "1" in the binary representation of the remaining bits of h(x).
- If the value stored in
x's bucket is already greater than or equal torho(x), we say that the HyperLogLog already containsx(and are wrong with some false positive probability). - Otherwise, we replace the value in the bucket with
rho(x), and say that this is our first encounter withx(and we are always correct).
The false positive rate for checking set containment of an element x is:
P(false positive)
= P(some element e, with rho(e) >= rho(x), was hashed into the same bucket as x)
= SUM_k P[rho(x) = k] * P(some element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)
= SUM_k P[rho(x) = k] * [1 - P(no element e, with rho(e) >= k, was hashed into the same bucket as x | rho(x) = k)]
= SUM_k P[rho(x) = k] * [1 - {1 - P(e has rho(e) >= k and e was hashed into the same bucket as x | rho(x) = k)}^n]
Now
P[rho(z) = k] = 1 / 2^kP[rho(z) >= k] = 1 / 2^(k-1)
So
P(false positive)
= SUM_k (1 / 2^k) * [1 - {1 - 1 / (B*2^(k-1))}^n]
~ SUM_k (1 / 2^k) * [1 - e^{-n / (B * 2^(k-1))}]
