-
Notifications
You must be signed in to change notification settings - Fork 345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generalized similarity calculation for minhasher #654
base: develop
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the PR! Let's see if we can make it binary compatible. Also, can you rebase on develop? It looks like we have a bunch of stray tag PRs in here.
Thanks again!
* Generalized Jaccard similarity estimation for multiple sets (size of intersection / size of union). | ||
* Jsim(S1..Sn) = P(hmin1 == hmin2 == ... == hminn) / numHashes | ||
*/ | ||
def similarityMulti(sigs: MinHashSignature*): Double = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you put this on the companion object? that should preserve binary compatibility and we can release it in the 0.13 branch.
* Decode multiple signatures into hash values and combine them using given fn. | ||
* This version is used by the new similarityMulti. | ||
*/ | ||
protected def buildArrayMulti(buffers: Seq[Array[Byte]])(fn: Seq[H] => H): Array[Byte] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you implement this as a private method on the companion that matches the case class we have, I think we can make this binary compatible.
private def buildArrayMulti[H](h: MinHasher[H], hasherbuffers: Seq[Array[Byte]])(fn: Seq[H] => H): Array[Byte] =
h match {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reviewing. Actually, buildArrayMulti is an abstract method overridden in the concrete classes and it calls other buildArray virtual methods. If I move it to companion object, I'll have to move all of them to companion object. Does it make sense to extend MinHasher32 and implement a new MinHasher32MultiSig with the new methods (same for 16bit version)? This would avoid binary compatibility issue and I can still access the protected methods and vals.
class MinHasher32MultiSig(numHashes: Int, numBands: Int)(implicit n: Numeric[Int])
extends MinHasher32(numHashes, numBands) {
protected def buildArrayMulti(buffers: Seq[Array[Byte]])(fn: Vector[Int] => Int): Array[Byte] = {
val intBuffers = buffers.map(b => ByteBuffer.wrap(b).asIntBuffer)
buildArray(fn(intBuffers.map(_.get).toVector))
}
def similarityMulti(sigs: MinHashSignature*): Double = {
buildArrayMulti(sigs.map(_.bytes))(vals => if (vals.forall(_ =? vals.head)) n.one else n.zero)
.map(_.toDouble)
.sum / numHashes
}
}
026ea21
to
2ed56b1
Compare
Codecov Report
@@ Coverage Diff @@
## develop #654 +/- ##
===========================================
- Coverage 83.14% 82.67% -0.48%
===========================================
Files 109 109
Lines 5222 5235 +13
Branches 317 475 +158
===========================================
- Hits 4342 4328 -14
- Misses 880 907 +27
Continue to review full report at Codecov.
|
I just pushed a new commit which uses a technique close to your suggestion. I added an implicit wrapper class to the companion object and implemented similarityMulti. However, I still had to add companion objects which deal with 32 and 16 bit case specific additional concrete implementations. This way, similarityMulti() is just an extension method , and client can call it on any minhasher. No need to change the abstract or the derived classes and binary compatibility is preserved. |
Hi,
I needed a generalized version of the similarity calculation in the minhasher. The current version only supports similarity(A, B). So I've implemented similarityMulti(sigs*) and currently using it to calculate intersection size of multiple sets based on hyperloglog unions and minhasher similarity estimation.
The generalized Jaccard similarity estimation for multiple sets is Jsim(S1..Sn) = |S1 n S2 n .. Sn| / |S1 u S2 u .. Sn|. Estimation based on minhash is calculated using; Jsim(S1..Sn) = P(hmin1 == hmin2 == ... == hminn) / numHashes.
Using this approach, it's possible to calculate estimates of intersections of multiple sets as follows;
intersection(set1..setN) = unionSize(set1..setN) * Jsim(set1..setN)
unionSize estimation comes from hyperloglog, and Jsim(set1..setN) is now implemented by similarityMulti() in the MinHasher.
PS: The existing hyperloglog intersection has a higher error rate than this approach and because of the recursive implementation, the runtime complexity of O(N!) where N is the number of sets to intersect. The above approach has a linear runtime complexity and better error rate. I'm planning to submit another pull request for the new intersection calculation soon.