Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalized similarity calculation for minhasher #654

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

ibenian
Copy link

@ibenian ibenian commented Feb 25, 2018

Hi,
I needed a generalized version of the similarity calculation in the minhasher. The current version only supports similarity(A, B). So I've implemented similarityMulti(sigs*) and currently using it to calculate intersection size of multiple sets based on hyperloglog unions and minhasher similarity estimation.
The generalized Jaccard similarity estimation for multiple sets is Jsim(S1..Sn) = |S1 n S2 n .. Sn| / |S1 u S2 u .. Sn|. Estimation based on minhash is calculated using; Jsim(S1..Sn) = P(hmin1 == hmin2 == ... == hminn) / numHashes.
Using this approach, it's possible to calculate estimates of intersections of multiple sets as follows;
intersection(set1..setN) = unionSize(set1..setN) * Jsim(set1..setN)
unionSize estimation comes from hyperloglog, and Jsim(set1..setN) is now implemented by similarityMulti() in the MinHasher.
PS: The existing hyperloglog intersection has a higher error rate than this approach and because of the recursive implementation, the runtime complexity of O(N!) where N is the number of sets to intersect. The above approach has a linear runtime complexity and better error rate. I'm planning to submit another pull request for the new intersection calculation soon.

@CLAassistant
Copy link

CLAassistant commented Feb 25, 2018

CLA assistant check
All committers have signed the CLA.

Copy link
Collaborator

@johnynek johnynek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR! Let's see if we can make it binary compatible. Also, can you rebase on develop? It looks like we have a bunch of stray tag PRs in here.

Thanks again!

* Generalized Jaccard similarity estimation for multiple sets (size of intersection / size of union).
* Jsim(S1..Sn) = P(hmin1 == hmin2 == ... == hminn) / numHashes
*/
def similarityMulti(sigs: MinHashSignature*): Double = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this on the companion object? that should preserve binary compatibility and we can release it in the 0.13 branch.

* Decode multiple signatures into hash values and combine them using given fn.
* This version is used by the new similarityMulti.
*/
protected def buildArrayMulti(buffers: Seq[Array[Byte]])(fn: Seq[H] => H): Array[Byte]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you implement this as a private method on the companion that matches the case class we have, I think we can make this binary compatible.

private def buildArrayMulti[H](h: MinHasher[H], hasherbuffers: Seq[Array[Byte]])(fn: Seq[H] => H): Array[Byte] =
  h match {

Copy link
Author

@ibenian ibenian Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. Actually, buildArrayMulti is an abstract method overridden in the concrete classes and it calls other buildArray virtual methods. If I move it to companion object, I'll have to move all of them to companion object. Does it make sense to extend MinHasher32 and implement a new MinHasher32MultiSig with the new methods (same for 16bit version)? This would avoid binary compatibility issue and I can still access the protected methods and vals.

class MinHasher32MultiSig(numHashes: Int, numBands: Int)(implicit n: Numeric[Int])
  extends MinHasher32(numHashes, numBands) {
  protected def buildArrayMulti(buffers: Seq[Array[Byte]])(fn: Vector[Int] => Int): Array[Byte] = {
    val intBuffers = buffers.map(b => ByteBuffer.wrap(b).asIntBuffer)
    buildArray(fn(intBuffers.map(_.get).toVector))
  }

  def similarityMulti(sigs: MinHashSignature*): Double = {
    buildArrayMulti(sigs.map(_.bytes))(vals => if (vals.forall(_ =? vals.head)) n.one else n.zero)
      .map(_.toDouble)
      .sum / numHashes
  }
}

@ibenian ibenian force-pushed the minhash_multiset_similarity branch from 026ea21 to 2ed56b1 Compare March 3, 2018 20:34
@codecov-io
Copy link

Codecov Report

Merging #654 into develop will decrease coverage by 0.47%.
The diff coverage is 53.84%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #654      +/-   ##
===========================================
- Coverage    83.14%   82.67%   -0.48%     
===========================================
  Files          109      109              
  Lines         5222     5235      +13     
  Branches       317      475     +158     
===========================================
- Hits          4342     4328      -14     
- Misses         880      907      +27
Impacted Files Coverage Δ
...rc/main/scala/com/twitter/algebird/MinHasher.scala 66.21% <53.84%> (-2.64%) ⬇️
...src/main/scala/com/twitter/algebird/Interval.scala 74.78% <0%> (-9.57%) ⬇️
...in/scala/com/twitter/algebird/scalacheck/Gen.scala 91.66% <0%> (-8.34%) ⬇️
...ala/com/twitter/algebird/ApproximateProperty.scala 76.08% <0%> (-4.35%) ⬇️
.../main/scala/com/twitter/algebird/BloomFilter.scala 92.27% <0%> (-3.01%) ⬇️
.../main/scala/com/twitter/algebird/Applicative.scala 64.51% <0%> (+3.22%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 52e53d2...2ed56b1. Read the comment docs.

@ibenian
Copy link
Author

ibenian commented Mar 3, 2018

I just pushed a new commit which uses a technique close to your suggestion. I added an implicit wrapper class to the companion object and implemented similarityMulti. However, I still had to add companion objects which deal with 32 and 16 bit case specific additional concrete implementations. This way, similarityMulti() is just an extension method , and client can call it on any minhasher. No need to change the abstract or the derived classes and binary compatibility is preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants