Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Euclidean distance not supported? #33

Open
bwlim opened this issue Jun 26, 2014 · 4 comments
Open

Is Euclidean distance not supported? #33

bwlim opened this issue Jun 26, 2014 · 4 comments

Comments

@bwlim
Copy link

bwlim commented Jun 26, 2014

I'm very happy to see open source Vector Database!
Simbase is great for me, thanks :D

I have a question (or maybe new feature request..)
Supported similarity(score) functions are "cosinesq" and "jensenshannon"
cosine similarity function does not count vector magnitude..
But in my application, vector magnitude is meaningful for similar vector search.
I want similarity function using "Euclidean distance" to be supported also :D
Give some guides, thanks for your great vector DB :D

@mountain
Copy link
Member

Thanks for your interesting of our project.

It is possible to support euclidean distance. Please take a look of the "score" package:

There are two suite of APIs in the implementation of a score function, one is for dense vector set, the other is for sparse vector set. And the rest of the API are all event hooks.

If you could implement this feature, it is highly plausible. Or we can take this but will be due in late next week.

Thanks again.

@mountain
Copy link
Member

A quick implementation without verification and tests, please check with changeset 099ecf1 and help us to review it. if no problem, I will close the issue tomorrow.

And @bwlim please give us feedback on this issue. Thanks!

@bwlim
Copy link
Author

bwlim commented Jun 30, 2014

Supporting Manhattan distance also seems very good, thanks!

but, I couldn't fully understand integer vector score function because I didn't fully read and understand simbase code ==>

  • @OverRide
  • public float score(String srcVKey, int srcId, int[] source, int srclen, String tgtVKey, int tgtId, int[] target,
  •        int tgtlen) {
    

I'm just in the phase of planning new service, I cannot test simbase code right now...
I don't have working system and test data now, (This is my hobby project with my wife :D)
Later I will test simbase~ I'm Sorry.

@mountain
Copy link
Member

Hi, @bwlim ,

The integer vector API is for the sparse vectors. Sparsity is very common in high dimensional data, in this scenario, dense storage format is very ineffective, so we introduce sparse storage format.

For example, we have a 1024 dimensional base, the two format are as below

  • dense storage format: cmp1, cmp2, ..., cmp1024
  • sparse storage format: idx1, cmp1, idx2, cmp2, ... (where cmpi is a non-zero component, and idxi is the index of the compoent)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants