Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

criminosis · 2023-12-03T05:05:34Z

I believe the pre-tokenizer used by the Solr client code is a bug because it produces tokens that cause misalignment with Solr, in particular around text that has a domain name like foo.com. I submitted a report to the listserve and mentioned it in the JanusGraph discord. @li-boxuan suggested opening a PR after a few days of no responses. Below is a repeat of what I posted on the listserve for archival purposes as a Github Issue.

I've been working on transitioning my JanusGraph deployment's mixed index backend from ElasticSearch to Solr in order to have Solr's shard splitting functionality as my index grows.

However one of my test cases started to fail after making the switch. It was a test case that would insert a value into a property (lets call the property "text_document_property") that was a text document. In that document would be a url along the lines of "https://foo.com/abc123".

My test case would then test the retrieval of this vertex by doing a query along the lines of "g.V().has('vertex_label', 'text_document_property', textContains('https://foo.com/abc123')"

When my backend mixed index was ElasticSearch this passed.

I think that's because when the criteria is sent to ES it is sent "as-is" from JanusGraph and ElasticSearch does the tokenizing on the client. That's based on the ES client code that gets invoked here and then ultimately is converted into ES constructs here.

However for Solr it does tokenization it looks like in JanusGraph here.

When Text.tokenize is given "https://foo.com/abc123" it produces the tokens "https,foo,com,abc123"

But in the default schema.xml provided by JanusGraph, if I'm reading it correctly, dictates usage of solr.StandardTokenizerFactory.

According to Solr's docs on the Standard Tokenizer:

Standard Tokenizer

This tokenizer splits the text field into tokens, treating whitespace and punctuation as delimiters. Delimiter characters are discarded, with the following exceptions:

    Periods (dots) that are not followed by whitespace are kept as part of the token, including Internet domain names.

    The "@" character is among the set of token-splitting punctuation, so email addresses are not preserved as single tokens.

Consequently I believe this is where an incongruence is introduced. The tokenizer when I submit my document will therefore produce a token stream of "https,foo.com,abc123" due to the first exception that's listed. I've at least verified this in the Solr admin console when submitting a query.

text_document_property_t:(https foo.com abc123) works as desired and I can see it returning the expected response in the Solr web console. However with the token stream produced by JanusGraph's Text.tokenize of "https, foo, com, abc123" producing a Solr query under the covers of text_document_property_t:(+https +foo +com +abc123) (the +'s coming from here, to convey to Solr that all tokens must be present) it's readily apparent the separate "+com" token will cause a failure to match.

If I'm understanding the schema.xml however we're already configuring Solr with a designated tokenizer for queries so I don't think we need to do this client side tokenizing? For what it's worth doing the query text_document_property_t:(https://foo.com/abc123), note the escaping I had to do, albeit in JanusGraph I believe that'll be handled here, in Solr's web console worked as expected, so presumably it's tokenizing on its backend.

As sort of evidence of this configured Solr-side tokenizer it appears JanusGraph is only doing this JanusGraph-side tokenizing for the textContains clause and not the other text predicates. Accordingly executing "g.V().has('vertex_label', 'text_document_property', textContainsPrefix('foo.com')" appears to work since it doesn't attempt to pre-tokenize the submission before sending it to Solr.

It appears the introduction of using Text.tokenize was introduced back in 2014. Instead of using the lower case of the string as given prior to then comment. It seems Solr 4 was released in 2012 with on going releases into 2014. Solr 4 broke with the original Tokenizer (now called Classic Tokenizer), so I'm wondering if this incongruence wasn't noticed at the time if JansusGraph (Titan back then) was still primarily targeting Solr 3.

Happy to port this into a Github issue if there's agreement it's not working as intended. On the surface seems like it could be a simple code change if all it takes is remove the special casing around Text.Contains and fold it into the same handling logic as the other Text predicates and let the Solr-side tokenizer do the work.

The text was updated successfully, but these errors were encountered:

Closes JanusGraph#4164 Added new test document and updated assertions Use tokenizer that mimics Solr's standardized tokenizer Geo predicate tweaking Implemented TextContainsPhrase for Solr

Closes JanusGraph#4164 Added new test document and updated assertions Use tokenizer that mimics Solr's standardized tokenizer Geo predicate tweaking Implemented TextContainsPhrase for Solr Signed-off-by: Allan Clements <[email protected]>

criminosis added the kind/bug/possible label Dec 3, 2023

criminosis linked a pull request Dec 3, 2023 that will close this issue

Solr textContainsPhrase & Solr Client Tokenizer Alignment #4166

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

criminosis commented Dec 3, 2023

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

Pre-Tokenizer Is Incongruent With Solr Default StandardTokenizer #4165

Comments

criminosis commented Dec 3, 2023