Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fill positions in NgramTokenizer #2471

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

darashi
Copy link

@darashi darashi commented Aug 6, 2024

I am looking into ways to make searches including non-ASCII characters. I want to search all documents that contain the query string as a substring, regardless of the word delimiter position.

For this purpose, I tried to use NgramTokenizer and PhraseQuery. However, it did not work as I expected. The order of the tokens seems to be ignored. I looked into the cause and found that Position was changed to all zeros in commit e75bb1d.

To be honest, I don't properly understand the reason for this change, but if this change was not necessary for the consistency of the specification or something, why not set the value of position?

So, I made a patch to fill "where the token starts, in terms of characters (not bytes)" in positions.

There may be a more appropriate way to do this and/or there may be something I've missed, but I would appreciate it if you could take a look.

As a supplement, I've used the following code to check the overall behavior:

// main.rs

use tantivy::collector::TopDocs;
use tantivy::query::QueryParser;
use tantivy::schema::*;
use tantivy::tokenizer::NgramTokenizer;
use tantivy::{doc, Index, IndexWriter, ReloadPolicy};

fn main() -> tantivy::Result<()> {
    let mut schema_builder = SchemaBuilder::default();

    let text_options = TextOptions::default()
        .set_stored()
        .set_indexing_options(
            TextFieldIndexing::default()
                .set_index_option(IndexRecordOption::WithFreqsAndPositions)
                .set_tokenizer("ngram"),
        )
        .set_stored();

    schema_builder.add_text_field("text", text_options);
    let schema = schema_builder.build();

    let index = Index::create_in_ram(schema.clone());
    let mut index_writer: IndexWriter = index.writer(50_000_000)?;

    let tokenizer = NgramTokenizer::new(1, 1, false)?;
    index.tokenizers().register("ngram", tokenizer.clone());

    let text = schema.get_field("text").unwrap();

    // This is the only document in the index
    index_writer.add_document(doc!(
    text => "本日は晴天なり", // 本日は晴天なり: Today is a fine day
    ))?;
    index_writer.commit()?;

    let reader = index
        .reader_builder()
        .reload_policy(ReloadPolicy::OnCommitWithDelay)
        .try_into()?;
    let searcher = reader.searcher();

    let query_parser = QueryParser::for_index(&index, vec![text]);

    // ✅ Work as expected
    let query = query_parser.parse_query(r#""本日は""#)?; // 本日は: Today is
    println!("Query: {:?}", query);
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
        println!("Matched (as expected): {}", retrieved_doc.to_json(&schema));
    }

    // ❌ Should not return any results
    println!();
    let query = query_parser.parse_query(r#""日本は""#)?; // 日本は: Japan is
    println!("Query: {:?}", query);
    let top_docs = searcher.search(&query, &TopDocs::with_limit(10))?;
    for (_score, doc_address) in top_docs {
        let retrieved_doc: TantivyDocument = searcher.doc(doc_address)?;
        println!(
            "Matched (but not expeced): {}",
            retrieved_doc.to_json(&schema)
        );
    }

    Ok(())
}

@PSeitz
Copy link
Contributor

PSeitz commented Aug 29, 2024

@darashi I think removing the 0 in position seems good, but I'm not sure what would be best in terms of spec. position is expressed in number of tokens, but the number of tokens in case of NgramTokenizer is ambiguous.

    /// Position, expressed in number of tokens.
    pub position: usize,

It would be good to know how Lucene handles this, the docs here are a little unclear to me. Instead of Position increment I'd be interested in the absolute position emitted.
https://lucene.apache.org/core/9_11_1/analysis/common/org/apache/lucene/analysis/ngram/NGramTokenizer.html

What do you think? Can you check what tokens Lucene emits?

@darashi
Copy link
Author

darashi commented Aug 29, 2024

Thank you for your comment! The offset of a token in N-Gram varies depending on the N, so I agree that it's not that simple to determine it uniquely especially in cases where multiple Ns are mixed together. Even so, I thought it would be reasonable to express the position in terms of number of Unicode characters.

I'm a novice with Java, but I managed to get NGramTokenizer running;

For hεllo (including non-ascii) with minGram=2 and maxGram=3:

NGramTokenizer@6576fe71 term=hε,bytes=[68 ce b5],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=hεl,bytes=[68 ce b5 6c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=εl,bytes=[ce b5 6c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=εll,bytes=[ce b5 6c 6c],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ll,bytes=[6c 6c],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=llo,bytes=[6c 6c 6f],startOffset=2,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=lo,bytes=[6c 6f],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=,bytes=[],startOffset=5,endOffset=5,positionIncrement=0,positionLength=1,type=word,termFrequency=1

and for hello with minGram=2 and maxGram=3:

NGramTokenizer@6576fe71 term=he,bytes=[68 65],startOffset=0,endOffset=2,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=hel,bytes=[68 65 6c],startOffset=0,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=el,bytes=[65 6c],startOffset=1,endOffset=3,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ell,bytes=[65 6c 6c],startOffset=1,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=ll,bytes=[6c 6c],startOffset=2,endOffset=4,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=llo,bytes=[6c 6c 6f],startOffset=2,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=lo,bytes=[6c 6f],startOffset=3,endOffset=5,positionIncrement=1,positionLength=1,type=word,termFrequency=1
NGramTokenizer@6576fe71 term=,bytes=[],startOffset=5,endOffset=5,positionIncrement=0,positionLength=1,type=word,termFrequency=1

If I'm not mistaken, it looks like startOffset indicates the absolute position offset in characters. Regardless of the presence of non-ascii characters, the same values are returned in both cases, namely [0, 0, 1, 1, 2, 2, 3, 5].

I used the codes in https://github.com/darashi/hellolucene to obtain the output mentioned above.

@PSeitz
Copy link
Contributor

PSeitz commented Sep 6, 2024

Thanks for checking. Unfortunately they emit positionIncrement, so it's still unclear what they use for position as absolute value.

If I'm not mistaken, it looks like startOffset indicates the absolute position offset in characters. Regardless of the presence of non-ascii characters, the same values are returned in both cases, namely [0, 0, 1, 1, 2, 2, 3, 5].

In tantivy we use byte offsets, maybe they use something else?

Can you add tests for ngram with different settings with phrase queries?
btw. I think the regular PhraseQuery is quite wasteful as it will match all tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants