Skip to content

Make BM25 parameters (k1, b) configurable #2924

@kkkqkx123

Description

@kkkqkx123

Is your feature request related to a problem? Please describe.

Currently, the BM25 scoring parameters k1 and b are hardcoded as module-level constants (K1 = 1.2, B = 0.75) in src/query/bm25.rs. This prevents users from tuning these parameters for domain-specific relevance optimization.

While (k1=1.2, b=0.75) works well as a general default (matching Lucene/Elasticsearch), different applications may require different values:

  • Short vs. long documents: Fields with significantly different average lengths (e.g., titles vs. full-text articles) may benefit from different b values (0.3-0.9 range)
  • Domain-specific corpora: Legal/medical documents or code repositories often have term frequency distributions that differ from general text, benefiting from adjusted k1 (1.0-2.0 range)
  • A/B testing: Production systems need to experiment with parameter values to optimize search relevance metrics (NDCG, MRR, etc.)
  • Code search optimization: In our code-context-engine project, we maintain a Tantivy fork just to adjust these parameters, which creates maintenance burden

Describe the solution you'd like

Add a Bm25Params struct to IndexSettings, allowing per-index BM25 configuration that persists across restarts.

Proposed API:

use tantivy::IndexSettings;
use tantivy::query::Bm25Params;

let bm25_params = Bm25Params { k1: 1.5, b: 0.6 };
let settings = IndexSettings {
    bm25_params: Some(bm25_params),
    ..Default::default()
};
let index = Index::create_in_dir(path, schema, settings)?;

Users who don't specify custom parameters continue to use the default (k1=1.2, b=0.75).

Implementation approach:

  1. Add Bm25Params { k1: f32, b: f32 } with Default, Serialize, and Deserialize
  2. Add bm25_params: Option<Bm25Params> to IndexSettings with #[serde(default)]
  3. Pass parameters through query execution chain to Bm25Weight construction
  4. When bm25_params is None, fall back to default constants
  5. Update call sites in TermQuery, PhraseQuery, BooleanQuery, and BlockWand

This approach provides:

  • Backward compatibility (old meta.json loads successfully)
  • Persistence across restarts (stored in meta.json)
  • Per-index configurability
  • No index format version bump required

[Optional] describe alternatives you've considered

Other approaches (environment variables, compile-time features, maintaining a fork) cannot provide per-index runtime configurability or create unacceptable maintenance burden. The proposed approach follows Tantivy's existing patterns (similar to TokenizerManager).

Additional context

  • Constants are defined in src/query/bm25.rs lines 8-9
  • This would make Tantivy more competitive with Elasticsearch/Lucene for production use cases
  • I'm willing to implement this and would appreciate feedback on the API design before starting

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions