Skip to content

Conversation

@xhluca
Copy link
Owner

@xhluca xhluca commented Nov 26, 2025

Discussion: #156

…CLI integration, examples, and documentation. (#144)
…mba backend (#149)

* fix index_nq example

* add np_csc to replace scipy

* updates

* Enhance BM25 class to support CSC matrix construction with optional scipy backend

* Remove scipy from install_requires and ensure it's listed under extras_require for indexing

* add auto compiling for numba

* Refactor CSC construction methods to remove dtype parameter. Update BM25 class to ensure data type conversion occurs within the method. This change simplifies the interface and improves consistency across CSC implementations.

* Update README.md to reflect changes in BM25S implementation, highlighting the transition from Scipy to Numpy for performance improvements and clarifying disk usage details for various environments.

* Update GitHub Actions workflow to trigger tests on both main and dev branches

* Refactor GitHub Actions workflow to streamline core test execution and reintroduce coverage reporting in a separate step.

* Reorder steps in GitHub Actions workflow to install Numba before running core tests, ensuring compatibility with updated version requirements.

* Comment out caching of Python dependencies in GitHub Actions workflow to simplify the process and avoid potential issues with dependency management.

* Reintroduce caching of Python dependencies in GitHub Actions workflow and update core test execution to include coverage with faulthandler for improved error reporting.

* Reorganize GitHub Actions workflow to run Numba tests after core tests, ensuring proper execution order and maintaining coverage reporting.

* Update GitHub Actions workflow to install Numba before running core tests and remove faulthandler from coverage command for cleaner execution.

* Enhance GitHub Actions workflow by setting global environment variables to prevent threading issues, adding concurrency handling for core tests, and separating numba tests to avoid coverage interference.

* Update GitHub Actions workflow to use --timid option in coverage command for improved tracing and avoid C-extension crashes during core tests.

* Update GitHub Actions workflow to utilize faulthandler for improved error reporting during core and numba tests execution.

* Refactor GitHub Actions workflow to disable Numba JIT compilation and set threading environment variables, improving stability during core tests execution.

* Refactor core test execution in GitHub Actions workflow by removing Numba JIT compilation environment variable and uninstalling Numba, enhancing stability and compatibility during tests.

* Update GitHub Actions workflow to uninstall Numba with the -y flag, streamlining the test execution process and ensuring a clean environment for core tests.

* Refactor GitHub Actions workflow to use unittest discovery for Numba tests, simplifying test execution and improving maintainability.

* Refactor GitHub Actions workflow to add a dedicated Numba test job, streamline dependency installation, and ensure proper execution order for Numba tests, enhancing test reliability and maintainability.

* Enhance GitHub Actions workflow by adding environment variables to disable TQDM monitor thread and Intel SVML, and to control AVX usage, improving stability during Numba tests execution.

* Update GitHub Actions workflow to rename TQDM_DISABLE environment variable to DISABLE_TQDM for consistency and clarity in test execution settings.

* Refactor tqdm integration across multiple modules to allow disabling via DISABLE_TQDM environment variable, improving flexibility and compatibility in environments without tqdm support.

* Refactor _faketqdm function across multiple modules to simplify its signature, enhancing code clarity and consistency in handling iterable arguments.

* Change Numba parallelization setting in _retrieve_internal_jitted_parallel function to improve performance consistency.

* Remove warmup call for numba scorer and change parallelization setting in _retrieve_internal_jitted_parallel function to enhance performance and maintainability.

* Enable warmup call for numba scorer and comment out warmup scoring logic in BM25 class to improve clarity and maintainability.

* Refactor warmup scoring logic in BM25 class to restore functionality and improve clarity by removing commented-out code.

* Refactor BM25 compile method to conditionally activate Numba and warmup processes based on parameters, improving flexibility and clarity in JIT compilation handling.

* Update BM25 class to ensure dtype is consistently wrapped in np.dtype and modify test cases to include auto_compile parameter, enhancing compatibility and flexibility in Numba backend operations.

* Add auto_compile parameter to BM25 class for improved JIT compilation

* Update BM25 compile method to disable warmup by default, enhancing JIT compilation behavior and improving performance consistency.
xhluca and others added 7 commits November 26, 2025 09:59
* improve readme

* docs: Update example index path and query in README and verification script.

* sync (#152)

* fix index_nq example (#146)

* Remove scipy, add indexing via numpy/numba, add auto-compiling for numba backend (#149)

* fix index_nq example

* add np_csc to replace scipy

* updates

* Enhance BM25 class to support CSC matrix construction with optional scipy backend

* Remove scipy from install_requires and ensure it's listed under extras_require for indexing

* add auto compiling for numba

* Refactor CSC construction methods to remove dtype parameter. Update BM25 class to ensure data type conversion occurs within the method. This change simplifies the interface and improves consistency across CSC implementations.

* Update README.md to reflect changes in BM25S implementation, highlighting the transition from Scipy to Numpy for performance improvements and clarifying disk usage details for various environments.

* Update GitHub Actions workflow to trigger tests on both main and dev branches

* Refactor GitHub Actions workflow to streamline core test execution and reintroduce coverage reporting in a separate step.

* Reorder steps in GitHub Actions workflow to install Numba before running core tests, ensuring compatibility with updated version requirements.

* Comment out caching of Python dependencies in GitHub Actions workflow to simplify the process and avoid potential issues with dependency management.

* Reintroduce caching of Python dependencies in GitHub Actions workflow and update core test execution to include coverage with faulthandler for improved error reporting.

* Reorganize GitHub Actions workflow to run Numba tests after core tests, ensuring proper execution order and maintaining coverage reporting.

* Update GitHub Actions workflow to install Numba before running core tests and remove faulthandler from coverage command for cleaner execution.

* Enhance GitHub Actions workflow by setting global environment variables to prevent threading issues, adding concurrency handling for core tests, and separating numba tests to avoid coverage interference.

* Update GitHub Actions workflow to use --timid option in coverage command for improved tracing and avoid C-extension crashes during core tests.

* Update GitHub Actions workflow to utilize faulthandler for improved error reporting during core and numba tests execution.

* Refactor GitHub Actions workflow to disable Numba JIT compilation and set threading environment variables, improving stability during core tests execution.

* Refactor core test execution in GitHub Actions workflow by removing Numba JIT compilation environment variable and uninstalling Numba, enhancing stability and compatibility during tests.

* Update GitHub Actions workflow to uninstall Numba with the -y flag, streamlining the test execution process and ensuring a clean environment for core tests.

* Refactor GitHub Actions workflow to use unittest discovery for Numba tests, simplifying test execution and improving maintainability.

* Refactor GitHub Actions workflow to add a dedicated Numba test job, streamline dependency installation, and ensure proper execution order for Numba tests, enhancing test reliability and maintainability.

* Enhance GitHub Actions workflow by adding environment variables to disable TQDM monitor thread and Intel SVML, and to control AVX usage, improving stability during Numba tests execution.

* Update GitHub Actions workflow to rename TQDM_DISABLE environment variable to DISABLE_TQDM for consistency and clarity in test execution settings.

* Refactor tqdm integration across multiple modules to allow disabling via DISABLE_TQDM environment variable, improving flexibility and compatibility in environments without tqdm support.

* Refactor _faketqdm function across multiple modules to simplify its signature, enhancing code clarity and consistency in handling iterable arguments.

* Change Numba parallelization setting in _retrieve_internal_jitted_parallel function to improve performance consistency.

* Remove warmup call for numba scorer and change parallelization setting in _retrieve_internal_jitted_parallel function to enhance performance and maintainability.

* Enable warmup call for numba scorer and comment out warmup scoring logic in BM25 class to improve clarity and maintainability.

* Refactor warmup scoring logic in BM25 class to restore functionality and improve clarity by removing commented-out code.

* Refactor BM25 compile method to conditionally activate Numba and warmup processes based on parameters, improving flexibility and clarity in JIT compilation handling.

* Update BM25 class to ensure dtype is consistently wrapped in np.dtype and modify test cases to include auto_compile parameter, enhancing compatibility and flexibility in Numba backend operations.

* Add auto_compile parameter to BM25 class for improved JIT compilation

* Update BM25 compile method to disable warmup by default, enhancing JIT compilation behavior and improving performance consistency.

* sync
* feat: Implement high-level BM25 search API with indexing and searching capabilities

* feat: Update BM25 class to support parameter overriding and enhance BM25Search initialization

* refactor: Enhance BM25Search class with type hints and improve search method to return structured results for multiple queries

* Add dummy dat for testing

* feat: Enhance BM25Search class with improved query handling and document loading capabilities

- Updated the search method to handle empty queries and ensure k does not exceed the corpus size.
- Refactored the index method to allow for dynamic creation of empty tokens based on the tokenized vocabulary.
- Implemented comprehensive document loading functionality for TXT, CSV, JSON, and JSONL formats in the load function.
- Added a new example script to demonstrate loading and querying capabilities with various document formats.

* feat: Add high-level API for document loading and searching

- Introduced a new section in the README to explain the high-level API for quickly searching local files.
- Added a new test suite for the high-level API, covering various document formats (TXT, CSV, JSON, JSONL) and ensuring robust functionality.
- Implemented tests for loading documents and searching with different query scenarios, including handling empty queries and larger k values than the corpus size.
- Added a new job for high-level tests in the GitHub Actions workflow, which installs dependencies and runs tests located in the `tests/high_level` directory.
- Introduced a new test file `test_high_level.py` that implements comprehensive unit tests for the high-level BM25 API, covering various document formats (TXT, CSV, JSON, JSONL) and ensuring robust functionality.
- Cleaned up the GitHub Actions workflow by removing unnecessary blank lines and correcting the command for running high-level tests.
- Enhanced the BM25Search class by adding the `allow_empty` parameter to improve query handling and prevent issues with empty queries.
- Added support for passing additional keyword arguments (**kwargs) to the BM25 class's load method, allowing for more flexible parameter overrides during index loading.
- Updated documentation to reflect the new **kwargs functionality, improving clarity on how to customize loaded parameters.
- Added installation of the `numba` package to the workflow to support new functionalities.
- Updated the command for running high-level tests to include the `-X faulthandler` option, improving error reporting during test execution.
* Update GitHub Actions workflow for high-level tests

- Set environment variables to control threading for OpenMP, MKL, and OpenBLAS, enhancing performance stability during test execution.
- Upgraded `numpy` to the latest version and ensured `numba` is installed with a minimum version requirement, improving compatibility with high-level dependencies.

* Update BM25Search to disable warmup during compilation to prevent segfaults in CI environments
* Refactor MCP integration in CLI and improve error handling

- Updated the BM25S CLI to import the MCP server dynamically, enhancing modularity.
- Improved error messaging for missing MCP package, providing clearer installation instructions for users.

* Add command-line interface for indexing and searching documents

- Introduced a terminal-based CLI for `bm25s` to facilitate quick indexing and searching of documents from CSV, TXT, JSON, and JSONL files.
- Implemented subcommands for `index` and `search`, allowing users to create an index and perform queries directly from the command line.
- Enhanced README documentation with examples for using the CLI, improving user guidance for indexing and searching workflows.
- Added comprehensive unit tests to validate CLI functionality and ensure robust error handling for various input scenarios.

* Add save functionality for search results in JSON format

- Introduced a new command-line argument `--save` to allow users to save search results to a specified JSON file.
- Updated the README to include examples of using the new save feature, enhancing user documentation.
- Added unit tests to verify the correct functionality of saving search results, ensuring robust implementation and validation of output structure.

* Enhance CLI functionality with user directory support

- Added the ability to save indices to a central user directory (`~/.bm25s/indices/`) using the `-u` flag, improving organization and accessibility of user-created indices.
- Updated command-line arguments for indexing and searching to include shorthand options, enhancing usability.
- Introduced an interactive index picker for selecting indices from the user directory, requiring the `rich` package for a better user experience.
- Enhanced README documentation with examples for the new user directory features, providing clearer guidance for users.
- Added unit tests to validate the new user directory functionality and ensure robust error handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants