diff --git a/README.md b/README.md index 58a2c64..4044435 100644 --- a/README.md +++ b/README.md @@ -169,6 +169,36 @@ for i in range(results.shape[1]): print(f"Rank {i+1}: {results[0, i]}") ``` +### Corpus Formats + +The `corpus` parameter in `bm25s` is flexible and supports multiple formats: + +**1. List of strings** (simplest format, shown in quickstart): +```python +corpus = [ + "a cat is a feline and likes to purr", + "a dog is the human's best friend and loves to play", +] +retriever.save("index_dir", corpus=corpus) +``` +When saved, strings are automatically converted to `{"id": , "text": }` format. + +**2. List of dictionaries** (for documents with metadata): +```python +corpus = [ + {"text": "a cat is a feline", "title": "About Cats", "author": "John"}, + {"text": "a dog is a friend", "title": "About Dogs", "author": "Jane"}, +] +retriever.save("index_dir", corpus=corpus) +``` +Dictionaries can have any keys you want - they are saved as-is in JSON format. + +**Important notes:** +- The corpus you pass to `save()` or `BM25(corpus=...)` is for **saving/loading purposes only** +- For **indexing**, you must tokenize your text first using `bm25s.tokenize()` +- The corpus items can be retrieved later using `retrieve(..., corpus=corpus)` to return documents instead of indices +- For more details, see [`examples/index_with_metadata.py`](examples/index_with_metadata.py) + ### Memory Efficient Retrieval `bm25s` is designed to be memory efficient. You can use the `mmap` option to load the BM25 index as a memory-mapped file, which allows you to load the index without loading the full index into memory. This is useful when you have a large index and want to save memory: @@ -441,6 +471,51 @@ Similarly, for MSMARCO (8M+ documents, 300M+ tokens), we show the following resu | Memory-mapped | 1.24 | 90.41 | 1.14 | 4.88 | | Mmap+Reload | 1.17 | 97.89 | 1.14 | 1.38 | +## API Reference and Documentation + +### Where is the API documentation? + +The primary API documentation is available through: +1. **Source code docstrings**: The main API is documented in `bm25s/__init__.py` with detailed docstrings for all methods +2. **Examples directory**: See [`examples/`](examples/) for practical usage examples +3. **Homepage**: Visit [bm25s.github.io](https://bm25s.github.io) for additional resources + +### Quick API Overview + +The main methods you'll use are: + +- **`bm25s.tokenize(texts, ...)`**: Tokenize corpus or queries + ```python + tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer) + ``` + +- **`BM25()`**: Initialize a BM25 model + ```python + retriever = bm25s.BM25(k1=1.5, b=0.75, method="lucene") + ``` + +- **`retriever.index(tokens)`**: Index your tokenized corpus + ```python + retriever.index(corpus_tokens) + ``` + +- **`retriever.retrieve(query_tokens, k=10, corpus=None)`**: Search and retrieve top-k results + ```python + results, scores = retriever.retrieve(query_tokens, k=10) + # Or return actual documents: + results, scores = retriever.retrieve(query_tokens, k=10, corpus=corpus) + ``` + +- **`retriever.save(dir, corpus=None)`**: Save index and optionally corpus + ```python + retriever.save("my_index", corpus=corpus) + ``` + +- **`BM25.load(dir, load_corpus=False)`**: Load saved index + ```python + retriever = bm25s.BM25.load("my_index", load_corpus=True) + ``` + ## Acknowledgement * The central idea behind the scoring mechanism in this library is originally from [bm25_pt](https://github.com/jxmorris12/bm25_pt), which was a major inspiration to this project. diff --git a/bm25s/__init__.py b/bm25s/__init__.py index d55b051..81e91a3 100644 --- a/bm25s/__init__.py +++ b/bm25s/__init__.py @@ -169,10 +169,12 @@ def __init__( int_dtype : str The data type of the indices in the BM25 scores. - corpus : Iterable[Dict] + corpus : Iterable[str] or Iterable[Dict] The corpus of documents. This is optional and is used for saving the corpus - to the snapshot. We expect the corpus to be a list of dictionaries, where each - dictionary represents a document. + to the snapshot. The corpus can be: + - A list of strings (e.g., ["text1", "text2"]) + - A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...]) + When saved, strings are automatically converted to {"id": , "text": } format. backend : str The backend used during retrieval. By default, it uses the numpy backend, which @@ -633,11 +635,16 @@ def retrieve( List of list of tokens for each query. If a Tokenized object is provided, it will be converted to a list of list of tokens. - corpus : List[str] or np.ndarray + corpus : List[str] or List[Dict] or List[Any] or np.ndarray List of "documents" or a numpy array of documents. If provided, the function - will return the documents instead of the indices. You do not have to provide - the original documents (for example, you can provide the unique IDs of the - documents here and then retrieve the actual documents from another source). + will return the documents instead of the indices. The corpus can be: + - A list of strings (the original documents) + - A list of dictionaries (documents with metadata) + - A list of any other type (e.g., document IDs) + - A numpy array + You do not have to provide the original documents (for example, you can provide + the unique IDs of the documents here and then retrieve the actual documents from + another source). k : int Number of documents to retrieve for each query. @@ -895,8 +902,13 @@ def save( save_dir : str The directory where the BM25S index will be saved. - corpus : List[Dict] - The corpus of documents. If provided, it will be saved to the `corpus` file. + corpus : Iterable[str] or Iterable[Dict] + The corpus of documents. If provided, it will be saved to the file specified by + the `corpus_name` parameter (default: "corpus.jsonl"). + The corpus can be: + - A list of strings (e.g., ["text1", "text2"]) + - A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...]) + When saved, strings are automatically converted to {"id": , "text": } format. corpus_name : str The name of the file that will contain the corpus.