Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
75 changes: 75 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,36 @@ for i in range(results.shape[1]):
print(f"Rank {i+1}: {results[0, i]}")
```

### Corpus Formats

The `corpus` parameter in `bm25s` is flexible and supports multiple formats:

**1. List of strings** (simplest format, shown in quickstart):
```python
corpus = [
"a cat is a feline and likes to purr",
"a dog is the human's best friend and loves to play",
]
retriever.save("index_dir", corpus=corpus)
```
When saved, strings are automatically converted to `{"id": <index>, "text": <string>}` format.

**2. List of dictionaries** (for documents with metadata):
```python
corpus = [
{"text": "a cat is a feline", "title": "About Cats", "author": "John"},
{"text": "a dog is a friend", "title": "About Dogs", "author": "Jane"},
]
retriever.save("index_dir", corpus=corpus)
```
Dictionaries can have any keys you want - they are saved as-is in JSON format.

**Important notes:**
- The corpus you pass to `save()` or `BM25(corpus=...)` is for **saving/loading purposes only**
- For **indexing**, you must tokenize your text first using `bm25s.tokenize()`
- The corpus items can be retrieved later using `retrieve(..., corpus=corpus)` to return documents instead of indices
- For more details, see [`examples/index_with_metadata.py`](examples/index_with_metadata.py)

### Memory Efficient Retrieval

`bm25s` is designed to be memory efficient. You can use the `mmap` option to load the BM25 index as a memory-mapped file, which allows you to load the index without loading the full index into memory. This is useful when you have a large index and want to save memory:
Expand Down Expand Up @@ -441,6 +471,51 @@ Similarly, for MSMARCO (8M+ documents, 300M+ tokens), we show the following resu
| Memory-mapped | 1.24 | 90.41 | 1.14 | 4.88 |
| Mmap+Reload | 1.17 | 97.89 | 1.14 | 1.38 |

## API Reference and Documentation

### Where is the API documentation?

The primary API documentation is available through:
1. **Source code docstrings**: The main API is documented in `bm25s/__init__.py` with detailed docstrings for all methods
2. **Examples directory**: See [`examples/`](examples/) for practical usage examples
3. **Homepage**: Visit [bm25s.github.io](https://bm25s.github.io) for additional resources

### Quick API Overview

The main methods you'll use are:

- **`bm25s.tokenize(texts, ...)`**: Tokenize corpus or queries
```python
tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)
```

- **`BM25()`**: Initialize a BM25 model
```python
retriever = bm25s.BM25(k1=1.5, b=0.75, method="lucene")
```

- **`retriever.index(tokens)`**: Index your tokenized corpus
```python
retriever.index(corpus_tokens)
```

- **`retriever.retrieve(query_tokens, k=10, corpus=None)`**: Search and retrieve top-k results
```python
results, scores = retriever.retrieve(query_tokens, k=10)
# Or return actual documents:
results, scores = retriever.retrieve(query_tokens, k=10, corpus=corpus)
```

- **`retriever.save(dir, corpus=None)`**: Save index and optionally corpus
```python
retriever.save("my_index", corpus=corpus)
```

- **`BM25.load(dir, load_corpus=False)`**: Load saved index
```python
retriever = bm25s.BM25.load("my_index", load_corpus=True)
```

## Acknowledgement

* The central idea behind the scoring mechanism in this library is originally from [bm25_pt](https://github.com/jxmorris12/bm25_pt), which was a major inspiration to this project.
Expand Down
30 changes: 21 additions & 9 deletions bm25s/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,10 +169,12 @@ def __init__(
int_dtype : str
The data type of the indices in the BM25 scores.

corpus : Iterable[Dict]
corpus : Iterable[str] or Iterable[Dict]
The corpus of documents. This is optional and is used for saving the corpus
to the snapshot. We expect the corpus to be a list of dictionaries, where each
dictionary represents a document.
to the snapshot. The corpus can be:
- A list of strings (e.g., ["text1", "text2"])
- A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...])
When saved, strings are automatically converted to {"id": <index>, "text": <string>} format.

backend : str
The backend used during retrieval. By default, it uses the numpy backend, which
Expand Down Expand Up @@ -633,11 +635,16 @@ def retrieve(
List of list of tokens for each query. If a Tokenized object is provided,
it will be converted to a list of list of tokens.

corpus : List[str] or np.ndarray
corpus : List[str] or List[Dict] or List[Any] or np.ndarray
List of "documents" or a numpy array of documents. If provided, the function
will return the documents instead of the indices. You do not have to provide
the original documents (for example, you can provide the unique IDs of the
documents here and then retrieve the actual documents from another source).
will return the documents instead of the indices. The corpus can be:
- A list of strings (the original documents)
- A list of dictionaries (documents with metadata)
- A list of any other type (e.g., document IDs)
- A numpy array
You do not have to provide the original documents (for example, you can provide
the unique IDs of the documents here and then retrieve the actual documents from
another source).

k : int
Number of documents to retrieve for each query.
Expand Down Expand Up @@ -895,8 +902,13 @@ def save(
save_dir : str
The directory where the BM25S index will be saved.

corpus : List[Dict]
The corpus of documents. If provided, it will be saved to the `corpus` file.
corpus : Iterable[str] or Iterable[Dict]
The corpus of documents. If provided, it will be saved to the file specified by
the `corpus_name` parameter (default: "corpus.jsonl").
The corpus can be:
- A list of strings (e.g., ["text1", "text2"])
- A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...])
When saved, strings are automatically converted to {"id": <index>, "text": <string>} format.

corpus_name : str
The name of the file that will contain the corpus.
Expand Down