xhluca · Copilot · Dec 20, 2025 · Dec 20, 2025 · Dec 20, 2025 · Dec 20, 2025
diff --git a/README.md b/README.md
@@ -169,6 +169,36 @@ for i in range(results.shape[1]):
     print(f"Rank {i+1}: {results[0, i]}")
 ```
 
+### Corpus Formats
+
+The `corpus` parameter in `bm25s` is flexible and supports multiple formats:
+
+**1. List of strings** (simplest format, shown in quickstart):
+```python
+corpus = [
+    "a cat is a feline and likes to purr",
+    "a dog is the human's best friend and loves to play",
+]
+retriever.save("index_dir", corpus=corpus)
+```
+When saved, strings are automatically converted to `{"id": <index>, "text": <string>}` format.
+
+**2. List of dictionaries** (for documents with metadata):
+```python
+corpus = [
+    {"text": "a cat is a feline", "title": "About Cats", "author": "John"},
+    {"text": "a dog is a friend", "title": "About Dogs", "author": "Jane"},
+]
+retriever.save("index_dir", corpus=corpus)
+```
+Dictionaries can have any keys you want - they are saved as-is in JSON format.
+
+**Important notes:**
+- The corpus you pass to `save()` or `BM25(corpus=...)` is for **saving/loading purposes only**
+- For **indexing**, you must tokenize your text first using `bm25s.tokenize()`
+- The corpus items can be retrieved later using `retrieve(..., corpus=corpus)` to return documents instead of indices
+- For more details, see [`examples/index_with_metadata.py`](examples/index_with_metadata.py)
+
 ### Memory Efficient Retrieval
 
 `bm25s` is designed to be memory efficient. You can use the `mmap` option to load the BM25 index as a memory-mapped file, which allows you to load the index without loading the full index into memory. This is useful when you have a large index and want to save memory:
@@ -441,6 +471,51 @@ Similarly, for MSMARCO (8M+ documents, 300M+ tokens), we show the following resu
 | Memory-mapped | 1.24           | 90.41         | 1.14                | 4.88                   |
 | Mmap+Reload   | 1.17           | 97.89         | 1.14                | 1.38                   |
 
+## API Reference and Documentation
+
+### Where is the API documentation?
+
+The primary API documentation is available through:
+1. **Source code docstrings**: The main API is documented in `bm25s/__init__.py` with detailed docstrings for all methods
+2. **Examples directory**: See [`examples/`](examples/) for practical usage examples
+3. **Homepage**: Visit [bm25s.github.io](https://bm25s.github.io) for additional resources
+
+### Quick API Overview
+
+The main methods you'll use are:
+
+- **`bm25s.tokenize(texts, ...)`**: Tokenize corpus or queries
+  ```python
+  tokens = bm25s.tokenize(corpus, stopwords="en", stemmer=stemmer)
+  ```
+
+- **`BM25()`**: Initialize a BM25 model
+  ```python
+  retriever = bm25s.BM25(k1=1.5, b=0.75, method="lucene")
+  ```
+
+- **`retriever.index(tokens)`**: Index your tokenized corpus
+  ```python
+  retriever.index(corpus_tokens)
+  ```
+
+- **`retriever.retrieve(query_tokens, k=10, corpus=None)`**: Search and retrieve top-k results
+  ```python
+  results, scores = retriever.retrieve(query_tokens, k=10)
+  # Or return actual documents:
+  results, scores = retriever.retrieve(query_tokens, k=10, corpus=corpus)
+  ```
+
+- **`retriever.save(dir, corpus=None)`**: Save index and optionally corpus
+  ```python
+  retriever.save("my_index", corpus=corpus)
+  ```
+
+- **`BM25.load(dir, load_corpus=False)`**: Load saved index
+  ```python
+  retriever = bm25s.BM25.load("my_index", load_corpus=True)
+  ```
+
 ## Acknowledgement
 
 * The central idea behind the scoring mechanism in this library is originally from [bm25_pt](https://github.com/jxmorris12/bm25_pt), which was a major inspiration to this project.

diff --git a/bm25s/__init__.py b/bm25s/__init__.py
@@ -169,10 +169,12 @@ def __init__(
         int_dtype : str
             The data type of the indices in the BM25 scores.
 
-        corpus : Iterable[Dict]
+        corpus : Iterable[str] or Iterable[Dict]
             The corpus of documents. This is optional and is used for saving the corpus
-            to the snapshot. We expect the corpus to be a list of dictionaries, where each
-            dictionary represents a document.
+            to the snapshot. The corpus can be:
+            - A list of strings (e.g., ["text1", "text2"])
+            - A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...])
+            When saved, strings are automatically converted to {"id": <index>, "text": <string>} format.
 
         backend : str
             The backend used during retrieval. By default, it uses the numpy backend, which
@@ -633,11 +635,16 @@ def retrieve(
             List of list of tokens for each query. If a Tokenized object is provided,
             it will be converted to a list of list of tokens.
 
-        corpus : List[str] or np.ndarray
+        corpus : List[str] or List[Dict] or List[Any] or np.ndarray
             List of "documents" or a numpy array of documents. If provided, the function
-            will return the documents instead of the indices. You do not have to provide
-            the original documents (for example, you can provide the unique IDs of the
-            documents here and then retrieve the actual documents from another source).
+            will return the documents instead of the indices. The corpus can be:
+            - A list of strings (the original documents)
+            - A list of dictionaries (documents with metadata)
+            - A list of any other type (e.g., document IDs)
+            - A numpy array
+            You do not have to provide the original documents (for example, you can provide 
+            the unique IDs of the documents here and then retrieve the actual documents from 
+            another source).
 
         k : int
             Number of documents to retrieve for each query.
@@ -895,8 +902,13 @@ def save(
         save_dir : str
             The directory where the BM25S index will be saved.
 
-        corpus : List[Dict]
-            The corpus of documents. If provided, it will be saved to the `corpus` file.
+        corpus : Iterable[str] or Iterable[Dict]
+            The corpus of documents. If provided, it will be saved to the file specified by 
+            the `corpus_name` parameter (default: "corpus.jsonl").
+            The corpus can be:
+            - A list of strings (e.g., ["text1", "text2"])
+            - A list of dictionaries (e.g., [{"text": "...", "metadata": {...}}, ...])
+            When saved, strings are automatically converted to {"id": <index>, "text": <string>} format.
 
         corpus_name : str
             The name of the file that will contain the corpus.