Clarify corpus parameter accepts two formats (strings and dicts only) #159

Copilot · 2025-12-20T20:49:07Z

The documentation inconsistently described the corpus parameter format: the quickstart showed List[str], while docstrings stated List[Dict]. Both formats are valid—the implementation fully supports strings and dictionaries.

Changes

Updated docstrings in BM25.__init__(), BM25.save(), and BM25.retrieve() to document only the two supported corpus formats
Added "Corpus Formats" section to README with examples of the two supported patterns:
- List[str] - strings auto-convert to {"id": i, "text": str} on save
- List[Dict] - dictionaries with arbitrary keys saved as-is
Added "API Reference and Documentation" section to README clarifying documentation locations (source docstrings, examples directory, homepage)

Note: Other formats (tuples, lists of lists) are not documented as supported. While the implementation may accept them during saving, they either convert to other types during JSON serialization (tuples → lists) or cause errors during retrieval operations (lists of lists cause reshape errors).

Example

Both formats work correctly:

# Simple strings (quickstart pattern)
corpus = ["a cat is a feline", "a dog is a friend"]
retriever.save("index", corpus=corpus)

# Dictionaries with metadata
corpus = [
    {"text": "a cat is a feline", "title": "Cats", "id": "doc1"},
    {"text": "a dog is a friend", "title": "Dogs", "id": "doc2"}
]
retriever.save("index", corpus=corpus)  # Dictionaries preserved as-is

<issue_title>Format of corpus and API reference</issue_title>
><issue_description>The documentation says that a corpus is a list of dictionaries, while the quickstart define a corpus with a list of strings. Also a DeepLearning.AI tutorial uses a list of strings.
>
> Who is right?
>
> What format the dictionary is in? What are the keys?
>
> Is the documentation compiled into an API reference somewhere?</issue_description>
>
><agent_instructions>Can you answer this question?</agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
>
>

Original prompt

This section details on the original issue you should resolve

<issue_title>Format of corpus and API reference</issue_title>
<issue_description>The documentation says that a corpus is a list of dictionaries, while the quickstart define a corpus with a list of strings. Also a DeepLearning.AI tutorial uses a list of strings.

Who is right?

What format the dictionary is in? What are the keys?

Is the documentation compiled into an API reference somewhere?</issue_description>

<agent_instructions>Can you answer this question?</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes Format of corpus and API reference #158

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: xhluca <[email protected]>

xhluca · 2025-12-20T20:57:11Z

@copilot the list of tuple or iterable of tuple is incorrect and not accepted. also

Co-authored-by: xhluca <[email protected]>

Copilot · 2025-12-20T21:03:20Z

@copilot the list of tuple or iterable of tuple is incorrect and not accepted. also

Fixed in 5a98693. Removed all references to tuples from:

BM25.__init__() docstring
BM25.save() docstring
README "Corpus Formats" section

The documentation now correctly states that corpus accepts strings, dictionaries, or lists (not tuples).

xhluca · 2025-12-22T20:20:28Z

@copilot list of list is not accepted either

Co-authored-by: xhluca <[email protected]>

Copilot · 2025-12-22T20:24:20Z

@copilot list of list is not accepted either

Fixed in 2d27401. Removed all references to "list of lists" from:

BM25.__init__() docstring (removed Iterable[List] and description)
BM25.save() docstring (removed Iterable[List] and description)
README "Corpus Formats" section (removed entire section 3 with list of lists example)

The documentation now correctly states that corpus accepts only strings or dictionaries.

Initial plan

d72616c

Copilot AI assigned Copilot and xhluca Dec 20, 2025

Copilot started work on behalf of xhluca December 20, 2025 20:49 View session

Copilot AI and others added 3 commits December 20, 2025 20:53

Fix corpus format documentation in docstrings

20c2a28

Co-authored-by: xhluca <[email protected]>

Add comprehensive corpus format documentation to README

bbef375

Co-authored-by: xhluca <[email protected]>

Address code review feedback - improve documentation clarity

b1120c6

Co-authored-by: xhluca <[email protected]>

Copilot AI changed the title ~~[WIP] Fix format discrepancies in corpus and API reference~~ Clarify corpus parameter accepts multiple formats (strings, dicts, lists) Dec 20, 2025

Copilot AI requested a review from xhluca December 20, 2025 20:59

Copilot finished work on behalf of xhluca December 20, 2025 20:59

Copilot started work on behalf of xhluca December 20, 2025 21:00 View session

Remove tuple references from corpus documentation

5a98693

Co-authored-by: xhluca <[email protected]>

Copilot finished work on behalf of xhluca December 20, 2025 21:04

xhluca marked this pull request as ready for review December 22, 2025 20:18

Copilot started work on behalf of xhluca December 22, 2025 20:21 View session

Remove list of lists from corpus documentation

2d27401

Co-authored-by: xhluca <[email protected]>

Copilot AI changed the title ~~Clarify corpus parameter accepts multiple formats (strings, dicts, lists)~~ Clarify corpus parameter accepts two formats (strings and dicts only) Dec 22, 2025

Copilot finished work on behalf of xhluca December 22, 2025 20:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Clarify corpus parameter accepts two formats (strings and dicts only) #159

Clarify corpus parameter accepts two formats (strings and dicts only) #159

Uh oh!

Copilot AI commented Dec 20, 2025 •

edited

Loading

Uh oh!

xhluca commented Dec 20, 2025

Uh oh!

Copilot AI commented Dec 20, 2025

Uh oh!

xhluca commented Dec 22, 2025

Uh oh!

Copilot AI commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Clarify corpus parameter accepts two formats (strings and dicts only) #159

Are you sure you want to change the base?

Clarify corpus parameter accepts two formats (strings and dicts only) #159

Uh oh!

Conversation

Copilot AI commented Dec 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Example

Comments on the Issue (you are @copilot in this section)

Uh oh!

xhluca commented Dec 20, 2025

Uh oh!

Copilot AI commented Dec 20, 2025

Uh oh!

xhluca commented Dec 22, 2025

Uh oh!

Copilot AI commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Copilot AI commented Dec 20, 2025 •

edited

Loading