Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 20, 2025

The documentation inconsistently described the corpus parameter format: the quickstart showed List[str], while docstrings stated List[Dict]. Both formats are valid—the implementation fully supports strings and dictionaries.

Changes

  • Updated docstrings in BM25.__init__(), BM25.save(), and BM25.retrieve() to document only the two supported corpus formats
  • Added "Corpus Formats" section to README with examples of the two supported patterns:
    • List[str] - strings auto-convert to {"id": i, "text": str} on save
    • List[Dict] - dictionaries with arbitrary keys saved as-is
  • Added "API Reference and Documentation" section to README clarifying documentation locations (source docstrings, examples directory, homepage)

Note: Other formats (tuples, lists of lists) are not documented as supported. While the implementation may accept them during saving, they either convert to other types during JSON serialization (tuples → lists) or cause errors during retrieval operations (lists of lists cause reshape errors).

Example

Both formats work correctly:

# Simple strings (quickstart pattern)
corpus = ["a cat is a feline", "a dog is a friend"]
retriever.save("index", corpus=corpus)

# Dictionaries with metadata
corpus = [
    {"text": "a cat is a feline", "title": "Cats", "id": "doc1"},
    {"text": "a dog is a friend", "title": "Dogs", "id": "doc2"}
]
retriever.save("index", corpus=corpus)  # Dictionaries preserved as-is

<issue_title>Format of corpus and API reference</issue_title>
><issue_description>The documentation says that a corpus is a list of dictionaries, while the quickstart define a corpus with a list of strings. Also a DeepLearning.AI tutorial uses a list of strings.
>
> Who is right?
>
> What format the dictionary is in? What are the keys?
>
> Is the documentation compiled into an API reference somewhere?</issue_description>
>
><agent_instructions>Can you answer this question?</agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
>
>

Original prompt

This section details on the original issue you should resolve

<issue_title>Format of corpus and API reference</issue_title>
<issue_description>The documentation says that a corpus is a list of dictionaries, while the quickstart define a corpus with a list of strings. Also a DeepLearning.AI tutorial uses a list of strings.

Who is right?

What format the dictionary is in? What are the keys?

Is the documentation compiled into an API reference somewhere?</issue_description>

<agent_instructions>Can you answer this question?</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@xhluca
Copy link
Owner

xhluca commented Dec 20, 2025

@copilot the list of tuple or iterable of tuple is incorrect and not accepted. also

Copilot AI changed the title [WIP] Fix format discrepancies in corpus and API reference Clarify corpus parameter accepts multiple formats (strings, dicts, lists) Dec 20, 2025
Copilot AI requested a review from xhluca December 20, 2025 20:59
Copy link
Contributor Author

Copilot AI commented Dec 20, 2025

@copilot the list of tuple or iterable of tuple is incorrect and not accepted. also

Fixed in 5a98693. Removed all references to tuples from:

  • BM25.__init__() docstring
  • BM25.save() docstring
  • README "Corpus Formats" section

The documentation now correctly states that corpus accepts strings, dictionaries, or lists (not tuples).

@xhluca xhluca marked this pull request as ready for review December 22, 2025 20:18
@xhluca
Copy link
Owner

xhluca commented Dec 22, 2025

@copilot list of list is not accepted either

Copy link
Contributor Author

Copilot AI commented Dec 22, 2025

@copilot list of list is not accepted either

Fixed in 2d27401. Removed all references to "list of lists" from:

  • BM25.__init__() docstring (removed Iterable[List] and description)
  • BM25.save() docstring (removed Iterable[List] and description)
  • README "Corpus Formats" section (removed entire section 3 with list of lists example)

The documentation now correctly states that corpus accepts only strings or dictionaries.

Copilot AI changed the title Clarify corpus parameter accepts multiple formats (strings, dicts, lists) Clarify corpus parameter accepts two formats (strings and dicts only) Dec 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Format of corpus and API reference

2 participants