-
Notifications
You must be signed in to change notification settings - Fork 83
Clarify corpus parameter accepts two formats (strings and dicts only) #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: xhluca <[email protected]>
Co-authored-by: xhluca <[email protected]>
Co-authored-by: xhluca <[email protected]>
|
@copilot the list of tuple or iterable of tuple is incorrect and not accepted. also |
Co-authored-by: xhluca <[email protected]>
Fixed in 5a98693. Removed all references to tuples from:
The documentation now correctly states that corpus accepts strings, dictionaries, or lists (not tuples). |
|
@copilot list of list is not accepted either |
Co-authored-by: xhluca <[email protected]>
Fixed in 2d27401. Removed all references to "list of lists" from:
The documentation now correctly states that corpus accepts only strings or dictionaries. |
The documentation inconsistently described the
corpusparameter format: the quickstart showedList[str], while docstrings statedList[Dict]. Both formats are valid—the implementation fully supports strings and dictionaries.Changes
BM25.__init__(),BM25.save(), andBM25.retrieve()to document only the two supported corpus formatsList[str]- strings auto-convert to{"id": i, "text": str}on saveList[Dict]- dictionaries with arbitrary keys saved as-isNote: Other formats (tuples, lists of lists) are not documented as supported. While the implementation may accept them during saving, they either convert to other types during JSON serialization (tuples → lists) or cause errors during retrieval operations (lists of lists cause reshape errors).
Example
Both formats work correctly:
<issue_title>Format of corpus and API reference</issue_title>
><issue_description>The documentation says that a corpus is a list of dictionaries, while the quickstart define a corpus with a list of strings. Also a DeepLearning.AI tutorial uses a list of strings.
>
> Who is right?
>
> What format the dictionary is in? What are the keys?
>
> Is the documentation compiled into an API reference somewhere?</issue_description>
>
><agent_instructions>Can you answer this question?</agent_instructions>
>
> ## Comments on the Issue (you are @copilot in this section)
>
>
>
Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.