-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Fix minor benchmark script bugs #1822
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
cc: @albertvillanova / @aymeric-roucher please review when free |
examples/smolagents_benchmark/run.py
Outdated
) | ||
elif action_type == "tool-calling": | ||
agent = ToolCallingAgent( | ||
tools=[GoogleSearchTool(provider="serper"), VisitWebpageTool(), PythonInterpreterTool()], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the default tools be changed so that there is an apples-apples comparison between the ToolCallingAgent
and CodeAgent
?
Maybe pass the bellow additional_authorized_imports
to the PythonInterpreterTool
initialization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes indeed, we should pass the same additional_authorized_imports
to PythonInterpreterTool
to make it really an even comparison!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let ou do the change before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Should we remove the max_tokens from the InferenceClientModel initialization. When I ran the script, I saw some errors indicating max_tokens + input_tokens was greater than some limit. Didn't save the error log
|
I think this value of max_tokens is good ; the issue must be more on the model/inference choice that you use. As a rule of thumb, any inference service serving under 32k tokens of context length will often run into its limits in agentic mode (just because each steps adds like ~2k new tokens, of course varying wildly depending on the step) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @suryabdev
Thanks for the review @aymeric-roucher, I've made the changes. Please trigger the PR checks when you are free
This makes sense, Thanks for the explanation! |
Found the following minor bugs when running the benchmark script
'ChatMessage' object is not iterable
There is an error while running the benchmark script
All the answers are
"'ChatMessage' object is not iterable"
. The entries in the output files will look likeSimilar to #1763, The following line creating the error has to be updated from
dict(message)
tomessage.dict()
smolagents/examples/smolagents_benchmark/run.py
Line 161 in 1904ddd
After that the output files have the expected answer
ToolCallingAgent unexpected keyword argument 'additional_authorized_imports'
additional_authorized_imports
has to be removed from theToolCallingAgent
initializationRemove default InferenceClient provider
The default provider
hf-inference
does not support all models. I faced an issue withQwen/Qwen3-Next-80B-A3B-Thinking
Removing the default provider and letting the API pick the provider is a good default behaviour
Datetime import issue
When running the
score.ipynb
notebook, I was facing an issue with the datetime linedatetime.date.today().isoformat()
Changing the import from
from datetime import datetime
toimport datetime
fixed the issue