Fix minor benchmark script bugs #1822

suryabdev · 2025-10-15T06:08:24Z

Found the following minor bugs when running the benchmark script

'ChatMessage' object is not iterable

There is an error while running the benchmark script

python3 ./run.py --model-id Qwen/Qwen2.5-Coder-32B-Instruct --provider together

All the answers are "'ChatMessage' object is not iterable". The entries in the output files will look like

{"model_id": "Qwen/Qwen2.5-Coder-32B-Instruct", "agent_action_type": "code", "question": "What year was the municipality of Ramiriqu\u00ed, Boyac\u00e1, Colombia, founded? Answer with only the final number.", "original_question": "What year was the municipality of Ramiriqu\u00ed, Boyac\u00e1, Colombia, founded?", "answer": "'ChatMessage' object is not iterable", "true_answer": "1541", "source": "SimpleQA", "intermediate_steps": [], "start_time": 1760503458.6643338, "end_time": "2025-10-15 04:44:22", "token_counts": {"input": 0, "output": 0}}

Similar to #1763, The following line creating the error has to be updated from dict(message) to message.dict()

smolagents/examples/smolagents_benchmark/run.py

Line 161 in 1904ddd

    
           intermediate_steps = [dict(message) for message in agent.write_memory_to_messages()]

After that the output files have the expected answer

{"model_id": "Qwen/Qwen2.5-Coder-32B-Instruct", "agent_action_type": "code", "question": "What is the counter strength value for the Fume Sword in Dark Souls II? Answer with only the final number.", "original_question": "What is the counter strength value for the Fume Sword in Dark Souls II?", "answer": "120", "true_answer": "120", "source": "SimpleQA", "intermediate_steps": ..., "start_time": 1760507832.4037542, "end_time": "2025-10-15 05:57:17", "token_counts": {"input_tokens": 5341, "output_tokens": 113, "total_tokens": 5454}}

ToolCallingAgent unexpected keyword argument 'additional_authorized_imports'

additional_authorized_imports has to be removed from the ToolCallingAgent initialization

Remove default InferenceClient provider

The default provider hf-inference does not support all models. I faced an issue with Qwen/Qwen3-Next-80B-A3B-Thinking

Error in generating model output:\n404 Client Error: Not Found for url: https://router.huggingface.co

Removing the default provider and letting the API pick the provider is a good default behaviour

Datetime import issue

When running the score.ipynb notebook, I was facing an issue with the datetime line datetime.date.today().isoformat()

AttributeError: 'method_descriptor' object has no attribute 'today'

Changing the import from from datetime import datetime to import datetime fixed the issue

suryabdev · 2025-10-15T06:08:55Z

cc: @albertvillanova / @aymeric-roucher please review when free

suryabdev · 2025-10-15T06:32:36Z

examples/smolagents_benchmark/run.py

        )
    elif action_type == "tool-calling":
        agent = ToolCallingAgent(
            tools=[GoogleSearchTool(provider="serper"), VisitWebpageTool(), PythonInterpreterTool()],


Should the default tools be changed so that there is an apples-apples comparison between the ToolCallingAgent and CodeAgent?
Maybe pass the bellow additional_authorized_imports to the PythonInterpreterTool initialization

Yes indeed, we should pass the same additional_authorized_imports to PythonInterpreterTool to make it really an even comparison!

I'll let ou do the change before merging

suryabdev · 2025-10-15T09:00:59Z

Should we remove the max_tokens from the InferenceClientModel initialization. When I ran the script, I saw some errors indicating max_tokens + input_tokens was greater than some limit. Didn't save the error log

smolagents/examples/smolagents_benchmark/run.py

Line 245 in 1904ddd

    
           model = InferenceClientModel(model_id=args.model_id, provider=args.provider, max_tokens=8192)

aymeric-roucher · 2025-10-15T21:03:23Z

Should we remove the max_tokens from the InferenceClientModel initialization. When I ran the script, I saw some errors indicating max_tokens + input_tokens was greater than some limit. Didn't save the error log

smolagents/examples/smolagents_benchmark/run.py

Line 245 in 1904ddd

model = InferenceClientModel(model_id=args.model_id, provider=args.provider, max_tokens=8192)

I think this value of max_tokens is good ; the issue must be more on the model/inference choice that you use. As a rule of thumb, any inference service serving under 32k tokens of context length will often run into its limits in agentic mode (just because each steps adds like ~2k new tokens, of course varying wildly depending on the step)

aymeric-roucher

Thank you @suryabdev

suryabdev · 2025-10-16T04:37:23Z

Thanks for the review @aymeric-roucher, I've made the changes. Please trigger the PR checks when you are free

under 32k tokens of context length will often run into its limits in agentic mode

This makes sense, Thanks for the explanation!

Fix 'ChatMessage' object is not iterable

59eb951

Remove additional_authorized_imports from ToolCallingAgent

c9f7068

suryabdev changed the title ~~Benchmark script bug: 'ChatMessage' object is not iterable~~ Fix minor benchmark script bugs Oct 15, 2025

suryabdev commented Oct 15, 2025

View reviewed changes

suryabdev mentioned this pull request Oct 15, 2025

Change default InferenceClient model to Qwen/Qwen3-Next-80B-A3B-Thinking #1813

Merged

suryabdev added 2 commits October 15, 2025 08:52

Remove default InferenceClient provider

9387a66

Fix datetime import error

fddf812

aymeric-roucher approved these changes Oct 15, 2025

View reviewed changes

suryabdev added 2 commits October 16, 2025 04:34

Add authorized_imports to ToolCallingAgent's PythonInterpreterTool

7c59cee

Merge branch 'main' into fix-benchmark-dict

54b3e4d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix minor benchmark script bugs #1822

Fix minor benchmark script bugs #1822

Uh oh!

suryabdev commented Oct 15, 2025 •

edited

Loading

Uh oh!

suryabdev commented Oct 15, 2025

Uh oh!

suryabdev Oct 15, 2025

Uh oh!

aymeric-roucher Oct 15, 2025

Uh oh!

aymeric-roucher Oct 15, 2025

Uh oh!

suryabdev Oct 16, 2025

Uh oh!

suryabdev commented Oct 15, 2025

Uh oh!

aymeric-roucher commented Oct 15, 2025

Uh oh!

aymeric-roucher left a comment

Uh oh!

suryabdev commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix minor benchmark script bugs #1822

Are you sure you want to change the base?

Fix minor benchmark script bugs #1822

Uh oh!

Conversation

suryabdev commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

'ChatMessage' object is not iterable

ToolCallingAgent unexpected keyword argument 'additional_authorized_imports'

Remove default InferenceClient provider

Datetime import issue

Uh oh!

suryabdev commented Oct 15, 2025

Uh oh!

suryabdev Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

aymeric-roucher Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

suryabdev Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

suryabdev commented Oct 15, 2025

Uh oh!

aymeric-roucher commented Oct 15, 2025

Uh oh!

aymeric-roucher left a comment

Choose a reason for hiding this comment

Uh oh!

suryabdev commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

suryabdev commented Oct 15, 2025 •

edited

Loading