Skip to content

Conversation

suryabdev
Copy link
Contributor

@suryabdev suryabdev commented Oct 15, 2025

Found the following minor bugs when running the benchmark script

'ChatMessage' object is not iterable

There is an error while running the benchmark script

python3 ./run.py --model-id Qwen/Qwen2.5-Coder-32B-Instruct --provider together

All the answers are "'ChatMessage' object is not iterable". The entries in the output files will look like

{"model_id": "Qwen/Qwen2.5-Coder-32B-Instruct", "agent_action_type": "code", "question": "What year was the municipality of Ramiriqu\u00ed, Boyac\u00e1, Colombia, founded? Answer with only the final number.", "original_question": "What year was the municipality of Ramiriqu\u00ed, Boyac\u00e1, Colombia, founded?", "answer": "'ChatMessage' object is not iterable", "true_answer": "1541", "source": "SimpleQA", "intermediate_steps": [], "start_time": 1760503458.6643338, "end_time": "2025-10-15 04:44:22", "token_counts": {"input": 0, "output": 0}}

Similar to #1763, The following line creating the error has to be updated from dict(message) to message.dict()

intermediate_steps = [dict(message) for message in agent.write_memory_to_messages()]

After that the output files have the expected answer

{"model_id": "Qwen/Qwen2.5-Coder-32B-Instruct", "agent_action_type": "code", "question": "What is the counter strength value for the Fume Sword in Dark Souls II? Answer with only the final number.", "original_question": "What is the counter strength value for the Fume Sword in Dark Souls II?", "answer": "120", "true_answer": "120", "source": "SimpleQA", "intermediate_steps": ..., "start_time": 1760507832.4037542, "end_time": "2025-10-15 05:57:17", "token_counts": {"input_tokens": 5341, "output_tokens": 113, "total_tokens": 5454}}

ToolCallingAgent unexpected keyword argument 'additional_authorized_imports'

additional_authorized_imports has to be removed from the ToolCallingAgent initialization

Remove default InferenceClient provider

The default provider hf-inference does not support all models. I faced an issue with Qwen/Qwen3-Next-80B-A3B-Thinking

Error in generating model output:\n404 Client Error: Not Found for url: https://router.huggingface.co

Removing the default provider and letting the API pick the provider is a good default behaviour

Datetime import issue

When running the score.ipynb notebook, I was facing an issue with the datetime line datetime.date.today().isoformat()

AttributeError: 'method_descriptor' object has no attribute 'today'

Changing the import from from datetime import datetime to import datetime fixed the issue

@suryabdev
Copy link
Contributor Author

cc: @albertvillanova / @aymeric-roucher please review when free

@suryabdev suryabdev changed the title Benchmark script bug: 'ChatMessage' object is not iterable Fix minor benchmark script bugs Oct 15, 2025
)
elif action_type == "tool-calling":
agent = ToolCallingAgent(
tools=[GoogleSearchTool(provider="serper"), VisitWebpageTool(), PythonInterpreterTool()],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the default tools be changed so that there is an apples-apples comparison between the ToolCallingAgent and CodeAgent?
Maybe pass the bellow additional_authorized_imports to the PythonInterpreterTool initialization

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed, we should pass the same additional_authorized_imports to PythonInterpreterTool to make it really an even comparison!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll let ou do the change before merging

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@suryabdev
Copy link
Contributor Author

Should we remove the max_tokens from the InferenceClientModel initialization. When I ran the script, I saw some errors indicating max_tokens + input_tokens was greater than some limit. Didn't save the error log

model = InferenceClientModel(model_id=args.model_id, provider=args.provider, max_tokens=8192)

@aymeric-roucher
Copy link
Collaborator

Should we remove the max_tokens from the InferenceClientModel initialization. When I ran the script, I saw some errors indicating max_tokens + input_tokens was greater than some limit. Didn't save the error log

model = InferenceClientModel(model_id=args.model_id, provider=args.provider, max_tokens=8192)

I think this value of max_tokens is good ; the issue must be more on the model/inference choice that you use. As a rule of thumb, any inference service serving under 32k tokens of context length will often run into its limits in agentic mode (just because each steps adds like ~2k new tokens, of course varying wildly depending on the step)

Copy link
Collaborator

@aymeric-roucher aymeric-roucher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @suryabdev

@suryabdev
Copy link
Contributor Author

Thanks for the review @aymeric-roucher, I've made the changes. Please trigger the PR checks when you are free

under 32k tokens of context length will often run into its limits in agentic mode

This makes sense, Thanks for the explanation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants