👉 New to ML Agents? Check out the BEGINNER_GUIDE.md for a step-by-step walkthrough!
-
ML Agents Community Program - Main hub for Cohere Labs' community-driven initiative on open-source agent research, focusing on agentic frameworks, applications, evaluations, and benchmarks
-
Project Documentation - Detailed specifications and roadmap for the ZeroHPO (Zero-shot Hyperparameter Optimization) project for agentic tasks
-
Project Tracker - Community project tracking, task assignments, and progress monitoring
-
Discord Community - Join the #ml-agents channel for discussions, meetings, and collaboration with the community
This project investigates how different reasoning approaches impact AI model performance across various tasks. It provides a comprehensive framework for comparing multiple reasoning techniques with various language models.
🎉 Phase 15 Complete: The platform now includes structured answer extraction with Instructor library, providing clean separation of reasoning text from final answers across all reasoning approaches with provider-aware optimization!
- Universal Benefit: Do all tasks benefit from reasoning?
- Model Variability: Do different models show varying benefits from reasoning?
- Approach Comparison: How do different reasoning approaches (CoT, PoT, etc.) compare?
- Task-Approach Fit: Do certain tasks benefit more from specific reasoning methods?
- Cost-Benefit Analysis: What is the tradeoff for each approach and task?
- Predictive Reasoning: Can we predict the need for reasoning based on the input prompt alone?
The platform currently supports 8 production-ready reasoning approaches with structured answer extraction:
- None - Baseline direct prompting without reasoning
- Chain-of-Thought (CoT) - Step-by-step reasoning process
- Program-of-Thought (PoT) - Code-based problem solving
- Reasoning-as-Planning - Strategic planning with goal decomposition
- Reflection - Self-evaluation and iterative improvement
- Chain-of-Verification - Systematic verification with follow-up questions
- Skeleton-of-Thought - Hierarchical outline-first reasoning
- Tree-of-Thought - Multiple reasoning path exploration and synthesis
Additional approaches planned: Graph-of-Thought, ReWOO, Buffer-of-Thoughts (Phase 6)
All reasoning approaches now include clean answer extraction that:
- Separates reasoning from answers: Preserves full reasoning traces while extracting clean final answers
- Removes common prefixes: Converts "The answer is 42" → "42" automatically
- Provider-optimized: Uses ANTHROPIC_TOOLS, TOOLS, or JSON modes based on provider capabilities
- Reliable fallback: TOOLS → JSON fallback ensures compatibility across all providers
- Type-safe extraction: Pydantic models with validation and confidence scoring
- Python 3.9+
- uv (for virtual environment management)
- API keys for at least one provider (Anthropic, Cohere, or OpenRouter)
Install the latest stable version from PyPI:
# Install globally
pip install ml-agents-reasoning
# Or install with development dependencies
pip install ml-agents-reasoning[dev]
# Verify installation
ml-agents --version
ml-agents --helpWith uv (fastest):
# Install with uv
uv tool install ml-agents-reasoning
# Run without installing (recommended for trying out)
uvx ml-agents-reasoning eval run LOCAL_TEST ChainOfThought --samples 10
# Add to project dependencies
uv add ml-agents-reasoningFor contributors or advanced users:
# Clone and install in development mode
git clone https://github.com/thompsonson/c4ai-ml-agents
cd c4ai-ml-agents
pip install -e .[dev]
# Or with uv (recommended)
uv sync --all-extrasAfter installation, configure your API keys:
# Create configuration file
cp .env.example .env
# Edit .env with your actual API keys
# Or set environment variables directly
export ANTHROPIC_API_KEY="your-key-here"
export OPENROUTER_API_KEY="your-key-here"The ML Agents CLI includes two types of commands:
- Stable Commands (✅ Production Ready):
setup,db,preprocess- Well-tested, stable API, suitable for production use - Pre-Alpha Commands (
⚠️ Experimental):eval,results- Experimental features that may be unstable or have breaking changes
For production use or getting started, we recommend using only the stable commands first.
Once installed, you can use the ML Agents CLI:
# Validate your environment
ml-agents setup validate-env
# List available reasoning approaches
ml-agents setup list-approaches
# Discover available datasets (⚠️ PRE-ALPHA)
ml-agents eval list
# Get dataset information (⚠️ PRE-ALPHA)
ml-agents eval info LOCAL_TEST
# Run a simple experiment (⚠️ PRE-ALPHA)
ml-agents eval run LOCAL_TEST ChainOfThought --samples 10
# Run with repository benchmark (⚠️ PRE-ALPHA)
ml-agents eval run BENCHMARK-01-GPQA.csv TreeOfThought --samples 50
# Compare multiple approaches (⚠️ PRE-ALPHA)
ml-agents eval compare --config examples/configs/comparison_study.yamlTo use the original Jupyter notebook interface:
jupyter notebook Reasoning_LLM.ipynb- Anthropic: Claude Opus 4, Claude Sonnet 4, Claude 3.5 Haiku
- Cohere: Command R+, Command R, Command Light
- OpenRouter: GPT-5, GPT-5 Mini, GPT OSS-120B, Gemini 2.5 Flash Lite
- Temperature: 0.0 - 2.0 (controls randomness)
- Max Tokens: 64 - 4096 (output length limit)
- Top P: 0.0 - 1.0 (nucleus sampling parameter)
The platform includes SQLite database persistence for all experiment results and supports Claude Code MCP server integration for direct database access during conversations.
- Real-time persistence: All experiment results are automatically saved to
ml_agents_results.db - Read-only MCP access: Query the database directly from Claude Code conversations
- Rich export formats: CSV, JSON, and Excel with advanced formatting
- Advanced analytics: Approach comparisons, failure analysis, and cost tracking
# Database management (Stable Commands)
ml-agents db init --db-path ./results.db # Initialize database
ml-agents db backup --source ./results.db # Create backup
ml-agents db stats --db-path ./results.db # Show statistics
ml-agents db migrate --db-path ./results.db # Migrate database schema
# Export and analysis (⚠️ PRE-ALPHA)
ml-agents results export EXPERIMENT_ID --format excel # Export to Excel
ml-agents results compare "exp1,exp2,exp3" # Compare experiments
ml-agents results analyze EXPERIMENT_ID --type accuracy # Generate reports
ml-agents results list --status completed # List experimentsExplore available datasets before running experiments:
# List all available datasets
ml-agents eval list
# Get detailed information about a dataset
ml-agents eval info LOCAL_TEST
ml-agents eval info BENCHMARK-01-GPQA.csv
# List datasets from custom repository
ml-agents eval list --repo your-org/your-benchmarksRun one reasoning approach on a dataset:
# Basic usage with LOCAL_TEST
ml-agents eval run LOCAL_TEST ChainOfThought --samples 10
# Use repository benchmark
ml-agents eval run BENCHMARK-01-GPQA.csv TreeOfThought --samples 50
# With specific model provider
ml-agents eval run LOCAL_TEST Reflection --provider anthropic --model claude-3-5-haiku-20241022
# With advanced reasoning settings
ml-agents eval run LOCAL_TEST ChainOfVerification --multi-step-verification --max-reasoning-calls 5
# With custom repository
ml-agents eval run my-dataset.csv ChainOfThought --repo your-org/benchmarks --samples 100Compare multiple approaches using configuration files:
# Basic comparison with config file
ml-agents eval compare --config examples/configs/comparison_study.yaml
# Override config settings
ml-agents eval compare --config examples/configs/comparison_study.yaml --samples 200 --parallelNote: The compare command uses YAML configuration files to specify multiple approaches. See the Configuration Files section below for details.
For complex experiments, use YAML configuration files:
# Run single experiment with config (⚠️ PRE-ALPHA)
ml-agents eval run LOCAL_TEST ChainOfThought --config examples/configs/single_experiment.yaml
# Run comparison with config (⚠️ PRE-ALPHA)
ml-agents eval compare --config examples/configs/comparison_study.yaml
# Override config parameters (⚠️ PRE-ALPHA)
ml-agents eval compare --config examples/configs/comparison_study.yaml --samples 200 --parallelExample configuration (config.yaml):
experiment:
name: "reasoning_comparison_study"
sample_count: 100
output_dir: "./results"
model:
provider: "openrouter"
name: "openai/gpt-oss-120b"
temperature: 0.3
max_tokens: 512
reasoning:
approaches:
- ChainOfThought
- ReasoningAsPlanning
- TreeOfThought
multi_step_verification: true
max_reasoning_calls: 5
execution:
parallel: true
max_workers: 4
save_checkpoints: trueResume interrupted experiments:
# List available checkpoints
ml-agents eval checkpoints
# Resume from specific checkpoint
ml-agents eval resume checkpoint_exp_20250818_123456.json# Set reasoning limits to control costs
ml-agents eval run LOCAL_TEST ChainOfVerification --max-reasoning-calls 3 --samples 50
# Monitor costs with verbose output
ml-agents eval compare --config examples/configs/comparison_study.yaml --samples 100 --verbose# Enable multi-step reflection
ml-agents eval run LOCAL_TEST Reflection --multi-step-reflection --max-reflection-iterations 3
# Enable multi-step verification
ml-agents eval run LOCAL_TEST ChainOfVerification --multi-step-verification --max-reasoning-calls 5# Parallel execution with config file (approaches defined in YAML)
ml-agents eval compare --config examples/configs/comparison_study.yaml --parallel --max-workers 2
# Override config for large experiments
ml-agents eval compare --config examples/configs/comparison_study.yaml --samples 500 --parallel --max-workers 8Results are organized by dataset with full preprocessing-evaluation traceability:
./outputs/
├── {dataset_name}/
│ ├── preprocessing/
│ │ ├── {timestamp}/
│ │ │ ├── analysis.json # Dataset schema analysis
│ │ │ ├── rules.json # Transformation rules
│ │ │ ├── processed.json # Standardized dataset
│ │ │ └── metadata.json # Preprocessing metadata
│ │ └── latest → {most_recent}/ # Symlink to latest preprocessing
│ └── eval/
│ ├── {exp_timestamp}/
│ │ ├── experiment_config.json # Experiment configuration
│ │ ├── experiment_results.csv # Detailed results per approach
│ │ ├── experiment_summary.json # Performance summary
│ │ └── experiment_errors.json # Any processing errors
│ └── latest → {most_recent}/ # Symlink to latest experiment
Each result file contains:
- Input prompts and model responses
- Complete reasoning traces
- Performance metrics (accuracy, time, cost)
- Configuration details
- Error information
- Preprocessing lineage: Complete traceability to preprocessing rules used
# Validate your setup first
ml-agents setup validate-env
ml-agents setup list-approaches
ml-agents setup version# Initialize and manage experiment database
ml-agents db init
ml-agents db stats
ml-agents db backup --source ./ml_agents_results.db# Preprocess datasets for evaluation
ml-agents preprocess list
ml-agents preprocess inspect MilaWang/SpatialEval --samples 100
ml-agents preprocess batch --max 5# 1. Preprocess a custom dataset (creates organized folder structure)
ml-agents preprocess inspect MilaWang/SpatialEval --config tqa
ml-agents preprocess generate-rules MilaWang/SpatialEval --config tqa
ml-agents preprocess transform MilaWang/SpatialEval rules.json --config tqa
# 2. Run evaluation with preprocessed data (auto-detects latest preprocessing) (⚠️ PRE-ALPHA)
ml-agents eval run MilaWang_SpatialEval_tqa ChainOfThought --samples 50
# 3. Compare approaches on same preprocessed dataset (⚠️ PRE-ALPHA)
ml-agents eval run MilaWang_SpatialEval_tqa TreeOfThought --samples 50
ml-agents eval run MilaWang_SpatialEval_tqa Reflection --samples 50
# 4. View organized results
ml-agents results list# Test with small sample size
ml-agents eval run LOCAL_TEST ChainOfThought --samples 5 --verbose# Comprehensive comparison study
ml-agents eval compare \
--approaches "None,ChainOfThought,ReasoningAsPlanning,TreeOfThought,Reflection" \
--samples 200 \
--parallel \
--max-workers 4 \
--multi-step-verification \
--output "./studies/comprehensive_study"| Command | Description |
|---|---|
ml-agents setup validate-env |
Check environment setup |
ml-agents setup list-approaches |
Show available reasoning methods |
ml-agents setup version |
Show version information |
ml-agents db init |
Initialize experiment database |
ml-agents db backup |
Create database backup |
ml-agents db stats |
Show database statistics |
ml-agents db migrate |
Migrate database schema |
ml-agents preprocess list |
List unprocessed datasets |
ml-agents preprocess inspect |
Inspect dataset schema |
ml-agents preprocess batch |
Batch process datasets |
| Command | Description |
|---|---|
ml-agents eval run |
Single reasoning experiment |
ml-agents eval compare |
Multi-approach comparison |
ml-agents eval resume |
Resume from checkpoint |
ml-agents eval checkpoints |
Show available checkpoints |
ml-agents results export |
Export experiment results |
ml-agents results compare |
Compare experiments |
ml-agents results analyze |
Analyze experiment patterns |
For detailed help on any command:
ml-agents setup --help
ml-agents eval run --help
ml-agents db --helpFor users who prefer the notebook interface:
- Setup: Ensure dependencies are installed via
./setup.sh - Configuration: Use interactive widgets to select models and approaches
- Data: Default uses "bbeh-eval" dataset, customizable
- Execute: Run experiment cells to process your dataset
- Results: Tables and CSV files with format
{model}_{approach}_{timestamp}.csv
Your dataset should include:
- input column: The question/problem to solve
- answer column (optional): Expected output for evaluation
- task column (optional): Task category for analysis
The notebook generates CSV files containing:
- Input prompts
- Model outputs
- Full reasoning traces
- Execution time
- Cost estimates
- Configuration details
ml-agents/
├── src/ # Main source code
│ ├── cli/ # CLI interface (Phase 5)
│ │ ├── main.py # CLI entry point
│ │ ├── commands.py # Run/compare commands
│ │ ├── config_loader.py # Configuration management
│ │ ├── display.py # Rich output formatting
│ │ └── validators.py # Input validation
│ ├── core/ # Core experiment logic
│ │ ├── experiment_runner.py # Experiment orchestration
│ │ ├── dataset_loader.py # Dataset loading
│ │ └── reasoning_inference.py # Inference engine
│ ├── reasoning/ # Reasoning approaches
│ │ ├── base.py # Base reasoning class
│ │ ├── chain_of_thought.py # CoT implementation
│ │ ├── tree_of_thought.py # ToT implementation
│ │ └── ... # Other approaches
│ └── utils/ # Utilities
│ ├── api_clients.py # API wrappers
│ ├── rate_limiter.py # Rate limiting
│ └── logging_config.py # Logging setup
├── examples/ # Usage examples
│ ├── configs/ # Configuration templates
│ ├── scripts/ # Batch processing scripts
│ └── README.md # Examples documentation
├── tests/ # Test suite
├── outputs/ # Organized experiment and preprocessing results
│ └── {dataset_name}/ # Dataset-centric organization
│ ├── preprocessing/ # Preprocessing runs with timestamps
│ └── eval/ # Evaluation runs with timestamps
├── Reasoning_LLM.ipynb # Original Jupyter notebook
├── config.py # Environment configuration
├── requirements.txt # Python dependencies
├── setup.sh # Automated setup script
├── Makefile # Development commands
└── README.md # This file
- Start Small: Begin with
--samples 10to test approaches quickly - Use Baselines: Always include
Noneapproach for comparison - Cost Control: Monitor costs with
--verboseand set--max-reasoning-calls - Parallel Processing: Use
--parallelfor faster comparison studies - Reproducibility: Save configuration files and use checkpoints
- Temperature Settings: Lower values (0.1-0.3) for consistent, cost-effective results
- Token Limits: Set appropriate
--max-tokensbased on your task complexity - Sample Sizing: Use smaller samples for initial exploration
- Provider Selection: Compare costs across different providers
- Multi-step Limits: Control
--max-reasoning-callsfor approaches like Chain-of-Verification
- Parallel Execution: Use
--parallel --max-workers Nfor comparison studies - Checkpoint Usage: Enable checkpoints for long-running experiments
- Rate Limiting: Adjust
--max-workersbased on provider rate limits - Batch Processing: Use configuration files and scripts for multiple experiments
# Check environment
ml-agents setup validate-env
# Fix dependency issues
make clean && make install-dev
# Verify imports
make debug-imports# Check .env file exists and has keys
cat .env
# Validate specific provider
ml-agents setup validate-envError messages will guide you to set missing keys:
export OPENROUTER_API_KEY="your_key_here"
export ANTHROPIC_API_KEY="your_key_here"If you encounter rate limits:
# Reduce parallel workers for comparison experiments
ml-agents eval compare --config examples/configs/comparison_study.yaml --max-workers 1
# Disable parallel processing for single experiments
ml-agents eval run LOCAL_TEST ChainOfThought --samples 50 --parallel falseFor large experiments:
# Reduce sample size for comparison experiments
ml-agents eval compare --config examples/configs/comparison_study.yaml --samples 50
# Disable parallel processing for comparison experiments
ml-agents eval compare --config examples/configs/comparison_study.yaml --parallel falseThe warning about NumPy 1.x vs 2.x is cosmetic and doesn't affect functionality:
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.3.2...
This is a known PyTorch compatibility issue and can be ignored.
# Reinstall the package
make install-dev
# Check entry point
which ml-agents# Activate virtual environment
source .venv/bin/activate
# Test imports
make debug-importsFor configuration errors, check:
- YAML/JSON syntax is valid
- All required fields are present
- Approach names match available options (
ml-agents list-approaches) - Provider/model combinations are supported
-
Command Help: Use
--helpwith any commandml-agents --help ml-agents eval run --help ml-agents eval compare --help
-
Verbose Output: Add
--verboseto see detailed execution logsml-agents eval run LOCAL_TEST ChainOfThought --samples 5 --verbose -
Check Status: Validate your setup
ml-agents setup validate-env ml-agents setup list-approaches make validate-env
-
Community Support: Join the Discord #ml-agents channel for help
For developers working on the codebase:
# Run test suite
make test
# Check code quality
make lint
# Type checking
make type-check
# Full development check
make checkFor developers using Claude Code, enable direct database queries in conversations:
# Configure MCP server (one-time setup)
make configure-mcp
# Or run the script directly
./scripts/install-sqlite-mcp-server.shAvailable MCP Tools:
read_query: Execute validated SELECT querieslist_tables: Show all database tablesdescribe_table: Show table schemas
claude mcp list due to a known bug. Use claude mcp get sqlite-read-only to verify installation.
Feel free to extend the notebook with:
- Additional reasoning approaches
- New evaluation metrics
- Support for more models/providers
- Performance optimizations
This project is licensed under the Creative Commons Attribution 4.0 International License. This means:
- ✅ Share - Copy and redistribute the material in any medium or format
- ✅ Adapt - Remix, transform, and build upon the material for any purpose, even commercially
- ✅ Attribution - You must give appropriate credit, provide a link to the license, and indicate if changes were made
This license is chosen because:
- Open Science: Aligns with Cohere Labs' open science mission
- Maximum Impact: Allows both academic and commercial use, accelerating AI research
- Community Growth: Enables derivatives while ensuring original work is credited
- Simplicity: Easy to understand and implement
Note: For the code components specifically, you may want to consider dual-licensing with MIT or Apache 2.0 for better software compatibility.
- CC BY-SA 4.0: Adds "ShareAlike" requirement - derivatives must use same license (more restrictive but ensures openness)
- CC BY-NC 4.0: Adds "NonCommercial" restriction - prevents commercial use (limits industry collaboration)
- CC0: Public domain dedication - no attribution required (maximum freedom but no credit requirement)
