diff --git a/skills/hugging-face-evaluation/SKILL.md b/skills/hugging-face-evaluation/SKILL.md index 5bdc03c8..3034a11a 100644 --- a/skills/hugging-face-evaluation/SKILL.md +++ b/skills/hugging-face-evaluation/SKILL.md @@ -1,651 +1,207 @@ --- name: hugging-face-evaluation -description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format. +description: Run evaluations for Hugging Face Hub models using inspect-ai and lighteval on local hardware. Use for backend selection, local GPU evals, and choosing between vLLM / Transformers / accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publication, or community-evals automation. --- # Overview -This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data: -- Extracting existing evaluation tables from README content -- Importing benchmark scores from Artificial Analysis -- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai) -## Integration with HF Ecosystem -- **Model Cards**: Updates model-index metadata for leaderboard integration -- **Artificial Analysis**: Direct API integration for benchmark imports -- **Papers with Code**: Compatible with their model-index specification -- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration -- **vLLM**: Efficient GPU inference for custom model evaluation -- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends -- **inspect-ai**: UK AI Safety Institute's evaluation framework +This skill is for **running evaluations against models on the Hugging Face Hub on local hardware**. -# Version -1.3.0 +It covers: +- `inspect-ai` with local inference +- `lighteval` with local inference +- choosing between `vllm`, Hugging Face Transformers, and `accelerate` +- smoke tests, task selection, and backend fallback strategy -# Dependencies +It does **not** cover: +- Hugging Face Jobs orchestration +- model-card or `model-index` edits +- README table extraction +- Artificial Analysis imports +- `.eval_results` generation or publishing +- PR creation or community-evals automation -## Core Dependencies -- huggingface_hub>=0.26.0 -- markdown-it-py>=3.0.0 -- python-dotenv>=1.2.1 -- pyyaml>=6.0.3 -- requests>=2.32.5 -- re (built-in) +If the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill. -## Inference Provider Evaluation -- inspect-ai>=0.3.0 -- inspect-evals -- openai +If the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~/code/community-evals`. -## vLLM Custom Model Evaluation (GPU required) -- lighteval[accelerate,vllm]>=0.6.0 -- vllm>=0.4.0 -- torch>=2.0.0 -- transformers>=4.40.0 -- accelerate>=0.30.0 +> All paths below are relative to the directory containing this `SKILL.md`. -Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`. +# When To Use Which Script -# IMPORTANT: Using This Skill +| Use case | Script | +|---|---| +| Local `inspect-ai` eval on a Hub model via inference providers | `scripts/inspect_eval_uv.py` | +| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts/inspect_vllm_uv.py` | +| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts/lighteval_vllm_uv.py` | +| Extra command patterns | `examples/USAGE_EXAMPLES.md` | -## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones +# Prerequisites -**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:** +- Prefer `uv run` for local execution. +- Set `HF_TOKEN` for gated/private models. +- For local GPU runs, verify GPU access before starting: ```bash -uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name" +uv --version +printenv HF_TOKEN >/dev/null +nvidia-smi ``` -**If open PRs exist:** -1. **DO NOT create a new PR** - this creates duplicate work for maintainers -2. **Warn the user** that open PRs already exist -3. **Show the user** the existing PR URLs so they can review them -4. Only proceed if the user explicitly confirms they want to create another PR +If `nvidia-smi` is unavailable, either: +- use `scripts/inspect_eval_uv.py` for lighter provider-backed evaluation, or +- hand off to the `hugging-face-jobs` skill if the user wants remote compute. -This prevents spamming model repositories with duplicate evaluation PRs. +# Core Workflow ---- - -> **All paths are relative to the directory containing this SKILL.md -file.** -> Before running any script, first `cd` to that directory or use the full -path. - - -**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`: -```bash -uv run scripts/evaluation_manager.py --help -uv run scripts/evaluation_manager.py inspect-tables --help -uv run scripts/evaluation_manager.py extract-readme --help -``` -Key workflow (matches CLI help): - -1) `get-prs` → check for existing open PRs first -2) `inspect-tables` → find table numbers/columns -3) `extract-readme --table N` → prints YAML by default -4) add `--apply` (push) or `--create-pr` to write changes - -# Core Capabilities - -## 1. Inspect and Extract Evaluation Tables from README -- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows -- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples) -- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist) -- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models) -- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text. -- **YAML Generation**: Convert selected table to model-index YAML format -- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`) - -## 2. Import from Artificial Analysis -- **API Integration**: Fetch benchmark scores directly from Artificial Analysis -- **Automatic Formatting**: Convert API responses to model-index format -- **Metadata Preservation**: Maintain source attribution and URLs -- **PR Creation**: Automatically create pull requests with evaluation updates - -## 3. Model-Index Management -- **YAML Generation**: Create properly formatted model-index entries -- **Merge Support**: Add evaluations to existing model cards without overwriting -- **Validation**: Ensure compliance with Papers with Code specification -- **Batch Operations**: Process multiple models efficiently - -## 4. Run Evaluations on HF Jobs (Inference Providers) -- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library -- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure -- **Zero-Config**: No Dockerfiles or Space management required -- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job -- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI - -## 5. Run Custom Model Evaluations with vLLM (NEW) - -⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory. -**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal -**When to use:** User working in local device directly when GPU is available - -### Before running the script - -- check the script path -- check uv is installed -- check gpu is available with `nvidia-smi` - -### Running the script - -```bash -uv run scripts/train_sft_example.py -``` -### Features +1. Choose the evaluation framework. + - Use `inspect-ai` when you want explicit task control and inspect-native flows. + - Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks. +2. Choose the inference backend. + - Prefer `vllm` for throughput on supported architectures. + - Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks. +3. Start with a smoke test. + - `inspect-ai`: add `--limit 10` or similar. + - `lighteval`: add `--max-samples 10`. +4. Scale up only after the smoke test passes. +5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args. -- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods) -- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks -- **inspect-ai Framework**: UK AI Safety Institute's evaluation library -- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure - -# Usage Instructions - -The skill includes Python scripts in `scripts/` to perform operations. - -### Prerequisites -- Preferred: use `uv run` (PEP 723 header auto-installs deps) -- Optional manual fallback: `uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests` -- Set `HF_TOKEN` environment variable with Write-access token -- For Artificial Analysis: Set `AA_API_KEY` environment variable -- `.env` is loaded automatically if `python-dotenv` is installed - -### Method 1: Extract from README (CLI workflow) - -Recommended flow (matches `--help`): -```bash -# 1) Inspect tables to get table numbers and column hints -uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model" - -# 2) Extract a specific table (prints YAML by default) -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model" \ - --table 1 \ - [--model-column-index ] \ - [--model-name-override ""] # use exact header text if you can't use the index - -# 3) Apply changes (push or PR) -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model" \ - --table 1 \ - --apply # push directly -# or -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model" \ - --table 1 \ - --create-pr # open a PR -``` - -Validation checklist: -- YAML is printed by default; compare against the README table before applying. -- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact. -- For transposed tables (models as rows), ensure only one row is extracted. - -### Method 2: Import from Artificial Analysis - -Fetch benchmark scores from Artificial Analysis API and add them to a model card. - -**Basic Usage:** -```bash -AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "username/model-name" -``` - -**With Environment File:** -```bash -# Create .env file -echo "AA_API_KEY=your-api-key" >> .env -echo "HF_TOKEN=your-hf-token" >> .env - -# Run import -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "username/model-name" -``` - -**Create Pull Request:** -```bash -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "username/model-name" \ - --create-pr -``` +# Quick Start -### Method 3: Run Evaluation Job +## Option A: inspect-ai with local inference providers path -Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI. +Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead. -**Direct CLI Usage:** ```bash -HF_TOKEN=$HF_TOKEN \ -hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \ - --flavor cpu-basic \ - --secret HF_TOKEN=$HF_TOKEN \ - -- --model "meta-llama/Llama-2-7b-hf" \ - --task "mmlu" -``` - -**GPU Example (A10G):** -```bash -HF_TOKEN=$HF_TOKEN \ -hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \ - --flavor a10g-small \ - --secret HF_TOKEN=$HF_TOKEN \ - -- --model "meta-llama/Llama-2-7b-hf" \ - --task "gsm8k" -``` - -**Python Helper (optional):** -```bash -uv run scripts/run_eval_job.py \ - --model "meta-llama/Llama-2-7b-hf" \ - --task "mmlu" \ - --hardware "t4-small" -``` - -### Method 4: Run Custom Model Evaluation with vLLM - -Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware. - -#### When to Use vLLM Evaluation (vs Inference Providers) - -| Feature | vLLM Scripts | Inference Provider Scripts | -|---------|-------------|---------------------------| -| Model access | Any HF model | Models with API endpoints | -| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure | -| Cost | HF Jobs compute cost | API usage fees | -| Speed | vLLM optimized | Depends on provider | -| Offline | Yes (after download) | No | - -#### Option A: lighteval with vLLM Backend - -lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks. - -**Standalone (local GPU):** -```bash -# Run MMLU 5-shot with vLLM -uv run scripts/lighteval_vllm_uv.py \ - --model meta-llama/Llama-3.2-1B \ - --tasks "leaderboard|mmlu|5" - -# Run multiple tasks -uv run scripts/lighteval_vllm_uv.py \ +uv run scripts/inspect_eval_uv.py \ --model meta-llama/Llama-3.2-1B \ - --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" - -# Use accelerate backend instead of vLLM -uv run scripts/lighteval_vllm_uv.py \ - --model meta-llama/Llama-3.2-1B \ - --tasks "leaderboard|mmlu|5" \ - --backend accelerate - -# Chat/instruction-tuned models -uv run scripts/lighteval_vllm_uv.py \ - --model meta-llama/Llama-3.2-1B-Instruct \ - --tasks "leaderboard|mmlu|5" \ - --use-chat-template -``` - -**Via HF Jobs:** -```bash -hf jobs uv run scripts/lighteval_vllm_uv.py \ - --flavor a10g-small \ - --secrets HF_TOKEN=$HF_TOKEN \ - -- --model meta-llama/Llama-3.2-1B \ - --tasks "leaderboard|mmlu|5" + --task mmlu \ + --limit 20 ``` -**lighteval Task Format:** -Tasks use the format `suite|task|num_fewshot`: -- `leaderboard|mmlu|5` - MMLU with 5-shot -- `leaderboard|gsm8k|5` - GSM8K with 5-shot -- `lighteval|hellaswag|0` - HellaSwag zero-shot -- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot - -**Finding Available Tasks:** -The complete list of available lighteval tasks can be found at: -https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt - -This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include: -- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.) -- `lighteval` - Additional lighteval tasks -- `bigbench` - BigBench tasks -- `original` - Original benchmark tasks - -To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example: -- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot) -- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0` -- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0` +Use this path when: +- you want a quick local smoke test +- you do not need direct GPU control +- the task already exists in `inspect-evals` -Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"` +## Option B: inspect-ai on Local GPU -#### Option B: inspect-ai with vLLM Backend +Best when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures. -inspect-ai is the UK AI Safety Institute's evaluation framework. +Local GPU: -**Standalone (local GPU):** ```bash -# Run MMLU with vLLM uv run scripts/inspect_vllm_uv.py \ --model meta-llama/Llama-3.2-1B \ - --task mmlu - -# Use HuggingFace Transformers backend -uv run scripts/inspect_vllm_uv.py \ - --model meta-llama/Llama-3.2-1B \ - --task mmlu \ - --backend hf - -# Multi-GPU with tensor parallelism -uv run scripts/inspect_vllm_uv.py \ - --model meta-llama/Llama-3.2-70B \ - --task mmlu \ - --tensor-parallel-size 4 -``` - -**Via HF Jobs:** -```bash -hf jobs uv run scripts/inspect_vllm_uv.py \ - --flavor a10g-small \ - --secrets HF_TOKEN=$HF_TOKEN \ - -- --model meta-llama/Llama-3.2-1B \ - --task mmlu + --task gsm8k \ + --limit 20 ``` -**Available inspect-ai Tasks:** -- `mmlu` - Massive Multitask Language Understanding -- `gsm8k` - Grade School Math -- `hellaswag` - Common sense reasoning -- `arc_challenge` - AI2 Reasoning Challenge -- `truthfulqa` - TruthfulQA benchmark -- `winogrande` - Winograd Schema Challenge -- `humaneval` - Code generation - -#### Option C: Python Helper Script - -The helper script auto-selects hardware and simplifies job submission: +Transformers fallback: ```bash -# Auto-detect hardware based on model size -uv run scripts/run_vllm_eval_job.py \ - --model meta-llama/Llama-3.2-1B \ - --task "leaderboard|mmlu|5" \ - --framework lighteval - -# Explicit hardware selection -uv run scripts/run_vllm_eval_job.py \ - --model meta-llama/Llama-3.2-70B \ - --task mmlu \ - --framework inspect \ - --hardware a100-large \ - --tensor-parallel-size 4 - -# Use HF Transformers backend -uv run scripts/run_vllm_eval_job.py \ +uv run scripts/inspect_vllm_uv.py \ --model microsoft/phi-2 \ --task mmlu \ - --framework inspect \ - --backend hf + --backend hf \ + --trust-remote-code \ + --limit 20 ``` -**Hardware Recommendations:** -| Model Size | Recommended Hardware | -|------------|---------------------| -| < 3B params | `t4-small` | -| 3B - 13B | `a10g-small` | -| 13B - 34B | `a10g-large` | -| 34B+ | `a100-large` | +## Option C: lighteval on Local GPU -### Commands Reference - -**Top-level help and version:** -```bash -uv run scripts/evaluation_manager.py --help -uv run scripts/evaluation_manager.py --version -``` +Best when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks. -**Inspect Tables (start here):** -```bash -uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name" -``` +Local GPU: -**Extract from README:** ```bash -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "username/model-name" \ - --table N \ - [--model-column-index N] \ - [--model-name-override "Exact Column Header or Model Name"] \ - [--task-type "text-generation"] \ - [--dataset-name "Custom Benchmarks"] \ - [--apply | --create-pr] -``` - -**Import from Artificial Analysis:** -```bash -AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "creator-name" \ - --model-name "model-slug" \ - --repo-id "username/model-name" \ - [--create-pr] -``` - -**View / Validate:** -```bash -uv run scripts/evaluation_manager.py show --repo-id "username/model-name" -uv run scripts/evaluation_manager.py validate --repo-id "username/model-name" -``` - -**Check Open PRs (ALWAYS run before --create-pr):** -```bash -uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name" -``` -Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL. - -**Run Evaluation Job (Inference Providers):** -```bash -hf jobs uv run scripts/inspect_eval_uv.py \ - --flavor "cpu-basic|t4-small|..." \ - --secret HF_TOKEN=$HF_TOKEN \ - -- --model "model-id" \ - --task "task-name" -``` - -or use the Python helper: - -```bash -uv run scripts/run_eval_job.py \ - --model "model-id" \ - --task "task-name" \ - --hardware "cpu-basic|t4-small|..." -``` - -**Run vLLM Evaluation (Custom Models):** -```bash -# lighteval with vLLM -hf jobs uv run scripts/lighteval_vllm_uv.py \ - --flavor "a10g-small" \ - --secrets HF_TOKEN=$HF_TOKEN \ - -- --model "model-id" \ - --tasks "leaderboard|mmlu|5" - -# inspect-ai with vLLM -hf jobs uv run scripts/inspect_vllm_uv.py \ - --flavor "a10g-small" \ - --secrets HF_TOKEN=$HF_TOKEN \ - -- --model "model-id" \ - --task "mmlu" - -# Helper script (auto hardware selection) -uv run scripts/run_vllm_eval_job.py \ - --model "model-id" \ - --task "leaderboard|mmlu|5" \ - --framework lighteval -``` - -### Model-Index Format - -The generated model-index follows this structure: - -```yaml -model-index: - - name: Model Name - results: - - task: - type: text-generation - dataset: - name: Benchmark Dataset - type: benchmark_type - metrics: - - name: MMLU - type: mmlu - value: 85.2 - - name: HumanEval - type: humaneval - value: 72.5 - source: - name: Source Name - url: https://source-url.com -``` - -WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field. - -### Error Handling -- **Table Not Found**: Script will report if no evaluation tables are detected -- **Invalid Format**: Clear error messages for malformed tables -- **API Errors**: Retry logic for transient Artificial Analysis API failures -- **Token Issues**: Validation before attempting updates -- **Merge Conflicts**: Preserves existing model-index entries when adding new ones -- **Space Creation**: Handles naming conflicts and hardware request failures gracefully - -### Best Practices - -1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates -2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command -3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow -4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr` -5. **Verify extracted values**: Compare YAML output against the README table manually -6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist -7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output -8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own -9. **One model per repo**: Only add the main model's results to model-index -10. **No markdown in YAML names**: The model name field in YAML should be plain text - -### Model Name Matching - -When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**: - -- Removes markdown formatting (bold `**`, links `[]()` ) -- Normalizes names (lowercase, replace `-` and `_` with spaces) -- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)` -- Only extracts if tokens match exactly (handles different word orders and separators) -- Fails if no exact match found (rather than guessing from similar names) - -**For column-based tables** (benchmarks as rows, models as columns): -- Finds the column header matching the model name -- Extracts scores from that column only - -**For transposed tables** (models as rows, benchmarks as columns): -- Finds the row in the first column matching the model name -- Extracts all benchmark scores from that row only - -This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints. - -### Common Patterns - -**Update Your Own Model:** -```bash -# Extract from README and push directly -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model" \ - --task-type "text-generation" +uv run scripts/lighteval_vllm_uv.py \ + --model meta-llama/Llama-3.2-3B-Instruct \ + --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \ + --max-samples 20 \ + --use-chat-template ``` -**Update Someone Else's Model (Full Workflow):** -```bash -# Step 1: ALWAYS check for existing PRs first -uv run scripts/evaluation_manager.py get-prs \ - --repo-id "other-username/their-model" - -# Step 2: If NO open PRs exist, proceed with creating one -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "other-username/their-model" \ - --create-pr - -# If open PRs DO exist: -# - Warn the user about existing PRs -# - Show them the PR URLs -# - Do NOT create a new PR unless user explicitly confirms -``` +`accelerate` fallback: -**Import Fresh Benchmarks:** ```bash -# Step 1: Check for existing PRs -uv run scripts/evaluation_manager.py get-prs \ - --repo-id "anthropic/claude-sonnet-4" - -# Step 2: If no PRs, import from Artificial Analysis -AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "anthropic/claude-sonnet-4" \ - --create-pr -``` - -### Troubleshooting - -**Issue**: "No evaluation tables found in README" -- **Solution**: Check if README contains markdown tables with numeric scores - -**Issue**: "Could not find model 'X' in transposed table" -- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list -- **Example**: `--model-name-override "**Olmo 3-32B**"` - -**Issue**: "AA_API_KEY not set" -- **Solution**: Set environment variable or add to .env file - -**Issue**: "Token does not have write access" -- **Solution**: Ensure HF_TOKEN has write permissions for the repository - -**Issue**: "Model not found in Artificial Analysis" -- **Solution**: Verify creator-slug and model-name match API values - -**Issue**: "Payment required for hardware" -- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware - -**Issue**: "vLLM out of memory" or CUDA OOM -- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU - -**Issue**: "Model architecture not supported by vLLM" -- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers - -**Issue**: "Trust remote code required" -- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen) - -**Issue**: "Chat template not found" -- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template - -### Integration Examples - -**Python Script Integration:** -```python -import subprocess -import os - -def update_model_evaluations(repo_id, readme_content): - """Update model card with evaluations from README.""" - result = subprocess.run([ - "python", "scripts/evaluation_manager.py", - "extract-readme", - "--repo-id", repo_id, - "--create-pr" - ], capture_output=True, text=True) - - if result.returncode == 0: - print(f"Successfully updated {repo_id}") - else: - print(f"Error: {result.stderr}") -``` +uv run scripts/lighteval_vllm_uv.py \ + --model microsoft/phi-2 \ + --tasks "leaderboard|mmlu|5" \ + --backend accelerate \ + --trust-remote-code \ + --max-samples 20 +``` + +# Remote Execution Boundary + +This skill intentionally stops at **local execution and backend selection**. + +If the user wants to: +- run these scripts on Hugging Face Jobs +- pick remote hardware +- pass secrets to remote jobs +- schedule recurring runs +- inspect / cancel / monitor jobs + +then switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments. + +# Task Selection + +`inspect-ai` examples: +- `mmlu` +- `gsm8k` +- `hellaswag` +- `arc_challenge` +- `truthfulqa` +- `winogrande` +- `humaneval` + +`lighteval` task strings use `suite|task|num_fewshot`: +- `leaderboard|mmlu|5` +- `leaderboard|gsm8k|5` +- `leaderboard|arc_challenge|25` +- `lighteval|hellaswag|0` + +Multiple `lighteval` tasks can be comma-separated in `--tasks`. + +# Backend Selection + +- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures. +- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model. +- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models. +- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback. +- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control. + +# Hardware Guidance + +| Model size | Suggested local hardware | +|---|---| +| `< 3B` | consumer GPU / Apple Silicon / small dev GPU | +| `3B - 13B` | stronger local GPU | +| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` | + +For smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`. + +# Troubleshooting + +- CUDA or vLLM OOM: + - reduce `--batch-size` + - reduce `--gpu-memory-utilization` + - switch to a smaller model for the smoke test + - if necessary, hand off to `hugging-face-jobs` +- Model unsupported by `vllm`: + - switch to `--backend hf` for `inspect-ai` + - switch to `--backend accelerate` for `lighteval` +- Gated/private repo access fails: + - verify `HF_TOKEN` +- Custom model code required: + - add `--trust-remote-code` + +# Examples + +See: +- `examples/USAGE_EXAMPLES.md` for local command patterns +- `scripts/inspect_eval_uv.py` +- `scripts/inspect_vllm_uv.py` +- `scripts/lighteval_vllm_uv.py` diff --git a/skills/hugging-face-evaluation/examples/.env.example b/skills/hugging-face-evaluation/examples/.env.example index 3d814a3c..26d9b9b4 100644 --- a/skills/hugging-face-evaluation/examples/.env.example +++ b/skills/hugging-face-evaluation/examples/.env.example @@ -1,7 +1,3 @@ -# Hugging Face Token (required for all operations) +# Hugging Face Token (required for gated/private models) # Get your token at: https://huggingface.co/settings/tokens HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx - -# Artificial Analysis API Key (required for import-aa command) -# Get your key at: https://artificialanalysis.ai/ -AA_API_KEY=aa_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx diff --git a/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md b/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md index b5cbb708..64c24334 100644 --- a/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md +++ b/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md @@ -1,378 +1,101 @@ # Usage Examples -This document provides practical examples for both methods of adding evaluations to HuggingFace model cards. +This document provides practical examples for **running evaluations locally** against Hugging Face Hub models. -## Table of Contents -1. [Setup](#setup) -2. [Method 1: Extract from README](#method-1-extract-from-readme) -3. [Method 2: Import from Artificial Analysis](#method-2-import-from-artificial-analysis) -4. [Standalone vs Integrated](#standalone-vs-integrated) -5. [Common Workflows](#common-workflows) +## What this skill covers -## Setup - -### Initial Configuration - -```bash -# Navigate to skill directory -cd hf_evaluation_skill - - -# Configure environment variables -cp examples/.env.example .env -# Edit .env with your tokens -``` - -Your `.env` file should contain: -```env -HF_TOKEN=hf_your_write_token_here -AA_API_KEY=aa_your_api_key_here # Optional for AA imports -``` - -### Verify Installation - -```bash -uv run scripts/test_extraction.py -``` - -## Method 1: Extract from README - -Extract evaluation tables from your model's existing README. - -### Basic Extraction - -```bash -# Preview what will be extracted (dry run) -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "meta-llama/Llama-3.3-70B-Instruct" \ - --dry-run -``` - -### Apply Extraction to Your Model - -```bash -# Extract and update model card directly -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model-7b" -``` - -### Custom Task and Dataset Names - -```bash -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model-7b" \ - --task-type "text-generation" \ - --dataset-name "Standard Benchmarks" \ - --dataset-type "llm_benchmarks" -``` - -### Create Pull Request (for models you don't own) - -```bash -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "organization/community-model" \ - --create-pr -``` - -### Example README Format - -Your model README should contain tables like: - -```markdown -## Evaluation Results - -| Benchmark | Score | -|---------------|-------| -| MMLU | 85.2 | -| HumanEval | 72.5 | -| GSM8K | 91.3 | -| HellaSwag | 88.9 | -``` - -## Method 2: Import from Artificial Analysis - -Fetch benchmark scores directly from Artificial Analysis API. - -### Integrated Approach (Recommended) - -```bash -# Import scores for Claude Sonnet 4.5 -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "your-username/claude-mirror" -``` - -### With Pull Request - -```bash -# Create PR instead of direct commit -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "openai" \ - --model-name "gpt-4" \ - --repo-id "your-username/gpt-4-mirror" \ - --create-pr -``` - -### Standalone Script - -For simple, one-off imports, use the standalone script: - -```bash -# Navigate to examples directory -cd examples +- `inspect-ai` local runs +- `inspect-ai` with `vllm` or Transformers backends +- `lighteval` local runs with `vllm` or `accelerate` +- smoke tests and backend fallback patterns -# Run standalone script -AA_API_KEY="your-key" HF_TOKEN="your-token" \ -uv run artificial_analysis_to_hub.py \ - --creator-slug "anthropic" \ - --model-name "claude-sonnet-4" \ - --repo-id "your-username/your-repo" -``` - -### Finding Creator Slug and Model Name - -1. Visit [Artificial Analysis](https://artificialanalysis.ai/) -2. Navigate to the model you want to import -3. The URL format is: `https://artificialanalysis.ai/models/{creator-slug}/{model-name}` -4. Or check their [API documentation](https://artificialanalysis.ai/api) - -Common examples: -- Anthropic: `--creator-slug "anthropic" --model-name "claude-sonnet-4"` -- OpenAI: `--creator-slug "openai" --model-name "gpt-4-turbo"` -- Meta: `--creator-slug "meta" --model-name "llama-3-70b"` - -## Standalone vs Integrated - -### Standalone Script Features -- ✓ Simple, single-purpose -- ✓ Can run via `uv run` from URL -- ✓ Minimal dependencies -- ✗ No README extraction -- ✗ No validation -- ✗ No dry-run mode - -**Use when:** You only need AA imports and want a simple script. +## What this skill does NOT cover -### Integrated Script Features -- ✓ Both README extraction AND AA import -- ✓ Validation and show commands -- ✓ Dry-run preview mode -- ✓ Better error handling -- ✓ Merge with existing evaluations -- ✓ More flexible options - -**Use when:** You want full evaluation management capabilities. - -## Common Workflows - -### Workflow 1: New Model with README Tables - -You've just created a model with evaluation tables in the README. - -```bash -# Step 1: Preview extraction -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/new-model-7b" \ - --dry-run +- `model-index` +- `.eval_results` +- community eval publication workflows +- model-card PR creation +- Hugging Face Jobs orchestration -# Step 2: Apply if it looks good -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/new-model-7b" +If you want to run these same scripts remotely, use the `hugging-face-jobs` skill and pass one of the scripts in `scripts/`. -# Step 3: Validate -uv run scripts/evaluation_manager.py validate \ - --repo-id "your-username/new-model-7b" - -# Step 4: View results -uv run scripts/evaluation_manager.py show \ - --repo-id "your-username/new-model-7b" -``` - -### Workflow 2: Model Benchmarked on AA - -Your model appears on Artificial Analysis with fresh benchmarks. - -```bash -# Import scores and create PR for review -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "your-org" \ - --model-name "your-model" \ - --repo-id "your-org/your-model-hf" \ - --create-pr -``` - -### Workflow 3: Combine Both Methods - -You have README tables AND AA scores. +## Setup ```bash -# Step 1: Extract from README -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/hybrid-model" - -# Step 2: Import from AA (will merge with existing) -uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "your-org" \ - --model-name "hybrid-model" \ - --repo-id "your-username/hybrid-model" - -# Step 3: View combined results -uv run scripts/evaluation_manager.py show \ - --repo-id "your-username/hybrid-model" +cd skills/hugging-face-evaluation +export HF_TOKEN=hf_xxx +uv --version ``` -### Workflow 4: Contributing to Community Models - -Help improve community models by adding missing evaluations. +For local GPU runs: ```bash -# Find a model with evaluations in README but no model-index -# Example: community/awesome-7b - -# Create PR with extracted evaluations -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "community/awesome-7b" \ - --create-pr - -# GitHub will notify the repository owner -# They can review and merge your PR +nvidia-smi ``` -### Workflow 5: Batch Processing +## inspect-ai examples -Update multiple models at once. +### Quick smoke test ```bash -# Create a list of repos -cat > models.txt << EOF -your-org/model-1-7b -your-org/model-2-13b -your-org/model-3-70b -EOF - -# Process each -while read repo_id; do - echo "Processing $repo_id..." - uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "$repo_id" -done < models.txt +uv run scripts/inspect_eval_uv.py \ + --model meta-llama/Llama-3.2-1B \ + --task mmlu \ + --limit 10 ``` -### Workflow 6: Automated Updates (CI/CD) - -Set up automatic evaluation updates using GitHub Actions. - -```yaml -# .github/workflows/update-evals.yml -name: Update Evaluations Weekly -on: - schedule: - - cron: '0 0 * * 0' # Every Sunday - workflow_dispatch: # Manual trigger - -jobs: - update: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - - name: Set up uv - uses: astral-sh/setup-uv@v5 - - - name: Set up Python - uses: actions/setup-python@v5 - with: - python-version: '3.13' - - - name: Update from Artificial Analysis - env: - AA_API_KEY: ${{ secrets.AA_API_KEY }} - HF_TOKEN: ${{ secrets.HF_TOKEN }} - run: | - uv run scripts/evaluation_manager.py import-aa \ - --creator-slug "${{ vars.AA_CREATOR_SLUG }}" \ - --model-name "${{ vars.AA_MODEL_NAME }}" \ - --repo-id "${{ github.repository }}" \ - --create-pr -``` - -## Verification and Validation - -### Check Current Evaluations +### Local GPU with vLLM ```bash -uv run scripts/evaluation_manager.py show \ - --repo-id "your-username/your-model" +uv run scripts/inspect_vllm_uv.py \ + --model meta-llama/Llama-3.2-8B-Instruct \ + --task gsm8k \ + --limit 20 ``` -### Validate Format +### Transformers fallback ```bash -uv run scripts/evaluation_manager.py validate \ - --repo-id "your-username/your-model" +uv run scripts/inspect_vllm_uv.py \ + --model microsoft/phi-2 \ + --task mmlu \ + --backend hf \ + --trust-remote-code \ + --limit 20 ``` -### View in HuggingFace UI +## lighteval examples -After updating, visit: -``` -https://huggingface.co/your-username/your-model -``` - -The evaluation widget should display your scores automatically. - -## Troubleshooting Examples - -### Problem: No tables found +### Single task ```bash -# Check what tables exist in your README -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model" \ - --dry-run - -# If no output, ensure your README has markdown tables with numeric scores +uv run scripts/lighteval_vllm_uv.py \ + --model meta-llama/Llama-3.2-3B-Instruct \ + --tasks "leaderboard|mmlu|5" \ + --max-samples 20 ``` -### Problem: AA model not found +### Multiple tasks ```bash -# Verify the creator and model slugs -# Check the AA website URL or API directly -curl -H "x-api-key: $AA_API_KEY" \ - https://artificialanalysis.ai/api/v2/data/llms/models | jq +uv run scripts/lighteval_vllm_uv.py \ + --model meta-llama/Llama-3.2-3B-Instruct \ + --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \ + --max-samples 20 \ + --use-chat-template ``` -### Problem: Token permission error +### accelerate fallback ```bash -# Verify your token has write access -# Generate a new token at: https://huggingface.co/settings/tokens -# Ensure "Write" scope is enabled +uv run scripts/lighteval_vllm_uv.py \ + --model microsoft/phi-2 \ + --tasks "leaderboard|mmlu|5" \ + --backend accelerate \ + --trust-remote-code \ + --max-samples 20 ``` -## Tips and Best Practices - -1. **Always dry-run first**: Use `--dry-run` to preview changes -2. **Use PRs for others' repos**: Always use `--create-pr` for repositories you don't own -3. **Validate after updates**: Run `validate` to ensure proper formatting -4. **Keep evaluations current**: Set up automated updates for AA scores -5. **Document sources**: The tool automatically adds source attribution -6. **Check the UI**: Always verify the evaluation widget displays correctly - -## Getting Help - -```bash -# General help -uv run scripts/evaluation_manager.py --help - -# Command-specific help -uv run scripts/evaluation_manager.py extract-readme --help -uv run scripts/evaluation_manager.py import-aa --help -``` +## Hand-off to Hugging Face Jobs -For issues or questions, consult: -- `../SKILL.md` - Complete documentation -- `../README.md` - Troubleshooting guide -- `../QUICKSTART.md` - Quick start guide +When local hardware is not enough, switch to the `hugging-face-jobs` skill and run one of these scripts remotely. Keep the script path and args; move the orchestration there. diff --git a/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py b/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py deleted file mode 100644 index 15216dd0..00000000 --- a/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py +++ /dev/null @@ -1,141 +0,0 @@ -# /// script -# requires-python = ">=3.13" -# dependencies = [ -# "huggingface-hub>=1.1.4", -# "python-dotenv>=1.2.1", -# "pyyaml>=6.0.3", -# "requests>=2.32.5", -# ] -# /// - -""" -Add Artificial Analysis evaluations to a Hugging Face model card. - -NOTE: This is a standalone reference script. For integrated functionality -with additional features (README extraction, validation, etc.), use: - ../scripts/evaluation_manager.py import-aa [options] - -STANDALONE USAGE: -AA_API_KEY="" HF_TOKEN="" \ -uv run artificial_analysis_to_hub.py \ ---creator-slug \ ---model-name \ ---repo-id - -INTEGRATED USAGE (Recommended): -uv run ../scripts/evaluation_manager.py import-aa \ ---creator-slug \ ---model-name \ ---repo-id \ -[--create-pr] -""" - -import argparse -import os - -import requests -import dotenv -from huggingface_hub import ModelCard - -dotenv.load_dotenv() - -API_KEY = os.getenv("AA_API_KEY") -HF_TOKEN = os.getenv("HF_TOKEN") -URL = "https://artificialanalysis.ai/api/v2/data/llms/models" -HEADERS = {"x-api-key": API_KEY} - -if not API_KEY: - raise ValueError("AA_API_KEY is not set") -if not HF_TOKEN: - raise ValueError("HF_TOKEN is not set") - - -def get_model_evaluations_data(creator_slug, model_name): - response = requests.get(URL, headers=HEADERS) - response_data = response.json()["data"] - for model in response_data: - if ( - model["model_creator"]["slug"] == creator_slug - and model["slug"] == model_name - ): - return model - raise ValueError(f"Model {model_name} not found") - - -def aa_evaluations_to_model_index( - model, - dataset_name="Artificial Analysis Benchmarks", - dataset_type="artificial_analysis", - task_type="evaluation", -): - if not model: - raise ValueError("Model data is required") - - model_name = model.get("name", model.get("slug", "unknown-model")) - evaluations = model.get("evaluations", {}) - - metrics = [] - for key, value in evaluations.items(): - metrics.append( - { - "name": key.replace("_", " ").title(), - "type": key, - "value": value, - } - ) - - model_index = [ - { - "name": model_name, - "results": [ - { - "task": {"type": task_type}, - "dataset": {"name": dataset_name, "type": dataset_type}, - "metrics": metrics, - "source": { - "name": "Artificial Analysis API", - "url": "https://artificialanalysis.ai", - }, - } - ], - } - ] - - return model_index - - -def main(): - parser = argparse.ArgumentParser() - parser.add_argument("--creator-slug", type=str, required=True) - parser.add_argument("--model-name", type=str, required=True) - parser.add_argument("--repo-id", type=str, required=True) - args = parser.parse_args() - - aa_evaluations_data = get_model_evaluations_data( - creator_slug=args.creator_slug, model_name=args.model_name - ) - - model_index = aa_evaluations_to_model_index(model=aa_evaluations_data) - - card = ModelCard.load(args.repo_id) - card.data["model-index"] = model_index - - commit_message = ( - f"Add Artificial Analysis evaluations for {args.model_name}" - ) - commit_description = ( - f"This commit adds the Artificial Analysis evaluations for the {args.model_name} model to this repository. " - "To see the scores, visit the [Artificial Analysis](https://artificialanalysis.ai) website." - ) - - card.push_to_hub( - args.repo_id, - token=HF_TOKEN, - commit_message=commit_message, - commit_description=commit_description, - create_pr=True, - ) - - -if __name__ == "__main__": - main() diff --git a/skills/hugging-face-evaluation/examples/example_readme_tables.md b/skills/hugging-face-evaluation/examples/example_readme_tables.md deleted file mode 100644 index c996338f..00000000 --- a/skills/hugging-face-evaluation/examples/example_readme_tables.md +++ /dev/null @@ -1,135 +0,0 @@ -# Example Evaluation Table Formats - -This file shows various formats of evaluation tables that can be extracted from model README files. - -## Format 1: Benchmarks as Rows (Most Common) - -```markdown -| Benchmark | Score | -|-----------|-------| -| MMLU | 85.2 | -| HumanEval | 72.5 | -| GSM8K | 91.3 | -| HellaSwag | 88.9 | -``` - -## Format 2: Multiple Metric Columns - -```markdown -| Benchmark | Accuracy | F1 Score | -|-----------|----------|----------| -| MMLU | 85.2 | 0.84 | -| GSM8K | 91.3 | 0.91 | -| DROP | 78.5 | 0.77 | -``` - -## Format 3: Benchmarks as Columns - -```markdown -| MMLU | HumanEval | GSM8K | HellaSwag | -|------|-----------|-------|-----------| -| 85.2 | 72.5 | 91.3 | 88.9 | -``` - -## Format 4: Percentage Values - -```markdown -| Benchmark | Score | -|---------------|----------| -| MMLU | 85.2% | -| HumanEval | 72.5% | -| GSM8K | 91.3% | -| TruthfulQA | 68.7% | -``` - -## Format 5: Mixed Format with Categories - -```markdown -### Reasoning - -| Benchmark | Score | -|-----------|-------| -| MMLU | 85.2 | -| BBH | 82.4 | -| GPQA | 71.3 | - -### Coding - -| Benchmark | Score | -|-----------|-------| -| HumanEval | 72.5 | -| MBPP | 78.9 | - -### Math - -| Benchmark | Score | -|-----------|-------| -| GSM8K | 91.3 | -| MATH | 65.8 | -``` - -## Format 6: With Additional Columns - -```markdown -| Benchmark | Score | Rank | Notes | -|-----------|-------|------|--------------------| -| MMLU | 85.2 | #5 | 5-shot | -| HumanEval | 72.5 | #8 | pass@1 | -| GSM8K | 91.3 | #3 | 8-shot, maj@1 | -``` - -## How the Extractor Works - -The script will: -1. Find all markdown tables in the README -2. Identify which tables contain evaluation results -3. Parse the table structure (rows vs columns) -4. Extract numeric values as scores -5. Convert to model-index YAML format - -## Tips for README Authors - -To ensure your evaluation tables are properly extracted: - -1. **Use clear headers**: Include "Benchmark", "Score", or similar terms -2. **Keep it simple**: Stick to benchmark name + score columns -3. **Use standard formats**: Follow markdown table syntax -4. **Include numeric values**: Ensure scores are parseable numbers -5. **Be consistent**: Use the same format across multiple tables - -## Example Complete README Section - -```markdown -# Model Card for MyModel-7B - -## Evaluation Results - -Our model was evaluated on several standard benchmarks: - -| Benchmark | Score | -|---------------|-------| -| MMLU | 85.2 | -| HumanEval | 72.5 | -| GSM8K | 91.3 | -| HellaSwag | 88.9 | -| ARC-Challenge | 81.7 | -| TruthfulQA | 68.7 | - -### Detailed Results - -For more detailed results and methodology, see our [paper](link). -``` - -## Running the Extractor - -```bash -# Extract from this example -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model" \ - --dry-run - -# Apply to your model card -uv run scripts/evaluation_manager.py extract-readme \ - --repo-id "your-username/your-model" \ - --task-type "text-generation" -``` diff --git a/skills/hugging-face-evaluation/examples/metric_mapping.json b/skills/hugging-face-evaluation/examples/metric_mapping.json deleted file mode 100644 index 121d7592..00000000 --- a/skills/hugging-face-evaluation/examples/metric_mapping.json +++ /dev/null @@ -1,50 +0,0 @@ -{ - "MMLU": { - "type": "mmlu", - "name": "Massive Multitask Language Understanding" - }, - "HumanEval": { - "type": "humaneval", - "name": "Code Generation (HumanEval)" - }, - "GSM8K": { - "type": "gsm8k", - "name": "Grade School Math" - }, - "HellaSwag": { - "type": "hellaswag", - "name": "HellaSwag Common Sense" - }, - "ARC-C": { - "type": "arc_challenge", - "name": "ARC Challenge" - }, - "ARC-E": { - "type": "arc_easy", - "name": "ARC Easy" - }, - "Winogrande": { - "type": "winogrande", - "name": "Winogrande" - }, - "TruthfulQA": { - "type": "truthfulqa", - "name": "TruthfulQA" - }, - "GPQA": { - "type": "gpqa", - "name": "Graduate-Level Google-Proof Q&A" - }, - "DROP": { - "type": "drop", - "name": "Discrete Reasoning Over Paragraphs" - }, - "BBH": { - "type": "bbh", - "name": "Big Bench Hard" - }, - "MATH": { - "type": "math", - "name": "MATH Dataset" - } -} diff --git a/skills/hugging-face-evaluation/scripts/evaluation_manager.py b/skills/hugging-face-evaluation/scripts/evaluation_manager.py deleted file mode 100644 index 8dcfa901..00000000 --- a/skills/hugging-face-evaluation/scripts/evaluation_manager.py +++ /dev/null @@ -1,1374 +0,0 @@ -# /// script -# requires-python = ">=3.13" -# dependencies = [ -# "huggingface-hub>=1.1.4", -# "markdown-it-py>=3.0.0", -# "python-dotenv>=1.2.1", -# "pyyaml>=6.0.3", -# "requests>=2.32.5", -# ] -# /// - -""" -Manage evaluation results in Hugging Face model cards. - -This script provides two methods: -1. Extract evaluation tables from model README files -2. Import evaluation scores from Artificial Analysis API - -Both methods update the model-index metadata in model cards. -""" - -import argparse -import os -import re -from textwrap import dedent -from typing import Any, Dict, List, Optional, Tuple - - -def load_env() -> None: - """Load .env if python-dotenv is available; keep help usable without it.""" - try: - import dotenv # type: ignore - except ModuleNotFoundError: - return - dotenv.load_dotenv() - - -def require_markdown_it(): - try: - from markdown_it import MarkdownIt # type: ignore - except ModuleNotFoundError as exc: - raise ModuleNotFoundError( - "markdown-it-py is required for table parsing. " - "Run with `uv run ...` or install with `uv pip install markdown-it-py`." - ) from exc - return MarkdownIt - - -def require_model_card(): - try: - from huggingface_hub import ModelCard # type: ignore - except ModuleNotFoundError as exc: - raise ModuleNotFoundError( - "huggingface-hub is required for model card operations. " - "Run with `uv run ...` or install with `uv pip install huggingface-hub`." - ) from exc - return ModelCard - - -def require_requests(): - try: - import requests # type: ignore - except ModuleNotFoundError as exc: - raise ModuleNotFoundError( - "requests is required for Artificial Analysis import. " - "Run with `uv run ...` or install with `uv pip install requests`." - ) from exc - return requests - - -def require_yaml(): - try: - import yaml # type: ignore - except ModuleNotFoundError as exc: - raise ModuleNotFoundError( - "PyYAML is required for YAML output. " - "Run with `uv run ...` or install with `uv pip install pyyaml`." - ) from exc - return yaml - - -# ============================================================================ -# Method 1: Extract Evaluations from README -# ============================================================================ - - -def extract_tables_from_markdown(markdown_content: str) -> List[str]: - """Extract all markdown tables from content.""" - # Pattern to match markdown tables - table_pattern = r"(\|[^\n]+\|(?:\r?\n\|[^\n]+\|)+)" - tables = re.findall(table_pattern, markdown_content) - return tables - - -def parse_markdown_table(table_str: str) -> Tuple[List[str], List[List[str]]]: - """ - Parse a markdown table string into headers and rows. - - Returns: - Tuple of (headers, data_rows) - """ - lines = [line.strip() for line in table_str.strip().split("\n")] - - # Remove separator line (the one with dashes) - lines = [line for line in lines if not re.match(r"^\|[\s\-:]+\|$", line)] - - if len(lines) < 2: - return [], [] - - # Parse header - header = [cell.strip() for cell in lines[0].split("|")[1:-1]] - - # Parse data rows - data_rows = [] - for line in lines[1:]: - cells = [cell.strip() for cell in line.split("|")[1:-1]] - if cells: - data_rows.append(cells) - - return header, data_rows - - -def is_evaluation_table(header: List[str], rows: List[List[str]]) -> bool: - """Determine if a table contains evaluation results.""" - if not header or not rows: - return False - - # Check if first column looks like benchmark names - benchmark_keywords = [ - "benchmark", "task", "dataset", "eval", "test", "metric", - "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande", - "truthfulqa", "boolq", "piqa", "siqa" - ] - - first_col = header[0].lower() - has_benchmark_header = any(keyword in first_col for keyword in benchmark_keywords) - - # Check if there are numeric values in the table - has_numeric_values = False - for row in rows: - for cell in row: - try: - float(cell.replace("%", "").replace(",", "")) - has_numeric_values = True - break - except ValueError: - continue - if has_numeric_values: - break - - return has_benchmark_header or has_numeric_values - - -def normalize_model_name(name: str) -> tuple[set[str], str]: - """ - Normalize a model name for matching. - - Args: - name: Model name to normalize - - Returns: - Tuple of (token_set, normalized_string) - """ - # Remove markdown formatting - cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', name) # Remove markdown links - cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned) # Remove bold - cleaned = cleaned.strip() - - # Normalize and tokenize - normalized = cleaned.lower().replace("-", " ").replace("_", " ") - tokens = set(normalized.split()) - - return tokens, normalized - - -def find_main_model_column(header: List[str], model_name: str) -> Optional[int]: - """ - Identify the column index that corresponds to the main model. - - Only returns a column if there's an exact normalized match with the model name. - This prevents extracting scores from training checkpoints or similar models. - - Args: - header: Table column headers - model_name: Model name from repo_id (e.g., "OLMo-3-32B-Think") - - Returns: - Column index of the main model, or None if no exact match found - """ - if not header or not model_name: - return None - - # Normalize model name and extract tokens - model_tokens, _ = normalize_model_name(model_name) - - # Find exact matches only - for i, col_name in enumerate(header): - if not col_name: - continue - - # Skip first column (benchmark names) - if i == 0: - continue - - col_tokens, _ = normalize_model_name(col_name) - - # Check for exact token match - if model_tokens == col_tokens: - return i - - # No exact match found - return None - - -def find_main_model_row( - rows: List[List[str]], model_name: str -) -> tuple[Optional[int], List[str]]: - """ - Identify the row index that corresponds to the main model in a transposed table. - - In transposed tables, each row represents a different model, with the first - column containing the model name. - - Args: - rows: Table data rows - model_name: Model name from repo_id (e.g., "OLMo-3-32B") - - Returns: - Tuple of (row_index, available_models) - - row_index: Index of the main model, or None if no exact match found - - available_models: List of all model names found in the table - """ - if not rows or not model_name: - return None, [] - - model_tokens, _ = normalize_model_name(model_name) - available_models = [] - - for i, row in enumerate(rows): - if not row or not row[0]: - continue - - row_name = row[0].strip() - - # Skip separator/header rows - if not row_name or row_name.startswith('---'): - continue - - row_tokens, _ = normalize_model_name(row_name) - - # Collect all non-empty model names - if row_tokens: - available_models.append(row_name) - - # Check for exact token match - if model_tokens == row_tokens: - return i, available_models - - return None, available_models - - -def is_transposed_table(header: List[str], rows: List[List[str]]) -> bool: - """ - Determine if a table is transposed (models as rows, benchmarks as columns). - - A table is considered transposed if: - - The first column contains model-like names (not benchmark names) - - Most other columns contain numeric values - - Header row contains benchmark-like names - - Args: - header: Table column headers - rows: Table data rows - - Returns: - True if table appears to be transposed, False otherwise - """ - if not header or not rows or len(header) < 3: - return False - - # Check if first column header suggests model names - first_col = header[0].lower() - model_indicators = ["model", "system", "llm", "name"] - has_model_header = any(indicator in first_col for indicator in model_indicators) - - # Check if remaining headers look like benchmarks - benchmark_keywords = [ - "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande", - "eval", "score", "benchmark", "test", "math", "code", "mbpp", - "truthfulqa", "boolq", "piqa", "siqa", "drop", "squad" - ] - - benchmark_header_count = 0 - for col_name in header[1:]: - col_lower = col_name.lower() - if any(keyword in col_lower for keyword in benchmark_keywords): - benchmark_header_count += 1 - - has_benchmark_headers = benchmark_header_count >= 2 - - # Check if data rows have numeric values in most columns (except first) - numeric_count = 0 - total_cells = 0 - - for row in rows[:5]: # Check first 5 rows - for cell in row[1:]: # Skip first column - total_cells += 1 - try: - float(cell.replace("%", "").replace(",", "").strip()) - numeric_count += 1 - except (ValueError, AttributeError): - continue - - has_numeric_data = total_cells > 0 and (numeric_count / total_cells) > 0.5 - - return (has_model_header or has_benchmark_headers) and has_numeric_data - - -def extract_metrics_from_table( - header: List[str], - rows: List[List[str]], - table_format: str = "auto", - model_name: Optional[str] = None, - model_column_index: Optional[int] = None -) -> List[Dict[str, Any]]: - """ - Extract metrics from parsed table data. - - Args: - header: Table column headers - rows: Table data rows - table_format: "rows" (benchmarks as rows), "columns" (benchmarks as columns), - "transposed" (models as rows, benchmarks as columns), or "auto" - model_name: Optional model name to identify the correct column/row - - Returns: - List of metric dictionaries with name, type, and value - """ - metrics = [] - - if table_format == "auto": - # First check if it's a transposed table (models as rows) - if is_transposed_table(header, rows): - table_format = "transposed" - else: - # Check if first column header is empty/generic (indicates benchmarks in rows) - first_header = header[0].lower().strip() if header else "" - is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] - - if is_first_col_benchmarks: - table_format = "rows" - else: - # Heuristic: if first row has mostly numeric values, benchmarks are columns - try: - numeric_count = sum( - 1 for cell in rows[0] if cell and - re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip()) - ) - table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows" - except (IndexError, ValueError): - table_format = "rows" - - if table_format == "rows": - # Benchmarks are in rows, scores in columns - # Try to identify the main model column if model_name is provided - target_column = model_column_index - if target_column is None and model_name: - target_column = find_main_model_column(header, model_name) - - for row in rows: - if not row: - continue - - benchmark_name = row[0].strip() - if not benchmark_name: - continue - - # If we identified a specific column, use it; otherwise use first numeric value - if target_column is not None and target_column < len(row): - try: - value_str = row[target_column].replace("%", "").replace(",", "").strip() - if value_str: - value = float(value_str) - metrics.append({ - "name": benchmark_name, - "type": benchmark_name.lower().replace(" ", "_"), - "value": value - }) - except (ValueError, IndexError): - pass - else: - # Extract numeric values from remaining columns (original behavior) - for i, cell in enumerate(row[1:], start=1): - try: - # Remove common suffixes and convert to float - value_str = cell.replace("%", "").replace(",", "").strip() - if not value_str: - continue - - value = float(value_str) - - # Determine metric name - metric_name = benchmark_name - if len(header) > i and header[i].lower() not in ["score", "value", "result"]: - metric_name = f"{benchmark_name} ({header[i]})" - - metrics.append({ - "name": metric_name, - "type": benchmark_name.lower().replace(" ", "_"), - "value": value - }) - break # Only take first numeric value per row - except (ValueError, IndexError): - continue - - elif table_format == "transposed": - # Models are in rows (first column), benchmarks are in columns (header) - # Find the row that matches the target model - if not model_name: - print("Warning: model_name required for transposed table format") - return metrics - - target_row_idx, available_models = find_main_model_row(rows, model_name) - - if target_row_idx is None: - print(f"\n⚠ Could not find model '{model_name}' in transposed table") - if available_models: - print("\nAvailable models in table:") - for i, model in enumerate(available_models, 1): - print(f" {i}. {model}") - print("\nPlease select the correct model name from the list above.") - print("You can specify it using the --model-name-override flag:") - print(f' --model-name-override "{available_models[0]}"') - return metrics - - target_row = rows[target_row_idx] - - # Extract metrics from each column (skip first column which is model name) - for i in range(1, len(header)): - benchmark_name = header[i].strip() - if not benchmark_name or i >= len(target_row): - continue - - try: - value_str = target_row[i].replace("%", "").replace(",", "").strip() - if not value_str: - continue - - value = float(value_str) - - metrics.append({ - "name": benchmark_name, - "type": benchmark_name.lower().replace(" ", "_").replace("-", "_"), - "value": value - }) - except (ValueError, AttributeError): - continue - - else: # table_format == "columns" - # Benchmarks are in columns - if not rows: - return metrics - - # Use first data row for values - data_row = rows[0] - - for i, benchmark_name in enumerate(header): - if not benchmark_name or i >= len(data_row): - continue - - try: - value_str = data_row[i].replace("%", "").replace(",", "").strip() - if not value_str: - continue - - value = float(value_str) - - metrics.append({ - "name": benchmark_name, - "type": benchmark_name.lower().replace(" ", "_"), - "value": value - }) - except ValueError: - continue - - return metrics - - -def extract_evaluations_from_readme( - repo_id: str, - task_type: str = "text-generation", - dataset_name: str = "Benchmarks", - dataset_type: str = "benchmark", - model_name_override: Optional[str] = None, - table_index: Optional[int] = None, - model_column_index: Optional[int] = None -) -> Optional[List[Dict[str, Any]]]: - """ - Extract evaluation results from a model's README. - - Args: - repo_id: Hugging Face model repository ID - task_type: Task type for model-index (e.g., "text-generation") - dataset_name: Name for the benchmark dataset - dataset_type: Type identifier for the dataset - model_name_override: Override model name for matching (column header for comparison tables) - table_index: 1-indexed table number from inspect-tables output - - Returns: - Model-index formatted results or None if no evaluations found - """ - try: - load_env() - ModelCard = require_model_card() - hf_token = os.getenv("HF_TOKEN") - card = ModelCard.load(repo_id, token=hf_token) - readme_content = card.content - - if not readme_content: - print(f"No README content found for {repo_id}") - return None - - # Extract model name from repo_id or use override - if model_name_override: - model_name = model_name_override - print(f"Using model name override: '{model_name}'") - else: - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - - # Use markdown-it parser for accurate table extraction - all_tables = extract_tables_with_parser(readme_content) - - if not all_tables: - print(f"No tables found in README for {repo_id}") - return None - - # If table_index specified, use that specific table - if table_index is not None: - if table_index < 1 or table_index > len(all_tables): - print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.") - print("Run inspect-tables to see available tables.") - return None - tables_to_process = [all_tables[table_index - 1]] - else: - # Filter to evaluation tables only - eval_tables = [] - for table in all_tables: - header = table.get("headers", []) - rows = table.get("rows", []) - if is_evaluation_table(header, rows): - eval_tables.append(table) - - if len(eval_tables) > 1: - print(f"\n⚠ Found {len(eval_tables)} evaluation tables.") - print("Run inspect-tables first, then use --table to select one:") - print(f' uv run scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"') - return None - elif len(eval_tables) == 0: - print(f"No evaluation tables found in README for {repo_id}") - return None - - tables_to_process = eval_tables - - # Extract metrics from selected table(s) - all_metrics = [] - for table in tables_to_process: - header = table.get("headers", []) - rows = table.get("rows", []) - metrics = extract_metrics_from_table( - header, - rows, - model_name=model_name, - model_column_index=model_column_index - ) - all_metrics.extend(metrics) - - if not all_metrics: - print(f"No metrics extracted from table") - return None - - # Build model-index structure - display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - - results = [{ - "task": {"type": task_type}, - "dataset": { - "name": dataset_name, - "type": dataset_type - }, - "metrics": all_metrics, - "source": { - "name": "Model README", - "url": f"https://huggingface.co/{repo_id}" - } - }] - - return results - - except Exception as e: - print(f"Error extracting evaluations from README: {e}") - return None - - -# ============================================================================ -# Table Inspection (using markdown-it-py for accurate parsing) -# ============================================================================ - - -def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]: - """ - Extract tables from markdown using markdown-it-py parser. - Uses GFM (GitHub Flavored Markdown) which includes table support. - """ - MarkdownIt = require_markdown_it() - # Disable linkify to avoid optional dependency errors; not needed for table parsing. - md = MarkdownIt("gfm-like", {"linkify": False}) - tokens = md.parse(markdown_content) - - tables = [] - i = 0 - while i < len(tokens): - token = tokens[i] - - if token.type == "table_open": - table_data = {"headers": [], "rows": []} - current_row = [] - in_header = False - - i += 1 - while i < len(tokens) and tokens[i].type != "table_close": - t = tokens[i] - if t.type == "thead_open": - in_header = True - elif t.type == "thead_close": - in_header = False - elif t.type == "tr_open": - current_row = [] - elif t.type == "tr_close": - if in_header: - table_data["headers"] = current_row - else: - table_data["rows"].append(current_row) - current_row = [] - elif t.type == "inline": - current_row.append(t.content.strip()) - i += 1 - - if table_data["headers"] or table_data["rows"]: - tables.append(table_data) - - i += 1 - - return tables - - -def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]: - """Analyze a table to detect its format and identify model columns.""" - headers = table.get("headers", []) - rows = table.get("rows", []) - - if not headers or not rows: - return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []} - - first_header = headers[0].lower() if headers else "" - is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"] - - # Check for numeric columns - numeric_columns = [] - for col_idx in range(1, len(headers)): - numeric_count = 0 - for row in rows[:5]: - if col_idx < len(row): - try: - val = re.sub(r'\s*\([^)]*\)', '', row[col_idx]) - float(val.replace("%", "").replace(",", "").strip()) - numeric_count += 1 - except (ValueError, AttributeError): - pass - if numeric_count > len(rows[:5]) / 2: - numeric_columns.append(col_idx) - - # Determine format - if is_first_col_benchmarks and len(numeric_columns) > 1: - format_type = "comparison" - elif is_first_col_benchmarks and len(numeric_columns) == 1: - format_type = "simple" - elif len(numeric_columns) > len(headers) / 2: - format_type = "transposed" - else: - format_type = "unknown" - - # Find model columns - model_columns = [] - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - model_tokens, _ = normalize_model_name(model_name) - - for idx, header in enumerate(headers): - if idx == 0 and is_first_col_benchmarks: - continue - if header: - header_tokens, _ = normalize_model_name(header) - is_match = model_tokens == header_tokens - is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens) - model_columns.append({ - "index": idx, - "header": header, - "is_exact_match": is_match, - "is_partial_match": is_partial and not is_match - }) - - return { - "format": format_type, - "columns": headers, - "model_columns": model_columns, - "row_count": len(rows), - "sample_rows": [row[0] for row in rows[:5] if row] - } - - -def inspect_tables(repo_id: str) -> None: - """Inspect and display all evaluation tables in a model's README.""" - try: - load_env() - ModelCard = require_model_card() - hf_token = os.getenv("HF_TOKEN") - card = ModelCard.load(repo_id, token=hf_token) - readme_content = card.content - - if not readme_content: - print(f"No README content found for {repo_id}") - return - - tables = extract_tables_with_parser(readme_content) - - if not tables: - print(f"No tables found in README for {repo_id}") - return - - print(f"\n{'='*70}") - print(f"Tables found in README for: {repo_id}") - print(f"{'='*70}") - - eval_table_count = 0 - for table in tables: - analysis = detect_table_format(table, repo_id) - - if analysis["format"] == "unknown" and not analysis.get("sample_rows"): - continue - - eval_table_count += 1 - print(f"\n## Table {eval_table_count}") - print(f" Format: {analysis['format']}") - print(f" Rows: {analysis['row_count']}") - - print(f"\n Columns ({len(analysis['columns'])}):") - for col_info in analysis.get("model_columns", []): - idx = col_info["index"] - header = col_info["header"] - if col_info["is_exact_match"]: - print(f" [{idx}] {header} ✓ EXACT MATCH") - elif col_info["is_partial_match"]: - print(f" [{idx}] {header} ~ partial match") - else: - print(f" [{idx}] {header}") - - if analysis.get("sample_rows"): - print(f"\n Sample rows (first column):") - for row_val in analysis["sample_rows"][:5]: - print(f" - {row_val}") - - if eval_table_count == 0: - print("\nNo evaluation tables detected.") - else: - print("\nSuggested next step:") - print(f' uv run scripts/evaluation_manager.py extract-readme --repo-id "{repo_id}" --table [--model-column-index ]') - - print(f"\n{'='*70}\n") - - except Exception as e: - print(f"Error inspecting tables: {e}") - - -# ============================================================================ -# Pull Request Management -# ============================================================================ - - -def get_open_prs(repo_id: str) -> List[Dict[str, Any]]: - """ - Fetch open pull requests for a Hugging Face model repository. - - Args: - repo_id: Hugging Face model repository ID (e.g., "allenai/Olmo-3-32B-Think") - - Returns: - List of open PR dictionaries with num, title, author, and createdAt - """ - requests = require_requests() - url = f"https://huggingface.co/api/models/{repo_id}/discussions" - - try: - response = requests.get(url, timeout=30, allow_redirects=True) - response.raise_for_status() - - data = response.json() - discussions = data.get("discussions", []) - - open_prs = [ - { - "num": d["num"], - "title": d["title"], - "author": d["author"]["name"], - "createdAt": d.get("createdAt", "unknown"), - } - for d in discussions - if d.get("status") == "open" and d.get("isPullRequest") - ] - - return open_prs - - except requests.RequestException as e: - print(f"Error fetching PRs from Hugging Face: {e}") - return [] - - -def list_open_prs(repo_id: str) -> None: - """Display open pull requests for a model repository.""" - prs = get_open_prs(repo_id) - - print(f"\n{'='*70}") - print(f"Open Pull Requests for: {repo_id}") - print(f"{'='*70}") - - if not prs: - print("\nNo open pull requests found.") - else: - print(f"\nFound {len(prs)} open PR(s):\n") - for pr in prs: - print(f" PR #{pr['num']} - {pr['title']}") - print(f" Author: {pr['author']}") - print(f" Created: {pr['createdAt']}") - print(f" URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}") - print() - - print(f"{'='*70}\n") - - -# ============================================================================ -# Method 2: Import from Artificial Analysis -# ============================================================================ - - -def get_aa_model_data(creator_slug: str, model_name: str) -> Optional[Dict[str, Any]]: - """ - Fetch model evaluation data from Artificial Analysis API. - - Args: - creator_slug: Creator identifier (e.g., "anthropic", "openai") - model_name: Model slug/identifier - - Returns: - Model data dictionary or None if not found - """ - load_env() - AA_API_KEY = os.getenv("AA_API_KEY") - if not AA_API_KEY: - raise ValueError("AA_API_KEY environment variable is not set") - - url = "https://artificialanalysis.ai/api/v2/data/llms/models" - headers = {"x-api-key": AA_API_KEY} - - requests = require_requests() - - try: - response = requests.get(url, headers=headers, timeout=30) - response.raise_for_status() - - data = response.json().get("data", []) - - for model in data: - creator = model.get("model_creator", {}) - if creator.get("slug") == creator_slug and model.get("slug") == model_name: - return model - - print(f"Model {creator_slug}/{model_name} not found in Artificial Analysis") - return None - - except requests.RequestException as e: - print(f"Error fetching data from Artificial Analysis: {e}") - return None - - -def aa_data_to_model_index( - model_data: Dict[str, Any], - dataset_name: str = "Artificial Analysis Benchmarks", - dataset_type: str = "artificial_analysis", - task_type: str = "evaluation" -) -> List[Dict[str, Any]]: - """ - Convert Artificial Analysis model data to model-index format. - - Args: - model_data: Raw model data from AA API - dataset_name: Dataset name for model-index - dataset_type: Dataset type identifier - task_type: Task type for model-index - - Returns: - Model-index formatted results - """ - model_name = model_data.get("name", model_data.get("slug", "unknown-model")) - evaluations = model_data.get("evaluations", {}) - - if not evaluations: - print(f"No evaluations found for model {model_name}") - return [] - - metrics = [] - for key, value in evaluations.items(): - if value is not None: - metrics.append({ - "name": key.replace("_", " ").title(), - "type": key, - "value": value - }) - - results = [{ - "task": {"type": task_type}, - "dataset": { - "name": dataset_name, - "type": dataset_type - }, - "metrics": metrics, - "source": { - "name": "Artificial Analysis API", - "url": "https://artificialanalysis.ai" - } - }] - - return results - - -def import_aa_evaluations( - creator_slug: str, - model_name: str, - repo_id: str -) -> Optional[List[Dict[str, Any]]]: - """ - Import evaluation results from Artificial Analysis for a model. - - Args: - creator_slug: Creator identifier in AA - model_name: Model identifier in AA - repo_id: Hugging Face repository ID to update - - Returns: - Model-index formatted results or None if import fails - """ - model_data = get_aa_model_data(creator_slug, model_name) - - if not model_data: - return None - - results = aa_data_to_model_index(model_data) - return results - - -# ============================================================================ -# Model Card Update Functions -# ============================================================================ - - -def update_model_card_with_evaluations( - repo_id: str, - results: List[Dict[str, Any]], - create_pr: bool = False, - commit_message: Optional[str] = None -) -> bool: - """ - Update a model card with evaluation results. - - Args: - repo_id: Hugging Face repository ID - results: Model-index formatted results - create_pr: Whether to create a PR instead of direct push - commit_message: Custom commit message - - Returns: - True if successful, False otherwise - """ - try: - load_env() - ModelCard = require_model_card() - hf_token = os.getenv("HF_TOKEN") - if not hf_token: - raise ValueError("HF_TOKEN environment variable is not set") - - # Load existing card - card = ModelCard.load(repo_id, token=hf_token) - - # Get model name - model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id - - # Create or update model-index - model_index = [{ - "name": model_name, - "results": results - }] - - # Merge with existing model-index if present - if "model-index" in card.data: - existing = card.data["model-index"] - if isinstance(existing, list) and existing: - # Keep existing name if present - if "name" in existing[0]: - model_index[0]["name"] = existing[0]["name"] - - # Merge results - existing_results = existing[0].get("results", []) - model_index[0]["results"].extend(existing_results) - - card.data["model-index"] = model_index - - # Prepare commit message - if not commit_message: - commit_message = f"Add evaluation results to {model_name}" - - commit_description = ( - "This commit adds structured evaluation results to the model card. " - "The results are formatted using the model-index specification and " - "will be displayed in the model card's evaluation widget." - ) - - # Push update - card.push_to_hub( - repo_id, - token=hf_token, - commit_message=commit_message, - commit_description=commit_description, - create_pr=create_pr - ) - - action = "Pull request created" if create_pr else "Model card updated" - print(f"✓ {action} successfully for {repo_id}") - return True - - except Exception as e: - print(f"Error updating model card: {e}") - return False - - -def show_evaluations(repo_id: str) -> None: - """Display current evaluations in a model card.""" - try: - load_env() - ModelCard = require_model_card() - hf_token = os.getenv("HF_TOKEN") - card = ModelCard.load(repo_id, token=hf_token) - - if "model-index" not in card.data: - print(f"No model-index found in {repo_id}") - return - - model_index = card.data["model-index"] - - print(f"\nEvaluations for {repo_id}:") - print("=" * 60) - - for model_entry in model_index: - model_name = model_entry.get("name", "Unknown") - print(f"\nModel: {model_name}") - - results = model_entry.get("results", []) - for i, result in enumerate(results, 1): - print(f"\n Result Set {i}:") - - task = result.get("task", {}) - print(f" Task: {task.get('type', 'unknown')}") - - dataset = result.get("dataset", {}) - print(f" Dataset: {dataset.get('name', 'unknown')}") - - metrics = result.get("metrics", []) - print(f" Metrics ({len(metrics)}):") - for metric in metrics: - name = metric.get("name", "Unknown") - value = metric.get("value", "N/A") - print(f" - {name}: {value}") - - source = result.get("source", {}) - if source: - print(f" Source: {source.get('name', 'Unknown')}") - - print("\n" + "=" * 60) - - except Exception as e: - print(f"Error showing evaluations: {e}") - - -def validate_model_index(repo_id: str) -> bool: - """Validate model-index format in a model card.""" - try: - load_env() - ModelCard = require_model_card() - hf_token = os.getenv("HF_TOKEN") - card = ModelCard.load(repo_id, token=hf_token) - - if "model-index" not in card.data: - print(f"✗ No model-index found in {repo_id}") - return False - - model_index = card.data["model-index"] - - if not isinstance(model_index, list): - print("✗ model-index must be a list") - return False - - for i, entry in enumerate(model_index): - if "name" not in entry: - print(f"✗ Entry {i} missing 'name' field") - return False - - if "results" not in entry: - print(f"✗ Entry {i} missing 'results' field") - return False - - for j, result in enumerate(entry["results"]): - if "task" not in result: - print(f"✗ Result {j} in entry {i} missing 'task' field") - return False - - if "dataset" not in result: - print(f"✗ Result {j} in entry {i} missing 'dataset' field") - return False - - if "metrics" not in result: - print(f"✗ Result {j} in entry {i} missing 'metrics' field") - return False - - print(f"✓ Model-index format is valid for {repo_id}") - return True - - except Exception as e: - print(f"Error validating model-index: {e}") - return False - - -# ============================================================================ -# CLI Interface -# ============================================================================ - - -def main(): - parser = argparse.ArgumentParser( - description=( - "Manage evaluation results in Hugging Face model cards.\n\n" - "Use standard Python or `uv run scripts/evaluation_manager.py ...` " - "to auto-resolve dependencies from the PEP 723 header." - ), - formatter_class=argparse.RawTextHelpFormatter, - epilog=dedent( - """\ - Typical workflows: - - Inspect tables first: - uv run scripts/evaluation_manager.py inspect-tables --repo-id - - Extract from README (prints YAML by default): - uv run scripts/evaluation_manager.py extract-readme --repo-id --table N - - Apply changes: - uv run scripts/evaluation_manager.py extract-readme --repo-id --table N --apply - - Import from Artificial Analysis: - AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug org --model-name slug --repo-id - - Tips: - - YAML is printed by default; use --apply or --create-pr to write changes. - - Set HF_TOKEN (and AA_API_KEY for import-aa); .env is loaded automatically if python-dotenv is installed. - - When multiple tables exist, run inspect-tables then select with --table N. - - To apply changes (push or PR), rerun extract-readme with --apply or --create-pr. - """ - ), - ) - parser.add_argument("--version", action="version", version="evaluation_manager 1.2.0") - - subparsers = parser.add_subparsers(dest="command", help="Command to execute") - - # Extract from README command - extract_parser = subparsers.add_parser( - "extract-readme", - help="Extract evaluation tables from model README", - formatter_class=argparse.RawTextHelpFormatter, - description="Parse README tables into model-index YAML. Default behavior prints YAML; use --apply/--create-pr to write changes.", - epilog=dedent( - """\ - Examples: - uv run scripts/evaluation_manager.py extract-readme --repo-id username/model - uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-column-index 3 - uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-name-override \"**Model 7B**\" # exact header text - uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --create-pr - - Apply changes: - - Default: prints YAML to stdout (no writes). - - Add --apply to push directly, or --create-pr to open a PR. - Model selection: - - Preferred: --model-column-index
- - If using --model-name-override, copy the column header text exactly. - """ - ), - ) - extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)") - extract_parser.add_argument("--model-column-index", type=int, help="Preferred: column index from inspect-tables output (exact selection)") - extract_parser.add_argument("--model-name-override", type=str, help="Exact column header/model name for comparison/transpose tables (when index is not used)") - extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Sets model-index task.type (e.g., text-generation, summarization)") - extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name") - extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type") - extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push") - extract_parser.add_argument("--apply", action="store_true", help="Apply changes (default is to print YAML only)") - extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating (default)") - - # Import from AA command - aa_parser = subparsers.add_parser( - "import-aa", - help="Import evaluation scores from Artificial Analysis", - formatter_class=argparse.RawTextHelpFormatter, - description="Fetch scores from Artificial Analysis API and write them into model-index.", - epilog=dedent( - """\ - Examples: - AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug anthropic --model-name claude-sonnet-4 --repo-id username/model - uv run scripts/evaluation_manager.py import-aa --creator-slug openai --model-name gpt-4o --repo-id username/model --create-pr - - Requires: AA_API_KEY in env (or .env if python-dotenv installed). - """ - ), - ) - aa_parser.add_argument("--creator-slug", type=str, required=True, help="AA creator slug") - aa_parser.add_argument("--model-name", type=str, required=True, help="AA model name") - aa_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - aa_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push") - - # Show evaluations command - show_parser = subparsers.add_parser( - "show", - help="Display current evaluations in model card", - formatter_class=argparse.RawTextHelpFormatter, - description="Print model-index content from the model card (requires HF_TOKEN for private repos).", - ) - show_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - - # Validate command - validate_parser = subparsers.add_parser( - "validate", - help="Validate model-index format", - formatter_class=argparse.RawTextHelpFormatter, - description="Schema sanity check for model-index section of the card.", - ) - validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - - # Inspect tables command - inspect_parser = subparsers.add_parser( - "inspect-tables", - help="Inspect tables in README → outputs suggested extract-readme command", - formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" -Workflow: - 1. inspect-tables → see table structure, columns, and table numbers - 2. extract-readme → run with --table N (from step 1); YAML prints by default - 3. apply changes → rerun extract-readme with --apply or --create-pr - -Reminder: - - Preferred: use --model-column-index . If needed, use --model-name-override with the exact column header text. -""" - ) - inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - - # Get PRs command - prs_parser = subparsers.add_parser( - "get-prs", - help="List open pull requests for a model repository", - formatter_class=argparse.RawTextHelpFormatter, - description="Check for existing open PRs before creating new ones to avoid duplicates.", - epilog=dedent( - """\ - Examples: - uv run scripts/evaluation_manager.py get-prs --repo-id "allenai/Olmo-3-32B-Think" - - IMPORTANT: Always run this before using --create-pr to avoid duplicate PRs. - """ - ), - ) - prs_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID") - - args = parser.parse_args() - - if not args.command: - parser.print_help() - return - - try: - # Execute command - if args.command == "extract-readme": - results = extract_evaluations_from_readme( - repo_id=args.repo_id, - task_type=args.task_type, - dataset_name=args.dataset_name, - dataset_type=args.dataset_type, - model_name_override=args.model_name_override, - table_index=args.table, - model_column_index=args.model_column_index - ) - - if not results: - print("No evaluations extracted") - return - - apply_changes = args.apply or args.create_pr - - # Default behavior: print YAML (dry-run) - yaml = require_yaml() - print("\nExtracted evaluations (YAML):") - print( - yaml.dump( - {"model-index": [{"name": args.repo_id.split('/')[-1], "results": results}]}, - sort_keys=False - ) - ) - - if apply_changes: - if args.model_name_override and args.model_column_index is not None: - print("Note: --model-column-index takes precedence over --model-name-override.") - update_model_card_with_evaluations( - repo_id=args.repo_id, - results=results, - create_pr=args.create_pr, - commit_message="Extract evaluation results from README" - ) - - elif args.command == "import-aa": - results = import_aa_evaluations( - creator_slug=args.creator_slug, - model_name=args.model_name, - repo_id=args.repo_id - ) - - if not results: - print("No evaluations imported") - return - - update_model_card_with_evaluations( - repo_id=args.repo_id, - results=results, - create_pr=args.create_pr, - commit_message=f"Add Artificial Analysis evaluations for {args.model_name}" - ) - - elif args.command == "show": - show_evaluations(args.repo_id) - - elif args.command == "validate": - validate_model_index(args.repo_id) - - elif args.command == "inspect-tables": - inspect_tables(args.repo_id) - - elif args.command == "get-prs": - list_open_prs(args.repo_id) - except ModuleNotFoundError as exc: - # Surface dependency hints cleanly when user only needs help output - print(exc) - except Exception as exc: - print(f"Error: {exc}") - - -if __name__ == "__main__": - main() diff --git a/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py b/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py index e52fdfb1..d398bc60 100644 --- a/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py +++ b/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py @@ -8,7 +8,7 @@ # /// """ -Entry point script for running inspect-ai evaluations via `hf jobs uv run`. +Entry point script for running inspect-ai evaluations against Hugging Face inference providers. """ from __future__ import annotations diff --git a/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py b/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py index 1bb73060..f1454c5a 100644 --- a/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py +++ b/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py @@ -16,13 +16,7 @@ separate from inference provider scripts (which use external APIs). Usage (standalone): - python inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu" - -Usage (via HF Jobs): - hf jobs uv run inspect_vllm_uv.py \\ - --flavor a10g-small \\ - --secret HF_TOKEN=$HF_TOKEN \\ - -- --model "meta-llama/Llama-3.2-1B" --task "mmlu" + uv run scripts/inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu" Model backends: - vllm: Fast inference with vLLM (recommended for large models) @@ -187,16 +181,16 @@ def main() -> None: epilog=""" Examples: # Run MMLU with vLLM backend - python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu + uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu # Run with HuggingFace Transformers backend - python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf + uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf # Run with limited samples for testing - python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10 + uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10 # Run on multiple GPUs with tensor parallelism - python inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4 + uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4 Available tasks (from inspect-evals): - mmlu: Massive Multitask Language Understanding @@ -207,11 +201,6 @@ def main() -> None: - winogrande: Winograd Schema Challenge - humaneval: Code generation (HumanEval) -Via HF Jobs: - hf jobs uv run inspect_vllm_uv.py \\ - --flavor a10g-small \\ - --secret HF_TOKEN=$HF_TOKEN \\ - -- --model meta-llama/Llama-3.2-1B --task mmlu """, ) diff --git a/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py b/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py index 38798003..91ba83b3 100644 --- a/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py +++ b/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py @@ -10,19 +10,14 @@ # /// """ -Entry point script for running lighteval evaluations with vLLM backend via `hf jobs uv run`. +Entry point script for running lighteval evaluations with local GPU backends. -This script runs evaluations using vLLM for efficient GPU inference on custom HuggingFace models. -It is separate from inference provider scripts and evaluates models directly on the hardware. +This script runs evaluations using vLLM or accelerate on custom HuggingFace models. +It is separate from inference provider scripts and evaluates models directly on local hardware. Usage (standalone): - python lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5" + uv run scripts/lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5" -Usage (via HF Jobs): - hf jobs uv run lighteval_vllm_uv.py \\ - --flavor a10g-small \\ - --secret HF_TOKEN=$HF_TOKEN \\ - -- --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5" """ from __future__ import annotations @@ -181,16 +176,16 @@ def main() -> None: epilog=""" Examples: # Run MMLU evaluation with vLLM - python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" + uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" # Run with accelerate backend instead of vLLM - python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate + uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate # Run with chat template for instruction-tuned models - python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template + uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template # Run with limited samples for testing - python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10 + uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10 Task format: Tasks use the format: "suite|task|num_fewshot" @@ -300,4 +295,3 @@ def main() -> None: if __name__ == "__main__": main() - diff --git a/skills/hugging-face-evaluation/scripts/run_eval_job.py b/skills/hugging-face-evaluation/scripts/run_eval_job.py deleted file mode 100644 index 1ba45860..00000000 --- a/skills/hugging-face-evaluation/scripts/run_eval_job.py +++ /dev/null @@ -1,98 +0,0 @@ -# /// script -# requires-python = ">=3.10" -# dependencies = [ -# "huggingface-hub>=0.26.0", -# "python-dotenv>=1.2.1", -# ] -# /// - -""" -Submit evaluation jobs using the `hf jobs uv run` CLI. - -This wrapper constructs the appropriate command to execute the local -`inspect_eval_uv.py` script on Hugging Face Jobs with the requested hardware. -""" - -import argparse -import os -import subprocess -import sys -from pathlib import Path -from typing import Optional - -from huggingface_hub import get_token -from dotenv import load_dotenv - -load_dotenv() - - -SCRIPT_PATH = Path(__file__).with_name("inspect_eval_uv.py").resolve() - - -def create_eval_job( - model_id: str, - task: str, - hardware: str = "cpu-basic", - hf_token: Optional[str] = None, - limit: Optional[int] = None, -) -> None: - """ - Submit an evaluation job using the Hugging Face Jobs CLI. - """ - token = hf_token or os.getenv("HF_TOKEN") or get_token() - if not token: - raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.") - - if not SCRIPT_PATH.exists(): - raise FileNotFoundError(f"Script not found at {SCRIPT_PATH}") - - print(f"Preparing evaluation job for {model_id} on task {task} (hardware: {hardware})") - - cmd = [ - "hf", - "jobs", - "uv", - "run", - str(SCRIPT_PATH), - "--flavor", - hardware, - "--secrets", - f"HF_TOKEN={token}", - "--", - "--model", - model_id, - "--task", - task, - ] - - if limit: - cmd.extend(["--limit", str(limit)]) - - print("Executing:", " ".join(cmd)) - - try: - subprocess.run(cmd, check=True) - except subprocess.CalledProcessError as exc: - print("hf jobs command failed", file=sys.stderr) - raise - - -def main() -> None: - parser = argparse.ArgumentParser(description="Run inspect-ai evaluations on Hugging Face Jobs") - parser.add_argument("--model", required=True, help="Model ID (e.g. Qwen/Qwen3-0.6B)") - parser.add_argument("--task", required=True, help="Inspect task (e.g. mmlu, gsm8k)") - parser.add_argument("--hardware", default="cpu-basic", help="Hardware flavor (e.g. t4-small, a10g-small)") - parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate") - - args = parser.parse_args() - - create_eval_job( - model_id=args.model, - task=args.task, - hardware=args.hardware, - limit=args.limit, - ) - - -if __name__ == "__main__": - main() diff --git a/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py b/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py deleted file mode 100644 index 97ef7271..00000000 --- a/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py +++ /dev/null @@ -1,331 +0,0 @@ -# /// script -# requires-python = ">=3.10" -# dependencies = [ -# "huggingface-hub>=0.26.0", -# "python-dotenv>=1.2.1", -# ] -# /// - -""" -Submit vLLM-based evaluation jobs using the `hf jobs uv run` CLI. - -This wrapper constructs the appropriate command to execute vLLM evaluation scripts -(lighteval or inspect-ai) on Hugging Face Jobs with GPU hardware. - -Unlike run_eval_job.py (which uses inference providers/APIs), this script runs -models directly on the job's GPU using vLLM or HuggingFace Transformers. - -Usage: - python run_vllm_eval_job.py \\ - --model meta-llama/Llama-3.2-1B \\ - --task mmlu \\ - --framework lighteval \\ - --hardware a10g-small -""" - -from __future__ import annotations - -import argparse -import os -import subprocess -import sys -from pathlib import Path -from typing import Optional - -from huggingface_hub import get_token -from dotenv import load_dotenv - -load_dotenv() - -# Script paths for different evaluation frameworks -SCRIPT_DIR = Path(__file__).parent.resolve() -LIGHTEVAL_SCRIPT = SCRIPT_DIR / "lighteval_vllm_uv.py" -INSPECT_SCRIPT = SCRIPT_DIR / "inspect_vllm_uv.py" - -# Hardware flavor recommendations for different model sizes -HARDWARE_RECOMMENDATIONS = { - "small": "t4-small", # < 3B parameters - "medium": "a10g-small", # 3B - 13B parameters - "large": "a10g-large", # 13B - 34B parameters - "xlarge": "a100-large", # 34B+ parameters -} - - -def estimate_hardware(model_id: str) -> str: - """ - Estimate appropriate hardware based on model ID naming conventions. - - Returns a hardware flavor recommendation. - """ - model_lower = model_id.lower() - - # Check for explicit size indicators in model name - if any(x in model_lower for x in ["70b", "72b", "65b"]): - return "a100-large" - elif any(x in model_lower for x in ["34b", "33b", "32b", "30b"]): - return "a10g-large" - elif any(x in model_lower for x in ["13b", "14b", "7b", "8b"]): - return "a10g-small" - elif any(x in model_lower for x in ["3b", "2b", "1b", "0.5b", "small", "mini"]): - return "t4-small" - - # Default to medium hardware - return "a10g-small" - - -def create_lighteval_job( - model_id: str, - tasks: str, - hardware: str, - hf_token: Optional[str] = None, - max_samples: Optional[int] = None, - backend: str = "vllm", - batch_size: int = 1, - tensor_parallel_size: int = 1, - trust_remote_code: bool = False, - use_chat_template: bool = False, -) -> None: - """ - Submit a lighteval evaluation job on HuggingFace Jobs. - """ - token = hf_token or os.getenv("HF_TOKEN") or get_token() - if not token: - raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.") - - if not LIGHTEVAL_SCRIPT.exists(): - raise FileNotFoundError(f"Script not found at {LIGHTEVAL_SCRIPT}") - - print(f"Preparing lighteval job for {model_id}") - print(f" Tasks: {tasks}") - print(f" Backend: {backend}") - print(f" Hardware: {hardware}") - - cmd = [ - "hf", "jobs", "uv", "run", - str(LIGHTEVAL_SCRIPT), - "--flavor", hardware, - "--secrets", f"HF_TOKEN={token}", - "--", - "--model", model_id, - "--tasks", tasks, - "--backend", backend, - "--batch-size", str(batch_size), - "--tensor-parallel-size", str(tensor_parallel_size), - ] - - if max_samples: - cmd.extend(["--max-samples", str(max_samples)]) - - if trust_remote_code: - cmd.append("--trust-remote-code") - - if use_chat_template: - cmd.append("--use-chat-template") - - print(f"\nExecuting: {' '.join(cmd)}") - - try: - subprocess.run(cmd, check=True) - except subprocess.CalledProcessError as exc: - print("hf jobs command failed", file=sys.stderr) - raise - - -def create_inspect_job( - model_id: str, - task: str, - hardware: str, - hf_token: Optional[str] = None, - limit: Optional[int] = None, - backend: str = "vllm", - tensor_parallel_size: int = 1, - trust_remote_code: bool = False, -) -> None: - """ - Submit an inspect-ai evaluation job on HuggingFace Jobs. - """ - token = hf_token or os.getenv("HF_TOKEN") or get_token() - if not token: - raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.") - - if not INSPECT_SCRIPT.exists(): - raise FileNotFoundError(f"Script not found at {INSPECT_SCRIPT}") - - print(f"Preparing inspect-ai job for {model_id}") - print(f" Task: {task}") - print(f" Backend: {backend}") - print(f" Hardware: {hardware}") - - cmd = [ - "hf", "jobs", "uv", "run", - str(INSPECT_SCRIPT), - "--flavor", hardware, - "--secrets", f"HF_TOKEN={token}", - "--", - "--model", model_id, - "--task", task, - "--backend", backend, - "--tensor-parallel-size", str(tensor_parallel_size), - ] - - if limit: - cmd.extend(["--limit", str(limit)]) - - if trust_remote_code: - cmd.append("--trust-remote-code") - - print(f"\nExecuting: {' '.join(cmd)}") - - try: - subprocess.run(cmd, check=True) - except subprocess.CalledProcessError as exc: - print("hf jobs command failed", file=sys.stderr) - raise - - -def main() -> None: - parser = argparse.ArgumentParser( - description="Submit vLLM-based evaluation jobs to HuggingFace Jobs", - formatter_class=argparse.RawDescriptionHelpFormatter, - epilog=""" -Examples: - # Run lighteval with vLLM on A10G GPU - python run_vllm_eval_job.py \\ - --model meta-llama/Llama-3.2-1B \\ - --task "leaderboard|mmlu|5" \\ - --framework lighteval \\ - --hardware a10g-small - - # Run inspect-ai on larger model with multi-GPU - python run_vllm_eval_job.py \\ - --model meta-llama/Llama-3.2-70B \\ - --task mmlu \\ - --framework inspect \\ - --hardware a100-large \\ - --tensor-parallel-size 4 - - # Auto-detect hardware based on model size - python run_vllm_eval_job.py \\ - --model meta-llama/Llama-3.2-1B \\ - --task mmlu \\ - --framework inspect - - # Run with HF Transformers backend (instead of vLLM) - python run_vllm_eval_job.py \\ - --model microsoft/phi-2 \\ - --task mmlu \\ - --framework inspect \\ - --backend hf - -Hardware flavors: - - t4-small: T4 GPU, good for models < 3B - - a10g-small: A10G GPU, good for models 3B-13B - - a10g-large: A10G GPU, good for models 13B-34B - - a100-large: A100 GPU, good for models 34B+ - -Frameworks: - - lighteval: HuggingFace's lighteval library - - inspect: UK AI Safety's inspect-ai library - -Task formats: - - lighteval: "suite|task|num_fewshot" (e.g., "leaderboard|mmlu|5") - - inspect: task name (e.g., "mmlu", "gsm8k") - """, - ) - - parser.add_argument( - "--model", - required=True, - help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)", - ) - parser.add_argument( - "--task", - required=True, - help="Evaluation task (format depends on framework)", - ) - parser.add_argument( - "--framework", - choices=["lighteval", "inspect"], - default="lighteval", - help="Evaluation framework to use (default: lighteval)", - ) - parser.add_argument( - "--hardware", - default=None, - help="Hardware flavor (auto-detected if not specified)", - ) - parser.add_argument( - "--backend", - choices=["vllm", "hf", "accelerate"], - default="vllm", - help="Model backend (default: vllm)", - ) - parser.add_argument( - "--limit", - "--max-samples", - type=int, - default=None, - dest="limit", - help="Limit number of samples to evaluate", - ) - parser.add_argument( - "--batch-size", - type=int, - default=1, - help="Batch size for evaluation (lighteval only)", - ) - parser.add_argument( - "--tensor-parallel-size", - type=int, - default=1, - help="Number of GPUs for tensor parallelism", - ) - parser.add_argument( - "--trust-remote-code", - action="store_true", - help="Allow executing remote code from model repository", - ) - parser.add_argument( - "--use-chat-template", - action="store_true", - help="Apply chat template (lighteval only)", - ) - - args = parser.parse_args() - - # Auto-detect hardware if not specified - hardware = args.hardware or estimate_hardware(args.model) - print(f"Using hardware: {hardware}") - - # Map backend names between frameworks - backend = args.backend - if args.framework == "lighteval" and backend == "hf": - backend = "accelerate" # lighteval uses "accelerate" for HF backend - - if args.framework == "lighteval": - create_lighteval_job( - model_id=args.model, - tasks=args.task, - hardware=hardware, - max_samples=args.limit, - backend=backend, - batch_size=args.batch_size, - tensor_parallel_size=args.tensor_parallel_size, - trust_remote_code=args.trust_remote_code, - use_chat_template=args.use_chat_template, - ) - else: - create_inspect_job( - model_id=args.model, - task=args.task, - hardware=hardware, - limit=args.limit, - backend=backend if backend != "accelerate" else "hf", - tensor_parallel_size=args.tensor_parallel_size, - trust_remote_code=args.trust_remote_code, - ) - - -if __name__ == "__main__": - main() - diff --git a/skills/hugging-face-evaluation/scripts/test_extraction.py b/skills/hugging-face-evaluation/scripts/test_extraction.py deleted file mode 100755 index 4c97e055..00000000 --- a/skills/hugging-face-evaluation/scripts/test_extraction.py +++ /dev/null @@ -1,206 +0,0 @@ -#!/usr/bin/env python3 -# /// script -# requires-python = ">=3.10" -# dependencies = [ -# "pyyaml", -# ] -# /// -""" -Test script for evaluation extraction functionality. - -This script demonstrates the table extraction capabilities without -requiring HF tokens or making actual API calls. - -Note: This script imports from evaluation_manager.py (same directory). -Run from the scripts/ directory: cd scripts && uv run test_extraction.py -""" - -import yaml - -from evaluation_manager import ( - extract_tables_from_markdown, - parse_markdown_table, - is_evaluation_table, - extract_metrics_from_table -) - -# Sample README content with various table formats -SAMPLE_README = """ -# My Awesome Model - -## Evaluation Results - -Here are the benchmark results: - -| Benchmark | Score | -|-----------|-------| -| MMLU | 85.2 | -| HumanEval | 72.5 | -| GSM8K | 91.3 | - -### Detailed Breakdown - -| Category | MMLU | GSM8K | HumanEval | -|---------------|-------|-------|-----------| -| Performance | 85.2 | 91.3 | 72.5 | - -## Other Information - -This is not an evaluation table: - -| Feature | Value | -|---------|-------| -| Size | 7B | -| Type | Chat | - -## More Results - -| Benchmark | Accuracy | F1 Score | -|---------------|----------|----------| -| HellaSwag | 88.9 | 0.87 | -| TruthfulQA | 68.7 | 0.65 | -""" - - -def test_table_extraction(): - """Test markdown table extraction.""" - print("=" * 60) - print("TEST 1: Table Extraction") - print("=" * 60) - - tables = extract_tables_from_markdown(SAMPLE_README) - print(f"Found {len(tables)} tables in the sample README\n") - - for i, table in enumerate(tables, 1): - print(f"Table {i}:") - print(table[:100] + "..." if len(table) > 100 else table) - print() - - return tables - - -def test_table_parsing(tables): - """Test table parsing.""" - print("\n" + "=" * 60) - print("TEST 2: Table Parsing") - print("=" * 60) - - parsed_tables = [] - for i, table in enumerate(tables, 1): - print(f"\nParsing Table {i}:") - header, rows = parse_markdown_table(table) - - print(f" Header: {header}") - print(f" Rows: {len(rows)}") - for j, row in enumerate(rows[:3], 1): # Show first 3 rows - print(f" Row {j}: {row}") - if len(rows) > 3: - print(f" ... and {len(rows) - 3} more rows") - - parsed_tables.append((header, rows)) - - return parsed_tables - - -def test_evaluation_detection(parsed_tables): - """Test evaluation table detection.""" - print("\n" + "=" * 60) - print("TEST 3: Evaluation Table Detection") - print("=" * 60) - - eval_tables = [] - for i, (header, rows) in enumerate(parsed_tables, 1): - is_eval = is_evaluation_table(header, rows) - status = "✓ IS" if is_eval else "✗ NOT" - print(f"\nTable {i}: {status} an evaluation table") - print(f" Header: {header}") - - if is_eval: - eval_tables.append((header, rows)) - - print(f"\nFound {len(eval_tables)} evaluation tables") - return eval_tables - - -def test_metric_extraction(eval_tables): - """Test metric extraction.""" - print("\n" + "=" * 60) - print("TEST 4: Metric Extraction") - print("=" * 60) - - all_metrics = [] - for i, (header, rows) in enumerate(eval_tables, 1): - print(f"\nExtracting metrics from table {i}:") - metrics = extract_metrics_from_table(header, rows, table_format="auto") - - print(f" Extracted {len(metrics)} metrics:") - for metric in metrics: - print(f" - {metric['name']}: {metric['value']} (type: {metric['type']})") - - all_metrics.extend(metrics) - - return all_metrics - - -def test_model_index_format(metrics): - """Test model-index format generation.""" - print("\n" + "=" * 60) - print("TEST 5: Model-Index Format") - print("=" * 60) - - model_index = { - "model-index": [ - { - "name": "test-model", - "results": [ - { - "task": {"type": "text-generation"}, - "dataset": { - "name": "Benchmarks", - "type": "benchmark" - }, - "metrics": metrics, - "source": { - "name": "Model README", - "url": "https://huggingface.co/test/model" - } - } - ] - } - ] - } - - print("\nGenerated model-index structure:") - print(yaml.dump(model_index, sort_keys=False, default_flow_style=False)) - - -def main(): - """Run all tests.""" - print("\n" + "=" * 60) - print("EVALUATION EXTRACTION TEST SUITE") - print("=" * 60) - print("\nThis test demonstrates the table extraction capabilities") - print("without requiring API access or tokens.\n") - - # Run tests - tables = test_table_extraction() - parsed_tables = test_table_parsing(tables) - eval_tables = test_evaluation_detection(parsed_tables) - metrics = test_metric_extraction(eval_tables) - test_model_index_format(metrics) - - # Summary - print("\n" + "=" * 60) - print("TEST SUMMARY") - print("=" * 60) - print(f"✓ Found {len(tables)} total tables") - print(f"✓ Identified {len(eval_tables)} evaluation tables") - print(f"✓ Extracted {len(metrics)} metrics") - print("✓ Generated model-index format successfully") - print("\n" + "=" * 60) - print("All tests completed! The extraction logic is working correctly.") - print("=" * 60 + "\n") - - -if __name__ == "__main__": - main()