diff --git a/skills/hugging-face-evaluation/SKILL.md b/skills/hugging-face-evaluation/SKILL.md
index 5bdc03c8..3034a11a 100644
--- a/skills/hugging-face-evaluation/SKILL.md
+++ b/skills/hugging-face-evaluation/SKILL.md
@@ -1,651 +1,207 @@
 ---
 name: hugging-face-evaluation
-description: Add and manage evaluation results in Hugging Face model cards. Supports extracting eval tables from README content, importing scores from Artificial Analysis API, and running custom model evaluations with vLLM/lighteval. Works with the model-index metadata format.
+description: Run evaluations for Hugging Face Hub models using inspect-ai and lighteval on local hardware. Use for backend selection, local GPU evals, and choosing between vLLM / Transformers / accelerate. Not for HF Jobs orchestration, model-card PRs, .eval_results publication, or community-evals automation.
 ---
 
 # Overview
-This skill provides tools to add structured evaluation results to Hugging Face model cards. It supports multiple methods for adding evaluation data:
-- Extracting existing evaluation tables from README content
-- Importing benchmark scores from Artificial Analysis
-- Running custom model evaluations with vLLM or accelerate backends (lighteval/inspect-ai)
 
-## Integration with HF Ecosystem
-- **Model Cards**: Updates model-index metadata for leaderboard integration
-- **Artificial Analysis**: Direct API integration for benchmark imports
-- **Papers with Code**: Compatible with their model-index specification
-- **Jobs**: Run evaluations directly on Hugging Face Jobs with `uv` integration
-- **vLLM**: Efficient GPU inference for custom model evaluation
-- **lighteval**: HuggingFace's evaluation library with vLLM/accelerate backends
-- **inspect-ai**: UK AI Safety Institute's evaluation framework
+This skill is for **running evaluations against models on the Hugging Face Hub on local hardware**.
 
-# Version
-1.3.0
+It covers:
+- `inspect-ai` with local inference
+- `lighteval` with local inference
+- choosing between `vllm`, Hugging Face Transformers, and `accelerate`
+- smoke tests, task selection, and backend fallback strategy
 
-# Dependencies
+It does **not** cover:
+- Hugging Face Jobs orchestration
+- model-card or `model-index` edits
+- README table extraction
+- Artificial Analysis imports
+- `.eval_results` generation or publishing
+- PR creation or community-evals automation
 
-## Core Dependencies
-- huggingface_hub>=0.26.0
-- markdown-it-py>=3.0.0
-- python-dotenv>=1.2.1
-- pyyaml>=6.0.3
-- requests>=2.32.5
-- re (built-in)
+If the user wants to **run the same eval remotely on Hugging Face Jobs**, hand off to the `hugging-face-jobs` skill and pass it one of the local scripts in this skill.
 
-## Inference Provider Evaluation
-- inspect-ai>=0.3.0
-- inspect-evals
-- openai
+If the user wants to **publish results into the community evals workflow**, stop after generating the evaluation run and hand off that publishing step to `~/code/community-evals`.
 
-## vLLM Custom Model Evaluation (GPU required)
-- lighteval[accelerate,vllm]>=0.6.0
-- vllm>=0.4.0
-- torch>=2.0.0
-- transformers>=4.40.0
-- accelerate>=0.30.0
+> All paths below are relative to the directory containing this `SKILL.md`.
 
-Note: vLLM dependencies are installed automatically via PEP 723 script headers when using `uv run`.
+# When To Use Which Script
 
-# IMPORTANT: Using This Skill
+| Use case | Script |
+|---|---|
+| Local `inspect-ai` eval on a Hub model via inference providers | `scripts/inspect_eval_uv.py` |
+| Local GPU eval with `inspect-ai` using `vllm` or Transformers | `scripts/inspect_vllm_uv.py` |
+| Local GPU eval with `lighteval` using `vllm` or `accelerate` | `scripts/lighteval_vllm_uv.py` |
+| Extra command patterns | `examples/USAGE_EXAMPLES.md` |
 
-## ⚠️ CRITICAL: Check for Existing PRs Before Creating New Ones
+# Prerequisites
 
-**Before creating ANY pull request with `--create-pr`, you MUST check for existing open PRs:**
+- Prefer `uv run` for local execution.
+- Set `HF_TOKEN` for gated/private models.
+- For local GPU runs, verify GPU access before starting:
 
 ```bash
-uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
+uv --version
+printenv HF_TOKEN >/dev/null
+nvidia-smi
 ```
 
-**If open PRs exist:**
-1. **DO NOT create a new PR** - this creates duplicate work for maintainers
-2. **Warn the user** that open PRs already exist
-3. **Show the user** the existing PR URLs so they can review them
-4. Only proceed if the user explicitly confirms they want to create another PR
+If `nvidia-smi` is unavailable, either:
+- use `scripts/inspect_eval_uv.py` for lighter provider-backed evaluation, or
+- hand off to the `hugging-face-jobs` skill if the user wants remote compute.
 
-This prevents spamming model repositories with duplicate evaluation PRs.
+# Core Workflow
 
----
-
-> **All paths are relative to the directory containing this SKILL.md
-file.**
-> Before running any script, first `cd` to that directory or use the full
-path.
-
-
-**Use `--help` for the latest workflow guidance.** Works with plain Python or `uv run`:
-```bash
-uv run scripts/evaluation_manager.py --help
-uv run scripts/evaluation_manager.py inspect-tables --help
-uv run scripts/evaluation_manager.py extract-readme --help
-```
-Key workflow (matches CLI help):
-
-1) `get-prs` → check for existing open PRs first
-2) `inspect-tables` → find table numbers/columns  
-3) `extract-readme --table N` → prints YAML by default  
-4) add `--apply` (push) or `--create-pr` to write changes
-
-# Core Capabilities
-
-## 1. Inspect and Extract Evaluation Tables from README
-- **Inspect Tables**: Use `inspect-tables` to see all tables in a README with structure, columns, and sample rows
-- **Parse Markdown Tables**: Accurate parsing using markdown-it-py (ignores code blocks and examples)
-- **Table Selection**: Use `--table N` to extract from a specific table (required when multiple tables exist)
-- **Format Detection**: Recognize common formats (benchmarks as rows, columns, or comparison tables with multiple models)
-- **Column Matching**: Automatically identify model columns/rows; prefer `--model-column-index` (index from inspect output). Use `--model-name-override` only with exact column header text.
-- **YAML Generation**: Convert selected table to model-index YAML format
-- **Task Typing**: `--task-type` sets the `task.type` field in model-index output (e.g., `text-generation`, `summarization`)
-
-## 2. Import from Artificial Analysis
-- **API Integration**: Fetch benchmark scores directly from Artificial Analysis
-- **Automatic Formatting**: Convert API responses to model-index format
-- **Metadata Preservation**: Maintain source attribution and URLs
-- **PR Creation**: Automatically create pull requests with evaluation updates
-
-## 3. Model-Index Management
-- **YAML Generation**: Create properly formatted model-index entries
-- **Merge Support**: Add evaluations to existing model cards without overwriting
-- **Validation**: Ensure compliance with Papers with Code specification
-- **Batch Operations**: Process multiple models efficiently
-
-## 4. Run Evaluations on HF Jobs (Inference Providers)
-- **Inspect-AI Integration**: Run standard evaluations using the `inspect-ai` library
-- **UV Integration**: Seamlessly run Python scripts with ephemeral dependencies on HF infrastructure
-- **Zero-Config**: No Dockerfiles or Space management required
-- **Hardware Selection**: Configure CPU or GPU hardware for the evaluation job
-- **Secure Execution**: Handles API tokens safely via secrets passed through the CLI
-
-## 5. Run Custom Model Evaluations with vLLM (NEW)
-
-⚠️ **Important:** This approach is only possible on devices with `uv` installed and sufficient GPU memory.
-**Benefits:** No need to use `hf_jobs()` MCP tool, can run scripts directly in terminal
-**When to use:** User working in local device directly  when GPU is available
-
-### Before running the script
-
-- check the script path
-- check uv is installed
-- check gpu is available with `nvidia-smi`
-
-### Running the script
-
-```bash
-uv run scripts/train_sft_example.py
-```
-### Features
+1. Choose the evaluation framework.
+   - Use `inspect-ai` when you want explicit task control and inspect-native flows.
+   - Use `lighteval` when the benchmark is naturally expressed as a lighteval task string, especially leaderboard-style tasks.
+2. Choose the inference backend.
+   - Prefer `vllm` for throughput on supported architectures.
+   - Use Hugging Face Transformers (`--backend hf`) or `accelerate` as compatibility fallbacks.
+3. Start with a smoke test.
+   - `inspect-ai`: add `--limit 10` or similar.
+   - `lighteval`: add `--max-samples 10`.
+4. Scale up only after the smoke test passes.
+5. If the user wants remote execution, hand off to `hugging-face-jobs` with the same script + args.
 
-- **vLLM Backend**: High-performance GPU inference (5-10x faster than standard HF methods)
-- **lighteval Framework**: HuggingFace's evaluation library with Open LLM Leaderboard tasks
-- **inspect-ai Framework**: UK AI Safety Institute's evaluation library
-- **Standalone or Jobs**: Run locally or submit to HF Jobs infrastructure
-
-# Usage Instructions
-
-The skill includes Python scripts in `scripts/` to perform operations.
-
-### Prerequisites
-- Preferred: use `uv run` (PEP 723 header auto-installs deps)
-- Optional manual fallback: `uv pip install huggingface-hub markdown-it-py python-dotenv pyyaml requests`
-- Set `HF_TOKEN` environment variable with Write-access token
-- For Artificial Analysis: Set `AA_API_KEY` environment variable
-- `.env` is loaded automatically if `python-dotenv` is installed
-
-### Method 1: Extract from README (CLI workflow)
-
-Recommended flow (matches `--help`):
-```bash
-# 1) Inspect tables to get table numbers and column hints
-uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model"
-
-# 2) Extract a specific table (prints YAML by default)
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model" \
-  --table 1 \
-  [--model-column-index <column index shown by inspect-tables>] \
-  [--model-name-override "<column header/model name>"]  # use exact header text if you can't use the index
-
-# 3) Apply changes (push or PR)
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model" \
-  --table 1 \
-  --apply       # push directly
-# or
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model" \
-  --table 1 \
-  --create-pr   # open a PR
-```
-
-Validation checklist:
-- YAML is printed by default; compare against the README table before applying.
-- Prefer `--model-column-index`; if using `--model-name-override`, the column header text must be exact.
-- For transposed tables (models as rows), ensure only one row is extracted.
-
-### Method 2: Import from Artificial Analysis
-
-Fetch benchmark scores from Artificial Analysis API and add them to a model card.
-
-**Basic Usage:**
-```bash
-AA_API_KEY="your-api-key" uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "username/model-name"
-```
-
-**With Environment File:**
-```bash
-# Create .env file
-echo "AA_API_KEY=your-api-key" >> .env
-echo "HF_TOKEN=your-hf-token" >> .env
-
-# Run import
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "username/model-name"
-```
-
-**Create Pull Request:**
-```bash
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "username/model-name" \
-  --create-pr
-```
+# Quick Start
 
-### Method 3: Run Evaluation Job
+## Option A: inspect-ai with local inference providers path
 
-Submit an evaluation job on Hugging Face infrastructure using the `hf jobs uv run` CLI.
+Best when the model is already supported by Hugging Face Inference Providers and you want the lowest local setup overhead.
 
-**Direct CLI Usage:**
 ```bash
-HF_TOKEN=$HF_TOKEN \
-hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
-  --flavor cpu-basic \
-  --secret HF_TOKEN=$HF_TOKEN \
-  -- --model "meta-llama/Llama-2-7b-hf" \
-     --task "mmlu"
-```
-
-**GPU Example (A10G):**
-```bash
-HF_TOKEN=$HF_TOKEN \
-hf jobs uv run hf-evaluation/scripts/inspect_eval_uv.py \
-  --flavor a10g-small \
-  --secret HF_TOKEN=$HF_TOKEN \
-  -- --model "meta-llama/Llama-2-7b-hf" \
-     --task "gsm8k"
-```
-
-**Python Helper (optional):**
-```bash
-uv run scripts/run_eval_job.py \
-  --model "meta-llama/Llama-2-7b-hf" \
-  --task "mmlu" \
-  --hardware "t4-small"
-```
-
-### Method 4: Run Custom Model Evaluation with vLLM
-
-Evaluate custom HuggingFace models directly on GPU using vLLM or accelerate backends. These scripts are **separate from inference provider scripts** and run models locally on the job's hardware.
-
-#### When to Use vLLM Evaluation (vs Inference Providers)
-
-| Feature | vLLM Scripts | Inference Provider Scripts |
-|---------|-------------|---------------------------|
-| Model access | Any HF model | Models with API endpoints |
-| Hardware | Your GPU (or HF Jobs GPU) | Provider's infrastructure |
-| Cost | HF Jobs compute cost | API usage fees |
-| Speed | vLLM optimized | Depends on provider |
-| Offline | Yes (after download) | No |
-
-#### Option A: lighteval with vLLM Backend
-
-lighteval is HuggingFace's evaluation library, supporting Open LLM Leaderboard tasks.
-
-**Standalone (local GPU):**
-```bash
-# Run MMLU 5-shot with vLLM
-uv run scripts/lighteval_vllm_uv.py \
-  --model meta-llama/Llama-3.2-1B \
-  --tasks "leaderboard|mmlu|5"
-
-# Run multiple tasks
-uv run scripts/lighteval_vllm_uv.py \
+uv run scripts/inspect_eval_uv.py \
   --model meta-llama/Llama-3.2-1B \
-  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"
-
-# Use accelerate backend instead of vLLM
-uv run scripts/lighteval_vllm_uv.py \
-  --model meta-llama/Llama-3.2-1B \
-  --tasks "leaderboard|mmlu|5" \
-  --backend accelerate
-
-# Chat/instruction-tuned models
-uv run scripts/lighteval_vllm_uv.py \
-  --model meta-llama/Llama-3.2-1B-Instruct \
-  --tasks "leaderboard|mmlu|5" \
-  --use-chat-template
-```
-
-**Via HF Jobs:**
-```bash
-hf jobs uv run scripts/lighteval_vllm_uv.py \
-  --flavor a10g-small \
-  --secrets HF_TOKEN=$HF_TOKEN \
-  -- --model meta-llama/Llama-3.2-1B \
-     --tasks "leaderboard|mmlu|5"
+  --task mmlu \
+  --limit 20
 ```
 
-**lighteval Task Format:**
-Tasks use the format `suite|task|num_fewshot`:
-- `leaderboard|mmlu|5` - MMLU with 5-shot
-- `leaderboard|gsm8k|5` - GSM8K with 5-shot
-- `lighteval|hellaswag|0` - HellaSwag zero-shot
-- `leaderboard|arc_challenge|25` - ARC-Challenge with 25-shot
-
-**Finding Available Tasks:**
-The complete list of available lighteval tasks can be found at:
-https://github.com/huggingface/lighteval/blob/main/examples/tasks/all_tasks.txt
-
-This file contains all supported tasks in the format `suite|task|num_fewshot|0` (the trailing `0` is a version flag and can be ignored). Common suites include:
-- `leaderboard` - Open LLM Leaderboard tasks (MMLU, GSM8K, ARC, HellaSwag, etc.)
-- `lighteval` - Additional lighteval tasks
-- `bigbench` - BigBench tasks
-- `original` - Original benchmark tasks
-
-To use a task from the list, extract the `suite|task|num_fewshot` portion (without the trailing `0`) and pass it to the `--tasks` parameter. For example:
-- From file: `leaderboard|mmlu|0` → Use: `leaderboard|mmlu|0` (or change to `5` for 5-shot)
-- From file: `bigbench|abstract_narrative_understanding|0` → Use: `bigbench|abstract_narrative_understanding|0`
-- From file: `lighteval|wmt14:hi-en|0` → Use: `lighteval|wmt14:hi-en|0`
+Use this path when:
+- you want a quick local smoke test
+- you do not need direct GPU control
+- the task already exists in `inspect-evals`
 
-Multiple tasks can be specified as comma-separated values: `--tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5"`
+## Option B: inspect-ai on Local GPU
 
-#### Option B: inspect-ai with vLLM Backend
+Best when you need to load the Hub model directly, use `vllm`, or fall back to Transformers for unsupported architectures.
 
-inspect-ai is the UK AI Safety Institute's evaluation framework.
+Local GPU:
 
-**Standalone (local GPU):**
 ```bash
-# Run MMLU with vLLM
 uv run scripts/inspect_vllm_uv.py \
   --model meta-llama/Llama-3.2-1B \
-  --task mmlu
-
-# Use HuggingFace Transformers backend
-uv run scripts/inspect_vllm_uv.py \
-  --model meta-llama/Llama-3.2-1B \
-  --task mmlu \
-  --backend hf
-
-# Multi-GPU with tensor parallelism
-uv run scripts/inspect_vllm_uv.py \
-  --model meta-llama/Llama-3.2-70B \
-  --task mmlu \
-  --tensor-parallel-size 4
-```
-
-**Via HF Jobs:**
-```bash
-hf jobs uv run scripts/inspect_vllm_uv.py \
-  --flavor a10g-small \
-  --secrets HF_TOKEN=$HF_TOKEN \
-  -- --model meta-llama/Llama-3.2-1B \
-     --task mmlu
+  --task gsm8k \
+  --limit 20
 ```
 
-**Available inspect-ai Tasks:**
-- `mmlu` - Massive Multitask Language Understanding
-- `gsm8k` - Grade School Math
-- `hellaswag` - Common sense reasoning
-- `arc_challenge` - AI2 Reasoning Challenge
-- `truthfulqa` - TruthfulQA benchmark
-- `winogrande` - Winograd Schema Challenge
-- `humaneval` - Code generation
-
-#### Option C: Python Helper Script
-
-The helper script auto-selects hardware and simplifies job submission:
+Transformers fallback:
 
 ```bash
-# Auto-detect hardware based on model size
-uv run scripts/run_vllm_eval_job.py \
-  --model meta-llama/Llama-3.2-1B \
-  --task "leaderboard|mmlu|5" \
-  --framework lighteval
-
-# Explicit hardware selection
-uv run scripts/run_vllm_eval_job.py \
-  --model meta-llama/Llama-3.2-70B \
-  --task mmlu \
-  --framework inspect \
-  --hardware a100-large \
-  --tensor-parallel-size 4
-
-# Use HF Transformers backend
-uv run scripts/run_vllm_eval_job.py \
+uv run scripts/inspect_vllm_uv.py \
   --model microsoft/phi-2 \
   --task mmlu \
-  --framework inspect \
-  --backend hf
+  --backend hf \
+  --trust-remote-code \
+  --limit 20
 ```
 
-**Hardware Recommendations:**
-| Model Size | Recommended Hardware |
-|------------|---------------------|
-| < 3B params | `t4-small` |
-| 3B - 13B | `a10g-small` |
-| 13B - 34B | `a10g-large` |
-| 34B+ | `a100-large` |
+## Option C: lighteval on Local GPU
 
-### Commands Reference
-
-**Top-level help and version:**
-```bash
-uv run scripts/evaluation_manager.py --help
-uv run scripts/evaluation_manager.py --version
-```
+Best when the task is naturally expressed as a `lighteval` task string, especially Open LLM Leaderboard style benchmarks.
 
-**Inspect Tables (start here):**
-```bash
-uv run scripts/evaluation_manager.py inspect-tables --repo-id "username/model-name"
-```
+Local GPU:
 
-**Extract from README:**
 ```bash
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "username/model-name" \
-  --table N \
-  [--model-column-index N] \
-  [--model-name-override "Exact Column Header or Model Name"] \
-  [--task-type "text-generation"] \
-  [--dataset-name "Custom Benchmarks"] \
-  [--apply | --create-pr]
-```
-
-**Import from Artificial Analysis:**
-```bash
-AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "creator-name" \
-  --model-name "model-slug" \
-  --repo-id "username/model-name" \
-  [--create-pr]
-```
-
-**View / Validate:**
-```bash
-uv run scripts/evaluation_manager.py show --repo-id "username/model-name"
-uv run scripts/evaluation_manager.py validate --repo-id "username/model-name"
-```
-
-**Check Open PRs (ALWAYS run before --create-pr):**
-```bash
-uv run scripts/evaluation_manager.py get-prs --repo-id "username/model-name"
-```
-Lists all open pull requests for the model repository. Shows PR number, title, author, date, and URL.
-
-**Run Evaluation Job (Inference Providers):**
-```bash
-hf jobs uv run scripts/inspect_eval_uv.py \
-  --flavor "cpu-basic|t4-small|..." \
-  --secret HF_TOKEN=$HF_TOKEN \
-  -- --model "model-id" \
-     --task "task-name"
-```
-
-or use the Python helper:
-
-```bash
-uv run scripts/run_eval_job.py \
-  --model "model-id" \
-  --task "task-name" \
-  --hardware "cpu-basic|t4-small|..."
-```
-
-**Run vLLM Evaluation (Custom Models):**
-```bash
-# lighteval with vLLM
-hf jobs uv run scripts/lighteval_vllm_uv.py \
-  --flavor "a10g-small" \
-  --secrets HF_TOKEN=$HF_TOKEN \
-  -- --model "model-id" \
-     --tasks "leaderboard|mmlu|5"
-
-# inspect-ai with vLLM
-hf jobs uv run scripts/inspect_vllm_uv.py \
-  --flavor "a10g-small" \
-  --secrets HF_TOKEN=$HF_TOKEN \
-  -- --model "model-id" \
-     --task "mmlu"
-
-# Helper script (auto hardware selection)
-uv run scripts/run_vllm_eval_job.py \
-  --model "model-id" \
-  --task "leaderboard|mmlu|5" \
-  --framework lighteval
-```
-
-### Model-Index Format
-
-The generated model-index follows this structure:
-
-```yaml
-model-index:
-  - name: Model Name
-    results:
-      - task:
-          type: text-generation
-        dataset:
-          name: Benchmark Dataset
-          type: benchmark_type
-        metrics:
-          - name: MMLU
-            type: mmlu
-            value: 85.2
-          - name: HumanEval
-            type: humaneval
-            value: 72.5
-        source:
-          name: Source Name
-          url: https://source-url.com
-```
-
-WARNING: Do not use markdown formatting in the model name. Use the exact name from the table. Only use urls in the source.url field.
-
-### Error Handling
-- **Table Not Found**: Script will report if no evaluation tables are detected
-- **Invalid Format**: Clear error messages for malformed tables
-- **API Errors**: Retry logic for transient Artificial Analysis API failures
-- **Token Issues**: Validation before attempting updates
-- **Merge Conflicts**: Preserves existing model-index entries when adding new ones
-- **Space Creation**: Handles naming conflicts and hardware request failures gracefully
-
-### Best Practices
-
-1. **Check for existing PRs first**: Run `get-prs` before creating any new PR to avoid duplicates
-2. **Always start with `inspect-tables`**: See table structure and get the correct extraction command
-3. **Use `--help` for guidance**: Run `inspect-tables --help` to see the complete workflow
-4. **Preview first**: Default behavior prints YAML; review it before using `--apply` or `--create-pr`
-5. **Verify extracted values**: Compare YAML output against the README table manually
-6. **Use `--table N` for multi-table READMEs**: Required when multiple evaluation tables exist
-7. **Use `--model-name-override` for comparison tables**: Copy the exact column header from `inspect-tables` output
-8. **Create PRs for Others**: Use `--create-pr` when updating models you don't own
-9. **One model per repo**: Only add the main model's results to model-index
-10. **No markdown in YAML names**: The model name field in YAML should be plain text
-
-### Model Name Matching
-
-When extracting evaluation tables with multiple models (either as columns or rows), the script uses **exact normalized token matching**:
-
-- Removes markdown formatting (bold `**`, links `[]()`  )
-- Normalizes names (lowercase, replace `-` and `_` with spaces)
-- Compares token sets: `"OLMo-3-32B"` → `{"olmo", "3", "32b"}` matches `"**Olmo 3 32B**"` or `"[Olmo-3-32B](...)`
-- Only extracts if tokens match exactly (handles different word orders and separators)
-- Fails if no exact match found (rather than guessing from similar names)
-
-**For column-based tables** (benchmarks as rows, models as columns):
-- Finds the column header matching the model name
-- Extracts scores from that column only
-
-**For transposed tables** (models as rows, benchmarks as columns):
-- Finds the row in the first column matching the model name
-- Extracts all benchmark scores from that row only
-
-This ensures only the correct model's scores are extracted, never unrelated models or training checkpoints. 
-
-### Common Patterns
-
-**Update Your Own Model:**
-```bash
-# Extract from README and push directly
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model" \
-  --task-type "text-generation"
+uv run scripts/lighteval_vllm_uv.py \
+  --model meta-llama/Llama-3.2-3B-Instruct \
+  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
+  --max-samples 20 \
+  --use-chat-template
 ```
 
-**Update Someone Else's Model (Full Workflow):**
-```bash
-# Step 1: ALWAYS check for existing PRs first
-uv run scripts/evaluation_manager.py get-prs \
-  --repo-id "other-username/their-model"
-
-# Step 2: If NO open PRs exist, proceed with creating one
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "other-username/their-model" \
-  --create-pr
-
-# If open PRs DO exist:
-# - Warn the user about existing PRs
-# - Show them the PR URLs
-# - Do NOT create a new PR unless user explicitly confirms
-```
+`accelerate` fallback:
 
-**Import Fresh Benchmarks:**
 ```bash
-# Step 1: Check for existing PRs
-uv run scripts/evaluation_manager.py get-prs \
-  --repo-id "anthropic/claude-sonnet-4"
-
-# Step 2: If no PRs, import from Artificial Analysis
-AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "anthropic/claude-sonnet-4" \
-  --create-pr
-```
-
-### Troubleshooting
-
-**Issue**: "No evaluation tables found in README"
-- **Solution**: Check if README contains markdown tables with numeric scores
-
-**Issue**: "Could not find model 'X' in transposed table"
-- **Solution**: The script will display available models. Use `--model-name-override` with the exact name from the list
-- **Example**: `--model-name-override "**Olmo 3-32B**"`
-
-**Issue**: "AA_API_KEY not set"
-- **Solution**: Set environment variable or add to .env file
-
-**Issue**: "Token does not have write access"
-- **Solution**: Ensure HF_TOKEN has write permissions for the repository
-
-**Issue**: "Model not found in Artificial Analysis"
-- **Solution**: Verify creator-slug and model-name match API values
-
-**Issue**: "Payment required for hardware"
-- **Solution**: Add a payment method to your Hugging Face account to use non-CPU hardware
-
-**Issue**: "vLLM out of memory" or CUDA OOM
-- **Solution**: Use a larger hardware flavor, reduce `--gpu-memory-utilization`, or use `--tensor-parallel-size` for multi-GPU
-
-**Issue**: "Model architecture not supported by vLLM"
-- **Solution**: Use `--backend hf` (inspect-ai) or `--backend accelerate` (lighteval) for HuggingFace Transformers
-
-**Issue**: "Trust remote code required"
-- **Solution**: Add `--trust-remote-code` flag for models with custom code (e.g., Phi-2, Qwen)
-
-**Issue**: "Chat template not found"
-- **Solution**: Only use `--use-chat-template` for instruction-tuned models that include a chat template
-
-### Integration Examples
-
-**Python Script Integration:**
-```python
-import subprocess
-import os
-
-def update_model_evaluations(repo_id, readme_content):
-    """Update model card with evaluations from README."""
-    result = subprocess.run([
-        "python", "scripts/evaluation_manager.py",
-        "extract-readme",
-        "--repo-id", repo_id,
-        "--create-pr"
-    ], capture_output=True, text=True)
-
-    if result.returncode == 0:
-        print(f"Successfully updated {repo_id}")
-    else:
-        print(f"Error: {result.stderr}")
-```
+uv run scripts/lighteval_vllm_uv.py \
+  --model microsoft/phi-2 \
+  --tasks "leaderboard|mmlu|5" \
+  --backend accelerate \
+  --trust-remote-code \
+  --max-samples 20
+```
+
+# Remote Execution Boundary
+
+This skill intentionally stops at **local execution and backend selection**.
+
+If the user wants to:
+- run these scripts on Hugging Face Jobs
+- pick remote hardware
+- pass secrets to remote jobs
+- schedule recurring runs
+- inspect / cancel / monitor jobs
+
+then switch to the **`hugging-face-jobs`** skill and pass it one of these scripts plus the chosen arguments.
+
+# Task Selection
+
+`inspect-ai` examples:
+- `mmlu`
+- `gsm8k`
+- `hellaswag`
+- `arc_challenge`
+- `truthfulqa`
+- `winogrande`
+- `humaneval`
+
+`lighteval` task strings use `suite|task|num_fewshot`:
+- `leaderboard|mmlu|5`
+- `leaderboard|gsm8k|5`
+- `leaderboard|arc_challenge|25`
+- `lighteval|hellaswag|0`
+
+Multiple `lighteval` tasks can be comma-separated in `--tasks`.
+
+# Backend Selection
+
+- Prefer `inspect_vllm_uv.py --backend vllm` for fast GPU inference on supported architectures.
+- Use `inspect_vllm_uv.py --backend hf` when `vllm` does not support the model.
+- Prefer `lighteval_vllm_uv.py --backend vllm` for throughput on supported models.
+- Use `lighteval_vllm_uv.py --backend accelerate` as the compatibility fallback.
+- Use `inspect_eval_uv.py` when Inference Providers already cover the model and you do not need direct GPU control.
+
+# Hardware Guidance
+
+| Model size | Suggested local hardware |
+|---|---|
+| `< 3B` | consumer GPU / Apple Silicon / small dev GPU |
+| `3B - 13B` | stronger local GPU |
+| `13B+` | high-memory local GPU or hand off to `hugging-face-jobs` |
+
+For smoke tests, prefer cheaper local runs plus `--limit` or `--max-samples`.
+
+# Troubleshooting
+
+- CUDA or vLLM OOM:
+  - reduce `--batch-size`
+  - reduce `--gpu-memory-utilization`
+  - switch to a smaller model for the smoke test
+  - if necessary, hand off to `hugging-face-jobs`
+- Model unsupported by `vllm`:
+  - switch to `--backend hf` for `inspect-ai`
+  - switch to `--backend accelerate` for `lighteval`
+- Gated/private repo access fails:
+  - verify `HF_TOKEN`
+- Custom model code required:
+  - add `--trust-remote-code`
+
+# Examples
+
+See:
+- `examples/USAGE_EXAMPLES.md` for local command patterns
+- `scripts/inspect_eval_uv.py`
+- `scripts/inspect_vllm_uv.py`
+- `scripts/lighteval_vllm_uv.py`
diff --git a/skills/hugging-face-evaluation/examples/.env.example b/skills/hugging-face-evaluation/examples/.env.example
index 3d814a3c..26d9b9b4 100644
--- a/skills/hugging-face-evaluation/examples/.env.example
+++ b/skills/hugging-face-evaluation/examples/.env.example
@@ -1,7 +1,3 @@
-# Hugging Face Token (required for all operations)
+# Hugging Face Token (required for gated/private models)
 # Get your token at: https://huggingface.co/settings/tokens
 HF_TOKEN=hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
-
-# Artificial Analysis API Key (required for import-aa command)
-# Get your key at: https://artificialanalysis.ai/
-AA_API_KEY=aa_xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
diff --git a/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md b/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md
index b5cbb708..64c24334 100644
--- a/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md
+++ b/skills/hugging-face-evaluation/examples/USAGE_EXAMPLES.md
@@ -1,378 +1,101 @@
 # Usage Examples
 
-This document provides practical examples for both methods of adding evaluations to HuggingFace model cards.
+This document provides practical examples for **running evaluations locally** against Hugging Face Hub models.
 
-## Table of Contents
-1. [Setup](#setup)
-2. [Method 1: Extract from README](#method-1-extract-from-readme)
-3. [Method 2: Import from Artificial Analysis](#method-2-import-from-artificial-analysis)
-4. [Standalone vs Integrated](#standalone-vs-integrated)
-5. [Common Workflows](#common-workflows)
+## What this skill covers
 
-## Setup
-
-### Initial Configuration
-
-```bash
-# Navigate to skill directory
-cd hf_evaluation_skill
-
-
-# Configure environment variables
-cp examples/.env.example .env
-# Edit .env with your tokens
-```
-
-Your `.env` file should contain:
-```env
-HF_TOKEN=hf_your_write_token_here
-AA_API_KEY=aa_your_api_key_here  # Optional for AA imports
-```
-
-### Verify Installation
-
-```bash
-uv run scripts/test_extraction.py
-```
-
-## Method 1: Extract from README
-
-Extract evaluation tables from your model's existing README.
-
-### Basic Extraction
-
-```bash
-# Preview what will be extracted (dry run)
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "meta-llama/Llama-3.3-70B-Instruct" \
-  --dry-run
-```
-
-### Apply Extraction to Your Model
-
-```bash
-# Extract and update model card directly
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model-7b"
-```
-
-### Custom Task and Dataset Names
-
-```bash
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model-7b" \
-  --task-type "text-generation" \
-  --dataset-name "Standard Benchmarks" \
-  --dataset-type "llm_benchmarks"
-```
-
-### Create Pull Request (for models you don't own)
-
-```bash
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "organization/community-model" \
-  --create-pr
-```
-
-### Example README Format
-
-Your model README should contain tables like:
-
-```markdown
-## Evaluation Results
-
-| Benchmark     | Score |
-|---------------|-------|
-| MMLU          | 85.2  |
-| HumanEval     | 72.5  |
-| GSM8K         | 91.3  |
-| HellaSwag     | 88.9  |
-```
-
-## Method 2: Import from Artificial Analysis
-
-Fetch benchmark scores directly from Artificial Analysis API.
-
-### Integrated Approach (Recommended)
-
-```bash
-# Import scores for Claude Sonnet 4.5
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "your-username/claude-mirror"
-```
-
-### With Pull Request
-
-```bash
-# Create PR instead of direct commit
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "openai" \
-  --model-name "gpt-4" \
-  --repo-id "your-username/gpt-4-mirror" \
-  --create-pr
-```
-
-### Standalone Script
-
-For simple, one-off imports, use the standalone script:
-
-```bash
-# Navigate to examples directory
-cd examples
+- `inspect-ai` local runs
+- `inspect-ai` with `vllm` or Transformers backends
+- `lighteval` local runs with `vllm` or `accelerate`
+- smoke tests and backend fallback patterns
 
-# Run standalone script
-AA_API_KEY="your-key" HF_TOKEN="your-token" \
-uv run artificial_analysis_to_hub.py \
-  --creator-slug "anthropic" \
-  --model-name "claude-sonnet-4" \
-  --repo-id "your-username/your-repo"
-```
-
-### Finding Creator Slug and Model Name
-
-1. Visit [Artificial Analysis](https://artificialanalysis.ai/)
-2. Navigate to the model you want to import
-3. The URL format is: `https://artificialanalysis.ai/models/{creator-slug}/{model-name}`
-4. Or check their [API documentation](https://artificialanalysis.ai/api)
-
-Common examples:
-- Anthropic: `--creator-slug "anthropic" --model-name "claude-sonnet-4"`
-- OpenAI: `--creator-slug "openai" --model-name "gpt-4-turbo"`
-- Meta: `--creator-slug "meta" --model-name "llama-3-70b"`
-
-## Standalone vs Integrated
-
-### Standalone Script Features
-- ✓ Simple, single-purpose
-- ✓ Can run via `uv run` from URL
-- ✓ Minimal dependencies
-- ✗ No README extraction
-- ✗ No validation
-- ✗ No dry-run mode
-
-**Use when:** You only need AA imports and want a simple script.
+## What this skill does NOT cover
 
-### Integrated Script Features
-- ✓ Both README extraction AND AA import
-- ✓ Validation and show commands
-- ✓ Dry-run preview mode
-- ✓ Better error handling
-- ✓ Merge with existing evaluations
-- ✓ More flexible options
-
-**Use when:** You want full evaluation management capabilities.
-
-## Common Workflows
-
-### Workflow 1: New Model with README Tables
-
-You've just created a model with evaluation tables in the README.
-
-```bash
-# Step 1: Preview extraction
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/new-model-7b" \
-  --dry-run
+- `model-index`
+- `.eval_results`
+- community eval publication workflows
+- model-card PR creation
+- Hugging Face Jobs orchestration
 
-# Step 2: Apply if it looks good
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/new-model-7b"
+If you want to run these same scripts remotely, use the `hugging-face-jobs` skill and pass one of the scripts in `scripts/`.
 
-# Step 3: Validate
-uv run scripts/evaluation_manager.py validate \
-  --repo-id "your-username/new-model-7b"
-
-# Step 4: View results
-uv run scripts/evaluation_manager.py show \
-  --repo-id "your-username/new-model-7b"
-```
-
-### Workflow 2: Model Benchmarked on AA
-
-Your model appears on Artificial Analysis with fresh benchmarks.
-
-```bash
-# Import scores and create PR for review
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "your-org" \
-  --model-name "your-model" \
-  --repo-id "your-org/your-model-hf" \
-  --create-pr
-```
-
-### Workflow 3: Combine Both Methods
-
-You have README tables AND AA scores.
+## Setup
 
 ```bash
-# Step 1: Extract from README
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/hybrid-model"
-
-# Step 2: Import from AA (will merge with existing)
-uv run scripts/evaluation_manager.py import-aa \
-  --creator-slug "your-org" \
-  --model-name "hybrid-model" \
-  --repo-id "your-username/hybrid-model"
-
-# Step 3: View combined results
-uv run scripts/evaluation_manager.py show \
-  --repo-id "your-username/hybrid-model"
+cd skills/hugging-face-evaluation
+export HF_TOKEN=hf_xxx
+uv --version
 ```
 
-### Workflow 4: Contributing to Community Models
-
-Help improve community models by adding missing evaluations.
+For local GPU runs:
 
 ```bash
-# Find a model with evaluations in README but no model-index
-# Example: community/awesome-7b
-
-# Create PR with extracted evaluations
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "community/awesome-7b" \
-  --create-pr
-
-# GitHub will notify the repository owner
-# They can review and merge your PR
+nvidia-smi
 ```
 
-### Workflow 5: Batch Processing
+## inspect-ai examples
 
-Update multiple models at once.
+### Quick smoke test
 
 ```bash
-# Create a list of repos
-cat > models.txt << EOF
-your-org/model-1-7b
-your-org/model-2-13b
-your-org/model-3-70b
-EOF
-
-# Process each
-while read repo_id; do
-  echo "Processing $repo_id..."
-  uv run scripts/evaluation_manager.py extract-readme \
-    --repo-id "$repo_id"
-done < models.txt
+uv run scripts/inspect_eval_uv.py \
+  --model meta-llama/Llama-3.2-1B \
+  --task mmlu \
+  --limit 10
 ```
 
-### Workflow 6: Automated Updates (CI/CD)
-
-Set up automatic evaluation updates using GitHub Actions.
-
-```yaml
-# .github/workflows/update-evals.yml
-name: Update Evaluations Weekly
-on:
-  schedule:
-    - cron: '0 0 * * 0'  # Every Sunday
-  workflow_dispatch:  # Manual trigger
-
-jobs:
-  update:
-    runs-on: ubuntu-latest
-    steps:
-      - uses: actions/checkout@v4
-
-      - name: Set up uv
-        uses: astral-sh/setup-uv@v5
-
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: '3.13'
-
-      - name: Update from Artificial Analysis
-        env:
-          AA_API_KEY: ${{ secrets.AA_API_KEY }}
-          HF_TOKEN: ${{ secrets.HF_TOKEN }}
-        run: |
-          uv run scripts/evaluation_manager.py import-aa \
-            --creator-slug "${{ vars.AA_CREATOR_SLUG }}" \
-            --model-name "${{ vars.AA_MODEL_NAME }}" \
-            --repo-id "${{ github.repository }}" \
-            --create-pr
-```
-
-## Verification and Validation
-
-### Check Current Evaluations
+### Local GPU with vLLM
 
 ```bash
-uv run scripts/evaluation_manager.py show \
-  --repo-id "your-username/your-model"
+uv run scripts/inspect_vllm_uv.py \
+  --model meta-llama/Llama-3.2-8B-Instruct \
+  --task gsm8k \
+  --limit 20
 ```
 
-### Validate Format
+### Transformers fallback
 
 ```bash
-uv run scripts/evaluation_manager.py validate \
-  --repo-id "your-username/your-model"
+uv run scripts/inspect_vllm_uv.py \
+  --model microsoft/phi-2 \
+  --task mmlu \
+  --backend hf \
+  --trust-remote-code \
+  --limit 20
 ```
 
-### View in HuggingFace UI
+## lighteval examples
 
-After updating, visit:
-```
-https://huggingface.co/your-username/your-model
-```
-
-The evaluation widget should display your scores automatically.
-
-## Troubleshooting Examples
-
-### Problem: No tables found
+### Single task
 
 ```bash
-# Check what tables exist in your README
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model" \
-  --dry-run
-
-# If no output, ensure your README has markdown tables with numeric scores
+uv run scripts/lighteval_vllm_uv.py \
+  --model meta-llama/Llama-3.2-3B-Instruct \
+  --tasks "leaderboard|mmlu|5" \
+  --max-samples 20
 ```
 
-### Problem: AA model not found
+### Multiple tasks
 
 ```bash
-# Verify the creator and model slugs
-# Check the AA website URL or API directly
-curl -H "x-api-key: $AA_API_KEY" \
-  https://artificialanalysis.ai/api/v2/data/llms/models | jq
+uv run scripts/lighteval_vllm_uv.py \
+  --model meta-llama/Llama-3.2-3B-Instruct \
+  --tasks "leaderboard|mmlu|5,leaderboard|gsm8k|5" \
+  --max-samples 20 \
+  --use-chat-template
 ```
 
-### Problem: Token permission error
+### accelerate fallback
 
 ```bash
-# Verify your token has write access
-# Generate a new token at: https://huggingface.co/settings/tokens
-# Ensure "Write" scope is enabled
+uv run scripts/lighteval_vllm_uv.py \
+  --model microsoft/phi-2 \
+  --tasks "leaderboard|mmlu|5" \
+  --backend accelerate \
+  --trust-remote-code \
+  --max-samples 20
 ```
 
-## Tips and Best Practices
-
-1. **Always dry-run first**: Use `--dry-run` to preview changes
-2. **Use PRs for others' repos**: Always use `--create-pr` for repositories you don't own
-3. **Validate after updates**: Run `validate` to ensure proper formatting
-4. **Keep evaluations current**: Set up automated updates for AA scores
-5. **Document sources**: The tool automatically adds source attribution
-6. **Check the UI**: Always verify the evaluation widget displays correctly
-
-## Getting Help
-
-```bash
-# General help
-uv run scripts/evaluation_manager.py --help
-
-# Command-specific help
-uv run scripts/evaluation_manager.py extract-readme --help
-uv run scripts/evaluation_manager.py import-aa --help
-```
+## Hand-off to Hugging Face Jobs
 
-For issues or questions, consult:
-- `../SKILL.md` - Complete documentation
-- `../README.md` - Troubleshooting guide
-- `../QUICKSTART.md` - Quick start guide
+When local hardware is not enough, switch to the `hugging-face-jobs` skill and run one of these scripts remotely. Keep the script path and args; move the orchestration there.
diff --git a/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py b/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py
deleted file mode 100644
index 15216dd0..00000000
--- a/skills/hugging-face-evaluation/examples/artificial_analysis_to_hub.py
+++ /dev/null
@@ -1,141 +0,0 @@
-# /// script
-# requires-python = ">=3.13"
-# dependencies = [
-#     "huggingface-hub>=1.1.4",
-#     "python-dotenv>=1.2.1",
-#     "pyyaml>=6.0.3",
-#     "requests>=2.32.5",
-# ]
-# ///
-
-"""
-Add Artificial Analysis evaluations to a Hugging Face model card.
-
-NOTE: This is a standalone reference script. For integrated functionality
-with additional features (README extraction, validation, etc.), use:
-    ../scripts/evaluation_manager.py import-aa [options]
-
-STANDALONE USAGE:
-AA_API_KEY="<your-api-key>" HF_TOKEN="<your-huggingface-token>" \
-uv run artificial_analysis_to_hub.py \
---creator-slug <artificial-analysis-creator-slug> \
---model-name <artificial-analysis-model-name> \
---repo-id <huggingface-repo-id>
-
-INTEGRATED USAGE (Recommended):
-uv run ../scripts/evaluation_manager.py import-aa \
---creator-slug <creator-slug> \
---model-name <model-name> \
---repo-id <repo-id> \
-[--create-pr]
-"""
-
-import argparse
-import os
-
-import requests
-import dotenv
-from huggingface_hub import ModelCard
-
-dotenv.load_dotenv()
-
-API_KEY = os.getenv("AA_API_KEY")
-HF_TOKEN = os.getenv("HF_TOKEN")
-URL = "https://artificialanalysis.ai/api/v2/data/llms/models"
-HEADERS = {"x-api-key": API_KEY}
-
-if not API_KEY:
-    raise ValueError("AA_API_KEY is not set")
-if not HF_TOKEN:
-    raise ValueError("HF_TOKEN is not set")
-
-
-def get_model_evaluations_data(creator_slug, model_name):
-    response = requests.get(URL, headers=HEADERS)
-    response_data = response.json()["data"]
-    for model in response_data:
-        if (
-            model["model_creator"]["slug"] == creator_slug
-            and model["slug"] == model_name
-        ):
-            return model
-    raise ValueError(f"Model {model_name} not found")
-
-
-def aa_evaluations_to_model_index(
-    model,
-    dataset_name="Artificial Analysis Benchmarks",
-    dataset_type="artificial_analysis",
-    task_type="evaluation",
-):
-    if not model:
-        raise ValueError("Model data is required")
-
-    model_name = model.get("name", model.get("slug", "unknown-model"))
-    evaluations = model.get("evaluations", {})
-
-    metrics = []
-    for key, value in evaluations.items():
-        metrics.append(
-            {
-                "name": key.replace("_", " ").title(),
-                "type": key,
-                "value": value,
-            }
-        )
-
-    model_index = [
-        {
-            "name": model_name,
-            "results": [
-                {
-                    "task": {"type": task_type},
-                    "dataset": {"name": dataset_name, "type": dataset_type},
-                    "metrics": metrics,
-                    "source": {
-                        "name": "Artificial Analysis API",
-                        "url": "https://artificialanalysis.ai",
-                    },
-                }
-            ],
-        }
-    ]
-
-    return model_index
-
-
-def main():
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--creator-slug", type=str, required=True)
-    parser.add_argument("--model-name", type=str, required=True)
-    parser.add_argument("--repo-id", type=str, required=True)
-    args = parser.parse_args()
-
-    aa_evaluations_data = get_model_evaluations_data(
-        creator_slug=args.creator_slug, model_name=args.model_name
-    )
-
-    model_index = aa_evaluations_to_model_index(model=aa_evaluations_data)
-
-    card = ModelCard.load(args.repo_id)
-    card.data["model-index"] = model_index
-
-    commit_message = (
-        f"Add Artificial Analysis evaluations for {args.model_name}"
-    )
-    commit_description = (
-        f"This commit adds the Artificial Analysis evaluations for the {args.model_name} model to this repository. "
-        "To see the scores, visit the [Artificial Analysis](https://artificialanalysis.ai) website."
-    )
-
-    card.push_to_hub(
-        args.repo_id,
-        token=HF_TOKEN,
-        commit_message=commit_message,
-        commit_description=commit_description,
-        create_pr=True,
-    )
-
-
-if __name__ == "__main__":
-    main()
diff --git a/skills/hugging-face-evaluation/examples/example_readme_tables.md b/skills/hugging-face-evaluation/examples/example_readme_tables.md
deleted file mode 100644
index c996338f..00000000
--- a/skills/hugging-face-evaluation/examples/example_readme_tables.md
+++ /dev/null
@@ -1,135 +0,0 @@
-# Example Evaluation Table Formats
-
-This file shows various formats of evaluation tables that can be extracted from model README files.
-
-## Format 1: Benchmarks as Rows (Most Common)
-
-```markdown
-| Benchmark | Score |
-|-----------|-------|
-| MMLU      | 85.2  |
-| HumanEval | 72.5  |
-| GSM8K     | 91.3  |
-| HellaSwag | 88.9  |
-```
-
-## Format 2: Multiple Metric Columns
-
-```markdown
-| Benchmark | Accuracy | F1 Score |
-|-----------|----------|----------|
-| MMLU      | 85.2     | 0.84     |
-| GSM8K     | 91.3     | 0.91     |
-| DROP      | 78.5     | 0.77     |
-```
-
-## Format 3: Benchmarks as Columns
-
-```markdown
-| MMLU | HumanEval | GSM8K | HellaSwag |
-|------|-----------|-------|-----------|
-| 85.2 | 72.5      | 91.3  | 88.9      |
-```
-
-## Format 4: Percentage Values
-
-```markdown
-| Benchmark     | Score    |
-|---------------|----------|
-| MMLU          | 85.2%    |
-| HumanEval     | 72.5%    |
-| GSM8K         | 91.3%    |
-| TruthfulQA    | 68.7%    |
-```
-
-## Format 5: Mixed Format with Categories
-
-```markdown
-### Reasoning
-
-| Benchmark | Score |
-|-----------|-------|
-| MMLU      | 85.2  |
-| BBH       | 82.4  |
-| GPQA      | 71.3  |
-
-### Coding
-
-| Benchmark | Score |
-|-----------|-------|
-| HumanEval | 72.5  |
-| MBPP      | 78.9  |
-
-### Math
-
-| Benchmark | Score |
-|-----------|-------|
-| GSM8K     | 91.3  |
-| MATH      | 65.8  |
-```
-
-## Format 6: With Additional Columns
-
-```markdown
-| Benchmark | Score | Rank | Notes              |
-|-----------|-------|------|--------------------|
-| MMLU      | 85.2  | #5   | 5-shot             |
-| HumanEval | 72.5  | #8   | pass@1             |
-| GSM8K     | 91.3  | #3   | 8-shot, maj@1      |
-```
-
-## How the Extractor Works
-
-The script will:
-1. Find all markdown tables in the README
-2. Identify which tables contain evaluation results
-3. Parse the table structure (rows vs columns)
-4. Extract numeric values as scores
-5. Convert to model-index YAML format
-
-## Tips for README Authors
-
-To ensure your evaluation tables are properly extracted:
-
-1. **Use clear headers**: Include "Benchmark", "Score", or similar terms
-2. **Keep it simple**: Stick to benchmark name + score columns
-3. **Use standard formats**: Follow markdown table syntax
-4. **Include numeric values**: Ensure scores are parseable numbers
-5. **Be consistent**: Use the same format across multiple tables
-
-## Example Complete README Section
-
-```markdown
-# Model Card for MyModel-7B
-
-## Evaluation Results
-
-Our model was evaluated on several standard benchmarks:
-
-| Benchmark     | Score |
-|---------------|-------|
-| MMLU          | 85.2  |
-| HumanEval     | 72.5  |
-| GSM8K         | 91.3  |
-| HellaSwag     | 88.9  |
-| ARC-Challenge | 81.7  |
-| TruthfulQA    | 68.7  |
-
-### Detailed Results
-
-For more detailed results and methodology, see our [paper](link).
-```
-
-## Running the Extractor
-
-```bash
-# Extract from this example
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model" \
-  --dry-run
-
-# Apply to your model card
-uv run scripts/evaluation_manager.py extract-readme \
-  --repo-id "your-username/your-model" \
-  --task-type "text-generation"
-```
diff --git a/skills/hugging-face-evaluation/examples/metric_mapping.json b/skills/hugging-face-evaluation/examples/metric_mapping.json
deleted file mode 100644
index 121d7592..00000000
--- a/skills/hugging-face-evaluation/examples/metric_mapping.json
+++ /dev/null
@@ -1,50 +0,0 @@
-{
-  "MMLU": {
-    "type": "mmlu",
-    "name": "Massive Multitask Language Understanding"
-  },
-  "HumanEval": {
-    "type": "humaneval",
-    "name": "Code Generation (HumanEval)"
-  },
-  "GSM8K": {
-    "type": "gsm8k",
-    "name": "Grade School Math"
-  },
-  "HellaSwag": {
-    "type": "hellaswag",
-    "name": "HellaSwag Common Sense"
-  },
-  "ARC-C": {
-    "type": "arc_challenge",
-    "name": "ARC Challenge"
-  },
-  "ARC-E": {
-    "type": "arc_easy",
-    "name": "ARC Easy"
-  },
-  "Winogrande": {
-    "type": "winogrande",
-    "name": "Winogrande"
-  },
-  "TruthfulQA": {
-    "type": "truthfulqa",
-    "name": "TruthfulQA"
-  },
-  "GPQA": {
-    "type": "gpqa",
-    "name": "Graduate-Level Google-Proof Q&A"
-  },
-  "DROP": {
-    "type": "drop",
-    "name": "Discrete Reasoning Over Paragraphs"
-  },
-  "BBH": {
-    "type": "bbh",
-    "name": "Big Bench Hard"
-  },
-  "MATH": {
-    "type": "math",
-    "name": "MATH Dataset"
-  }
-}
diff --git a/skills/hugging-face-evaluation/scripts/evaluation_manager.py b/skills/hugging-face-evaluation/scripts/evaluation_manager.py
deleted file mode 100644
index 8dcfa901..00000000
--- a/skills/hugging-face-evaluation/scripts/evaluation_manager.py
+++ /dev/null
@@ -1,1374 +0,0 @@
-# /// script
-# requires-python = ">=3.13"
-# dependencies = [
-#     "huggingface-hub>=1.1.4",
-#     "markdown-it-py>=3.0.0",
-#     "python-dotenv>=1.2.1",
-#     "pyyaml>=6.0.3",
-#     "requests>=2.32.5",
-# ]
-# ///
-
-"""
-Manage evaluation results in Hugging Face model cards.
-
-This script provides two methods:
-1. Extract evaluation tables from model README files
-2. Import evaluation scores from Artificial Analysis API
-
-Both methods update the model-index metadata in model cards.
-"""
-
-import argparse
-import os
-import re
-from textwrap import dedent
-from typing import Any, Dict, List, Optional, Tuple
-
-
-def load_env() -> None:
-    """Load .env if python-dotenv is available; keep help usable without it."""
-    try:
-        import dotenv  # type: ignore
-    except ModuleNotFoundError:
-        return
-    dotenv.load_dotenv()
-
-
-def require_markdown_it():
-    try:
-        from markdown_it import MarkdownIt  # type: ignore
-    except ModuleNotFoundError as exc:
-        raise ModuleNotFoundError(
-            "markdown-it-py is required for table parsing. "
-            "Run with `uv run ...` or install with `uv pip install markdown-it-py`."
-        ) from exc
-    return MarkdownIt
-
-
-def require_model_card():
-    try:
-        from huggingface_hub import ModelCard  # type: ignore
-    except ModuleNotFoundError as exc:
-        raise ModuleNotFoundError(
-            "huggingface-hub is required for model card operations. "
-            "Run with `uv run ...` or install with `uv pip install huggingface-hub`."
-        ) from exc
-    return ModelCard
-
-
-def require_requests():
-    try:
-        import requests  # type: ignore
-    except ModuleNotFoundError as exc:
-        raise ModuleNotFoundError(
-            "requests is required for Artificial Analysis import. "
-            "Run with `uv run ...` or install with `uv pip install requests`."
-        ) from exc
-    return requests
-
-
-def require_yaml():
-    try:
-        import yaml  # type: ignore
-    except ModuleNotFoundError as exc:
-        raise ModuleNotFoundError(
-            "PyYAML is required for YAML output. "
-            "Run with `uv run ...` or install with `uv pip install pyyaml`."
-        ) from exc
-    return yaml
-
-
-# ============================================================================
-# Method 1: Extract Evaluations from README
-# ============================================================================
-
-
-def extract_tables_from_markdown(markdown_content: str) -> List[str]:
-    """Extract all markdown tables from content."""
-    # Pattern to match markdown tables
-    table_pattern = r"(\|[^\n]+\|(?:\r?\n\|[^\n]+\|)+)"
-    tables = re.findall(table_pattern, markdown_content)
-    return tables
-
-
-def parse_markdown_table(table_str: str) -> Tuple[List[str], List[List[str]]]:
-    """
-    Parse a markdown table string into headers and rows.
-
-    Returns:
-        Tuple of (headers, data_rows)
-    """
-    lines = [line.strip() for line in table_str.strip().split("\n")]
-
-    # Remove separator line (the one with dashes)
-    lines = [line for line in lines if not re.match(r"^\|[\s\-:]+\|$", line)]
-
-    if len(lines) < 2:
-        return [], []
-
-    # Parse header
-    header = [cell.strip() for cell in lines[0].split("|")[1:-1]]
-
-    # Parse data rows
-    data_rows = []
-    for line in lines[1:]:
-        cells = [cell.strip() for cell in line.split("|")[1:-1]]
-        if cells:
-            data_rows.append(cells)
-
-    return header, data_rows
-
-
-def is_evaluation_table(header: List[str], rows: List[List[str]]) -> bool:
-    """Determine if a table contains evaluation results."""
-    if not header or not rows:
-        return False
-
-    # Check if first column looks like benchmark names
-    benchmark_keywords = [
-        "benchmark", "task", "dataset", "eval", "test", "metric",
-        "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
-        "truthfulqa", "boolq", "piqa", "siqa"
-    ]
-
-    first_col = header[0].lower()
-    has_benchmark_header = any(keyword in first_col for keyword in benchmark_keywords)
-
-    # Check if there are numeric values in the table
-    has_numeric_values = False
-    for row in rows:
-        for cell in row:
-            try:
-                float(cell.replace("%", "").replace(",", ""))
-                has_numeric_values = True
-                break
-            except ValueError:
-                continue
-        if has_numeric_values:
-            break
-
-    return has_benchmark_header or has_numeric_values
-
-
-def normalize_model_name(name: str) -> tuple[set[str], str]:
-    """
-    Normalize a model name for matching.
-
-    Args:
-        name: Model name to normalize
-
-    Returns:
-        Tuple of (token_set, normalized_string)
-    """
-    # Remove markdown formatting
-    cleaned = re.sub(r'\[([^\]]+)\]\([^\)]+\)', r'\1', name)  # Remove markdown links
-    cleaned = re.sub(r'\*\*([^\*]+)\*\*', r'\1', cleaned)  # Remove bold
-    cleaned = cleaned.strip()
-
-    # Normalize and tokenize
-    normalized = cleaned.lower().replace("-", " ").replace("_", " ")
-    tokens = set(normalized.split())
-
-    return tokens, normalized
-
-
-def find_main_model_column(header: List[str], model_name: str) -> Optional[int]:
-    """
-    Identify the column index that corresponds to the main model.
-
-    Only returns a column if there's an exact normalized match with the model name.
-    This prevents extracting scores from training checkpoints or similar models.
-
-    Args:
-        header: Table column headers
-        model_name: Model name from repo_id (e.g., "OLMo-3-32B-Think")
-
-    Returns:
-        Column index of the main model, or None if no exact match found
-    """
-    if not header or not model_name:
-        return None
-
-    # Normalize model name and extract tokens
-    model_tokens, _ = normalize_model_name(model_name)
-
-    # Find exact matches only
-    for i, col_name in enumerate(header):
-        if not col_name:
-            continue
-
-        # Skip first column (benchmark names)
-        if i == 0:
-            continue
-
-        col_tokens, _ = normalize_model_name(col_name)
-
-        # Check for exact token match
-        if model_tokens == col_tokens:
-            return i
-
-    # No exact match found
-    return None
-
-
-def find_main_model_row(
-    rows: List[List[str]], model_name: str
-) -> tuple[Optional[int], List[str]]:
-    """
-    Identify the row index that corresponds to the main model in a transposed table.
-
-    In transposed tables, each row represents a different model, with the first
-    column containing the model name.
-
-    Args:
-        rows: Table data rows
-        model_name: Model name from repo_id (e.g., "OLMo-3-32B")
-
-    Returns:
-        Tuple of (row_index, available_models)
-        - row_index: Index of the main model, or None if no exact match found
-        - available_models: List of all model names found in the table
-    """
-    if not rows or not model_name:
-        return None, []
-
-    model_tokens, _ = normalize_model_name(model_name)
-    available_models = []
-
-    for i, row in enumerate(rows):
-        if not row or not row[0]:
-            continue
-
-        row_name = row[0].strip()
-
-        # Skip separator/header rows
-        if not row_name or row_name.startswith('---'):
-            continue
-
-        row_tokens, _ = normalize_model_name(row_name)
-
-        # Collect all non-empty model names
-        if row_tokens:
-            available_models.append(row_name)
-
-        # Check for exact token match
-        if model_tokens == row_tokens:
-            return i, available_models
-
-    return None, available_models
-
-
-def is_transposed_table(header: List[str], rows: List[List[str]]) -> bool:
-    """
-    Determine if a table is transposed (models as rows, benchmarks as columns).
-
-    A table is considered transposed if:
-    - The first column contains model-like names (not benchmark names)
-    - Most other columns contain numeric values
-    - Header row contains benchmark-like names
-
-    Args:
-        header: Table column headers
-        rows: Table data rows
-
-    Returns:
-        True if table appears to be transposed, False otherwise
-    """
-    if not header or not rows or len(header) < 3:
-        return False
-
-    # Check if first column header suggests model names
-    first_col = header[0].lower()
-    model_indicators = ["model", "system", "llm", "name"]
-    has_model_header = any(indicator in first_col for indicator in model_indicators)
-
-    # Check if remaining headers look like benchmarks
-    benchmark_keywords = [
-        "mmlu", "humaneval", "gsm", "hellaswag", "arc", "winogrande",
-        "eval", "score", "benchmark", "test", "math", "code", "mbpp",
-        "truthfulqa", "boolq", "piqa", "siqa", "drop", "squad"
-    ]
-
-    benchmark_header_count = 0
-    for col_name in header[1:]:
-        col_lower = col_name.lower()
-        if any(keyword in col_lower for keyword in benchmark_keywords):
-            benchmark_header_count += 1
-
-    has_benchmark_headers = benchmark_header_count >= 2
-
-    # Check if data rows have numeric values in most columns (except first)
-    numeric_count = 0
-    total_cells = 0
-
-    for row in rows[:5]:  # Check first 5 rows
-        for cell in row[1:]:  # Skip first column
-            total_cells += 1
-            try:
-                float(cell.replace("%", "").replace(",", "").strip())
-                numeric_count += 1
-            except (ValueError, AttributeError):
-                continue
-
-    has_numeric_data = total_cells > 0 and (numeric_count / total_cells) > 0.5
-
-    return (has_model_header or has_benchmark_headers) and has_numeric_data
-
-
-def extract_metrics_from_table(
-    header: List[str],
-    rows: List[List[str]],
-    table_format: str = "auto",
-    model_name: Optional[str] = None,
-    model_column_index: Optional[int] = None
-) -> List[Dict[str, Any]]:
-    """
-    Extract metrics from parsed table data.
-
-    Args:
-        header: Table column headers
-        rows: Table data rows
-        table_format: "rows" (benchmarks as rows), "columns" (benchmarks as columns),
-                     "transposed" (models as rows, benchmarks as columns), or "auto"
-        model_name: Optional model name to identify the correct column/row
-
-    Returns:
-        List of metric dictionaries with name, type, and value
-    """
-    metrics = []
-
-    if table_format == "auto":
-        # First check if it's a transposed table (models as rows)
-        if is_transposed_table(header, rows):
-            table_format = "transposed"
-        else:
-            # Check if first column header is empty/generic (indicates benchmarks in rows)
-            first_header = header[0].lower().strip() if header else ""
-            is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
-
-            if is_first_col_benchmarks:
-                table_format = "rows"
-            else:
-                # Heuristic: if first row has mostly numeric values, benchmarks are columns
-                try:
-                    numeric_count = sum(
-                        1 for cell in rows[0] if cell and
-                        re.match(r"^\d+\.?\d*%?$", cell.replace(",", "").strip())
-                    )
-                    table_format = "columns" if numeric_count > len(rows[0]) / 2 else "rows"
-                except (IndexError, ValueError):
-                    table_format = "rows"
-
-    if table_format == "rows":
-        # Benchmarks are in rows, scores in columns
-        # Try to identify the main model column if model_name is provided
-        target_column = model_column_index
-        if target_column is None and model_name:
-            target_column = find_main_model_column(header, model_name)
-
-        for row in rows:
-            if not row:
-                continue
-
-            benchmark_name = row[0].strip()
-            if not benchmark_name:
-                continue
-
-            # If we identified a specific column, use it; otherwise use first numeric value
-            if target_column is not None and target_column < len(row):
-                try:
-                    value_str = row[target_column].replace("%", "").replace(",", "").strip()
-                    if value_str:
-                        value = float(value_str)
-                        metrics.append({
-                            "name": benchmark_name,
-                            "type": benchmark_name.lower().replace(" ", "_"),
-                            "value": value
-                        })
-                except (ValueError, IndexError):
-                    pass
-            else:
-                # Extract numeric values from remaining columns (original behavior)
-                for i, cell in enumerate(row[1:], start=1):
-                    try:
-                        # Remove common suffixes and convert to float
-                        value_str = cell.replace("%", "").replace(",", "").strip()
-                        if not value_str:
-                            continue
-
-                        value = float(value_str)
-
-                        # Determine metric name
-                        metric_name = benchmark_name
-                        if len(header) > i and header[i].lower() not in ["score", "value", "result"]:
-                            metric_name = f"{benchmark_name} ({header[i]})"
-
-                        metrics.append({
-                            "name": metric_name,
-                            "type": benchmark_name.lower().replace(" ", "_"),
-                            "value": value
-                        })
-                        break  # Only take first numeric value per row
-                    except (ValueError, IndexError):
-                        continue
-
-    elif table_format == "transposed":
-        # Models are in rows (first column), benchmarks are in columns (header)
-        # Find the row that matches the target model
-        if not model_name:
-            print("Warning: model_name required for transposed table format")
-            return metrics
-
-        target_row_idx, available_models = find_main_model_row(rows, model_name)
-
-        if target_row_idx is None:
-            print(f"\n⚠ Could not find model '{model_name}' in transposed table")
-            if available_models:
-                print("\nAvailable models in table:")
-                for i, model in enumerate(available_models, 1):
-                    print(f"  {i}. {model}")
-                print("\nPlease select the correct model name from the list above.")
-                print("You can specify it using the --model-name-override flag:")
-                print(f'  --model-name-override "{available_models[0]}"')
-            return metrics
-
-        target_row = rows[target_row_idx]
-
-        # Extract metrics from each column (skip first column which is model name)
-        for i in range(1, len(header)):
-            benchmark_name = header[i].strip()
-            if not benchmark_name or i >= len(target_row):
-                continue
-
-            try:
-                value_str = target_row[i].replace("%", "").replace(",", "").strip()
-                if not value_str:
-                    continue
-
-                value = float(value_str)
-
-                metrics.append({
-                    "name": benchmark_name,
-                    "type": benchmark_name.lower().replace(" ", "_").replace("-", "_"),
-                    "value": value
-                })
-            except (ValueError, AttributeError):
-                continue
-
-    else:  # table_format == "columns"
-        # Benchmarks are in columns
-        if not rows:
-            return metrics
-
-        # Use first data row for values
-        data_row = rows[0]
-
-        for i, benchmark_name in enumerate(header):
-            if not benchmark_name or i >= len(data_row):
-                continue
-
-            try:
-                value_str = data_row[i].replace("%", "").replace(",", "").strip()
-                if not value_str:
-                    continue
-
-                value = float(value_str)
-
-                metrics.append({
-                    "name": benchmark_name,
-                    "type": benchmark_name.lower().replace(" ", "_"),
-                    "value": value
-                })
-            except ValueError:
-                continue
-
-    return metrics
-
-
-def extract_evaluations_from_readme(
-    repo_id: str,
-    task_type: str = "text-generation",
-    dataset_name: str = "Benchmarks",
-    dataset_type: str = "benchmark",
-    model_name_override: Optional[str] = None,
-    table_index: Optional[int] = None,
-    model_column_index: Optional[int] = None
-) -> Optional[List[Dict[str, Any]]]:
-    """
-    Extract evaluation results from a model's README.
-
-    Args:
-        repo_id: Hugging Face model repository ID
-        task_type: Task type for model-index (e.g., "text-generation")
-        dataset_name: Name for the benchmark dataset
-        dataset_type: Type identifier for the dataset
-        model_name_override: Override model name for matching (column header for comparison tables)
-        table_index: 1-indexed table number from inspect-tables output
-
-    Returns:
-        Model-index formatted results or None if no evaluations found
-    """
-    try:
-        load_env()
-        ModelCard = require_model_card()
-        hf_token = os.getenv("HF_TOKEN")
-        card = ModelCard.load(repo_id, token=hf_token)
-        readme_content = card.content
-
-        if not readme_content:
-            print(f"No README content found for {repo_id}")
-            return None
-
-        # Extract model name from repo_id or use override
-        if model_name_override:
-            model_name = model_name_override
-            print(f"Using model name override: '{model_name}'")
-        else:
-            model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
-
-        # Use markdown-it parser for accurate table extraction
-        all_tables = extract_tables_with_parser(readme_content)
-
-        if not all_tables:
-            print(f"No tables found in README for {repo_id}")
-            return None
-
-        # If table_index specified, use that specific table
-        if table_index is not None:
-            if table_index < 1 or table_index > len(all_tables):
-                print(f"Invalid table index {table_index}. Found {len(all_tables)} tables.")
-                print("Run inspect-tables to see available tables.")
-                return None
-            tables_to_process = [all_tables[table_index - 1]]
-        else:
-            # Filter to evaluation tables only
-            eval_tables = []
-            for table in all_tables:
-                header = table.get("headers", [])
-                rows = table.get("rows", [])
-                if is_evaluation_table(header, rows):
-                    eval_tables.append(table)
-
-            if len(eval_tables) > 1:
-                print(f"\n⚠ Found {len(eval_tables)} evaluation tables.")
-                print("Run inspect-tables first, then use --table to select one:")
-                print(f'  uv run scripts/evaluation_manager.py inspect-tables --repo-id "{repo_id}"')
-                return None
-            elif len(eval_tables) == 0:
-                print(f"No evaluation tables found in README for {repo_id}")
-                return None
-
-            tables_to_process = eval_tables
-
-        # Extract metrics from selected table(s)
-        all_metrics = []
-        for table in tables_to_process:
-            header = table.get("headers", [])
-            rows = table.get("rows", [])
-            metrics = extract_metrics_from_table(
-                header,
-                rows,
-                model_name=model_name,
-                model_column_index=model_column_index
-            )
-            all_metrics.extend(metrics)
-
-        if not all_metrics:
-            print(f"No metrics extracted from table")
-            return None
-
-        # Build model-index structure
-        display_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
-
-        results = [{
-            "task": {"type": task_type},
-            "dataset": {
-                "name": dataset_name,
-                "type": dataset_type
-            },
-            "metrics": all_metrics,
-            "source": {
-                "name": "Model README",
-                "url": f"https://huggingface.co/{repo_id}"
-            }
-        }]
-
-        return results
-
-    except Exception as e:
-        print(f"Error extracting evaluations from README: {e}")
-        return None
-
-
-# ============================================================================
-# Table Inspection (using markdown-it-py for accurate parsing)
-# ============================================================================
-
-
-def extract_tables_with_parser(markdown_content: str) -> List[Dict[str, Any]]:
-    """
-    Extract tables from markdown using markdown-it-py parser.
-    Uses GFM (GitHub Flavored Markdown) which includes table support.
-    """
-    MarkdownIt = require_markdown_it()
-    # Disable linkify to avoid optional dependency errors; not needed for table parsing.
-    md = MarkdownIt("gfm-like", {"linkify": False})
-    tokens = md.parse(markdown_content)
-
-    tables = []
-    i = 0
-    while i < len(tokens):
-        token = tokens[i]
-
-        if token.type == "table_open":
-            table_data = {"headers": [], "rows": []}
-            current_row = []
-            in_header = False
-
-            i += 1
-            while i < len(tokens) and tokens[i].type != "table_close":
-                t = tokens[i]
-                if t.type == "thead_open":
-                    in_header = True
-                elif t.type == "thead_close":
-                    in_header = False
-                elif t.type == "tr_open":
-                    current_row = []
-                elif t.type == "tr_close":
-                    if in_header:
-                        table_data["headers"] = current_row
-                    else:
-                        table_data["rows"].append(current_row)
-                    current_row = []
-                elif t.type == "inline":
-                    current_row.append(t.content.strip())
-                i += 1
-
-            if table_data["headers"] or table_data["rows"]:
-                tables.append(table_data)
-
-        i += 1
-
-    return tables
-
-
-def detect_table_format(table: Dict[str, Any], repo_id: str) -> Dict[str, Any]:
-    """Analyze a table to detect its format and identify model columns."""
-    headers = table.get("headers", [])
-    rows = table.get("rows", [])
-
-    if not headers or not rows:
-        return {"format": "unknown", "columns": headers, "model_columns": [], "row_count": 0, "sample_rows": []}
-
-    first_header = headers[0].lower() if headers else ""
-    is_first_col_benchmarks = not first_header or first_header in ["", "benchmark", "task", "dataset", "metric", "eval"]
-
-    # Check for numeric columns
-    numeric_columns = []
-    for col_idx in range(1, len(headers)):
-        numeric_count = 0
-        for row in rows[:5]:
-            if col_idx < len(row):
-                try:
-                    val = re.sub(r'\s*\([^)]*\)', '', row[col_idx])
-                    float(val.replace("%", "").replace(",", "").strip())
-                    numeric_count += 1
-                except (ValueError, AttributeError):
-                    pass
-        if numeric_count > len(rows[:5]) / 2:
-            numeric_columns.append(col_idx)
-
-    # Determine format
-    if is_first_col_benchmarks and len(numeric_columns) > 1:
-        format_type = "comparison"
-    elif is_first_col_benchmarks and len(numeric_columns) == 1:
-        format_type = "simple"
-    elif len(numeric_columns) > len(headers) / 2:
-        format_type = "transposed"
-    else:
-        format_type = "unknown"
-
-    # Find model columns
-    model_columns = []
-    model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
-    model_tokens, _ = normalize_model_name(model_name)
-
-    for idx, header in enumerate(headers):
-        if idx == 0 and is_first_col_benchmarks:
-            continue
-        if header:
-            header_tokens, _ = normalize_model_name(header)
-            is_match = model_tokens == header_tokens
-            is_partial = model_tokens.issubset(header_tokens) or header_tokens.issubset(model_tokens)
-            model_columns.append({
-                "index": idx,
-                "header": header,
-                "is_exact_match": is_match,
-                "is_partial_match": is_partial and not is_match
-            })
-
-    return {
-        "format": format_type,
-        "columns": headers,
-        "model_columns": model_columns,
-        "row_count": len(rows),
-        "sample_rows": [row[0] for row in rows[:5] if row]
-    }
-
-
-def inspect_tables(repo_id: str) -> None:
-    """Inspect and display all evaluation tables in a model's README."""
-    try:
-        load_env()
-        ModelCard = require_model_card()
-        hf_token = os.getenv("HF_TOKEN")
-        card = ModelCard.load(repo_id, token=hf_token)
-        readme_content = card.content
-
-        if not readme_content:
-            print(f"No README content found for {repo_id}")
-            return
-
-        tables = extract_tables_with_parser(readme_content)
-
-        if not tables:
-            print(f"No tables found in README for {repo_id}")
-            return
-
-        print(f"\n{'='*70}")
-        print(f"Tables found in README for: {repo_id}")
-        print(f"{'='*70}")
-
-        eval_table_count = 0
-        for table in tables:
-            analysis = detect_table_format(table, repo_id)
-
-            if analysis["format"] == "unknown" and not analysis.get("sample_rows"):
-                continue
-
-            eval_table_count += 1
-            print(f"\n## Table {eval_table_count}")
-            print(f"   Format: {analysis['format']}")
-            print(f"   Rows: {analysis['row_count']}")
-
-            print(f"\n   Columns ({len(analysis['columns'])}):")
-            for col_info in analysis.get("model_columns", []):
-                idx = col_info["index"]
-                header = col_info["header"]
-                if col_info["is_exact_match"]:
-                    print(f"      [{idx}] {header}  ✓ EXACT MATCH")
-                elif col_info["is_partial_match"]:
-                    print(f"      [{idx}] {header}  ~ partial match")
-                else:
-                    print(f"      [{idx}] {header}")
-
-            if analysis.get("sample_rows"):
-                print(f"\n   Sample rows (first column):")
-                for row_val in analysis["sample_rows"][:5]:
-                    print(f"      - {row_val}")
-
-        if eval_table_count == 0:
-            print("\nNo evaluation tables detected.")
-        else:
-            print("\nSuggested next step:")
-            print(f'  uv run scripts/evaluation_manager.py extract-readme --repo-id "{repo_id}" --table <table-number> [--model-column-index <column-index>]')
-
-        print(f"\n{'='*70}\n")
-
-    except Exception as e:
-        print(f"Error inspecting tables: {e}")
-
-
-# ============================================================================
-# Pull Request Management
-# ============================================================================
-
-
-def get_open_prs(repo_id: str) -> List[Dict[str, Any]]:
-    """
-    Fetch open pull requests for a Hugging Face model repository.
-
-    Args:
-        repo_id: Hugging Face model repository ID (e.g., "allenai/Olmo-3-32B-Think")
-
-    Returns:
-        List of open PR dictionaries with num, title, author, and createdAt
-    """
-    requests = require_requests()
-    url = f"https://huggingface.co/api/models/{repo_id}/discussions"
-
-    try:
-        response = requests.get(url, timeout=30, allow_redirects=True)
-        response.raise_for_status()
-
-        data = response.json()
-        discussions = data.get("discussions", [])
-
-        open_prs = [
-            {
-                "num": d["num"],
-                "title": d["title"],
-                "author": d["author"]["name"],
-                "createdAt": d.get("createdAt", "unknown"),
-            }
-            for d in discussions
-            if d.get("status") == "open" and d.get("isPullRequest")
-        ]
-
-        return open_prs
-
-    except requests.RequestException as e:
-        print(f"Error fetching PRs from Hugging Face: {e}")
-        return []
-
-
-def list_open_prs(repo_id: str) -> None:
-    """Display open pull requests for a model repository."""
-    prs = get_open_prs(repo_id)
-
-    print(f"\n{'='*70}")
-    print(f"Open Pull Requests for: {repo_id}")
-    print(f"{'='*70}")
-
-    if not prs:
-        print("\nNo open pull requests found.")
-    else:
-        print(f"\nFound {len(prs)} open PR(s):\n")
-        for pr in prs:
-            print(f"  PR #{pr['num']} - {pr['title']}")
-            print(f"     Author: {pr['author']}")
-            print(f"     Created: {pr['createdAt']}")
-            print(f"     URL: https://huggingface.co/{repo_id}/discussions/{pr['num']}")
-            print()
-
-    print(f"{'='*70}\n")
-
-
-# ============================================================================
-# Method 2: Import from Artificial Analysis
-# ============================================================================
-
-
-def get_aa_model_data(creator_slug: str, model_name: str) -> Optional[Dict[str, Any]]:
-    """
-    Fetch model evaluation data from Artificial Analysis API.
-
-    Args:
-        creator_slug: Creator identifier (e.g., "anthropic", "openai")
-        model_name: Model slug/identifier
-
-    Returns:
-        Model data dictionary or None if not found
-    """
-    load_env()
-    AA_API_KEY = os.getenv("AA_API_KEY")
-    if not AA_API_KEY:
-        raise ValueError("AA_API_KEY environment variable is not set")
-
-    url = "https://artificialanalysis.ai/api/v2/data/llms/models"
-    headers = {"x-api-key": AA_API_KEY}
-
-    requests = require_requests()
-
-    try:
-        response = requests.get(url, headers=headers, timeout=30)
-        response.raise_for_status()
-
-        data = response.json().get("data", [])
-
-        for model in data:
-            creator = model.get("model_creator", {})
-            if creator.get("slug") == creator_slug and model.get("slug") == model_name:
-                return model
-
-        print(f"Model {creator_slug}/{model_name} not found in Artificial Analysis")
-        return None
-
-    except requests.RequestException as e:
-        print(f"Error fetching data from Artificial Analysis: {e}")
-        return None
-
-
-def aa_data_to_model_index(
-    model_data: Dict[str, Any],
-    dataset_name: str = "Artificial Analysis Benchmarks",
-    dataset_type: str = "artificial_analysis",
-    task_type: str = "evaluation"
-) -> List[Dict[str, Any]]:
-    """
-    Convert Artificial Analysis model data to model-index format.
-
-    Args:
-        model_data: Raw model data from AA API
-        dataset_name: Dataset name for model-index
-        dataset_type: Dataset type identifier
-        task_type: Task type for model-index
-
-    Returns:
-        Model-index formatted results
-    """
-    model_name = model_data.get("name", model_data.get("slug", "unknown-model"))
-    evaluations = model_data.get("evaluations", {})
-
-    if not evaluations:
-        print(f"No evaluations found for model {model_name}")
-        return []
-
-    metrics = []
-    for key, value in evaluations.items():
-        if value is not None:
-            metrics.append({
-                "name": key.replace("_", " ").title(),
-                "type": key,
-                "value": value
-            })
-
-    results = [{
-        "task": {"type": task_type},
-        "dataset": {
-            "name": dataset_name,
-            "type": dataset_type
-        },
-        "metrics": metrics,
-        "source": {
-            "name": "Artificial Analysis API",
-            "url": "https://artificialanalysis.ai"
-        }
-    }]
-
-    return results
-
-
-def import_aa_evaluations(
-    creator_slug: str,
-    model_name: str,
-    repo_id: str
-) -> Optional[List[Dict[str, Any]]]:
-    """
-    Import evaluation results from Artificial Analysis for a model.
-
-    Args:
-        creator_slug: Creator identifier in AA
-        model_name: Model identifier in AA
-        repo_id: Hugging Face repository ID to update
-
-    Returns:
-        Model-index formatted results or None if import fails
-    """
-    model_data = get_aa_model_data(creator_slug, model_name)
-
-    if not model_data:
-        return None
-
-    results = aa_data_to_model_index(model_data)
-    return results
-
-
-# ============================================================================
-# Model Card Update Functions
-# ============================================================================
-
-
-def update_model_card_with_evaluations(
-    repo_id: str,
-    results: List[Dict[str, Any]],
-    create_pr: bool = False,
-    commit_message: Optional[str] = None
-) -> bool:
-    """
-    Update a model card with evaluation results.
-
-    Args:
-        repo_id: Hugging Face repository ID
-        results: Model-index formatted results
-        create_pr: Whether to create a PR instead of direct push
-        commit_message: Custom commit message
-
-    Returns:
-        True if successful, False otherwise
-    """
-    try:
-        load_env()
-        ModelCard = require_model_card()
-        hf_token = os.getenv("HF_TOKEN")
-        if not hf_token:
-            raise ValueError("HF_TOKEN environment variable is not set")
-
-        # Load existing card
-        card = ModelCard.load(repo_id, token=hf_token)
-
-        # Get model name
-        model_name = repo_id.split("/")[-1] if "/" in repo_id else repo_id
-
-        # Create or update model-index
-        model_index = [{
-            "name": model_name,
-            "results": results
-        }]
-
-        # Merge with existing model-index if present
-        if "model-index" in card.data:
-            existing = card.data["model-index"]
-            if isinstance(existing, list) and existing:
-                # Keep existing name if present
-                if "name" in existing[0]:
-                    model_index[0]["name"] = existing[0]["name"]
-
-                # Merge results
-                existing_results = existing[0].get("results", [])
-                model_index[0]["results"].extend(existing_results)
-
-        card.data["model-index"] = model_index
-
-        # Prepare commit message
-        if not commit_message:
-            commit_message = f"Add evaluation results to {model_name}"
-
-        commit_description = (
-            "This commit adds structured evaluation results to the model card. "
-            "The results are formatted using the model-index specification and "
-            "will be displayed in the model card's evaluation widget."
-        )
-
-        # Push update
-        card.push_to_hub(
-            repo_id,
-            token=hf_token,
-            commit_message=commit_message,
-            commit_description=commit_description,
-            create_pr=create_pr
-        )
-
-        action = "Pull request created" if create_pr else "Model card updated"
-        print(f"✓ {action} successfully for {repo_id}")
-        return True
-
-    except Exception as e:
-        print(f"Error updating model card: {e}")
-        return False
-
-
-def show_evaluations(repo_id: str) -> None:
-    """Display current evaluations in a model card."""
-    try:
-        load_env()
-        ModelCard = require_model_card()
-        hf_token = os.getenv("HF_TOKEN")
-        card = ModelCard.load(repo_id, token=hf_token)
-
-        if "model-index" not in card.data:
-            print(f"No model-index found in {repo_id}")
-            return
-
-        model_index = card.data["model-index"]
-
-        print(f"\nEvaluations for {repo_id}:")
-        print("=" * 60)
-
-        for model_entry in model_index:
-            model_name = model_entry.get("name", "Unknown")
-            print(f"\nModel: {model_name}")
-
-            results = model_entry.get("results", [])
-            for i, result in enumerate(results, 1):
-                print(f"\n  Result Set {i}:")
-
-                task = result.get("task", {})
-                print(f"    Task: {task.get('type', 'unknown')}")
-
-                dataset = result.get("dataset", {})
-                print(f"    Dataset: {dataset.get('name', 'unknown')}")
-
-                metrics = result.get("metrics", [])
-                print(f"    Metrics ({len(metrics)}):")
-                for metric in metrics:
-                    name = metric.get("name", "Unknown")
-                    value = metric.get("value", "N/A")
-                    print(f"      - {name}: {value}")
-
-                source = result.get("source", {})
-                if source:
-                    print(f"    Source: {source.get('name', 'Unknown')}")
-
-        print("\n" + "=" * 60)
-
-    except Exception as e:
-        print(f"Error showing evaluations: {e}")
-
-
-def validate_model_index(repo_id: str) -> bool:
-    """Validate model-index format in a model card."""
-    try:
-        load_env()
-        ModelCard = require_model_card()
-        hf_token = os.getenv("HF_TOKEN")
-        card = ModelCard.load(repo_id, token=hf_token)
-
-        if "model-index" not in card.data:
-            print(f"✗ No model-index found in {repo_id}")
-            return False
-
-        model_index = card.data["model-index"]
-
-        if not isinstance(model_index, list):
-            print("✗ model-index must be a list")
-            return False
-
-        for i, entry in enumerate(model_index):
-            if "name" not in entry:
-                print(f"✗ Entry {i} missing 'name' field")
-                return False
-
-            if "results" not in entry:
-                print(f"✗ Entry {i} missing 'results' field")
-                return False
-
-            for j, result in enumerate(entry["results"]):
-                if "task" not in result:
-                    print(f"✗ Result {j} in entry {i} missing 'task' field")
-                    return False
-
-                if "dataset" not in result:
-                    print(f"✗ Result {j} in entry {i} missing 'dataset' field")
-                    return False
-
-                if "metrics" not in result:
-                    print(f"✗ Result {j} in entry {i} missing 'metrics' field")
-                    return False
-
-        print(f"✓ Model-index format is valid for {repo_id}")
-        return True
-
-    except Exception as e:
-        print(f"Error validating model-index: {e}")
-        return False
-
-
-# ============================================================================
-# CLI Interface
-# ============================================================================
-
-
-def main():
-    parser = argparse.ArgumentParser(
-        description=(
-            "Manage evaluation results in Hugging Face model cards.\n\n"
-            "Use standard Python or `uv run scripts/evaluation_manager.py ...` "
-            "to auto-resolve dependencies from the PEP 723 header."
-        ),
-        formatter_class=argparse.RawTextHelpFormatter,
-        epilog=dedent(
-            """\
-            Typical workflows:
-              - Inspect tables first:
-                  uv run scripts/evaluation_manager.py inspect-tables --repo-id <model>
-              - Extract from README (prints YAML by default):
-                  uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N
-              - Apply changes:
-                  uv run scripts/evaluation_manager.py extract-readme --repo-id <model> --table N --apply
-              - Import from Artificial Analysis:
-                  AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug org --model-name slug --repo-id <model>
-
-            Tips:
-              - YAML is printed by default; use --apply or --create-pr to write changes.
-              - Set HF_TOKEN (and AA_API_KEY for import-aa); .env is loaded automatically if python-dotenv is installed.
-              - When multiple tables exist, run inspect-tables then select with --table N.
-              - To apply changes (push or PR), rerun extract-readme with --apply or --create-pr.
-            """
-        ),
-    )
-    parser.add_argument("--version", action="version", version="evaluation_manager 1.2.0")
-
-    subparsers = parser.add_subparsers(dest="command", help="Command to execute")
-
-    # Extract from README command
-    extract_parser = subparsers.add_parser(
-        "extract-readme",
-        help="Extract evaluation tables from model README",
-        formatter_class=argparse.RawTextHelpFormatter,
-        description="Parse README tables into model-index YAML. Default behavior prints YAML; use --apply/--create-pr to write changes.",
-        epilog=dedent(
-            """\
-            Examples:
-              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model
-              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-column-index 3
-              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --model-name-override \"**Model 7B**\"  # exact header text
-              uv run scripts/evaluation_manager.py extract-readme --repo-id username/model --table 2 --create-pr
-
-            Apply changes:
-              - Default: prints YAML to stdout (no writes).
-              - Add --apply to push directly, or --create-pr to open a PR.
-            Model selection:
-              - Preferred: --model-column-index <header index shown by inspect-tables>
-              - If using --model-name-override, copy the column header text exactly.
-            """
-        ),
-    )
-    extract_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-    extract_parser.add_argument("--table", type=int, help="Table number (1-indexed, from inspect-tables output)")
-    extract_parser.add_argument("--model-column-index", type=int, help="Preferred: column index from inspect-tables output (exact selection)")
-    extract_parser.add_argument("--model-name-override", type=str, help="Exact column header/model name for comparison/transpose tables (when index is not used)")
-    extract_parser.add_argument("--task-type", type=str, default="text-generation", help="Sets model-index task.type (e.g., text-generation, summarization)")
-    extract_parser.add_argument("--dataset-name", type=str, default="Benchmarks", help="Dataset name")
-    extract_parser.add_argument("--dataset-type", type=str, default="benchmark", help="Dataset type")
-    extract_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
-    extract_parser.add_argument("--apply", action="store_true", help="Apply changes (default is to print YAML only)")
-    extract_parser.add_argument("--dry-run", action="store_true", help="Preview YAML without updating (default)")
-
-    # Import from AA command
-    aa_parser = subparsers.add_parser(
-        "import-aa",
-        help="Import evaluation scores from Artificial Analysis",
-        formatter_class=argparse.RawTextHelpFormatter,
-        description="Fetch scores from Artificial Analysis API and write them into model-index.",
-        epilog=dedent(
-            """\
-            Examples:
-              AA_API_KEY=... uv run scripts/evaluation_manager.py import-aa --creator-slug anthropic --model-name claude-sonnet-4 --repo-id username/model
-              uv run scripts/evaluation_manager.py import-aa --creator-slug openai --model-name gpt-4o --repo-id username/model --create-pr
-
-            Requires: AA_API_KEY in env (or .env if python-dotenv installed).
-            """
-        ),
-    )
-    aa_parser.add_argument("--creator-slug", type=str, required=True, help="AA creator slug")
-    aa_parser.add_argument("--model-name", type=str, required=True, help="AA model name")
-    aa_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-    aa_parser.add_argument("--create-pr", action="store_true", help="Create PR instead of direct push")
-
-    # Show evaluations command
-    show_parser = subparsers.add_parser(
-        "show",
-        help="Display current evaluations in model card",
-        formatter_class=argparse.RawTextHelpFormatter,
-        description="Print model-index content from the model card (requires HF_TOKEN for private repos).",
-    )
-    show_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-
-    # Validate command
-    validate_parser = subparsers.add_parser(
-        "validate",
-        help="Validate model-index format",
-        formatter_class=argparse.RawTextHelpFormatter,
-        description="Schema sanity check for model-index section of the card.",
-    )
-    validate_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-
-    # Inspect tables command
-    inspect_parser = subparsers.add_parser(
-        "inspect-tables",
-        help="Inspect tables in README → outputs suggested extract-readme command",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Workflow:
-  1. inspect-tables     → see table structure, columns, and table numbers
-  2. extract-readme     → run with --table N (from step 1); YAML prints by default
-  3. apply changes      → rerun extract-readme with --apply or --create-pr
-
-Reminder:
-  - Preferred: use --model-column-index <index>. If needed, use --model-name-override with the exact column header text.
-"""
-    )
-    inspect_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-
-    # Get PRs command
-    prs_parser = subparsers.add_parser(
-        "get-prs",
-        help="List open pull requests for a model repository",
-        formatter_class=argparse.RawTextHelpFormatter,
-        description="Check for existing open PRs before creating new ones to avoid duplicates.",
-        epilog=dedent(
-            """\
-            Examples:
-              uv run scripts/evaluation_manager.py get-prs --repo-id "allenai/Olmo-3-32B-Think"
-
-            IMPORTANT: Always run this before using --create-pr to avoid duplicate PRs.
-            """
-        ),
-    )
-    prs_parser.add_argument("--repo-id", type=str, required=True, help="HF repository ID")
-
-    args = parser.parse_args()
-
-    if not args.command:
-        parser.print_help()
-        return
-
-    try:
-        # Execute command
-        if args.command == "extract-readme":
-            results = extract_evaluations_from_readme(
-                repo_id=args.repo_id,
-                task_type=args.task_type,
-                dataset_name=args.dataset_name,
-                dataset_type=args.dataset_type,
-                model_name_override=args.model_name_override,
-                table_index=args.table,
-                model_column_index=args.model_column_index
-            )
-
-            if not results:
-                print("No evaluations extracted")
-                return
-
-            apply_changes = args.apply or args.create_pr
-
-            # Default behavior: print YAML (dry-run)
-            yaml = require_yaml()
-            print("\nExtracted evaluations (YAML):")
-            print(
-                yaml.dump(
-                    {"model-index": [{"name": args.repo_id.split('/')[-1], "results": results}]},
-                    sort_keys=False
-                )
-            )
-
-            if apply_changes:
-                if args.model_name_override and args.model_column_index is not None:
-                    print("Note: --model-column-index takes precedence over --model-name-override.")
-                update_model_card_with_evaluations(
-                    repo_id=args.repo_id,
-                    results=results,
-                    create_pr=args.create_pr,
-                    commit_message="Extract evaluation results from README"
-                )
-
-        elif args.command == "import-aa":
-            results = import_aa_evaluations(
-                creator_slug=args.creator_slug,
-                model_name=args.model_name,
-                repo_id=args.repo_id
-            )
-
-            if not results:
-                print("No evaluations imported")
-                return
-
-            update_model_card_with_evaluations(
-                repo_id=args.repo_id,
-                results=results,
-                create_pr=args.create_pr,
-                commit_message=f"Add Artificial Analysis evaluations for {args.model_name}"
-            )
-
-        elif args.command == "show":
-            show_evaluations(args.repo_id)
-
-        elif args.command == "validate":
-            validate_model_index(args.repo_id)
-
-        elif args.command == "inspect-tables":
-            inspect_tables(args.repo_id)
-
-        elif args.command == "get-prs":
-            list_open_prs(args.repo_id)
-    except ModuleNotFoundError as exc:
-        # Surface dependency hints cleanly when user only needs help output
-        print(exc)
-    except Exception as exc:
-        print(f"Error: {exc}")
-
-
-if __name__ == "__main__":
-    main()
diff --git a/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py b/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py
index e52fdfb1..d398bc60 100644
--- a/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py
+++ b/skills/hugging-face-evaluation/scripts/inspect_eval_uv.py
@@ -8,7 +8,7 @@
 # ///
 
 """
-Entry point script for running inspect-ai evaluations via `hf jobs uv run`.
+Entry point script for running inspect-ai evaluations against Hugging Face inference providers.
 """
 
 from __future__ import annotations
diff --git a/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py b/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py
index 1bb73060..f1454c5a 100644
--- a/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py
+++ b/skills/hugging-face-evaluation/scripts/inspect_vllm_uv.py
@@ -16,13 +16,7 @@
 separate from inference provider scripts (which use external APIs).
 
 Usage (standalone):
-    python inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu"
-
-Usage (via HF Jobs):
-    hf jobs uv run inspect_vllm_uv.py \\
-        --flavor a10g-small \\
-        --secret HF_TOKEN=$HF_TOKEN \\
-        -- --model "meta-llama/Llama-3.2-1B" --task "mmlu"
+    uv run scripts/inspect_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --task "mmlu"
 
 Model backends:
     - vllm: Fast inference with vLLM (recommended for large models)
@@ -187,16 +181,16 @@ def main() -> None:
         epilog="""
 Examples:
   # Run MMLU with vLLM backend
-  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu
+  uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu
 
   # Run with HuggingFace Transformers backend
-  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf
+  uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --backend hf
 
   # Run with limited samples for testing
-  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10
+  uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-1B --task mmlu --limit 10
 
   # Run on multiple GPUs with tensor parallelism
-  python inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4
+  uv run scripts/inspect_vllm_uv.py --model meta-llama/Llama-3.2-70B --task mmlu --tensor-parallel-size 4
 
 Available tasks (from inspect-evals):
   - mmlu: Massive Multitask Language Understanding
@@ -207,11 +201,6 @@ def main() -> None:
   - winogrande: Winograd Schema Challenge
   - humaneval: Code generation (HumanEval)
 
-Via HF Jobs:
-  hf jobs uv run inspect_vllm_uv.py \\
-      --flavor a10g-small \\
-      --secret HF_TOKEN=$HF_TOKEN \\
-      -- --model meta-llama/Llama-3.2-1B --task mmlu
         """,
     )
 
diff --git a/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py b/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py
index 38798003..91ba83b3 100644
--- a/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py
+++ b/skills/hugging-face-evaluation/scripts/lighteval_vllm_uv.py
@@ -10,19 +10,14 @@
 # ///
 
 """
-Entry point script for running lighteval evaluations with vLLM backend via `hf jobs uv run`.
+Entry point script for running lighteval evaluations with local GPU backends.
 
-This script runs evaluations using vLLM for efficient GPU inference on custom HuggingFace models.
-It is separate from inference provider scripts and evaluates models directly on the hardware.
+This script runs evaluations using vLLM or accelerate on custom HuggingFace models.
+It is separate from inference provider scripts and evaluates models directly on local hardware.
 
 Usage (standalone):
-    python lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
+    uv run scripts/lighteval_vllm_uv.py --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
 
-Usage (via HF Jobs):
-    hf jobs uv run lighteval_vllm_uv.py \\
-        --flavor a10g-small \\
-        --secret HF_TOKEN=$HF_TOKEN \\
-        -- --model "meta-llama/Llama-3.2-1B" --tasks "leaderboard|mmlu|5"
 """
 
 from __future__ import annotations
@@ -181,16 +176,16 @@ def main() -> None:
         epilog="""
 Examples:
   # Run MMLU evaluation with vLLM
-  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5"
+  uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5"
 
   # Run with accelerate backend instead of vLLM
-  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate
+  uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --backend accelerate
 
   # Run with chat template for instruction-tuned models
-  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template
+  uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B-Instruct --tasks "leaderboard|mmlu|5" --use-chat-template
 
   # Run with limited samples for testing
-  python lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10
+  uv run scripts/lighteval_vllm_uv.py --model meta-llama/Llama-3.2-1B --tasks "leaderboard|mmlu|5" --max-samples 10
 
 Task format:
   Tasks use the format: "suite|task|num_fewshot"
@@ -300,4 +295,3 @@ def main() -> None:
 
 if __name__ == "__main__":
     main()
-
diff --git a/skills/hugging-face-evaluation/scripts/run_eval_job.py b/skills/hugging-face-evaluation/scripts/run_eval_job.py
deleted file mode 100644
index 1ba45860..00000000
--- a/skills/hugging-face-evaluation/scripts/run_eval_job.py
+++ /dev/null
@@ -1,98 +0,0 @@
-# /// script
-# requires-python = ">=3.10"
-# dependencies = [
-#     "huggingface-hub>=0.26.0",
-#     "python-dotenv>=1.2.1",
-# ]
-# ///
-
-"""
-Submit evaluation jobs using the `hf jobs uv run` CLI.
-
-This wrapper constructs the appropriate command to execute the local
-`inspect_eval_uv.py` script on Hugging Face Jobs with the requested hardware.
-"""
-
-import argparse
-import os
-import subprocess
-import sys
-from pathlib import Path
-from typing import Optional
-
-from huggingface_hub import get_token
-from dotenv import load_dotenv
-
-load_dotenv()
-
-
-SCRIPT_PATH = Path(__file__).with_name("inspect_eval_uv.py").resolve()
-
-
-def create_eval_job(
-    model_id: str,
-    task: str,
-    hardware: str = "cpu-basic",
-    hf_token: Optional[str] = None,
-    limit: Optional[int] = None,
-) -> None:
-    """
-    Submit an evaluation job using the Hugging Face Jobs CLI.
-    """
-    token = hf_token or os.getenv("HF_TOKEN") or get_token()
-    if not token:
-        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
-
-    if not SCRIPT_PATH.exists():
-        raise FileNotFoundError(f"Script not found at {SCRIPT_PATH}")
-
-    print(f"Preparing evaluation job for {model_id} on task {task} (hardware: {hardware})")
-
-    cmd = [
-        "hf",
-        "jobs",
-        "uv",
-        "run",
-        str(SCRIPT_PATH),
-        "--flavor",
-        hardware,
-        "--secrets",
-        f"HF_TOKEN={token}",
-        "--",
-        "--model",
-        model_id,
-        "--task",
-        task,
-    ]
-
-    if limit:
-        cmd.extend(["--limit", str(limit)])
-
-    print("Executing:", " ".join(cmd))
-
-    try:
-        subprocess.run(cmd, check=True)
-    except subprocess.CalledProcessError as exc:
-        print("hf jobs command failed", file=sys.stderr)
-        raise
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(description="Run inspect-ai evaluations on Hugging Face Jobs")
-    parser.add_argument("--model", required=True, help="Model ID (e.g. Qwen/Qwen3-0.6B)")
-    parser.add_argument("--task", required=True, help="Inspect task (e.g. mmlu, gsm8k)")
-    parser.add_argument("--hardware", default="cpu-basic", help="Hardware flavor (e.g. t4-small, a10g-small)")
-    parser.add_argument("--limit", type=int, default=None, help="Limit number of samples to evaluate")
-
-    args = parser.parse_args()
-
-    create_eval_job(
-        model_id=args.model,
-        task=args.task,
-        hardware=args.hardware,
-        limit=args.limit,
-    )
-
-
-if __name__ == "__main__":
-    main()
diff --git a/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py b/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py
deleted file mode 100644
index 97ef7271..00000000
--- a/skills/hugging-face-evaluation/scripts/run_vllm_eval_job.py
+++ /dev/null
@@ -1,331 +0,0 @@
-# /// script
-# requires-python = ">=3.10"
-# dependencies = [
-#     "huggingface-hub>=0.26.0",
-#     "python-dotenv>=1.2.1",
-# ]
-# ///
-
-"""
-Submit vLLM-based evaluation jobs using the `hf jobs uv run` CLI.
-
-This wrapper constructs the appropriate command to execute vLLM evaluation scripts
-(lighteval or inspect-ai) on Hugging Face Jobs with GPU hardware.
-
-Unlike run_eval_job.py (which uses inference providers/APIs), this script runs
-models directly on the job's GPU using vLLM or HuggingFace Transformers.
-
-Usage:
-    python run_vllm_eval_job.py \\
-        --model meta-llama/Llama-3.2-1B \\
-        --task mmlu \\
-        --framework lighteval \\
-        --hardware a10g-small
-"""
-
-from __future__ import annotations
-
-import argparse
-import os
-import subprocess
-import sys
-from pathlib import Path
-from typing import Optional
-
-from huggingface_hub import get_token
-from dotenv import load_dotenv
-
-load_dotenv()
-
-# Script paths for different evaluation frameworks
-SCRIPT_DIR = Path(__file__).parent.resolve()
-LIGHTEVAL_SCRIPT = SCRIPT_DIR / "lighteval_vllm_uv.py"
-INSPECT_SCRIPT = SCRIPT_DIR / "inspect_vllm_uv.py"
-
-# Hardware flavor recommendations for different model sizes
-HARDWARE_RECOMMENDATIONS = {
-    "small": "t4-small",       # < 3B parameters
-    "medium": "a10g-small",    # 3B - 13B parameters
-    "large": "a10g-large",     # 13B - 34B parameters
-    "xlarge": "a100-large",    # 34B+ parameters
-}
-
-
-def estimate_hardware(model_id: str) -> str:
-    """
-    Estimate appropriate hardware based on model ID naming conventions.
-    
-    Returns a hardware flavor recommendation.
-    """
-    model_lower = model_id.lower()
-    
-    # Check for explicit size indicators in model name
-    if any(x in model_lower for x in ["70b", "72b", "65b"]):
-        return "a100-large"
-    elif any(x in model_lower for x in ["34b", "33b", "32b", "30b"]):
-        return "a10g-large"
-    elif any(x in model_lower for x in ["13b", "14b", "7b", "8b"]):
-        return "a10g-small"
-    elif any(x in model_lower for x in ["3b", "2b", "1b", "0.5b", "small", "mini"]):
-        return "t4-small"
-    
-    # Default to medium hardware
-    return "a10g-small"
-
-
-def create_lighteval_job(
-    model_id: str,
-    tasks: str,
-    hardware: str,
-    hf_token: Optional[str] = None,
-    max_samples: Optional[int] = None,
-    backend: str = "vllm",
-    batch_size: int = 1,
-    tensor_parallel_size: int = 1,
-    trust_remote_code: bool = False,
-    use_chat_template: bool = False,
-) -> None:
-    """
-    Submit a lighteval evaluation job on HuggingFace Jobs.
-    """
-    token = hf_token or os.getenv("HF_TOKEN") or get_token()
-    if not token:
-        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
-
-    if not LIGHTEVAL_SCRIPT.exists():
-        raise FileNotFoundError(f"Script not found at {LIGHTEVAL_SCRIPT}")
-
-    print(f"Preparing lighteval job for {model_id}")
-    print(f"  Tasks: {tasks}")
-    print(f"  Backend: {backend}")
-    print(f"  Hardware: {hardware}")
-
-    cmd = [
-        "hf", "jobs", "uv", "run",
-        str(LIGHTEVAL_SCRIPT),
-        "--flavor", hardware,
-        "--secrets", f"HF_TOKEN={token}",
-        "--",
-        "--model", model_id,
-        "--tasks", tasks,
-        "--backend", backend,
-        "--batch-size", str(batch_size),
-        "--tensor-parallel-size", str(tensor_parallel_size),
-    ]
-
-    if max_samples:
-        cmd.extend(["--max-samples", str(max_samples)])
-
-    if trust_remote_code:
-        cmd.append("--trust-remote-code")
-
-    if use_chat_template:
-        cmd.append("--use-chat-template")
-
-    print(f"\nExecuting: {' '.join(cmd)}")
-
-    try:
-        subprocess.run(cmd, check=True)
-    except subprocess.CalledProcessError as exc:
-        print("hf jobs command failed", file=sys.stderr)
-        raise
-
-
-def create_inspect_job(
-    model_id: str,
-    task: str,
-    hardware: str,
-    hf_token: Optional[str] = None,
-    limit: Optional[int] = None,
-    backend: str = "vllm",
-    tensor_parallel_size: int = 1,
-    trust_remote_code: bool = False,
-) -> None:
-    """
-    Submit an inspect-ai evaluation job on HuggingFace Jobs.
-    """
-    token = hf_token or os.getenv("HF_TOKEN") or get_token()
-    if not token:
-        raise ValueError("HF_TOKEN is required. Set it in environment or pass as argument.")
-
-    if not INSPECT_SCRIPT.exists():
-        raise FileNotFoundError(f"Script not found at {INSPECT_SCRIPT}")
-
-    print(f"Preparing inspect-ai job for {model_id}")
-    print(f"  Task: {task}")
-    print(f"  Backend: {backend}")
-    print(f"  Hardware: {hardware}")
-
-    cmd = [
-        "hf", "jobs", "uv", "run",
-        str(INSPECT_SCRIPT),
-        "--flavor", hardware,
-        "--secrets", f"HF_TOKEN={token}",
-        "--",
-        "--model", model_id,
-        "--task", task,
-        "--backend", backend,
-        "--tensor-parallel-size", str(tensor_parallel_size),
-    ]
-
-    if limit:
-        cmd.extend(["--limit", str(limit)])
-
-    if trust_remote_code:
-        cmd.append("--trust-remote-code")
-
-    print(f"\nExecuting: {' '.join(cmd)}")
-
-    try:
-        subprocess.run(cmd, check=True)
-    except subprocess.CalledProcessError as exc:
-        print("hf jobs command failed", file=sys.stderr)
-        raise
-
-
-def main() -> None:
-    parser = argparse.ArgumentParser(
-        description="Submit vLLM-based evaluation jobs to HuggingFace Jobs",
-        formatter_class=argparse.RawDescriptionHelpFormatter,
-        epilog="""
-Examples:
-  # Run lighteval with vLLM on A10G GPU
-  python run_vllm_eval_job.py \\
-      --model meta-llama/Llama-3.2-1B \\
-      --task "leaderboard|mmlu|5" \\
-      --framework lighteval \\
-      --hardware a10g-small
-
-  # Run inspect-ai on larger model with multi-GPU
-  python run_vllm_eval_job.py \\
-      --model meta-llama/Llama-3.2-70B \\
-      --task mmlu \\
-      --framework inspect \\
-      --hardware a100-large \\
-      --tensor-parallel-size 4
-
-  # Auto-detect hardware based on model size
-  python run_vllm_eval_job.py \\
-      --model meta-llama/Llama-3.2-1B \\
-      --task mmlu \\
-      --framework inspect
-
-  # Run with HF Transformers backend (instead of vLLM)
-  python run_vllm_eval_job.py \\
-      --model microsoft/phi-2 \\
-      --task mmlu \\
-      --framework inspect \\
-      --backend hf
-
-Hardware flavors:
-  - t4-small: T4 GPU, good for models < 3B
-  - a10g-small: A10G GPU, good for models 3B-13B
-  - a10g-large: A10G GPU, good for models 13B-34B
-  - a100-large: A100 GPU, good for models 34B+
-
-Frameworks:
-  - lighteval: HuggingFace's lighteval library
-  - inspect: UK AI Safety's inspect-ai library
-
-Task formats:
-  - lighteval: "suite|task|num_fewshot" (e.g., "leaderboard|mmlu|5")
-  - inspect: task name (e.g., "mmlu", "gsm8k")
-        """,
-    )
-
-    parser.add_argument(
-        "--model",
-        required=True,
-        help="HuggingFace model ID (e.g., meta-llama/Llama-3.2-1B)",
-    )
-    parser.add_argument(
-        "--task",
-        required=True,
-        help="Evaluation task (format depends on framework)",
-    )
-    parser.add_argument(
-        "--framework",
-        choices=["lighteval", "inspect"],
-        default="lighteval",
-        help="Evaluation framework to use (default: lighteval)",
-    )
-    parser.add_argument(
-        "--hardware",
-        default=None,
-        help="Hardware flavor (auto-detected if not specified)",
-    )
-    parser.add_argument(
-        "--backend",
-        choices=["vllm", "hf", "accelerate"],
-        default="vllm",
-        help="Model backend (default: vllm)",
-    )
-    parser.add_argument(
-        "--limit",
-        "--max-samples",
-        type=int,
-        default=None,
-        dest="limit",
-        help="Limit number of samples to evaluate",
-    )
-    parser.add_argument(
-        "--batch-size",
-        type=int,
-        default=1,
-        help="Batch size for evaluation (lighteval only)",
-    )
-    parser.add_argument(
-        "--tensor-parallel-size",
-        type=int,
-        default=1,
-        help="Number of GPUs for tensor parallelism",
-    )
-    parser.add_argument(
-        "--trust-remote-code",
-        action="store_true",
-        help="Allow executing remote code from model repository",
-    )
-    parser.add_argument(
-        "--use-chat-template",
-        action="store_true",
-        help="Apply chat template (lighteval only)",
-    )
-
-    args = parser.parse_args()
-
-    # Auto-detect hardware if not specified
-    hardware = args.hardware or estimate_hardware(args.model)
-    print(f"Using hardware: {hardware}")
-
-    # Map backend names between frameworks
-    backend = args.backend
-    if args.framework == "lighteval" and backend == "hf":
-        backend = "accelerate"  # lighteval uses "accelerate" for HF backend
-
-    if args.framework == "lighteval":
-        create_lighteval_job(
-            model_id=args.model,
-            tasks=args.task,
-            hardware=hardware,
-            max_samples=args.limit,
-            backend=backend,
-            batch_size=args.batch_size,
-            tensor_parallel_size=args.tensor_parallel_size,
-            trust_remote_code=args.trust_remote_code,
-            use_chat_template=args.use_chat_template,
-        )
-    else:
-        create_inspect_job(
-            model_id=args.model,
-            task=args.task,
-            hardware=hardware,
-            limit=args.limit,
-            backend=backend if backend != "accelerate" else "hf",
-            tensor_parallel_size=args.tensor_parallel_size,
-            trust_remote_code=args.trust_remote_code,
-        )
-
-
-if __name__ == "__main__":
-    main()
-
diff --git a/skills/hugging-face-evaluation/scripts/test_extraction.py b/skills/hugging-face-evaluation/scripts/test_extraction.py
deleted file mode 100755
index 4c97e055..00000000
--- a/skills/hugging-face-evaluation/scripts/test_extraction.py
+++ /dev/null
@@ -1,206 +0,0 @@
-#!/usr/bin/env python3
-# /// script
-# requires-python = ">=3.10"
-# dependencies = [
-#     "pyyaml",
-# ]
-# ///
-"""
-Test script for evaluation extraction functionality.
-
-This script demonstrates the table extraction capabilities without
-requiring HF tokens or making actual API calls.
-
-Note: This script imports from evaluation_manager.py (same directory).
-Run from the scripts/ directory: cd scripts && uv run test_extraction.py
-"""
-
-import yaml
-
-from evaluation_manager import (
-    extract_tables_from_markdown,
-    parse_markdown_table,
-    is_evaluation_table,
-    extract_metrics_from_table
-)
-
-# Sample README content with various table formats
-SAMPLE_README = """
-# My Awesome Model
-
-## Evaluation Results
-
-Here are the benchmark results:
-
-| Benchmark | Score |
-|-----------|-------|
-| MMLU      | 85.2  |
-| HumanEval | 72.5  |
-| GSM8K     | 91.3  |
-
-### Detailed Breakdown
-
-| Category      | MMLU  | GSM8K | HumanEval |
-|---------------|-------|-------|-----------|
-| Performance   | 85.2  | 91.3  | 72.5      |
-
-## Other Information
-
-This is not an evaluation table:
-
-| Feature | Value |
-|---------|-------|
-| Size    | 7B    |
-| Type    | Chat  |
-
-## More Results
-
-| Benchmark     | Accuracy | F1 Score |
-|---------------|----------|----------|
-| HellaSwag     | 88.9     | 0.87     |
-| TruthfulQA    | 68.7     | 0.65     |
-"""
-
-
-def test_table_extraction():
-    """Test markdown table extraction."""
-    print("=" * 60)
-    print("TEST 1: Table Extraction")
-    print("=" * 60)
-
-    tables = extract_tables_from_markdown(SAMPLE_README)
-    print(f"Found {len(tables)} tables in the sample README\n")
-
-    for i, table in enumerate(tables, 1):
-        print(f"Table {i}:")
-        print(table[:100] + "..." if len(table) > 100 else table)
-        print()
-
-    return tables
-
-
-def test_table_parsing(tables):
-    """Test table parsing."""
-    print("\n" + "=" * 60)
-    print("TEST 2: Table Parsing")
-    print("=" * 60)
-
-    parsed_tables = []
-    for i, table in enumerate(tables, 1):
-        print(f"\nParsing Table {i}:")
-        header, rows = parse_markdown_table(table)
-
-        print(f"  Header: {header}")
-        print(f"  Rows: {len(rows)}")
-        for j, row in enumerate(rows[:3], 1):  # Show first 3 rows
-            print(f"    Row {j}: {row}")
-        if len(rows) > 3:
-            print(f"    ... and {len(rows) - 3} more rows")
-
-        parsed_tables.append((header, rows))
-
-    return parsed_tables
-
-
-def test_evaluation_detection(parsed_tables):
-    """Test evaluation table detection."""
-    print("\n" + "=" * 60)
-    print("TEST 3: Evaluation Table Detection")
-    print("=" * 60)
-
-    eval_tables = []
-    for i, (header, rows) in enumerate(parsed_tables, 1):
-        is_eval = is_evaluation_table(header, rows)
-        status = "✓ IS" if is_eval else "✗ NOT"
-        print(f"\nTable {i}: {status} an evaluation table")
-        print(f"  Header: {header}")
-
-        if is_eval:
-            eval_tables.append((header, rows))
-
-    print(f"\nFound {len(eval_tables)} evaluation tables")
-    return eval_tables
-
-
-def test_metric_extraction(eval_tables):
-    """Test metric extraction."""
-    print("\n" + "=" * 60)
-    print("TEST 4: Metric Extraction")
-    print("=" * 60)
-
-    all_metrics = []
-    for i, (header, rows) in enumerate(eval_tables, 1):
-        print(f"\nExtracting metrics from table {i}:")
-        metrics = extract_metrics_from_table(header, rows, table_format="auto")
-
-        print(f"  Extracted {len(metrics)} metrics:")
-        for metric in metrics:
-            print(f"    - {metric['name']}: {metric['value']} (type: {metric['type']})")
-
-        all_metrics.extend(metrics)
-
-    return all_metrics
-
-
-def test_model_index_format(metrics):
-    """Test model-index format generation."""
-    print("\n" + "=" * 60)
-    print("TEST 5: Model-Index Format")
-    print("=" * 60)
-
-    model_index = {
-        "model-index": [
-            {
-                "name": "test-model",
-                "results": [
-                    {
-                        "task": {"type": "text-generation"},
-                        "dataset": {
-                            "name": "Benchmarks",
-                            "type": "benchmark"
-                        },
-                        "metrics": metrics,
-                        "source": {
-                            "name": "Model README",
-                            "url": "https://huggingface.co/test/model"
-                        }
-                    }
-                ]
-            }
-        ]
-    }
-
-    print("\nGenerated model-index structure:")
-    print(yaml.dump(model_index, sort_keys=False, default_flow_style=False))
-
-
-def main():
-    """Run all tests."""
-    print("\n" + "=" * 60)
-    print("EVALUATION EXTRACTION TEST SUITE")
-    print("=" * 60)
-    print("\nThis test demonstrates the table extraction capabilities")
-    print("without requiring API access or tokens.\n")
-
-    # Run tests
-    tables = test_table_extraction()
-    parsed_tables = test_table_parsing(tables)
-    eval_tables = test_evaluation_detection(parsed_tables)
-    metrics = test_metric_extraction(eval_tables)
-    test_model_index_format(metrics)
-
-    # Summary
-    print("\n" + "=" * 60)
-    print("TEST SUMMARY")
-    print("=" * 60)
-    print(f"✓ Found {len(tables)} total tables")
-    print(f"✓ Identified {len(eval_tables)} evaluation tables")
-    print(f"✓ Extracted {len(metrics)} metrics")
-    print("✓ Generated model-index format successfully")
-    print("\n" + "=" * 60)
-    print("All tests completed! The extraction logic is working correctly.")
-    print("=" * 60 + "\n")
-
-
-if __name__ == "__main__":
-    main()