This repository reproduces the Eye-Q (\benchmark{}) evaluation setup from the paper:
\benchmark{}: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning
The dataset lives on Hugging Face: llm-lab/Eye-Q.
Eye-Q contains 1,343 cue-implicit visual word puzzles across four subsets:
| subset | code | # puzzles |
|---|---|---|
| English | en |
300 |
| Persian | pe |
671 |
| Arabic | ar |
50 |
| Cross-lingual (English cues → Persian answer) | cross |
322 |
Each row contains an image and a single target word / short phrase (answer).
- Loads Eye-Q directly from the Hugging Face Hub.
- Caches images locally (so API clients can send file paths).
- Runs one of the paper’s prompt variants:
- Basic (answer-length hint)
- Few-Shot CoT (3 demonstrations, each includes the demo image + derivation + answer)
- Iterative Refinement (up to N attempts with feedback)
- Partial Character Reveal (deterministic 25% reveal pattern)
- Writes results as JSONL (one record per sample) so runs are restart-safe.
- Computes exact-match accuracy per subset.
Note: The three Few-Shot CoT demonstration items are held out from evaluation in all runs (to avoid any protocol leakage).
pip install -r requirements.txt
python main.py --models openai --prompt-variant basic
python scripts/calculate_accuracy.py results_cache.jsonlpip install -r requirements.txtIf the dataset repo is private for you, login once:
huggingface-cli loginSet the key(s) for the model(s) you want to run.
OPENAI_API_KEY- optional:
OPENAI_BASE_URL,OPENAI_MODEL
GOOGLE_API_KEY- optional:
GOOGLE_BASE_URL,GOOGLE_MODEL
OPENROUTER_API_KEY- optional:
GROK_MODEL,QWEN_MODEL
LLAMA_API_KEYorAVALAI_API_KEY- optional:
LLAMA_BASE_URL/AVALAI_BASE_URL,LLAMA_MODEL
Windows PowerShell
$env:OPENAI_API_KEY = "..."macOS/Linux (bash/zsh)
export OPENAI_API_KEY="..."--repo-id(default:llm-lab/Eye-Q)--config(default:default)--split(default:train)--languagescomma-separated:en,pe,cross,ar--modelscomma-separated:openai,google,grok,llama,qwen--temperature(passed to the API client)--max-samples(debug: cap number of evaluated samples)--max-workers(parallelism)--cache-file(JSONL results file)--image-cache-dir(where decoded images are stored as.jpg)
All commands below load from the Hub and append results to results_cache.jsonl.
Adds a character-count hint (excluding spaces).
python main.py \
--models openai \
--languages en,pe,cross,ar \
--prompt-variant basicDemonstrations are defined in eyeq_benchmark/derivations.py. They are selected by HF id and include the demo image.
python main.py \
--models openai \
--languages en,pe,cross,ar \
--prompt-variant few_shot_cot \
--num-examples 3Runs the Basic prompt first; if incorrect, appends feedback and retries.
python main.py \
--models openai \
--languages en,pe,cross,ar \
--prompt-variant iterative_refinement \
--num-pass 3Reveals a deterministic 25% of non-space characters (stable per (language, id)), masking the rest with _.
python main.py \
--models openai \
--languages en,pe,cross,ar \
--prompt-variant partial_character_revealpython scripts/calculate_accuracy.py results_cache.jsonlThe summary groups by (model, prompt_variant) and prints exact-match accuracy per subset.
Results are appended as one JSON object per line (JSONL). Fields include:
id,languagemodel_nameprompt_variantground_truth,model_ans,solvedattempts(all raw model responses + parsed final answers)
This makes runs restart-safe: if you re-run the same command, already-computed rows are skipped.
- Few-Shot CoT uses fixed, hand-written derivations in
eyeq_benchmark/derivations.py. - Partial Character Reveal uses a stable hash-based seed so the same
(lang, id)always yields the same reveal pattern. - String matching uses lightweight normalization (whitespace/punctuation normalization and Arabic/Persian diacritics removal) in
eyeq_benchmark/eval.py.
If you want to manually combine the low-level switches (for ablations), use custom:
python main.py \
--prompt-variant custom \
--use-context \
--hint-type partial_character_reveal \
--pass-at \
--num-pass 3main.py– experiment runner (HF dataset → prompts → model API → JSONL)eyeq_benchmark/data.py– HF loading + image cachingeyeq_benchmark/prompts.py– base prompt + per-subset language ruleseyeq_benchmark/derivations.py– fixed Few-Shot CoT demonstrationseyeq_benchmark/hints.py– answer-length + partial reveal hintseyeq_benchmark/eval.py– answer normalization + JSON parsingscripts/calculate_accuracy.py– accuracy report from JSONL
If you use Eye-Q, please cite the accompanying paper.