LexInstructEval

Lexical Instruction Following Evaluation for Large Language Models

Huimin Ren^1*, Yan Liang^2*, Baiqiao Su², Chaobo Sun^1†,
Hengtong Lu¹, Kaike Zhang¹, Chen Wei¹

¹Li Auto Inc., ²Beijing University of Posts and Telecommunications

📢 News

[2025-11] We released the code and bilingual datasets.
[2025-11] 🎉 LexInstructEval has been accepted to AAAI 2026!

📖 Abstract

Evaluating the ability of Large Language Models (LLMs) to follow complex, fine-grained lexical instructions remains a significant challenge. Existing methods either rely on costly human evaluation or "LLM-as-a-judge" systems, which suffer from inherent biases and unreliability.

We introduce LexInstructEval, a new benchmark and evaluation framework designed for fine-grained lexical instruction following.

Formal Grammar: Built upon a canonical < Procedure, Relation, Value > triplet.
Bilingual: Contains both English and Chinese datasets (~2.5k instructions).
Objective Verification: Features a transparent, programmatic verification engine that achieves 97% consistency with expert human annotators, eliminating the need for LLM judges.

🚀 Key Features

Low Cost & Fast: Purely rule-based verification; no API costs or slow model inference required for evaluation.
High Granularity: Tests constraints from the paragraph level down to specific characters.
Explainable: Provides detailed feedback on exactly which rule failed (e.g., "The second sentence did not end with 'future'").
Multi-Metric: Supports both Strict Accuracy (exact compliance) and Loose Accuracy (robust to formatting noise).

📂 Dataset Structure

The dataset is located in the data/ directory. Each entry contains a detailed instruction and the corresponding formal rules for verification.

Dataset	Language	Difficulty Levels	Count
`lex_instruct_en.jsonl`	English	Easy, Medium, Hard	1,243
`lex_instruct_zh.jsonl`	Chinese	Easy, Medium, Hard	1,232

🛠️ Usage

1. Installation

Clone this repository and install the required packages:

git clone https://github.com/huiminren/LexInstructEval.git
cd LexInstructEval

2. Inference (Model Generation)

To evaluate your model, you need to generate responses for the instructions provided in the data directory.

Input: Read the .jsonl files (e.g., data/lex_instruct_en.jsonl).
Generate: Feed the instruction field to your LLM.
Output: Save the model's response into a new answer field in the JSON object.

Format Example:

{
  "id": 1,
  "instruction": "The second sentence of the final paragraph must contain the word 'future'.",
  "constraints": [...],
  "answer": "Here is the generated text... We look towards the future."
}

3. Evaluation

Run the automated verification script eval.py.

# Evaluate your generated file
python eval.py --input_file data/your_model_output.jsonl

(Note: Please ensure the input_file path points to the file containing your model's answers.)

The script will compute and display:

Strict Accuracy: Percentage of responses passing verification directly.
Loose Accuracy: Percentage passing after minor formatting normalization.

📊 Leaderboard (Preview)

Results on LexInstructEval (averaged over 4 runs). See the paper for full details.

Model	English (Strict)	Chinese (Strict)	Overall (Strict)
GPT-o3-2025-04-16	63.5%	76.9%	70.2%
GPT-4o-2024-11-20	26.6%	29.1%	27.8%
Gemini-2.5-Pro	49.7%	52.0%	50.9%
DeepSeek-R1-0528	33.4%	44.0%	38.7%

📝 Citation

If you find this repo useful, please cite our AAAI 2026 paper:

@article{ren2025lexinstructeval,
  title={LexInstructEval: Lexical Instruction Following Evaluation for Large Language Models},
  author={Ren, Huimin and Liang, Yan and Su, Baiqiao and Sun, Chaobo and Lu, Hengtong and Zhang, Kaike and Wei, Chen},
  journal={arXiv preprint arXiv:2511.17561},
  year={2025},
  note={Accepted by AAAI 2026}
}

📧 Contact

For any questions or feedback, please file an issue or contact:

Huimin Ren: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
doc		doc
img		img
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LexInstructEval

Lexical Instruction Following Evaluation for Large Language Models

📢 News

📖 Abstract

🚀 Key Features

📂 Dataset Structure

🛠️ Usage

1. Installation

2. Inference (Model Generation)

3. Evaluation

📊 Leaderboard (Preview)

📝 Citation

📧 Contact

About

Uh oh!

Releases

Packages

huiminren/LexInstructEval

Folders and files

Latest commit

History

Repository files navigation

LexInstructEval

Lexical Instruction Following Evaluation for Large Language Models

📢 News

📖 Abstract

🚀 Key Features

📂 Dataset Structure

🛠️ Usage

1. Installation

2. Inference (Model Generation)

3. Evaluation

📊 Leaderboard (Preview)

📝 Citation

📧 Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages