Evaluate student SQL submissions with the assistance of a Generative Large Language Model (LLM). This project augments traditional auto-grading by using schema-aware prompting and natural-language reasoning to assess correctness, categorize mistakes, and surface rich feedback and analytics.
- Schema-aware grading: the LLM reads a database schema description to contextualize evaluations.
- Automated scoring: classify submissions as correct/incorrect and optionally assign partial credit.
- Error categorization: map common failure modes (e.g., wrong join, aggregation mistakes) to interpretable labels.
- Analytics: generate reports and plots (e.g., confusion matrices, classification metrics) to monitor performance.
- Reproducible pipeline: version inputs, cache model outputs, and export results for downstream analysis.
-
Ingest assessment context
- Database schema description (e.g., a PDF or text file).
- A dataset of SQL submissions stored in an SQLite database.
-
Construct prompts
- Combine the schema summary, the task, and each submitted query into a grading prompt.
-
LLM-assisted evaluation
- Query a configurable OpenAI-compatible API to obtain judgments (labels, rationales, and optional feedback).
-
Aggregation and metrics
- Compare LLM predictions with ground truth (if available), compute classification metrics, and visualize results.
-
Reporting
- Export predictions, rationales, and evaluation artifacts for audit and iterative improvement.
- Python 3.9
- A virtual environment managed with virtualenv
- Dependencies:
sqlite3for database interactionstqdmfor progress barspandasfor data manipulationpdfplumberfor PDF parsing (if you are using the PDF database schema descriptions)openaiPython package for interacting with OpenAI's APIscikit-learnfor classification metrics and confusion matrices