Paper: Toward Reliable LLM-Integrated Web Architectures for Teacher-Aligned Automatic Student Grading
Venue: ICWE 2026 — International Conference on Web Engineering
This repository is the online appendix / artifact package for the paper. It contains the analysis scripts, raw and consolidated outputs, and reproducibility artifacts.
ICWE26-Appendix/
├── config/
│ └── incremental_in_context_eval.json
├── data/
│ ├── manual_eval/raw/manual_eval.csv
│ ├── manual_eval.csv
│ └── config/*.yaml
├── scripts/
│ ├── 01_manual_coverage_analysis.py
│ ├── ...
│ ├── 17_selective_automation_policy.py
│ └── run_all.py
├── outputs/
│ ├── incremental_in_context_eval/
│ │ ├── runs/...
│ │ └── analysis/...
│ ├── incremental_in_context_eval_consolidated/
│ │ ├── runs/...
│ │ └── analysis/...
│ ├── online_appendix/
│ ├── openrouter_activity_2026-02-15.csv
│ ├── openrouter_activity_2026-02-15__cost_summary.json
│ ├── openrouter_activity_2026-02-15__cost_summary_by_model.csv
│ └── openrouter_usage_summary_full_runs_2026-02-15.csv
└── requirements.txt
-
Replication package with prompts/scripts/artifacts (
main.tex:76)
Covered by:scripts/(all experiment and analysis scripts)config/incremental_in_context_eval.jsonoutputs/
-
Full provider model identifiers + evaluation timestamp + complete API call log (
main.tex:249)
Covered by:- Model identifiers and run metadata:
outputs/incremental_in_context_eval/runs/*/99_meta/run_meta.jsonoutputs/incremental_in_context_eval/runs/*/99_meta/model_run__*.json
- API call logs:
outputs/incremental_in_context_eval/runs/*/02_predictions/*/api_call_usage_log.csvoutputs/openrouter_activity_2026-02-15.csvoutputs/openrouter_usage_summary_full_runs_2026-02-15.csv
- Model identifiers and run metadata:
-
Full reliability diagrams and Brier score breakdowns (
main.tex:403)
Covered by:- Reliability diagrams:
outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_grid.pdfoutputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_grid.png
- Reliability bins:
outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_bins.csv
- Brier score per model:
outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__overview.csv(columnbrier_score)
- Reliability diagrams:
-
Full coverage-accuracy sweep data for confidence routing (
main.tex:415)
Covered by:outputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy_sweep__parsable_only.csv
-
Full log-curve fit statistics (
main.tex:582)
Covered by:outputs/incremental_in_context_eval_consolidated/analysis/learning_curve_fit__parsable_only.csvoutputs/incremental_in_context_eval_consolidated/analysis/learning_curve_fit__parsable_only.md
-
Full threshold sweep data (FPR-budget discussion) (
main.tex:608)
Covered by:outputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy_sweep__parsable_only.csvoutputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy__parsable_only.csv
-
Per-question detail referenced in discussion (e.g., r1q17) (
main.tex:546)
Covered by:outputs/incremental_in_context_eval_consolidated/analysis/per_question_accuracy__n40__parsable_only.csvoutputs/online_appendix/google_sheets_import__kfold_question_metrics_long.csv
outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_key_claims.jsonoutputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table1_teacher_alignment.csvoutputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table2_confidence_calibration.csvoutputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table3_feedback_thresholds.csvoutputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table4_cost.csvoutputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/main_tex_check_report.json
pip install -r requirements.txtUse consolidated outputs as base:
BASE=outputs/incremental_in_context_eval_consolidatedRun scripts that regenerate paper analyses/tables/figures from included prediction CSVs:
python scripts/08_plot_accuracy_overlay_full_runs.py --base-output-dir "$BASE" --parsable-only
python scripts/09_analyze_confidence_full_runs.py --base-output-dir "$BASE" --parsable-only
python scripts/16_learning_curve_fit.py --base-output-dir "$BASE"
python scripts/17_selective_automation_policy.py --base-output-dir "$BASE"
python scripts/12_generate_paper_metrics_snapshot.py --base-output-dir "$BASE" --out-dir "$BASE/analysis/paper_repro_snapshot"
python scripts/15_cost_pareto.py --base-output-dir "$BASE"
python scripts/13_significance_testing.py --base-output-dir "$BASE"
python scripts/14_per_question_variability.py --base-output-dir "$BASE"scripts/04_incremental_in_context_eval.py
- Requires live OpenRouter access (
OPENROUTER_API_KEY) - Re-runs model inference (not needed to regenerate paper analytics from bundled outputs)
scripts/run_all.pycurrently runs only scripts01and02.- The canonical paper-facing outputs are under:
outputs/incremental_in_context_eval_consolidated/analysis/
- Additional raw run traces and API logs are included under:
outputs/incremental_in_context_eval/runs/