Skip to content

WSE-research/ICWE26-Appendix-Reliable-LLM-Integrated-Web-Architectures-for-Teacher-Aligned-Grading

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Replication Package — ICWE 2026

Paper: Toward Reliable LLM-Integrated Web Architectures for Teacher-Aligned Automatic Student Grading
Venue: ICWE 2026 — International Conference on Web Engineering

This repository is the online appendix / artifact package for the paper. It contains the analysis scripts, raw and consolidated outputs, and reproducibility artifacts.

Repository Structure

ICWE26-Appendix/
├── config/
│   └── incremental_in_context_eval.json
├── data/
│   ├── manual_eval/raw/manual_eval.csv
│   ├── manual_eval.csv
│   └── config/*.yaml
├── scripts/
│   ├── 01_manual_coverage_analysis.py
│   ├── ...
│   ├── 17_selective_automation_policy.py
│   └── run_all.py
├── outputs/
│   ├── incremental_in_context_eval/
│   │   ├── runs/...
│   │   └── analysis/...
│   ├── incremental_in_context_eval_consolidated/
│   │   ├── runs/...
│   │   └── analysis/...
│   ├── online_appendix/
│   ├── openrouter_activity_2026-02-15.csv
│   ├── openrouter_activity_2026-02-15__cost_summary.json
│   ├── openrouter_activity_2026-02-15__cost_summary_by_model.csv
│   └── openrouter_usage_summary_full_runs_2026-02-15.csv
└── requirements.txt

Paper Promise Coverage (main.tex -> appendix files)

  1. Replication package with prompts/scripts/artifacts (main.tex:76)
    Covered by:

    • scripts/ (all experiment and analysis scripts)
    • config/incremental_in_context_eval.json
    • outputs/
  2. Full provider model identifiers + evaluation timestamp + complete API call log (main.tex:249)
    Covered by:

    • Model identifiers and run metadata:
      • outputs/incremental_in_context_eval/runs/*/99_meta/run_meta.json
      • outputs/incremental_in_context_eval/runs/*/99_meta/model_run__*.json
    • API call logs:
      • outputs/incremental_in_context_eval/runs/*/02_predictions/*/api_call_usage_log.csv
      • outputs/openrouter_activity_2026-02-15.csv
      • outputs/openrouter_usage_summary_full_runs_2026-02-15.csv
  3. Full reliability diagrams and Brier score breakdowns (main.tex:403)
    Covered by:

    • Reliability diagrams:
      • outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_grid.pdf
      • outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_grid.png
    • Reliability bins:
      • outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__reliability_bins.csv
    • Brier score per model:
      • outputs/incremental_in_context_eval_consolidated/analysis/model_confidence_analysis_full_runs__parsable_only__overview.csv (column brier_score)
  4. Full coverage-accuracy sweep data for confidence routing (main.tex:415)
    Covered by:

    • outputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy_sweep__parsable_only.csv
  5. Full log-curve fit statistics (main.tex:582)
    Covered by:

    • outputs/incremental_in_context_eval_consolidated/analysis/learning_curve_fit__parsable_only.csv
    • outputs/incremental_in_context_eval_consolidated/analysis/learning_curve_fit__parsable_only.md
  6. Full threshold sweep data (FPR-budget discussion) (main.tex:608)
    Covered by:

    • outputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy_sweep__parsable_only.csv
    • outputs/incremental_in_context_eval_consolidated/analysis/selective_automation_policy__parsable_only.csv
  7. Per-question detail referenced in discussion (e.g., r1q17) (main.tex:546)
    Covered by:

    • outputs/incremental_in_context_eval_consolidated/analysis/per_question_accuracy__n40__parsable_only.csv
    • outputs/online_appendix/google_sheets_import__kfold_question_metrics_long.csv

Key Reproducibility Outputs

  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_key_claims.json
  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table1_teacher_alignment.csv
  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table2_confidence_calibration.csv
  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table3_feedback_thresholds.csv
  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/paper_table4_cost.csv
  • outputs/incremental_in_context_eval_consolidated/analysis/paper_repro_snapshot/main_tex_check_report.json

Setup

pip install -r requirements.txt

Reproducing Analysis from Included Prediction Data

Use consolidated outputs as base:

BASE=outputs/incremental_in_context_eval_consolidated

Run scripts that regenerate paper analyses/tables/figures from included prediction CSVs:

python scripts/08_plot_accuracy_overlay_full_runs.py --base-output-dir "$BASE" --parsable-only
python scripts/09_analyze_confidence_full_runs.py --base-output-dir "$BASE" --parsable-only
python scripts/16_learning_curve_fit.py --base-output-dir "$BASE"
python scripts/17_selective_automation_policy.py --base-output-dir "$BASE"
python scripts/12_generate_paper_metrics_snapshot.py --base-output-dir "$BASE" --out-dir "$BASE/analysis/paper_repro_snapshot"
python scripts/15_cost_pareto.py --base-output-dir "$BASE"
python scripts/13_significance_testing.py --base-output-dir "$BASE"
python scripts/14_per_question_variability.py --base-output-dir "$BASE"

Scripts with External-Service Prerequisites

  1. scripts/04_incremental_in_context_eval.py
  • Requires live OpenRouter access (OPENROUTER_API_KEY)
  • Re-runs model inference (not needed to regenerate paper analytics from bundled outputs)

Notes

  • scripts/run_all.py currently runs only scripts 01 and 02.
  • The canonical paper-facing outputs are under:
    • outputs/incremental_in_context_eval_consolidated/analysis/
  • Additional raw run traces and API logs are included under:
    • outputs/incremental_in_context_eval/runs/

About

Replication package for the ICWE 2026 paper 'Toward Reliable LLM-Integrated Web Architectures for Teacher-Aligned Automatic Student Grading' (scripts, configs, prediction data, analysis outputs).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages