Add quality estimation module #793

mshannon-sil · 2025-08-13T16:18:53Z

This module adds support for estimating the quality of drafts with regard to confidence, chrf3, and usability.

It requires a diff_predictions file to establish a correlation between confidence and chrF3, as well as confidence files to project chrF3 and ultimately usability from. The user has a few options to specify confidence files. They can provide a list of file paths (relative to the experiment directory) or they can specify just the book ids. They can also change the directory to look for confidence files in (relative to the experiment directory) e.g. /infer/5000/source. See the argument descriptions for more details. Also, if a usability_parameters.tsv file is present, it will use the parameters in that file for the chrf3 - usability distribution rather than the default parameters that we gathered from usability surveys.

It outputs '*.projected_chrf3.tsv' files corresponding to each confidence file, listing the vref, confidence, and projected_chrf3 for each verse. It then also outputs a single usability_chapters.tsv file covering chapter usability for all chapter refs in all the confidence files, as well as a single usability_books.tsv for book usability.

Currently it only supports chrF3 as a projected scorer and assumes confidence files were generated from a .SFM file rather than a .txt file so that vref information can be retrieved. Future work could be done to expand functionality as the need arises.

This change is

Copilot

Pull Request Overview

This PR introduces a quality estimation module for evaluating NMT model draft quality using confidence scores, chrF3 projections, and usability metrics. The module establishes correlations between confidence and chrF3 scores from diff predictions files to project quality metrics for confidence files.

Adds quality_estimation.py module with functions to project chrF3 scores from confidence data and compute usability proportions
Implements command-line interface supporting multiple confidence file selection methods (file paths or book IDs)
Generates projected chrF3 output files and aggregated usability metrics at chapter and book levels

Comments suppressed due to low confidence (1)

silnlp/nmt/quality_estimation.py:94

The regex pattern is missing the end-of-line anchor '$'. Without it, the pattern could match partial lines, potentially causing incorrect parsing of verse references.

            match = re.match(r"^([0-9A-Z][A-Z]{2}) (\d+):(\d+)(/.*)?", line)

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

silnlp/nmt/quality_estimation.py

benjaminking

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @mshannon-sil)

silnlp/nmt/quality_estimation.py line 26 at r3 (raw file):

def estimate_quality(diff_predictions_file: Path, confidence_files: List[Path]):

Can you add a -> None type hint?

silnlp/nmt/quality_estimation.py line 40 at r3 (raw file):

            f"in {diff_predictions_file} do not match."
        )
    slope, intercept = linregress(confidence_scores, chrf3_scores)[:2]

Can you refactor this so that we only do the regression once? Linear regression isn't very expensive, but we may want to support other types of regression that require numerical or iterative methods someday.

silnlp/nmt/quality_estimation.py line 205 at r3 (raw file):

        nargs="*",
        help="Relative paths for the confidence files to process (relative to experiment folder or --dir if specified) "
        + "e.g. 'infer/5000/source/631JHN.SFM.confidences.tsv' or '631JHN.SFM.confidences.tsv --dir infer/5000/source'",

Very minor, but the abbreviation for 1 John is 1JN.

silnlp/nmt/quality_estimation.py line 208 at r3 (raw file):

    )
    parser.add_argument(
        "--confidence-dir",

These two arguments also use different cases (snake vs. kebab)

silnlp/nmt/quality_estimation.py line 247 at r3 (raw file):

        confidence_files = []
        for book_id in args.books:
            confidence_files.extend(confidence_dir.glob(f"[0-9]*{book_id}.*.confidences.tsv"))

I believe this could result in unexpected behavior when we are either producing multiple drafts or applying postprocessing, since there could be multiple files that match this glob for each book.

silnlp/nmt/quality_estimation.py line 249 at r3 (raw file):

            confidence_files.extend(confidence_dir.glob(f"[0-9]*{book_id}.*.confidences.tsv"))

    estimate_quality(exp_dir / args.diff_predictions_file_name, confidence_files)

I can imagine the diff predictions file being in a different experiment folder than the confidence files, since it will usually have different training data (holding out a random 100/250 verses)

…ug fixes

mshannon-sil

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @benjaminking)

silnlp/nmt/quality_estimation.py line 26 at r3 (raw file):