Skip to content

Conversation

mshannon-sil
Copy link
Collaborator

@mshannon-sil mshannon-sil commented Aug 13, 2025

This module adds support for estimating the quality of drafts with regard to confidence, chrf3, and usability.

It requires a diff_predictions file to establish a correlation between confidence and chrF3, as well as confidence files to project chrF3 and ultimately usability from. The user has a few options to specify confidence files. They can provide a list of file paths (relative to the experiment directory) or they can specify just the book ids. They can also change the directory to look for confidence files in (relative to the experiment directory) e.g. /infer/5000/source. See the argument descriptions for more details. Also, if a usability_parameters.tsv file is present, it will use the parameters in that file for the chrf3 - usability distribution rather than the default parameters that we gathered from usability surveys.

It outputs '*.projected_chrf3.tsv' files corresponding to each confidence file, listing the vref, confidence, and projected_chrf3 for each verse. It then also outputs a single usability_chapters.tsv file covering chapter usability for all chapter refs in all the confidence files, as well as a single usability_books.tsv for book usability.

Currently it only supports chrF3 as a projected scorer and assumes confidence files were generated from a .SFM file rather than a .txt file so that vref information can be retrieved. Future work could be done to expand functionality as the need arises.


This change is Reviewable

@mshannon-sil mshannon-sil requested a review from Copilot August 13, 2025 16:18
@mshannon-sil mshannon-sil self-assigned this Aug 13, 2025
Copilot

This comment was marked as outdated.

@mshannon-sil mshannon-sil requested a review from Copilot August 13, 2025 19:47
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a quality estimation module for evaluating NMT model draft quality using confidence scores, chrF3 projections, and usability metrics. The module establishes correlations between confidence and chrF3 scores from diff predictions files to project quality metrics for confidence files.

  • Adds quality_estimation.py module with functions to project chrF3 scores from confidence data and compute usability proportions
  • Implements command-line interface supporting multiple confidence file selection methods (file paths or book IDs)
  • Generates projected chrF3 output files and aggregated usability metrics at chapter and book levels
Comments suppressed due to low confidence (1)

silnlp/nmt/quality_estimation.py:94

  • The regex pattern is missing the end-of-line anchor '$'. Without it, the pattern could match partial lines, potentially causing incorrect parsing of verse references.
            match = re.match(r"^([0-9A-Z][A-Z]{2}) (\d+):(\d+)(/.*)?", line)

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 6 unresolved discussions (waiting on @mshannon-sil)


silnlp/nmt/quality_estimation.py line 26 at r3 (raw file):

def estimate_quality(diff_predictions_file: Path, confidence_files: List[Path]):

Can you add a -> None type hint?


silnlp/nmt/quality_estimation.py line 40 at r3 (raw file):

            f"in {diff_predictions_file} do not match."
        )
    slope, intercept = linregress(confidence_scores, chrf3_scores)[:2]

Can you refactor this so that we only do the regression once? Linear regression isn't very expensive, but we may want to support other types of regression that require numerical or iterative methods someday.


silnlp/nmt/quality_estimation.py line 205 at r3 (raw file):

        nargs="*",
        help="Relative paths for the confidence files to process (relative to experiment folder or --dir if specified) "
        + "e.g. 'infer/5000/source/631JHN.SFM.confidences.tsv' or '631JHN.SFM.confidences.tsv --dir infer/5000/source'",

Very minor, but the abbreviation for 1 John is 1JN.


silnlp/nmt/quality_estimation.py line 208 at r3 (raw file):

    )
    parser.add_argument(
        "--confidence-dir",

These two arguments also use different cases (snake vs. kebab)


silnlp/nmt/quality_estimation.py line 247 at r3 (raw file):

        confidence_files = []
        for book_id in args.books:
            confidence_files.extend(confidence_dir.glob(f"[0-9]*{book_id}.*.confidences.tsv"))

I believe this could result in unexpected behavior when we are either producing multiple drafts or applying postprocessing, since there could be multiple files that match this glob for each book.


silnlp/nmt/quality_estimation.py line 249 at r3 (raw file):

            confidence_files.extend(confidence_dir.glob(f"[0-9]*{book_id}.*.confidences.tsv"))

    estimate_quality(exp_dir / args.diff_predictions_file_name, confidence_files)

I can imagine the diff predictions file being in a different experiment folder than the confidence files, since it will usually have different training data (holding out a random 100/250 verses)

Copy link
Collaborator Author

@mshannon-sil mshannon-sil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 1 files reviewed, 6 unresolved discussions (waiting on @benjaminking)


silnlp/nmt/quality_estimation.py line 26 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

Can you add a -> None type hint?

Done.


silnlp/nmt/quality_estimation.py line 40 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

Can you refactor this so that we only do the regression once? Linear regression isn't very expensive, but we may want to support other types of regression that require numerical or iterative methods someday.

Done.


silnlp/nmt/quality_estimation.py line 205 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

Very minor, but the abbreviation for 1 John is 1JN.

Done.


silnlp/nmt/quality_estimation.py line 208 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

These two arguments also use different cases (snake vs. kebab)

I think this is the intended casing for both arguments, since after parsing they should both be snake case. For positional arguments, the convention is to use snake case since the variable name is taken as is by the arg parser. For optional flags, the convention is to use kebab case so it's consistent with the -- in front, and arg parser automatically converts it to snake case.


silnlp/nmt/quality_estimation.py line 247 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

I believe this could result in unexpected behavior when we are either producing multiple drafts or applying postprocessing, since there could be multiple files that match this glob for each book.

Would postprocessing files be included here? I didn't think any files with a postprocessing suffix also had a .confidences.tsv suffix.

For the multiple drafts case, I've added a --draft-index arg that can be used when --books is specified.


silnlp/nmt/quality_estimation.py line 249 at r3 (raw file):

Previously, benjaminking (Ben King) wrote…

I can imagine the diff predictions file being in a different experiment folder than the confidence files, since it will usually have different training data (holding out a random 100/250 verses)

Done.

Copy link
Collaborator

@benjaminking benjaminking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@benjaminking reviewed 1 of 1 files at r4, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @mshannon-sil)


silnlp/nmt/quality_estimation.py line 247 at r3 (raw file):

Previously, mshannon-sil wrote…

Would postprocessing files be included here? I didn't think any files with a postprocessing suffix also had a .confidences.tsv suffix.

For the multiple drafts case, I've added a --draft-index arg that can be used when --books is specified.

You're right. Good catch.

@mshannon-sil mshannon-sil merged commit f4ff3b4 into master Aug 25, 2025
1 check was pending
@mshannon-sil mshannon-sil deleted the #784_qe branch August 25, 2025 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants