Skip to content

Conversation

@eecavanna
Copy link
Collaborator

@eecavanna eecavanna commented Aug 27, 2025

On this branch, I added real-time referential checking to the apply_metadata_in Dagster graph.

Details

Specifically, I:

  • Added the check_inter_document_references=True kwarg to a call to validate_json() so that the apply_metadata_in Dagster graph performs referential integrity checking (in addition to the schema validation it already performed).
  • Updated the existing test_apply_metadata_in_functional_annotation_agg test so that the document referenced by the documents the test inserts via the Dagster op, exists. That way, the insertion does not fail due to the newly-introduced referential integrity check.
  • Added a new test, named test_apply_metadata_in_checks_referential_integrity, which demonstrates that, when a document being inserted via the Dagster op includes a broken reference, the document does not get inserted.
  • Added some TODOs about documenting undocumented functions.

The generically-named perform_mongo_updates function is only called by the apply_metadata_in graph.

Related issue(s)

Fixes #969

Related subsystem(s)

  • Runtime API (except the Minter)
  • Minter
  • Dagster
  • Project documentation (in the docs directory)
  • Translators (metadata ingest pipelines)
  • MongoDB migrations
  • Other

Testing

  • I tested these changes (explain below)
  • I did not test these changes

I tested these changes by...

  • Adding a test that demonstrates the new functionality
  • Ensuring all tests pass locally and on GHA

Documentation

  • I have not checked for relevant documentation yet (e.g. in the docs directory)
  • I have updated all relevant documentation so it will remain accurate
  • Other (explain below)

Maintainability

  • Every Python function I defined includes a docstring (test functions are exempt from this)
  • Every Python function parameter I introduced includes a type hint (e.g. study_id: str)
  • All "to do" or "fix me" Python comments I added begin with either # TODO or # FIXME
  • I used black to format all the Python files I created/modified
  • The PR title is in the imperative mood (e.g. "Do X") and not the declarative mood (e.g. "Does X" or "Did X")

@eecavanna eecavanna self-assigned this Aug 27, 2025
@eecavanna eecavanna linked an issue Aug 27, 2025 that may be closed by this pull request
@eecavanna eecavanna marked this pull request as ready for review August 27, 2025 18:58
@eecavanna
Copy link
Collaborator Author

eecavanna commented Aug 27, 2025

I'm still a little confused about how Dagster things get defined. I'm trying to determine whether perform_mongo_updates is ever used to do any kinds of MongoDB operations besides insertions of documents (other kinds include updates and deletions of documents).

I see that perform_mongo_updates becomes an op (because it's decorated with @op(...)). The only place it is "called" is from within the apply_metadata_in graph. However, the string "apply_metadata_in" is also used as a dictionary key within the _ensure_job__metadata_in function and the claim_and_run_metadata_in_jobs function (the latter is decorated with @sensor(...).

@dwinston
Copy link
Collaborator

I'm still a little confused about how Dagster things get defined. I'm trying to determine whether perform_mongo_updates is ever used to do any kinds of MongoDB operations besides insertions of documents (other kinds include updates and deletions of documents).

nmdc_runtime.site.ops.perform_mongo_updates calls nmdc_runtime.site.resources.MongoDB.add_docs, which bulk_writes ReplaceOne or InsertOne operations. Thus, it inserts or replaces (the latter with upsert=True). It does not delete.

I see that perform_mongo_updates becomes an op (because it's decorated with @op(...)). The only place it is "called" is from within the apply_metadata_in graph. However, the string "apply_metadata_in" is also used as a dictionary key within the _ensure_job__metadata_in function and the claim_and_run_metadata_in_jobs function (the latter is decorated with @sensor(...).

The string "metadata-in-1.0.0" represents (or was intended to represent) a specific (versioned) NMDC workflow, like something run on NERSC. "apply_metadata_in" represents a dagster job. The claim_and_run_metadata_in_jobs dagster sensor indeed yields a (dagster) RunRequest for apply_metadata_in.to_job, which wraps the dagster apply_metadata_in "graph of ops" as a configured "job". Dagster has its own terminology that uses the words "graph", "op", and "job" (and "sensor").

@eecavanna
Copy link
Collaborator Author

eecavanna commented Aug 30, 2025

Thanks, @dwinston. Because the function can do updates, I don't think the solution I've implemented in this PR will suffice. It covers insertions, but not updates.

For example: if the caller wants to update an existing document (matched by its id) so it has a different type, the solution I implemented here would not check references that were already pointing to that document (from documents not involved in the operation) to confirm they are still schema-compliant references.

There is a function that can be used to check referential integrity in this case (it uses a MongoDB transaction to do the writes, check references, and then roll back the writes), but it it not what I did here and may complicate this function enough that we defer adding referential integrity checking to this function until after Q4 ends (which was always an option for the three referential integrity-related tickets on the squad board).

I'll think about this more next week, and we can discuss it at the next squad meeting.

@dwinston dwinston marked this pull request as draft September 17, 2025 11:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add referential integrity check to apply_metadata_in Dagster job

4 participants