Skip to content

feature: Layer files are deduplicated on creation based on the science data#268

Merged
alastairtree merged 14 commits intomainfrom
feat/layer-deduplication-at-save-time
May 1, 2026
Merged

feature: Layer files are deduplicated on creation based on the science data#268
alastairtree merged 14 commits intomainfrom
feat/layer-deduplication-at-save-time

Conversation

@alastairtree
Copy link
Copy Markdown
Collaborator

…librate with database

When running SetQualityAndNaN calibration twice with SaveMode.LocalAndDatabase,
the second run produced identical data CSV content. DBIndexedDatastoreFileManager
found a hash match at v001 in the DB and redirected the handler back to v001,
while the layer JSON had already been written referencing data_filename=v002.csv.
Fixed by allowing layer files version to be locked.

Bumps version to 5.1.1.

…librate with database

When running SetQualityAndNaN calibration twice with SaveMode.LocalAndDatabase,
the second run produced identical data CSV content. DBIndexedDatastoreFileManager
found a hash match at v001 in the DB and redirected the handler back to v001,
while the layer JSON had already been written referencing data_filename=v002.csv.
Fixed by allowing layer files version to be locked.

Bumps version to 5.1.1.
Previously, calibration layers were pre-versioned before MATLAB was
called, which was fragile and required callers to predict the next
available slot upfront.

Now layers are always generated at v001 and deduplication/versioning
happens in the datastore managers. Two layers are considered identical
when their companion CSV hash and content date match.

Key changes:
- Add content-identity protocol to IFilePathHandler (get_content_identity,
  get_stored_content_identity, prepare_for_version, get_storage_meta,
  is_version_blocked_by_sibling)
- CalibrationLayerPathHandler overrides these: JSON identity is the
  companion CSV hash; prepare_for_version rewrites data_filename in JSON
  when the assigned version differs from v001; is_version_blocked_by_sibling
  ensures JSON and CSV always land on the same version slot
- DatastoreFileManager and DBIndexedDatastoreFileManager updated to use
  the new protocol for dedup and sibling-aware versioning
- Remove upfront set_layer_to_next_viable_version from CalibrationJob and
  the raise_if_resequenced guard from SetQualityAndNaNCalibration
- Remove version-locking from VersionedPathHandler
- Tests updated throughout to reflect that MATLAB always receives v001
  paths and version bumping is verified via datastore output
Comment thread src/imap_mag/io/file/CalibrationLayerPathHandler.py
@alastairtree alastairtree force-pushed the feat/layer-deduplication-at-save-time branch from e51c648 to ac9b987 Compare April 28, 2026 17:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 28, 2026

Coverage Report (3.14)

File Coverage
All files 79%
imap_db/main.py 26%
imap_db/model.py 82%
imap_db/migrations/versions/2026_05_01-52c7b098641d_migrate_layer_files.py 64%
imap_mag/__init__.py 88%
imap_mag/appLogging.py 25%
imap_mag/check/IALiRTAnomaly.py 98%
imap_mag/cli/apply.py 84%
imap_mag/cli/calibrate.py 80%
imap_mag/cli/cliUtils.py 73%
imap_mag/cli/ialirtUtils.py 91%
imap_mag/cli/process.py 90%
imap_mag/cli/fetch/DownloadDateManager.py 92%
imap_mag/cli/fetch/binary.py 79%
imap_mag/cli/fetch/ialirt.py 76%
imap_mag/cli/fetch/science.py 59%
imap_mag/cli/fetch/spice.py 77%
imap_mag/cli/fetch/spin_table.py 42%
imap_mag/cli/plot/plot_ialirt.py 83%
imap_mag/client/IALiRTApiClient.py 83%
imap_mag/client/SDCDataAccess.py 89%
imap_mag/client/WebPODA.py 83%
imap_mag/config/CalibrationConfig.py 87%
imap_mag/config/NestedAliasEnvSettingsSource.py 94%
imap_mag/data_pipelines/DownloadLoPivotCsvFilesStage.py 84%
imap_mag/data_pipelines/DownloadSpinTableFilesStage.py 70%
imap_mag/data_pipelines/GetProcessingDatesStage.py 95%
imap_mag/data_pipelines/Pipeline.py 90%
imap_mag/data_pipelines/Record.py 88%
imap_mag/data_pipelines/Result.py 96%
imap_mag/data_pipelines/Stages.py 95%
imap_mag/db/Database.py 80%
imap_mag/download/FetchIALiRT.py 97%
imap_mag/download/FetchScience.py 71%
imap_mag/io/DBIndexedDatastoreFileManager.py 93%
imap_mag/io/DatastoreFileManager.py 87%
imap_mag/io/FileFinder.py 82%
imap_mag/io/FilePathHandlerSelector.py 97%
imap_mag/io/file/AncillaryPathHandler.py 83%
imap_mag/io/file/CalibrationLayerPathHandler.py 92%
imap_mag/io/file/IALiRTHKPathHandler.py 85%
imap_mag/io/file/PartitionedPathHandler.py 96%
imap_mag/io/file/QuicklookPathHandler.py 86%
imap_mag/io/file/SPICEPathHandler.py 68%
imap_mag/io/file/SpinTablePathHandler.py 81%
imap_mag/io/file/StandardSPDFPathHandler.py 69%
imap_mag/io/file/VersionedPathHandler.py 69%
imap_mag/plot/plot_ialirt_files.py 76%
imap_mag/process/HKProcessSettings.py 83%
imap_mag/process/HKProcessor.py 92%
imap_mag/process/get_packet_definition_folder.py 83%
imap_mag/process/metakernel.py 88%
imap_mag/util/CCSDSBinaryPacketFile.py 54%
imap_mag/util/DatetimeProvider.py 65%
imap_mag/util/HKPacket.py 87%
imap_mag/util/Humaniser.py 81%
imap_mag/util/TimeConversion.py 79%
mag_toolkit/CDFLoader.py 0%
mag_toolkit/calibration/CalibrationApplicator.py 76%
mag_toolkit/calibration/CalibrationDefinitions.py 86%
mag_toolkit/calibration/CalibrationLayer.py 73%
mag_toolkit/calibration/CalibrationMatrix.py 94%
mag_toolkit/calibration/Layer.py 76%
mag_toolkit/calibration/MatlabWrapper.py 79%
mag_toolkit/calibration/ScienceLayer.py 58%
mag_toolkit/calibration/calibrators/CalibrationJob.py 72%
mag_toolkit/calibration/calibrators/EmptyCalibration.py 72%
mag_toolkit/calibration/calibrators/GradiometerCalibration.py 71%
mag_toolkit/calibration/calibrators/SetQualityAndNaNCalibration.py 85%
prefect_server/checkIALiRT.py 92%
prefect_server/datastoreCleanupFlow.py 90%
prefect_server/durationUtils.py 90%
prefect_server/performCalibration.py 70%
prefect_server/pollHK.py 89%
prefect_server/pollIALiRT.py 77%
prefect_server/pollLoPivotPlatform.py 66%
prefect_server/pollScience.py 67%
prefect_server/pollSpice.py 75%
prefect_server/pollSpinTable.py 0%
prefect_server/postgresUploadFlow.py 65%
prefect_server/prefectUtils.py 54%
prefect_server/serverConfig.py 0%
prefect_server/uploadSharedDocsFlow.py 95%
prefect_server/workflow.py 0%

Minimum allowed coverage is 80%

Generated by 🐒 cobertura-action against ee0cc09

For calibration layer JSON files the content identity for deduplication
is now the companion CSV data hash, not the JSON file hash. This means
re-versioned JSONs (which only differ in their data_filename reference)
correctly deduplicate against an existing record.

Changes:
- CalibrationLayerPathHandler.get_content_identity: use raw JSON parsing
  to read data_hash from metadata; fallback to companion CSV hash; final
  fallback to JSON file hash when neither is available (e.g. empty stub
  files in the datastore)
- CalibrationLayerPathHandler.prepare_for_version: switch to raw JSON
  parsing so it works on minimal test JSONs and files without a
  co-located companion CSV
- DBIndexedDatastoreFileManager: dedup comparison now also checks
  file_meta["data_file_hash"] so that records written with the CSV hash
  stored in metadata are found; stores data_file_hash in file_meta
  whenever content identity differs from the raw file hash
- Tests: fix CSV content and hash computation to survive the pandas
  round-trip; consolidate duplicate imports
… identity

The companion CSV hash for calibration layers is now stored directly in
file.hash (via get_content_identity), making the separate data_file_hash
field in file_meta redundant. Pre-existing DB records have hash=csv_hash
so the dedup comparison reduces to the standard f.hash == identity_hash
check.
…r pairs

All tests now create genuine CalibrationLayer JSON+CSV pairs (with
computed data_hash) via a shared write_calibration_layer_pair helper
in tests/util/miscellaneous.py.  Removed _generate_layer_json, empty
touch() stubs, and raw CSV strings.  Tests that need a half-written
datastore state (only-JSON, only-CSV) create a real pair then delete
the unwanted file.
…handler

_companion_csv_path and get_content_identity now use
CalibrationLayer.from_file(load_contents=False) to read data_filename
and data_hash rather than raw json.loads, keeping the handler strongly
typed and consistent with the rest of the calibration layer API.
@alastairtree alastairtree marked this pull request as ready for review May 1, 2026 08:54
@alastairtree alastairtree requested a review from mhairifin May 1, 2026 08:54
@alastairtree
Copy link
Copy Markdown
Collaborator Author

@mhairifin this is a second attempt to replace #265

@alastairtree
Copy link
Copy Markdown
Collaborator Author

I have now tested this and it migrates all the layer files to the new format and they are now deduped on save as expected. @mhairifin I am merging but please do a review and i will pick up and fixes later

@alastairtree alastairtree changed the title fix: layer data file version redirected to older version on second cal feature: Layer files are deduplicated on creation based on the science data May 1, 2026
@alastairtree alastairtree merged commit 96ea29c into main May 1, 2026
14 checks passed
@alastairtree alastairtree deleted the feat/layer-deduplication-at-save-time branch May 1, 2026 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant