Skip to content

Releases: cedadev/padocc

Version 1.2 Allocations and Banding

12 Apr 13:24
e19c785
Compare
Choose a tag to compare
Pre-release

Updates for version 1.2:

  • Pipeline to Aggregate Data for Optimised Cloud Capabilities (padocc) - New official name for the pipeline.

Assessor (addition 1.2.a)

  • Two new modes added! (match and status_log)
  • Added new display options! (allocations and bands now displayed)
  • Bug fixes:
    • merge_old_new - issue with indexing different types of lists. (1.2.a1)
    • cleanup - now able to delete allocation directories. (1.2.a2)
    • progress - now able to match multiple error types. (1.2.a3)

Allocation (addition 1.2.b)

  • Added allocations for compute processes with estimations using binpacking - requires specific flag to enable.
  • Added general purpose bands for rerunning different processes - will look at past runs and add time for failed jobs.
    • Uses default values for time for each phase, unless --allow-band-increase is enabled in which case previous runs are considered.

Documentation (addition 1.2.c)

  • Added developer's guide for adding new features!
  • Updated flag listings for all tool scripts.

Group Run (addition 1.2.d)

  • Added default times for different phases
  • Added deployment function for multiple arrays from within a single call! Allocations and bands can now be deployed (no current limit to number of array jobs that can be deployed simultaneously)
  • Added pre-deployment input requirement to check deployments are as expected.
  • Minor bug fixes
    • Verbose flag now carries over to subprocesses (1.2.d1)

Compute (addition 1.2.e)

  • Added Zarr processor!
  • Rearranged all processors with new names and class inherritance.
  • Added ProjectProcessor parent class from which KerchunkDSProcessor now inherits!
  • New functions for checking variable shapes and determining behaviour which helps optimise processes within the pipeline.

Errors (addition 1.2.f)

  • Added NaNComparisonError - for consistent issues with comparing arrays (1.2.f1)
  • Added RemoteProtocolError - if the remote protocol cannot be handled properly (1.2.f2)
  • Added SourceNotFoundError - for resources that failed to open (1.2.f3)
  • Added ArchiveConnectionError - catches fsspec ReferenceNotReachable for multiple tries (1.2.f4)
  • Added KerchunkDecodeError - issue opening Kerchunk file (normally time decode related) (1.2.f5)
  • Added FullsetRequiredError - raised instead of risking a timeout in validation (1.2.f6)

Index_Cat (addition 1.2.g)

  • Initialised script for later use pushing Kerchunk records to an index

Ingest (addition 1.2.h)

  • Initialised script with some basic functions to use when ingesting data to the CEDA archive, also checks download links have been added properly.

Init (addition 1.2.i)

  • Updated docstrings

Logs (addition 1.2.j)

  • Added log_status fetch function
  • Updated init_logger to ensure filehandler exists.

Scan (addition 1.2.k)

  • Removed unused function eval_sizes
  • Altered scan setup to use instances of processor classes.
  • Added new detail-cfg attributes!
  • Added override_type for specifying Zarr as an output type.

Utils (addition 1.2.l)

  • Added new switches to BypassSwitch class for fasttrack, skip link addition.
  • Reconfigured remote protocol option in open_kerchunk.
  • Added function specifically to get the blacklist.
  • Added get/set_last_run routines for band increases if jobs time out.
  • Added find_divisor and find_closest routines for use in allocations.

Validate (addition 1.2.m)

  • Integration of new errors
  • Multiple tries of fetching Kerchunk/Xarray data with different options if required.
  • Added array flattening at point of checking NaN values, the flattened arrays are then used throughout, and with the new error codes for unreachable chunks, this means once the data is fetched successfully it can be kept and used for all tests.

Notebooks (addition 1.2.n)

  • Renamed simple scan notebook.
  • Initialised pipeline test notebook.

Single Run (addition 1.2.o)

  • Reconfigured how allocations/bands/subsets work for single/multiple processes.
  • Reconfigured logger creation when dealing with multiple processes in a single job.
  • Added override_type flag for compute phase.

Version 1.1 PPC Error Tracking

08 Mar 15:58
80232c3
Compare
Choose a tag to compare
Pre-release

New Software Features:

  • Per Project Code (PPC) Error tracking
  • Individual log files for each dataset, updated automatically with filehandler updates built into the pipeline - for upcoming job allocation improvements.
  • Scanning improvements and identification of types of dimensions.
  • Support for virtual dimension additions (file_number)
  • BypassSwitch option and default changes.

New Documentation:

  • Documentation Updates
  • Example CCI Water Vapour files and tutorial
  • Kerchunk Powerpoints

v1.0.2

16 Feb 12:27
d1f5a07
Compare
Choose a tag to compare
v1.0.2 Pre-release
Pre-release

Version 1.0.2

  • Major documentation overhaul
  • Added BypassSwitch for better control of switch options
  • Added features to Assessor:
  1. Blacklist and proj_code list concatenation with existing files.
  2. Slurm error type recognition and better labelling.

Alpha 1.0.1

13 Feb 16:22
Compare
Choose a tag to compare
Alpha 1.0.1 Pre-release
Pre-release

Version 1.0.1

  • errors.py - contains all custom error classes for Kerchunk pipeline
  • logs.py - contains logger content and other utils
  • setup.py - installs pipeline scripts into environment
  • validate.py - thorough validation added (try subset, then try full fileset)

Bug Fixes:

  • Dap link in cached files: Pipeline now writes cache files BEFORE concatenation to ensure no linkage.
  • Dimension loading issue: Known issue with remote_protocol set - greater distinction between pre-concatenation and post-concatenation Kerchunk files.

Features added:

  • Memory flag: Assign specific memory to parallel job arrays
  • In-place validation: Pipeline allows rechecking of 'complete' datasets.
  • Multi-dimension concatenation support
  • Identical_dims identification within compute phase.
  • Added custom error classes for edge cases.

Alpha 1.0

17 Jan 11:08
b76eb7d
Compare
Choose a tag to compare
Alpha 1.0 Pre-release
Pre-release

First release (alpha) of CEDA Kerchunk Pipeline. Includes Init, Scan, Compute and Validate phases but no Catalog/Ingest control solution. Kerchunk files created by the pipeline are fully verified so all results from NetCDF solutions will be the same as the Kerchunk method.