Skip to content

zarr-python's consolidated metadata implementation violates the spec #371

@d-v-b

Description

@d-v-b

Previously, additional fields in metadata were allowed as long as they were JSON objects with a must_understand: false key: value pair. Zarr-python's consolidated metadata implementation complied with this requirement (see script below).

The recent redefinition of extra fields in metadata documents added the requirement that such extra fields have a name key which is a string. zarr-python's consolidated metadata does not contain a name key, and so it is out of spec.

As consolidated metadata is used heavily by xarray users, this is a very high-impact change. The recent spec refactor has thus made many (most?) zarr v3 xarray datasets technically out of spec.

# /// script
# dependencies = [
#   "zarr==3.1.0",
# ]
# ///

import zarr
from pprint import pprint
import json

store = {}
zarr.create_group(store)
consolidated = zarr.consolidate_metadata(store)
pprint(json.loads(store["zarr.json"].to_bytes()))
"""
{'attributes': {},
 'consolidated_metadata': {'kind': 'inline',
                           'metadata': {},
                           'must_understand': False},
 'node_type': 'group',
 'zarr_format': 3}
"""

I think we should treat this as a regression in the spec. A fix could be:

  • clarify that readers may ignore any additional field that is a JSON object with a must_understand: false key value pair, no matter what other keys that object has.
  • remove the requirement that top-level extra fields in array / group metadata objects have a name field if they are JSON objects.

Without these two changes, or changes that achieve the same effect, a large volume of zarr data is out of spec, and we need to fix that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions