Skip to content

JSON not properly decoded by backends #415

Open
@TomNicholas

Description

@TomNicholas

Kerchunk doesn't properly decode the JSON for zarr array-level attributes, instead leaving dictionaries as long strings. For example:

# create example netCDF4 file
xr.tutorial.open_dataset('air_temperature').to_netcdf('air.nc')

kerchunk.backends.SingleHdf5ToZarr('air.nc', inline_threshold=300).translate()
{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  '.zattrs': '{"Conventions":"COARDS","description":"Data is from NMC initialized reanalysis\\n(4x\\/day).  These are the 0.9950 sigma level values.","platform":"Model","references":"http:\\/\\/[www.esrl.noaa.gov\\/psd\\/data\\/gridded\\/data.ncep.reanalysis.html](https://www.esrl.noaa.gov///psd///data///gridded///data.ncep.reanalysis.html)","title":"4x daily NMC reanalysis (1948)"}',
  'air/.zarray': '{"chunks":[2920,25,53],"compressor":null,"dtype":"<i2","fill_value":null,"filters":null,"order":"C","shape":[2920,25,53],"zarr_format":2}',
  'air/.zattrs': '{"GRIB_id":11,"GRIB_name":"TMP","_ARRAY_DIMENSIONS":["time","lat","lon"],"actual_range":[185.16000366210938,322.1000061035156],"dataset":"NMC Reanalysis","level_desc":"Surface","long_name":"4xDaily Air temperature at sigma level 995","parent_stat":"Other","precision":2,"scale_factor":0.01,"statistic":"Individual Obs","units":"degK","var_desc":"Air temperature"}',
  'air/0.0.0': ['air.nc', 15419, 7738000],
  'lat/.zarray': '{"chunks":[25],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[25],"zarr_format":2}',
  'lat/.zattrs': '{"_ARRAY_DIMENSIONS":["lat"],"axis":"Y","long_name":"Latitude","standard_name":"latitude","units":"degrees_north"}',
  'lat/0': ['air.nc', 5179, 100],
  'lon/.zarray': '{"chunks":[53],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[53],"zarr_format":2}',
  'lon/.zattrs': '{"_ARRAY_DIMENSIONS":["lon"],"axis":"X","long_name":"Longitude","standard_name":"longitude","units":"degrees_east"}',
  'lon/0': ['air.nc', 5279, 212],
  'time/.zarray': '{"chunks":[2920],"compressor":null,"dtype":"<f4","fill_value":"NaN","filters":null,"order":"C","shape":[2920],"zarr_format":2}',
  'time/.zattrs': '{"_ARRAY_DIMENSIONS":["time"],"calendar":"standard","long_name":"Time","standard_name":"time","units":"hours since 1800-01-01"}',
  'time/0': ['air.nc', 7757515, 11680]}}

Notice that this is only partially decoded - the top two levels are nested python dictionaries, but below that the various zarr attributes are stored as long strings, e.g:

'{"chunks":[2920,25,53],"compressor":null,"dtype":"<i2","fill_value":null,"filters":null,"order":"C","shape":[2920,25,53],"zarr_format":2}'

This seems silly, why not just decode the whole thing properly at the beginning so you can always treat it like a nested python dictionary? (Or even better use a dedicated abstraction like suggested in #375)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions