Skip to content

BUG: read_hdf converting literal 'nan' string in Index to NaN (GH-9604)#65603

Open
jbrockmendel wants to merge 1 commit into
pandas-dev:mainfrom
jbrockmendel:bug-9604
Open

BUG: read_hdf converting literal 'nan' string in Index to NaN (GH-9604)#65603
jbrockmendel wants to merge 1 commit into
pandas-dev:mainfrom
jbrockmendel:bug-9604

Conversation

@jbrockmendel
Copy link
Copy Markdown
Member

Summary

  • closes BUG: Index not respecting nan_rep in HDF5 serialiazation #9604
  • _unconvert_string_array used to silently default nan_rep=None to "nan", replacing every literal "nan" in the data with np.nan. The two index read paths (fixed-format _unconvert_index and table-format _get_converter) always called it with nan_rep=None, but the corresponding write path for indices never substitutes a NaN sentinel — so the substitution was an unconditional one-way data corruption for string indices.
  • Make nan_rep=None mean "no substitution" in _unconvert_string_array, and have DataCol.convert pass "nan" explicitly when the persisted attr is missing, so old files where nan_rep wasn't written still read correctly on the data-column path (which is symmetric).

Test plan

  • New regressions in pandas/tests/io/pytables/test_round_trip.py:
    • test_string_nan_in_index_fixed — literal "nan" in a Series index, fixed format
    • test_string_nan_in_index_table — same, table format with a custom nan_rep
    • test_string_nan_in_dataframe_index — DataFrame index, both formats
    • test_string_column_literal_nan_and_real_nan — companion test that pins down the symmetric data-column path: a custom nan_rep round-trips both a literal "nan" and an actual NaN
  • Full pandas/tests/io/pytables/ suite passes locally (565 passed, 1 unrelated skip)
  • pre-commit clean

The unconvert path for string indices applied a NaN-sentinel substitution
the writer never performed, so a literal "nan" in a string Index was
silently replaced with float NaN on read. Make _unconvert_string_array
treat nan_rep=None as "no substitution" and preserve the legacy default
for the data-column path explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: Index not respecting nan_rep in HDF5 serialiazation

1 participant