BUG: read_hdf converting literal 'nan' string in Index to NaN (GH-9604)#65603
Open
jbrockmendel wants to merge 1 commit into
Open
BUG: read_hdf converting literal 'nan' string in Index to NaN (GH-9604)#65603jbrockmendel wants to merge 1 commit into
jbrockmendel wants to merge 1 commit into
Conversation
The unconvert path for string indices applied a NaN-sentinel substitution the writer never performed, so a literal "nan" in a string Index was silently replaced with float NaN on read. Make _unconvert_string_array treat nan_rep=None as "no substitution" and preserve the legacy default for the data-column path explicitly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_unconvert_string_arrayused to silently defaultnan_rep=Noneto"nan", replacing every literal"nan"in the data withnp.nan. The two index read paths (fixed-format_unconvert_indexand table-format_get_converter) always called it withnan_rep=None, but the corresponding write path for indices never substitutes a NaN sentinel — so the substitution was an unconditional one-way data corruption for string indices.nan_rep=Nonemean "no substitution" in_unconvert_string_array, and haveDataCol.convertpass"nan"explicitly when the persisted attr is missing, so old files wherenan_repwasn't written still read correctly on the data-column path (which is symmetric).Test plan
pandas/tests/io/pytables/test_round_trip.py:test_string_nan_in_index_fixed— literal"nan"in a Series index, fixed formattest_string_nan_in_index_table— same, table format with a customnan_reptest_string_nan_in_dataframe_index— DataFrame index, both formatstest_string_column_literal_nan_and_real_nan— companion test that pins down the symmetric data-column path: a customnan_repround-trips both a literal"nan"and an actualNaNpandas/tests/io/pytables/suite passes locally (565 passed, 1 unrelated skip)