-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset index before returning dataframe #805
Conversation
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
@@ -110,7 +110,9 @@ def _read_file_content(self, encoding_format: str, file: Path) -> pd.DataFrame: | |||
return pd.read_json(file, lines=True) | |||
elif encoding_format == EncodingFormat.PARQUET: | |||
try: | |||
return pd.read_parquet(file) | |||
df = pd.read_parquet(file) | |||
df.reset_index(inplace=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either:
df.reset_index(inplace=True)
or
df = df.reset_index()
because we use this style a lot
@@ -110,7 +110,9 @@ def _read_file_content(self, encoding_format: str, file: Path) -> pd.DataFrame: | |||
return pd.read_json(file, lines=True) | |||
elif encoding_format == EncodingFormat.PARQUET: | |||
try: | |||
return pd.read_parquet(file) | |||
df = pd.read_parquet(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you can add a comment saying something like: Sometimes the author already set an index in Parquet, so we want to reset it to always have the same format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks! Also updated the PR description with more context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Sometimes, when processing parquet-based datasets using pandas, the index column is treated as the dataframe's index, even though it should be another column. This results in an error at the column lookup step in the ReadFields operation:
croissant/python/mlcroissant/mlcroissant/_src/operation_graph/operations/field.py
Line 219 in f2d0cfd
Example dataset where this happens: https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia