Reset index before returning dataframe #805

ccl-core · 2025-02-11T14:33:17Z

Sometimes, when processing parquet-based datasets using pandas, the index column is treated as the dataframe's index, even though it should be another column. This results in an error at the column lookup step in the ReadFields operation:

croissant/python/mlcroissant/mlcroissant/_src/operation_graph/operations/field.py

Line 219 in f2d0cfd

assert column in df, (

Example dataset where this happens: https://huggingface.co/datasets/rag-datasets/rag-mini-wikipedia

github-actions · 2025-02-11T14:33:35Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

marcenacp · 2025-02-11T14:46:35Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py

@@ -110,7 +110,9 @@ def _read_file_content(self, encoding_format: str, file: Path) -> pd.DataFrame:
                return pd.read_json(file, lines=True)
            elif encoding_format == EncodingFormat.PARQUET:
                try:
-                    return pd.read_parquet(file)
+                    df = pd.read_parquet(file)
+                    df.reset_index(inplace=True) 


Either:

df.reset_index(inplace=True)

or

df = df.reset_index()

because we use this style a lot

marcenacp · 2025-02-11T14:46:48Z

python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py

@@ -110,7 +110,9 @@ def _read_file_content(self, encoding_format: str, file: Path) -> pd.DataFrame:
                return pd.read_json(file, lines=True)
            elif encoding_format == EncodingFormat.PARQUET:
                try:
-                    return pd.read_parquet(file)
+                    df = pd.read_parquet(file)


Maybe you can add a comment saying something like: Sometimes the author already set an index in Parquet, so we want to reset it to always have the same format

Done, thanks! Also updated the PR description with more context.

marcenacp

Thanks!

Reset index before returning dataframe

57d345b

ccl-core requested a review from a team as a code owner February 11, 2025 14:33

marcenacp reviewed Feb 11, 2025

View reviewed changes

marcenacp approved these changes Feb 11, 2025

View reviewed changes

ccl-core added 2 commits February 11, 2025 21:11

Add comment and fix format.

7356438

Fix isort

8c403df

ccl-core merged commit 5f71dbe into main Feb 11, 2025
11 of 12 checks passed

github-actions bot locked and limited conversation to collaborators Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset index before returning dataframe #805

Reset index before returning dataframe #805

ccl-core commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading

marcenacp Feb 11, 2025

marcenacp Feb 11, 2025

ccl-core Feb 11, 2025

marcenacp left a comment

Reset index before returning dataframe #805

Reset index before returning dataframe #805

Conversation

ccl-core commented Feb 11, 2025 • edited Loading

github-actions bot commented Feb 11, 2025 • edited Loading

marcenacp Feb 11, 2025

Choose a reason for hiding this comment

marcenacp Feb 11, 2025

Choose a reason for hiding this comment

ccl-core Feb 11, 2025

Choose a reason for hiding this comment

marcenacp left a comment

Choose a reason for hiding this comment

ccl-core commented Feb 11, 2025 •

edited

Loading

github-actions bot commented Feb 11, 2025 •

edited

Loading