Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use narwhals to support Polars, cuDF, Modin, etc. #388

Merged
merged 40 commits into from
Oct 25, 2024
Merged

Conversation

MarcAntoineSchmidtQC
Copy link
Member

Checklist

  • Added a CHANGELOG.rst entry

pixi.lock Outdated
@@ -368,7 +371,7 @@ environments:
- conda: https://conda.anaconda.org/conda-forge/linux-64/binutils_impl_linux-64-2.40-ha1999f0_7.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/binutils_linux-64-2.40-hb3c18ed_1.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/bzip2-1.0.8-h4bc722e_7.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/c-ares-1.33.1-heb4867d_0.conda
- conda: https://conda.anaconda.org/conda-forge/linux-64/c-ares-1.32.3-h4bc722e_0.conda
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why it's asking to downgrade here

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC marked this pull request as ready for review September 12, 2024 18:33
@stanmart
Copy link
Collaborator

stanmart commented Sep 12, 2024

I think this will raise an exception if neither pandas nor polars is installed (which is now a posibility):

else:
indices, categories = pd.factorize(cat_vec, sort=True)

We could replace it with some numpy-based solution, which is probably not very efficient, but that's probably fine for now. E.g.,

    else:
        categories, indices = np.unique(cat_vec.to_numpy(), return_inverse=True)

@stanmart
Copy link
Collaborator

Same here, but the solution might be less straightforward for the non-pandas-or-polars case:

return pd.Categorical.from_codes(self.indices, categories=self.categories)

Maybe we could say that the property is deprecated and is only there for backwards compatibility, so we are not implementing it for non-pandas (or polars) input?

@stanmart
Copy link
Collaborator

I don't think we need this: anymore

try:
import polars as pl
except ImportError:
pl = None # type: ignore

Copy link
Collaborator

@stanmart stanmart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! All that remains is fixing these couple of issues in categorical_matrix.py.

It might also be nice to have a test for from_df which is not polars or pandas. Pyarrow is test dependency because of polars already; would it be okay if I added a test for pyarrow dataframes?

@stanmart
Copy link
Collaborator

I made the following updates to fix the issues above:

Deconstructing the cat argument of CatagoricalMatrix.__init__:

  • When the constructor of CategoricalMatrix gets a pandas or polars input, it uses @lbittarello's solution
  • In any other case, it uses a pure numpy fallback. It is necessary as we cannot assume that pandas will be available. It is not super efficient when the input is already a (e.g. pyarrow) categorical, as we convert it into a string vector, then reconstruct the incdices/categories representation again. However, narwhals does not have a way to access the underlying representation of a categorical, and I'm not too keen on adding more custom cases besides pandas and polars.

The cat property

  • When the CategoricalMatrix was created based on a series that narwhals supports, use narwhals.new_series to create a categorical series of the same type. Special case pandas for performance reasons and differing behavior, as before.
  • When the CategoricalMatrix is created from a list, numpy array or similar, just return a dictionary with indices and categories. We cannot return a categorical as it is not guaranteed that any dataframe library is available.
  • As @lbittarello mentioned in the discussion of from_polars #370, cat is deprecated anyways, so it is just a temporary solution, and the extra complexity can be removed when we release the next major version.

@stanmart
Copy link
Collaborator

@MarcAntoineSchmidtQC, if we don't want to add new functionality to the deprecated cat property then let me know and I'll remove that part. I think the more important addition is handling non-pandas-or-polars categoricals when constructing the CategoricalMatrix, which is an orthogonal issue.

@stanmart
Copy link
Collaborator

Perhaps one final question: all three matrix types have an unpack method that returns the container storing the data. For CategoricalMatrix it used to be a pandas.Categorical, so it simply returned .cat(). We should probably either deprecate this function for the CategoricalMatrix class or show a FutureWarning and from the next breaking release return (indices, categories) instead.

Copy link

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, amazing, seeing how far you got with Narwhals independently here really warms my heart 🤗

If you're happy with this, may I suggest to use import narwhals.stable.v1 as nw instead of import narwhals as nw? This will future-proof you against breaking changes (if we need to make them), see https://narwhals-dev.github.io/narwhals/backcompat/ for an explanation

Well done here, and please always feel free to ping us from Narwhals if you have any questions / requests / comments 🙏

@stanmart
Copy link
Collaborator

Thanks so much for the review and the fantastic library @MarcoGorelli! 🙏 Using the v1 API is a great idea. The only thing I ran into is that there does not seem to be a narwhals.stable.v1.dtypes.NumericType counterpart to narwhals.dtypes.NumericType. Is there an easy way to refer to any numeric type under the v1 API?

@MarcoGorelli
Copy link

thanks! yup, .is_numeric should work:

In [10]: import narwhals.stable.v1 as nw

In [11]: nw.from_native(pl.Series([1,2,3]), series_only=True).dtype.is_numeric()
Out[11]: True

In [12]: nw.from_native(pl.Series(['foo', 'bar']), series_only=True).dtype.is_numeric()
Out[12]: False

@MarcAntoineSchmidtQC MarcAntoineSchmidtQC merged commit 42fda2c into main Oct 25, 2024
24 checks passed
@MarcAntoineSchmidtQC MarcAntoineSchmidtQC deleted the narwhals branch October 25, 2024 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants