Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Equivalent to unite in dplyr #1445

Open
fkgruber opened this issue Feb 18, 2025 · 1 comment
Open

Equivalent to unite in dplyr #1445

fkgruber opened this issue Feb 18, 2025 · 1 comment

Comments

@fkgruber
Copy link

In tidyverse you can use unite to combine columns into a new column. This is useful when you want to create an id for example by combining multiple columns. In dplyr we have the function unite for this. We probably also want a separate function.

This is an initial proposition. I also added an option to reduce the number of significant digits on numeric quantities. Otherwise, numerical precision can lead to different IDs.

def signif(x, digits=2):
    """Round numeric values to significant digits."""
    try:
        return float(f"{x:.{digits}g}") if isinstance(x, (int, float)) else x
    except:
        return x  # Return as-is if conversion fails

@register_dataframe_method
def unite(df, prefix, new_column_name, sep="_", digits=4):
    """
    Combines all columns with a given prefix into a single column without removing the originals.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    prefix (str): The prefix to filter columns.
    new_column_name (str): The name of the new combined column.
    sep (str): Separator for concatenating values.
    digits (int): Number of significant digits for numeric values.

    Returns:
    pd.DataFrame: DataFrame with the new combined column.
    """
    df2 = df.copy()

    # Select columns with the given prefix using pyjanitor's select method
    config_cols = df2.select(columns=[f"{prefix}*"])

    # Apply rounding to numeric values using map
    config_cols = config_cols.map(lambda x: signif(x, digits) if isinstance(x, (float, int)) else x)

    # Create the new combined column
    df2[new_column_name] = config_cols.astype(str).agg(sep.join, axis=1)

    return df2

Example run:

import pandas as pd
df = pd.DataFrame({
    "config_a": [1.234567, 2.345678, 3.456789],
    "config_b": ["B1", "B2", "B3"],
    "config_c": [100.567, 200.678, 300.789],
    "other_col": [1, 2, 3]
})

# Use the custom pandas method via pandas_flavor
df = df.unite(prefix="config", new_column_name="id")

print(df)

Image

@samukweku
Copy link
Collaborator

@fkgruber there is the concatenate_columns that does this already:

df.concatenate_columns(column_names=['config_a','config_b','config_c'], new_column_name='id', sep='_')
   config_a config_b  config_c  other_col                   id
0  1.234567       B1   100.567          1  1.234567_B1_100.567
1  2.345678       B2   200.678          2  2.345678_B2_200.678
2  3.456789       B3   300.789          3  3.456789_B3_300.789

The function needs to be updated though to avoid mutation; we can even add select syntax support for column_names.

There is an outstanding PR that was never resolved, if u want to take a look at it : #1164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants