Equivalent to unite in dplyr #1445

fkgruber · 2025-02-18T14:03:41Z

In tidyverse you can use unite to combine columns into a new column. This is useful when you want to create an id for example by combining multiple columns. In dplyr we have the function unite for this. We probably also want a separate function.

This is an initial proposition. I also added an option to reduce the number of significant digits on numeric quantities. Otherwise, numerical precision can lead to different IDs.

def signif(x, digits=2):
    """Round numeric values to significant digits."""
    try:
        return float(f"{x:.{digits}g}") if isinstance(x, (int, float)) else x
    except:
        return x  # Return as-is if conversion fails

@register_dataframe_method
def unite(df, prefix, new_column_name, sep="_", digits=4):
    """
    Combines all columns with a given prefix into a single column without removing the originals.

    Parameters:
    df (pd.DataFrame): The input DataFrame.
    prefix (str): The prefix to filter columns.
    new_column_name (str): The name of the new combined column.
    sep (str): Separator for concatenating values.
    digits (int): Number of significant digits for numeric values.

    Returns:
    pd.DataFrame: DataFrame with the new combined column.
    """
    df2 = df.copy()

    # Select columns with the given prefix using pyjanitor's select method
    config_cols = df2.select(columns=[f"{prefix}*"])

    # Apply rounding to numeric values using map
    config_cols = config_cols.map(lambda x: signif(x, digits) if isinstance(x, (float, int)) else x)

    # Create the new combined column
    df2[new_column_name] = config_cols.astype(str).agg(sep.join, axis=1)

    return df2

Example run:

import pandas as pd
df = pd.DataFrame({
    "config_a": [1.234567, 2.345678, 3.456789],
    "config_b": ["B1", "B2", "B3"],
    "config_c": [100.567, 200.678, 300.789],
    "other_col": [1, 2, 3]
})

# Use the custom pandas method via pandas_flavor
df = df.unite(prefix="config", new_column_name="id")

print(df)

The text was updated successfully, but these errors were encountered:

samukweku · 2025-02-19T23:07:53Z

@fkgruber there is the concatenate_columns that does this already:

df.concatenate_columns(column_names=['config_a','config_b','config_c'], new_column_name='id', sep='_')
   config_a config_b  config_c  other_col                   id
0  1.234567       B1   100.567          1  1.234567_B1_100.567
1  2.345678       B2   200.678          2  2.345678_B2_200.678
2  3.456789       B3   300.789          3  3.456789_B3_300.789

The function needs to be updated though to avoid mutation; we can even add select syntax support for column_names.

There is an outstanding PR that was never resolved, if u want to take a look at it : #1164

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Equivalent to unite in dplyr #1445

Equivalent to unite in dplyr #1445

fkgruber commented Feb 18, 2025

samukweku commented Feb 19, 2025

Equivalent to unite in dplyr #1445

Equivalent to unite in dplyr #1445

Comments

fkgruber commented Feb 18, 2025

samukweku commented Feb 19, 2025