DAU Undersampling (Density-Aware Undersampling)

DAU (Density-Aware Undersampling) is a Python package to handle imbalanced datasets by reducing the majority class without losing important information.

Instead of random undersampling, DAU keeps:

Sparse points (outliers / rare cases) → retained fully
Dense clusters → represented by a few points (using DBSCAN)
Noise points → kept separately

This leads to smarter undersampling and better ML performance compared to random undersampling.

Installation

From PyPI:

pip install dau-undersampler

(Optionally, for testing on TestPyPI):

pip install -i https://test.pypi.org/simple/ dau-undersampling

⚡ Quickstart

import pandas as pd
from sklearn.datasets import make_classification
from dau_undersampling import DAU

# 1. Create an imbalanced dataset
X, y = make_classification(
    n_samples=1000, n_features=10,
    n_classes=2, weights=[0.9, 0.1],
    random_state=42
)

X = pd.DataFrame(X)
y = pd.Series(y)

# 2. Apply DAU undersampling
dau = DAU(n_neighbors=5, min_samples=3, eps=0.5, percentile=25)
X_resampled, y_resampled = dau.fit_transform(X, y)

print("Original dataset shape:", y.value_counts().to_dict())
print("Resampled dataset shape:", y_resampled.value_counts().to_dict())

🛠 Usage & Parameters

Class: `DAU`

DAU(n_neighbors=3, min_samples=5, eps=0.05, percentile=25)

Parameters:

n_neighbors (int, default=3) Number of neighbors for KNN distance calculation.
min_samples (int, default=5) Minimum samples per cluster (DBSCAN).
eps (float, default=0.05) Maximum neighborhood radius (DBSCAN).
percentile (int, default=25) Threshold to split sparse vs dense points.

Method: `fit_transform(X, y)`

Performs density-aware undersampling.

Arguments:

X: pd.DataFrame → features of majority class or dataset.
y: pd.Series → labels (binary classification).

Returns:

X_resampled: Reduced features after undersampling.
y_resampled: Reduced labels aligned with features.

Example in Pipeline

You can also integrate DAU into an ML pipeline (with imblearn):

from imblearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipeline = Pipeline([
    ('undersample', DAU(n_neighbors=7, min_samples=5, eps=0.4, percentile=30)),
    ('clf', LogisticRegression())
])

pipeline.fit(X, y)

Why DAU vs Other Methods?

Method	Behavior
Random undersampling	Drops samples randomly (risk of losing rare but important cases).
NearMiss / Tomek Links	Works with distances but may remove outliers or boundary points.
DAU (this package)	Preserves outliers + keeps 1 representative per dense cluster (balanced).

Contributing

Fork this repo
Create a new branch (git checkout -b feature-xyz)
Commit changes (git commit -m "Added xyz")
Push (git push origin feature-xyz)
Open a Pull Request

License

This project is licensed under the MIT License – see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
dau		dau
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DAU Undersampling (Density-Aware Undersampling)

Installation

⚡ Quickstart

🛠 Usage & Parameters

Class: `DAU`

Parameters:

Method: `fit_transform(X, y)`

Example in Pipeline

Why DAU vs Other Methods?

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

arjunravi26/dau

Folders and files

Latest commit

History

Repository files navigation

DAU Undersampling (Density-Aware Undersampling)

Installation

⚡ Quickstart

🛠 Usage & Parameters

Class: DAU

Parameters:

Method: fit_transform(X, y)

Example in Pipeline

Why DAU vs Other Methods?

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Class: `DAU`

Method: `fit_transform(X, y)`

Packages