Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDBSCAN-BC #88

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

HDBSCAN-BC #88

wants to merge 9 commits into from

Conversation

azizkayumov
Copy link
Contributor

Closes #87

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @azizkayumov, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

Summary of Changes

This pull request addresses issue #87 and introduces a modification to the fit method across several clustering algorithms (Dbscan, HDbscan, and Optics) to accept an optional parameter partial_labels. This change enables semi-supervised clustering by allowing users to provide initial labels, influencing the clustering process. The benchmarks and examples are also updated to reflect this change.

Highlights

  • Semi-Supervised Clustering: The fit method in Dbscan, HDbscan, and Optics now accepts an optional partial_labels parameter, enabling semi-supervised clustering.
  • HDBSCAN BCubed Metric: Introduces BCubed metric calculation within HDBSCAN to evaluate clustering performance against partial labels.
  • Code Modernization: Removes deprecated code and updates dependencies in HDBSCAN implementation.
  • Test Coverage: Adds a new test case for HDBSCAN to validate the behavior with partial labels.

Changelog

Click here to see the changelog
  • benches/tests/dbscan.rs
    • Updated benchmark tests to pass None as the optional parameter to the fit method.
  • benches/tests/hdbscan.rs
    • Updated benchmark tests to pass None as the optional parameter to the fit method.
  • benches/tests/optics.rs
    • Updated benchmark tests to pass None as the optional parameter to the fit method.
  • examples/hdbscan.rs
    • Updated example to pass None as the optional parameter to the fit method.
  • src/dbscan.rs
    • Modified the Fit trait implementation to accept an optional parameter.
    • Updated tests to pass None as the optional parameter to the fit method.
  • src/hdbscan.rs
    • Modified the Fit trait implementation to accept an optional parameter.
    • Added get_bcubed function to calculate BCubed metric.
    • Modified find_clusters to incorporate BCubed metric for cluster selection.
    • Removed unused imports and deprecated code.
    • Added a new test case partial_labels to validate the behavior with partial labels.
  • src/lib.rs
    • Modified the Fit trait definition to include a parameter for partial labels.
  • src/mst.rs
    • Added new module mst with helper functions for HDBSCAN algorithm
  • src/optics.rs
    • Modified the Fit trait implementation to accept an optional parameter.
    • Updated tests to pass None as the optional parameter to the fit method.
  • src/union_find.rs
    • Added new module union_find with helper functions for HDBSCAN algorithm
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.


Trivia time!

What does HDBSCAN stand for?

Click here for the answer
HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces an optional parameter for partial labeling to the HDBSCAN algorithm, enhancing its functionality. The changes are well-structured and include necessary modifications across the codebase. However, there are some areas that could benefit from further review and refinement.

Summary of Findings

  • Code Duplication: The mst module contains code that was previously in hdbscan.rs. Consider whether the code in mst.rs can be refactored to avoid duplication.
  • Unused Imports: Several unused imports were removed from src/hdbscan.rs. Ensure that all removed imports are indeed unnecessary and won't impact other functionalities.
  • Test Coverage: The pull request adds a new test case partial_labels in src/hdbscan.rs. Ensure that the new functionality is adequately covered with comprehensive test cases, including edge cases and boundary conditions.

Merge Readiness

The pull request introduces valuable functionality by adding partial labeling to the HDBSCAN algorithm. However, before merging, it's crucial to address the identified issues, particularly the code duplication and test coverage. I am unable to directly approve this pull request, and recommend that others review and approve this code before merging. I recommend that this pull request not be merged until the identified issues are addressed.

@azizkayumov
Copy link
Contributor Author

@gemini-code-assist please review again and be more specific about the issues you mentioned above.

Copy link

OK. I have reviewed the code again, focusing on providing more specific feedback.

Here's a summary of the changes and my comments:

  • .github/workflows/ci.yml: Updated the Rust version to 1.81.0. No specific comments.
  • benches/tests/*.rs: Added None as a second argument to the fit method calls. See comment on src/dbscan.rs for details.
  • examples/hdbscan.rs: Added None as a second argument to the fit method call. See comment on src/dbscan.rs for details.
  • src/dbscan.rs: Modified the Fit trait implementation to accept an optional parameter. See comment below for details.
  • src/hdbscan.rs: Substantial changes to the hdbscan.rs file, including the addition of a partial_labels parameter to the fit method, and the addition of mst.rs and union_find.rs files. See comments below for details.
  • src/optics.rs: Added None as a second argument to the fit method calls. See comment on src/dbscan.rs for details.

FILE: src/dbscan.rs

  1. Fit Trait Implementation: The Fit trait implementation has been modified to accept an optional parameter _: Option<&HashMap<usize, Vec<usize>>>. This change seems to be related to incorporating partial labels into the clustering process. It would be helpful to have a comment explaining the purpose of this parameter and how it's used (or not used) within the dbscan implementation.

FILE: src/hdbscan.rs

  1. fit method: The fit method now accepts a partial_labels parameter. This is a significant change that introduces semi-supervised learning capabilities. The implementation includes a get_bcubed function to calculate BCubed scores based on these labels, influencing cluster formation. It would be helpful to have a comment explaining the purpose of this parameter and how it's used within the hdbscan implementation.
  2. Dependencies: Several dependencies (e.g., std::cmp::Ordering, rayon::prelude::*, succinct::*) have been removed, and new modules (mst and union_find) have been introduced. This suggests a refactoring of the code to improve modularity and potentially remove unnecessary dependencies. It would be helpful to have a high level comment on the refactoring that was performed.
  3. New Modules: The addition of mst.rs and union_find.rs indicates a move towards encapsulating minimum spanning tree and union-find functionalities into separate modules. This improves code organization and reusability. No specific comments.

Copy link

codecov bot commented Mar 22, 2025

Codecov Report

Attention: Patch coverage is 88.62559% with 48 lines in your changes missing coverage. Please review.

Project coverage is 83.75%. Comparing base (28b7552) to head (9f5f258).
Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
src/mst.rs 85.98% 45 Missing ⚠️
src/hdbscan.rs 94.44% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #88      +/-   ##
==========================================
+ Coverage   82.21%   83.75%   +1.53%     
==========================================
  Files           4        6       +2     
  Lines         669      720      +51     
==========================================
+ Hits          550      603      +53     
+ Misses        119      117       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@azizkayumov
Copy link
Contributor Author

@gemini-code-assist , here is a summary of changes based on your feedback:
FILE: src/dbscan.rs:

  1. Added a docs comment noting that the optional parameter _params is not used, but kept for consistency.
    FILE: src/hdbscan.rs:
  2. Added a docs section to note that the optional parameter is used as partial labels, also noted the algorithms being used if partial_labels are provided by the user.
    2 and 3 are the same: refactored mst and union_find related code into their own modules so that hdbscan is easier to manage.
    Can you review again?

Copy link

Thanks for the update, @azizkayumov. I've reviewed the changes again.

It looks like you've addressed the previous comments by adding documentation for the optional parameters in src/dbscan.rs and src/hdbscan.rs, and refactored the mst and union_find related code into their own modules.

I noticed that the Fit trait in src/lib.rs was updated to include a Params type. This is great, but it looks like the Fit implementations in Dbscan and Optics are not fully taking advantage of this new Params type, as they are simply passing None to the fit method. This is acceptable, but it might be good to consider how these algorithms could leverage the Params type in the future. For example, the Params type could be used to pass in initial cluster assignments, or to configure other aspects of the algorithm.

Overall, the code looks good and the changes are well-documented. I'm happy with the changes that were made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement semi-supervised clustering with HDBSCAN
1 participant