Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Kernel Density Estimation #193

Open
humphreylee opened this issue Sep 22, 2023 · 15 comments
Open

[Feature]: Kernel Density Estimation #193

humphreylee opened this issue Sep 22, 2023 · 15 comments

Comments

@humphreylee
Copy link

Thanks for sharing the good work. Is there any implementation for kernel density estimation available (univariate, bivariate or multivariate)? Thanks.

@wangjiawen2013
Copy link

I think this is very helpful ! Here is repo for kernel density estimation (https://github.com/seatonullberg/kernel-density-estimation)
It would be great to incorporate it statrs

@YeungOnion
Copy link
Contributor

I'm unsure how well this would fit into the exist tooling in statrs. While we do have the Empirical distribution, introducing hyperparameters to the data for data driven distributions starts to fit into a broader realm of statistics that is data-driven.

@henryjac what would you say? We're amidst defining new direction, but I think this would fit better in a different crate.

@henryjac
Copy link
Contributor

henryjac commented Mar 9, 2024

I would not be against adding functionality for performing more data driven distributions and functionality. With this being a statistics crate most features regarding any kind of statistics fits quite well.

We even already have functionality for statistics in the statistics module, so expanding that with KDE etc. makes sense to me.

@YeungOnion
Copy link
Contributor

Well, I would be happy to support data-driven distributions as long as we look ahead a little bit at what we'll choose (for near-future) as out of scope. Perhaps I'd have done better had I said that I'm averse to looking at it now since we've got some short-term priorities right now.

Overall, what I think is that it just takes some discussion on clearly choosing scope, starting with upper bound (things that will certainly be out of scope). What can you say won't be in scope (for near-future)?


But all cards on the table, I actually think KDE would be good candidate compared to some other data-driven distributions, reasons being:

  • the distribution function for it similar to named random variables in statrs::distribution in that it is,
    • invariant under permutation of data
    • deterministic and closed form in terms of specified data and kernel (plus possible hyperparameters)
  • I'd expect real-world use-cases where data volume is less than scale of memory (i.e. not developing API to support larger-than-memory beyond considering IntoIterator)

@humphreylee
Copy link
Author

Don't mean to push things. I fully appreciate your time to make things happen. Just curious, any hope of having this feature?

@YeungOnion
Copy link
Contributor

My first thoughts returning to this are that it would be great! Would really get statrs to be a fuller statistical suite. As far as statistics, we mostly have the parametric stuff without estimators or regression.

I don't think I could personally implement anything with practical performance for anything beyond 1D and I won't prioritize authoring it right now, but perhaps we could start with a simple implementation to set the API and define how this estimator fits (and in principle, future others fit) into the module structure we have or a different one that better suits future features.

As a would be user, do you have any example workflows/use cases we could start the design around?

@humphreylee
Copy link
Author

There is a rust implementation for 1D from this crate, while a 2D is in progress. But my main interest is to build contour plot.

@YeungOnion
Copy link
Contributor

I can start with writing out the set of traits and perhaps with an example that won't run but would compile, but I might want to restructure modules. I'll refer to this issue when I put in the PR. Likely won't start until next week, so aiming for 2/17 on this.

@YeungOnion
Copy link
Contributor

YeungOnion commented Feb 19, 2025

While I've also been looking at some of the spatial search data structures, I've had a back and forth with LLMs for API design. Multiple conversations with ChatGPT and some others with Augment Code accessing statrs, so now I'm waning to obtain some user feedback on what I'm thinking of for public API.

definitions

/// A spatial index that returns the nearest neighbor to a given query key.
///
/// This trait is implemented by data structures (e.g. k-D trees or ball trees) that provide
/// efficient lookup of the single element closest to a specified key.
pub trait NearestNeighbor<K,> {
    /// Returns the nearest neighbor's keys to the query key.
    fn k_nearest_neighbors(&self, k: usize, key: &K) -> Vec<K>;

    fn nearest_neighbors(&self, key: &K) -> Option<K> {
        self.k_nearest_neighbors(1, key).get(0)
    }
}

/// A spatial index that retrieves all elements with keys within a specified range.
///
/// Data structures implementing this trait (such as k-D trees) can efficiently return all
/// elements whose keys fall between a lower and an upper bound, or a set of hyperplanes.
///
/// Note for implementers: at the moment, we have only considered flat spaces
pub trait RangeQueryable<K, V> {
    /// Returns a vector of values whose keys lie within the given range.
    fn range_query(&self, min: &K, max: &K) -> Vec<V>;
}

/// A spatial index that retrieves all elements within a specified radius from a query key.
///
/// The radius parameter is generic (and must implement `PartialOrd`) to allow for different
/// distance metrics (such as L1 or L2 norms).
pub trait RadiusQueryable<K, V, R: PartialOrd> {
    /// Returns a vector of values that are within the specified radius of the query key.
    fn radius_query(&self, key: &K, radius: R) -> Vec<V>;
}

/// A kernel function (those used in kernel methods) that computes a weight given a distance and bandwidth.
///
/// Kernel functions are used in kernel methods like KDE to weight the contribution
/// of each data point based on its distance from the query point.
pub trait Kernel {
    /// Evaluates the kernel function at the specified distance `d` using the provided bandwidth `h`.
    fn evaluate(&self, d: f64, h: f64) -> f64;
    /// Future possibility of kernels defined on a product space would be useful for adaptive KDE
    /// would use spatial index type K and generalized bandwidth B, dependent on evaluation points from K.
    // fn evaluate_pairwise(&self, x: &K, y: &K, h: &B)
}

/// A metric space that defines how to compute the distance between two keys.
///
/// This trait is useful for algorithms (e.g. KDE) that require a distance metric to weigh contributions.
pub trait MetricSpace<K> {
    /// Returns the distance between two keys as an `f64`.
    fn distance(a: &K, b: &K) -> f64;
}

/// A kernel density estimator that computes density estimates from a dataset.
///
/// This estimator is parameterized by a data backend, a kernel function, and a neighbor selection
/// callback. Bandwidth is specified at estimation time.
///
/// # Type Parameters
///
/// * `D` - The data backend (e.g., a KDTree) that allow efficient queries.
/// * `K` - The type of spatial index that the data are.
/// * `Kernel` - The kernel used. 
/// * `F` - A function that, given a reference to the data, a query key, and a bandwidth,
///         returns an iterator over `(value, distance)` pairs.
///         I'm doing it this was as I'm presently putting off the trait design until I learn from writing those out.
pub struct KernelDensityEstimator<D, K, Kern, F>
where
    F: Fn(&D, &K, f64) -> Box<dyn Iterator<Item = (K, f64)>>,
    Kern: Kernel,
{
    /// The data backend (e.g., a KDTree, BallTree, etc.).
    data: D,
    _kern: PhantomData<Kern>,
    /// A callback function to retrieve neighbors and their distances.
    neighbor_selector: F,
}

impl<D, K, V, F> KernelDensityEstimator<D, K, V, F>
where
    F: Fn(&D, &K, f64) -> Box<dyn Iterator<Item = (V, f64)>>,
{
    pub fn new(...) -> Self {
        Self {...}
    }
    pub fn estimate(&self, query: &K, bandwidth: f64) -> f64 {...}
    pub fn take(self) -> D {
}

We implement the typical box, epanechikov, and gaussian kernels.

And the backing data structures for spatial search will be some subset of {ball trees, k-d trees, M-trees, R-trees}, I'm looking at dependencies that might support some of these with the right amount of work, but I may also start with the heap allocated binary k-d tree just to get something up.

EDIT: I can't believe I forgot to mention it, but I'm also looking at batch queries since that's really useful for the lower dimensional tools like visualization.

@humphreylee
Copy link
Author

Thanks for working on the code. It is not my area of expertise to review the detail of the code, but generally, the structure does follow (my) common reference -> Internet book, Another Rust implementation, Wikipedia. BTW, this is a multivariate KDE implementation, right?

@YeungOnion
Copy link
Contributor

Definitely multivariate!
Mostly, I'm just wondering if you could imagine using this code.

Below is an example and maybe that would be interesting to see where the usage lies on the predictable-unexpected spectrum.

use nalgebra::Vector2;
use contour_isobands::{ContourBuilder, Band};
use std::error::Error;

// Assume these are defined in statrs
use statrs::tree::KdTree;
use statrs::estimator::{
    KernelDensityEstimator,
    kernel::{Kernel, Gaussian, Epanechikov},
};

/// A stub function that simulates building a k-D tree.
/// In practice, this tree would be built from your dataset.
fn build_sample_kd_tree() -> KdTree {
    // For this example, assume KDTree2D can be built with a vector of 2D points.
    // Here we just create a dummy tree.
    let points = vec![
        Vector2::new(0.1, 0.2),
        Vector2::new(0.4, 0.5),
        Vector2::new(0.7, 0.8),
        Vector2::new(0.3, 0.9),
        // ... more points
    ];
    KDTree2D::from_iter(points.iter())
}

fn main() -> Result<(), Box<dyn Error>> {
    let kd_tree = build_sample_kd_tree();

    // Define a neighbor selector closure.
    // Given the tree, a query point, and a bandwidth, it returns an iterator
    // over (point, distance) pairs for all neighbors within the given bandwidth.
    let neighbor_selector = |data: &KDTree2D, query: &Vector2<f64>, bandwidth: f64| -> Box<Vec<(Vector2<f64>, f64)>> {
        // `radius_query` returns all points within `bandwidth` of `query`.
        let neighbors = data.radius_query(query, bandwidth);
        Box::new(neighbors.into_iter().map(|pt| {
                let d = (query - pt).norm();
                (pt, d)
            })
            .collect())
    };

    let bandwidth = 0.1; // Example bandwidth

    // Create two KDE estimators using different kernels.
    // We clone the tree for the first estimator.
    let kde_gaussian = KernelDensityEstimator::<KdTree, Vector2<f64>, Gaussian, _>::new(
        kd_tree.clone(),
        Box::new(Gaussian),
        bandwidth,
        neighbor_selector,
    );

    let kde_epanechnikov = KernelDensityEstimator::<KdTree, Vector2<f64>, Epanechnikov, _>::new(
        kd_tree,
        Box::new(Epanechnikov),
        bandwidth,
        neighbor_selector,
    );

    // Evaluate the density on a mesh of points in [0, 1] x [0, 1].
    let grid_size = 50; // 50 x 50 grid
    let mut densities_gaussian = Vec::with_capacity(grid_size * grid_size);
    let mut densities_epanechnikov = Vec::with_capacity(grid_size * grid_size);

    for i in 0..grid_size {
        for j in 0..grid_size {
            let x = i as f64 / (grid_size - 1) as f64;
            let y = j as f64 / (grid_size - 1) as f64;
            let query = Vector2::new(x, y);
            let density_g = kde_gaussian.estimate(&query, bandwidth);
            let density_e = kde_epanechnikov.estimate(&query, bandwidth);
            densities_gaussian.push(density_g);
            densities_epanechnikov.push(density_e);
        }
    }

    // Define contour intervals.
    let intervals = vec![0.0, 0.1, 0.2, 0.5, 1.0];

    // Generate contours for the Gaussian KDE estimate.
    let bands_gaussian: Vec<Band> = ContourBuilder::new(grid_size, grid_size)
        .use_quad_tree(true)
        .contours(&densities_gaussian, &intervals)?;
    
    // Generate contours for the Epanechnikov KDE estimate.
    let bands_epanechnikov: Vec<Band> = ContourBuilder::new(grid_size, grid_size)
        .use_quad_tree(true)
        .contours(&densities_epanechnikov, &intervals)?;
    
    // In practice, these contour bands could be plotted using a library such as `contour-isobands`.
    println!("Generated {} Gaussian contours and {} Epanechnikov contours.",
             bands_gaussian.len(), bands_epanechnikov.len());

    Ok(())
}

I'm still considering dependencies, but it seems like the ones based on the best JS libraries are mirrors and are 2D (that would be nice if we relied on the nalgebra API to implement for specific static dimensionality, but that's for later), so we'll likely go with rstar and kdtree-rs since they're still performant, but better for dynamic cases, so a reasonably smart batch query could be a little tricky to get the PR in there.

@humphreylee
Copy link
Author

humphreylee commented Feb 21, 2025

Looks reasonably logical.

let neighbor_selector = |data: &KDTree2D, query: &Vector2, bandwidth: f64| -> Box<Vec<(Vector2, f64)>> {
// radius_query returns all points within bandwidth of query.
let neighbors = data.radius_query(query, bandwidth);
Box::new(neighbors.into_iter().map(|pt| {
let d = (query - pt).norm();
(pt, d)
})
.collect())
};

I don't really get this part, particularly on the nearest_neighbour or radius_query. Can you share the reference to this?

with rstar and

I guess you meant https://github.com/georust/rstar, right?

Is the k-d_tree required only in your example for demo/ simulation purpose? I mean in other/simple usage, this is not required, right?

@YeungOnion
Copy link
Contributor

// the below "KDTree2D" type is from my insufficently judicious review of GPT generated code near my bedtime.
// should be KdTree which would `impl RadiusQueryable` so it can be used as a field in KernelDensityEstimator
// further, it's likely more readable to call `.collect` on the iterator instead of through the smart pointer
let neighbor_selector = |data: &KDTree2D, query: &Vector2, bandwidth: f64| -> Box<Vec<(Vector2, f64)>> {
// radius_query returns all points within bandwidth of query.
let neighbors = data.radius_query(query, bandwidth);
Box::new(neighbors.into_iter().map(|pt| {
let d = (query - pt).norm();
(pt, d)
})
.collect())
};

Since KDE and KNN both have the notion of spatial searching (given a query find subset of data in spatial context), I'm bucketing them together. While we don't explicitly use the nearest_neighbor query, we do wish to find the set of neighboring data to the query point, and that's the <KdTree as RadiusQueryable>::radius_query(&data, query, bandwidth)

While this would be more ergonomic to define as trait-driven API instead of passing a closure as a field, I'm not yet sure how to group the range query and the radius to match both needs.

But most importantly, at the moment it would have to look like this, we need a data structure designed for spatial search for it to be efficient. However, it would be valuable if expose some of the most common neighbor_selector candidate closures before we tie down the API, likely one per data structure-query pair.


I did mean to link to georust/rstar

@humphreylee
Copy link
Author

humphreylee commented Feb 24, 2025

Maybe later, could we add histogram to provide a more complete KDE, since histogram is the simplest method to estimate a density? I hope this is not too much to ask, considering this is volunteer work. I really appreciate it!

I’m suggesting this because I see you’ve already thought ahead with the algorithm, including spatial searching, etc

@humphreylee
Copy link
Author

humphreylee commented Feb 25, 2025

Maybe later, could we add histogram to provide a more complete KDE, since histogram is the simplest method to estimate a density? I hope this is not too much to ask, considering this is volunteer work. I really appreciate it!

I’m suggesting this because I see you’ve already thought ahead with the algorithm, including spatial searching, etc

On second thought, histogram implementation doesn't look simple -> 1, 2, 3 are non-trivial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants