-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature]: Kernel Density Estimation #193
Comments
I think this is very helpful ! Here is repo for kernel density estimation (https://github.com/seatonullberg/kernel-density-estimation) |
I'm unsure how well this would fit into the exist tooling in statrs. While we do have the Empirical distribution, introducing hyperparameters to the data for data driven distributions starts to fit into a broader realm of statistics that is data-driven. @henryjac what would you say? We're amidst defining new direction, but I think this would fit better in a different crate. |
I would not be against adding functionality for performing more data driven distributions and functionality. With this being a statistics crate most features regarding any kind of statistics fits quite well. We even already have functionality for statistics in the statistics module, so expanding that with KDE etc. makes sense to me. |
Well, I would be happy to support data-driven distributions as long as we look ahead a little bit at what we'll choose (for near-future) as out of scope. Perhaps I'd have done better had I said that I'm averse to looking at it now since we've got some short-term priorities right now. Overall, what I think is that it just takes some discussion on clearly choosing scope, starting with upper bound (things that will certainly be out of scope). What can you say won't be in scope (for near-future)? But all cards on the table, I actually think KDE would be good candidate compared to some other data-driven distributions, reasons being:
|
Don't mean to push things. I fully appreciate your time to make things happen. Just curious, any hope of having this feature? |
My first thoughts returning to this are that it would be great! Would really get I don't think I could personally implement anything with practical performance for anything beyond 1D and I won't prioritize authoring it right now, but perhaps we could start with a simple implementation to set the API and define how this estimator fits (and in principle, future others fit) into the module structure we have or a different one that better suits future features. As a would be user, do you have any example workflows/use cases we could start the design around? |
There is a rust implementation for 1D from this crate, while a 2D is in progress. But my main interest is to build contour plot. |
I can start with writing out the set of traits and perhaps with an example that won't run but would compile, but I might want to restructure modules. I'll refer to this issue when I put in the PR. Likely won't start until next week, so aiming for 2/17 on this. |
While I've also been looking at some of the spatial search data structures, I've had a back and forth with LLMs for API design. Multiple conversations with ChatGPT and some others with Augment Code accessing statrs, so now I'm waning to obtain some user feedback on what I'm thinking of for public API. definitions /// A spatial index that returns the nearest neighbor to a given query key.
///
/// This trait is implemented by data structures (e.g. k-D trees or ball trees) that provide
/// efficient lookup of the single element closest to a specified key.
pub trait NearestNeighbor<K,> {
/// Returns the nearest neighbor's keys to the query key.
fn k_nearest_neighbors(&self, k: usize, key: &K) -> Vec<K>;
fn nearest_neighbors(&self, key: &K) -> Option<K> {
self.k_nearest_neighbors(1, key).get(0)
}
}
/// A spatial index that retrieves all elements with keys within a specified range.
///
/// Data structures implementing this trait (such as k-D trees) can efficiently return all
/// elements whose keys fall between a lower and an upper bound, or a set of hyperplanes.
///
/// Note for implementers: at the moment, we have only considered flat spaces
pub trait RangeQueryable<K, V> {
/// Returns a vector of values whose keys lie within the given range.
fn range_query(&self, min: &K, max: &K) -> Vec<V>;
}
/// A spatial index that retrieves all elements within a specified radius from a query key.
///
/// The radius parameter is generic (and must implement `PartialOrd`) to allow for different
/// distance metrics (such as L1 or L2 norms).
pub trait RadiusQueryable<K, V, R: PartialOrd> {
/// Returns a vector of values that are within the specified radius of the query key.
fn radius_query(&self, key: &K, radius: R) -> Vec<V>;
}
/// A kernel function (those used in kernel methods) that computes a weight given a distance and bandwidth.
///
/// Kernel functions are used in kernel methods like KDE to weight the contribution
/// of each data point based on its distance from the query point.
pub trait Kernel {
/// Evaluates the kernel function at the specified distance `d` using the provided bandwidth `h`.
fn evaluate(&self, d: f64, h: f64) -> f64;
/// Future possibility of kernels defined on a product space would be useful for adaptive KDE
/// would use spatial index type K and generalized bandwidth B, dependent on evaluation points from K.
// fn evaluate_pairwise(&self, x: &K, y: &K, h: &B)
}
/// A metric space that defines how to compute the distance between two keys.
///
/// This trait is useful for algorithms (e.g. KDE) that require a distance metric to weigh contributions.
pub trait MetricSpace<K> {
/// Returns the distance between two keys as an `f64`.
fn distance(a: &K, b: &K) -> f64;
}
/// A kernel density estimator that computes density estimates from a dataset.
///
/// This estimator is parameterized by a data backend, a kernel function, and a neighbor selection
/// callback. Bandwidth is specified at estimation time.
///
/// # Type Parameters
///
/// * `D` - The data backend (e.g., a KDTree) that allow efficient queries.
/// * `K` - The type of spatial index that the data are.
/// * `Kernel` - The kernel used.
/// * `F` - A function that, given a reference to the data, a query key, and a bandwidth,
/// returns an iterator over `(value, distance)` pairs.
/// I'm doing it this was as I'm presently putting off the trait design until I learn from writing those out.
pub struct KernelDensityEstimator<D, K, Kern, F>
where
F: Fn(&D, &K, f64) -> Box<dyn Iterator<Item = (K, f64)>>,
Kern: Kernel,
{
/// The data backend (e.g., a KDTree, BallTree, etc.).
data: D,
_kern: PhantomData<Kern>,
/// A callback function to retrieve neighbors and their distances.
neighbor_selector: F,
}
impl<D, K, V, F> KernelDensityEstimator<D, K, V, F>
where
F: Fn(&D, &K, f64) -> Box<dyn Iterator<Item = (V, f64)>>,
{
pub fn new(...) -> Self {
Self {...}
}
pub fn estimate(&self, query: &K, bandwidth: f64) -> f64 {...}
pub fn take(self) -> D {
} We implement the typical box, epanechikov, and gaussian kernels. And the backing data structures for spatial search will be some subset of {ball trees, k-d trees, M-trees, R-trees}, I'm looking at dependencies that might support some of these with the right amount of work, but I may also start with the heap allocated binary k-d tree just to get something up. EDIT: I can't believe I forgot to mention it, but I'm also looking at batch queries since that's really useful for the lower dimensional tools like visualization. |
Thanks for working on the code. It is not my area of expertise to review the detail of the code, but generally, the structure does follow (my) common reference -> Internet book, Another Rust implementation, Wikipedia. BTW, this is a multivariate KDE implementation, right? |
Definitely multivariate! Below is an example and maybe that would be interesting to see where the usage lies on the predictable-unexpected spectrum. use nalgebra::Vector2;
use contour_isobands::{ContourBuilder, Band};
use std::error::Error;
// Assume these are defined in statrs
use statrs::tree::KdTree;
use statrs::estimator::{
KernelDensityEstimator,
kernel::{Kernel, Gaussian, Epanechikov},
};
/// A stub function that simulates building a k-D tree.
/// In practice, this tree would be built from your dataset.
fn build_sample_kd_tree() -> KdTree {
// For this example, assume KDTree2D can be built with a vector of 2D points.
// Here we just create a dummy tree.
let points = vec![
Vector2::new(0.1, 0.2),
Vector2::new(0.4, 0.5),
Vector2::new(0.7, 0.8),
Vector2::new(0.3, 0.9),
// ... more points
];
KDTree2D::from_iter(points.iter())
}
fn main() -> Result<(), Box<dyn Error>> {
let kd_tree = build_sample_kd_tree();
// Define a neighbor selector closure.
// Given the tree, a query point, and a bandwidth, it returns an iterator
// over (point, distance) pairs for all neighbors within the given bandwidth.
let neighbor_selector = |data: &KDTree2D, query: &Vector2<f64>, bandwidth: f64| -> Box<Vec<(Vector2<f64>, f64)>> {
// `radius_query` returns all points within `bandwidth` of `query`.
let neighbors = data.radius_query(query, bandwidth);
Box::new(neighbors.into_iter().map(|pt| {
let d = (query - pt).norm();
(pt, d)
})
.collect())
};
let bandwidth = 0.1; // Example bandwidth
// Create two KDE estimators using different kernels.
// We clone the tree for the first estimator.
let kde_gaussian = KernelDensityEstimator::<KdTree, Vector2<f64>, Gaussian, _>::new(
kd_tree.clone(),
Box::new(Gaussian),
bandwidth,
neighbor_selector,
);
let kde_epanechnikov = KernelDensityEstimator::<KdTree, Vector2<f64>, Epanechnikov, _>::new(
kd_tree,
Box::new(Epanechnikov),
bandwidth,
neighbor_selector,
);
// Evaluate the density on a mesh of points in [0, 1] x [0, 1].
let grid_size = 50; // 50 x 50 grid
let mut densities_gaussian = Vec::with_capacity(grid_size * grid_size);
let mut densities_epanechnikov = Vec::with_capacity(grid_size * grid_size);
for i in 0..grid_size {
for j in 0..grid_size {
let x = i as f64 / (grid_size - 1) as f64;
let y = j as f64 / (grid_size - 1) as f64;
let query = Vector2::new(x, y);
let density_g = kde_gaussian.estimate(&query, bandwidth);
let density_e = kde_epanechnikov.estimate(&query, bandwidth);
densities_gaussian.push(density_g);
densities_epanechnikov.push(density_e);
}
}
// Define contour intervals.
let intervals = vec![0.0, 0.1, 0.2, 0.5, 1.0];
// Generate contours for the Gaussian KDE estimate.
let bands_gaussian: Vec<Band> = ContourBuilder::new(grid_size, grid_size)
.use_quad_tree(true)
.contours(&densities_gaussian, &intervals)?;
// Generate contours for the Epanechnikov KDE estimate.
let bands_epanechnikov: Vec<Band> = ContourBuilder::new(grid_size, grid_size)
.use_quad_tree(true)
.contours(&densities_epanechnikov, &intervals)?;
// In practice, these contour bands could be plotted using a library such as `contour-isobands`.
println!("Generated {} Gaussian contours and {} Epanechnikov contours.",
bands_gaussian.len(), bands_epanechnikov.len());
Ok(())
} I'm still considering dependencies, but it seems like the ones based on the best JS libraries are mirrors and are 2D (that would be nice if we relied on the nalgebra API to implement for specific static dimensionality, but that's for later), so we'll likely go with rstar and kdtree-rs since they're still performant, but better for dynamic cases, so a reasonably smart batch query could be a little tricky to get the PR in there. |
Looks reasonably logical.
I don't really get this part, particularly on the nearest_neighbour or radius_query. Can you share the reference to this?
I guess you meant https://github.com/georust/rstar, right? Is the k-d_tree required only in your example for demo/ simulation purpose? I mean in other/simple usage, this is not required, right? |
Since KDE and KNN both have the notion of spatial searching (given a query find subset of data in spatial context), I'm bucketing them together. While we don't explicitly use the While this would be more ergonomic to define as trait-driven API instead of passing a closure as a field, I'm not yet sure how to group the range query and the radius to match both needs. But most importantly, at the moment it would have to look like this, we need a data structure designed for spatial search for it to be efficient. However, it would be valuable if expose some of the most common I did mean to link to georust/rstar |
Maybe later, could we add histogram to provide a more complete KDE, since histogram is the simplest method to estimate a density? I hope this is not too much to ask, considering this is volunteer work. I really appreciate it! I’m suggesting this because I see you’ve already thought ahead with the algorithm, including spatial searching, etc |
On second thought, histogram implementation doesn't look simple -> 1, 2, 3 are non-trivial. |
Thanks for sharing the good work. Is there any implementation for kernel density estimation available (univariate, bivariate or multivariate)? Thanks.
The text was updated successfully, but these errors were encountered: