Skip to content

Binary segmentation

Toby Dylan Hocking edited this page Mar 10, 2021 · 13 revisions

Background

Binary segmentation is an extremely fast heuristic algorithm for computing changepoint models in sequence data sets (e.g., time series, genomics, etc). Its asymptotic complexity for N data and K segments is

  • O( N log K ) in the typical/average case, which occurs when each split results in two segments of similar sizes (e.g., splitting 100 data points into 50 and 50).
  • O( N K ) in the worst case, which occurs when each split results in two segments of very different sizes (e.g., splitting 100 data points into 99 and 1).

For a more detailed theoretical description of the algorithm see Truong et al., arXiv:1801.00718.

Related work

  • changepoint::cpt.mean(method=”BinSeg”) uses C code that can compute binary segmentation for various statistical models (Normal change in mean with constant variance, Normal change in mean and variance, Poisson change in mean and variance, etc). It supports input of min segment length constraints. Its quadratic O(K^2) output size (K x K changepoint matrix) can be prohibitive for large K. It outputs segment-specific model parameters (e.g., segment means) for only the selected model (but does not support output of segment-specific parameters for all models from 1 to K segments).
  • binsegRcpp::binseg_normal uses C++ code that computes binary segmentation for only one statistical model (Normal change in mean with constant variance). It does not support input of min segment length constraints. Its linear O(K) output size (data table with K rows) works even for very large K. Using this output the coef method supports computing segment-specific parameters (e.g., segment means) for any models from 1 to K segments. In addition it uses a std::multiset from the C++ Standard Template Library for efficient storage/lookup of split candidates, and it uses an object-oriented design with various classes that help avoid repetition.
  • gfpop::gfpop uses C++ code to compute optimal segmentation for various statistical models. It is optimal in the sense of the distribution-specific loss function – it is guaranteed to compute changepoints which result in the minimum loss (whereas binary segmentation may not). The supported distributions are:
    • “mean” Normal change in mean with constant variance.
    • “variance” Normal change in variance with constant mean.
    • “poisson” Poisson change in mean and variance.
    • “exp” Exponential change in mean and variance.
    • “negbin” Negative binomial change in mean with constant variance.

Overall we have the following table which compares the features of the binary segmentation packages:

Feature changepoint binsegRcpp
distributions many one
min seg length yes no
output size quadratic linear
output segment params one model all models

So changepoint has some advantages, and binsegRcpp has other strengths.

Coding project: new BinarySegmentation package

The goal is to code a new BinarySegmentation package that combines the useful features which are currently only available in either binsegRcpp or changepoint. It should

  • support observation-specific positive real weights for the cost function.
  • support input of min segment length constraints.
  • support at least the same distributions as gfpop, ideally all of the distributions in changepoint as well.
  • use C++ std::multiset for efficient storage/lookup of split candidates.
  • use an object-oriented design with various classes that help avoid repetition.
  • for max number of segments = K, output data table with K rows (one per model size), which includes segment-specific model parameters (e.g., segment means), with a coef method for converting that output to a data table with one row per segment.

The source code for computing the distribution-specific loss function in the changepoint package is in src/cost_general_functions.c. The table below shows the mapping from gfpop distributions types and changepoint cost function names:

gfpop type changepoint cost
mean mean_norm
variance var_norm
poisson meanvar_poisson
exp meanvar_exp
negbin NA

Expected impact

This new fast and fully-featured implementation of binary segmentation will be very useful for useRs with sequential data and changepoint detection problems.

Mentors

Please get in touch after completing at least one of the tests below.

Tests

Do one or several — doing more hard tests makes you more likely to be selected.

  • Easy: download data(neuroblastoma, package=”neuroblastoma”) and plot profile.id=4, chromosome=2 (y=logratio, x=position). Use either changepoint or binsegRcpp to compute binary segmentation models up to 5 segments (change in mean, normal distribution), then plot those segment means and changepoints on top of the data. Try using a ggplot with facet_grid(segments ~ .) to show each number of segments in a different panel from top to bottom, as shown in my Supervised changepoint detection tutorial.
  • Medium: Use the other package (binsegRcpp or changepoint) to compute models up to 5 segments for the same data set. Are the changepoints the same? Are the loss values the same? (make ggplots that answer these two questions)
  • Hard 1: Begin by deriving the math/equation for the Negative Binomial loss function, based on that wikipedia article (and/or implementation in gfpop C++ code). Then implement that loss function in C code using the same arguments as in changepoint/src/cost_general_functions.c.
  • Hard 2: Fork binsegRcpp and modify src/binseg_normal.cpp so that it computes change in mean and variance (instead of change of mean with constant variance). You can refer to the existing implementation of this loss function in changepoint/src/cost_general_functions.c (meanvar_norm). Use your new implementation/package to compute changepoints for the same neuroblastoma data set, and verify that the result is the same as the result of changepoint::cpt.meanvar.

Solutions of tests

Students, please post a link to your test results here.

STUDENT NAME/GITHUB TEST RESULTS LINK
Avinal Kumar avinal/rstats-test
Diego Urgell web page
Clone this wiki locally