Skip to content

PeakSegPipeline on CRAN

Toby Dylan Hocking edited this page Jan 25, 2018 · 3 revisions

Background

There are many R packages for ChIP-seq data analysis, which is important in genomics. One such package is PeakSegPipeline, which implements state-of-the-art machine learning algorithms for highly accurate peak detection. PeakSegPipeline is working on Ubuntu, and the goal of this project is to port PeakSegPipeline to Windows/Mac to get it ready for a CRAN submission.

Related work

There are several packages for ChIP-seq data analysis on BioC, such as chipseq, mosaics, and ChIPseeker (full list here). These packages are based on unsupervised statistical models, whereas PeakSegPipeline is based on supervised machine learning.

The other PeakSeg packages include

  • PeakSegDP: R package with C code implementing a O(K N^2) time heuristic algorithm for up-down constrained changepoint detection in a single sample (K segments, N data points). This package is not used by PeakSegPipeline.
  • PeakSegOptimal: R package with C++ code implementing a O(N log N) time optimal algorithm for up-down constrained changepoint detection in a single sample. This package is not used by PeakSegPipeline, because PeakSegOptimal::PeakSegFPOP uses O(N log N) memory, which would cause memory swapping for large genomic data sets. Instead, PeakSegPipeline::PeakSegFPOP_disk provides an O(N log N) time, O(log N) memory, O(N log N) disk algorithm for computing the same model, which is suitable for very large genomic data sets (its reduced memory requirements mean that it will not swap).
  • PeakSegJoint: R package with C code implementing a O(N S log S) time heuristic algorithm for finding the most likely common single peak in a genomic region in S samples (with N data points per sample). This package is used by PeakSegPipeline to predict which samples are up or down in each genomic region.
  • PeakError: R package with C code that computes the number of incorrect labels for a given peak model and labeled data set. This package is used by PeakSegPipeline to compute the target interval of penalty values which results in minimal number of incorrect labels. The target interval is the output in the supervised learning problem, which aims to learn a function that predicts a penalty (and thus the number of peaks) for each sample or genomic region.

Coding project: PeakSegPipeline on CRAN

Detailed project goals:

  • setup Appveyor-CI to test on windows and mac.
  • create/document installation procedure for windows/mac.
  • identify GNU/Linux-specific code, e.g. fread("tail -1 coverage.bedGraph"), and modify so that it works on windows/mac.
  • use R-hub to check on Mac and Win-Builder to check on windows.

The ideal student will also write tests, documentation, vignettes and a blog.

Expected impact

This GSOC project will make the PeakSegPipeline package more portable, so that this important code can be used on a wider variety of systems, and installed easily from CRAN.

Mentors

Please get in touch with Toby Dylan Hocking <[email protected]> and Guillem Rigaill <[email protected]> after completing at least one of the tests below.

Tests

Easy: download the PeakSegPipeline package and run test-pipeline-demo.R on your own computer.

Medium: demonstrate that you know how to use docker images, which you will need to use in GSOC to test PeakSegPipeline on different systems. Inside of a windows docker image, try installing PeakSegPipeline – what happens?

Hard: look through the source code and explain which parts you think will need to be changed so that the code works on Win/Mac.

Solutions of tests

  • Students, please post a link to your test results here.
Clone this wiki locally