-
Notifications
You must be signed in to change notification settings - Fork 13
SAGA sparse linear models
R has excellent support for regularized linear models, which are important machine learning models. For example, the glmnet package implements a coordinate descent algorithm for elastic net regularized generalized linear models. However there is not yet an implementation of the fast new SAGA algorithm in R, NIPS’14 paper.
Importantly, there is evidence in Fig 2 of the SAG journal paper that SAG is significantly faster than coordinate descent on a wide range of L2-regularized problems. (since SAGA iterations are similar to SAG, we expect similar speedups for SAGA over coordinate descent for L1-regularized problems)
The goal of this project is to provide an R package implementing SAGA for solving L1-regularized generalized linear models.
Scikit-learn in python has an implementation of the SAGA algorithm in the LogisticRegression module, and we could probably copy/modify their code.
The SAG GSOC’15 project implemented a similar algorithm (SAG) for solving L2-regularized problems.
- Port the SAGA code from Python to an R package.
- implement all loss families from glmnet. (poisson, binomial, etc)
- vignette with speed/optimality comparisons with glmnet.
- Put it on CRAN.
Sparse generalized linear models are widely used in the Stats/ML/R community, and this package would provide a fast new algo for computing them.
Students, please contact mentors below after completing at least one of the tests below.
- Toby Hocking <[email protected]> is a machine learning researcher who mentored the previous SAG R-GSOC student.
- Michael Weylandt has experience with R packages and convex optimization algorithms.
Students, please do one or more of the following tests before contacting the mentors above.
- Easy: use glmnet::cv.glmnet to compute a L1-regularized linear model of the spam data in library(ElemStatLearn). What features are selected for the prediction function? What is the test error and test AUC of the learned model? Is it significantly better than the trivial model which predicts the most frequent class in the training data? Answer these questions by using K-fold cross-validation in the spam data.
- Medium: compute timings of glmnet (with alpha=0) and bigoptim for data sets (feature matrices) with varying number of rows and columns. Compute timings using microbenchmark::microbenchmark and plot timings versus number of rows and columns – which is faster? Which is more accurate, in terms of objective function values? (the objective function is the logistic loss + L2 penalty, smaller is better)
- Medium-Hard: benchmark Scikit-learn’s SAGA solver for L1-regularized logistic regression against R’s glmnet. For a given objective function value (logistic loss + L1 penalty), which computes a faster solution? Or for a given budget on time, which computes a more accurate solution? (lower objective function values) Even better, plot runtime and objective function values (maybe also difference in objective between the two algos) of both algos as a function of data set size (rows and columns) to show which scales to larger data sets better.
- Hard: write an R package with a simple C++ function that computes the L1-regularized logistic regression objective function, using either the classic .C interface, or the new Rcpp interface. Also write a vectorized R function that computes the same thing. Use testthat::expect_equal to write a test that shows your C++ code and R code compute the same thing.
Students, please post a link to your test results here.
- Johan Larsson
- Luofeng Liao