-
Notifications
You must be signed in to change notification settings - Fork 6
Critical efficiency improvements of mcmcse
The mcmcse package is the leading package for estimating Monte Carlo standard errors for Markov chain Monte Carlo. Since 2012, it has since expanded to multivariate output analysis methods and the reliable calculation of effective sample size. Functions are often called on massive matrices with rows in the order of millions and columns in the order of thousands. This creates bottlenecks in efficiency. The primary goal of the project is to systematically identify and clear efficiency bottlenecks via detailed benchmarking and testings.
Most of the heavy code is written in C++ using Rcpp. A CRAN hosted version of the package is here and a GitHub development version of the package is here.
There are a few other packages in R that do univariate effective sample size calculations, the most popular of which is coda
. However, coda
does not use consistent estimators of the variance, and the variance estimates are known to be liberal. In addition, there is no other package that we know that does multivariate effective sample size calculations.
The following would be the primary tasks of the students
- Improve
batchSize()
function: for large matrices, this function is a prime bottleneck function and is integral to the smooth processing of most other functions in the package. The goal is to identify the bottlenecks and see whether it can be moved to Rcpp. This may be challenging, since the function utilizes the already fastar
implementations. - Develop new
batchSize()
: The student will be responsible to code the batch size calculations of Andrews (1991) and compare the implementations with the current function. This will require statistical knowledge and background reading in addition to coding aptitude. - Create
mcse
object class: The output ofmcse.multi
produces a list that saves integral information about the Monte Carlo standard errors in the Markov chain. In addition, it saves information about batch size used, methods employed, etc. Since output frommcse.multi
can be utilized and used by other external packages, it would be essential to make this list identifiable with a class. -
Rcpp
contains some useful sugar functions for large matrices that should be integrated into the current code. This may require changing fromRcppArmadillo
toRcpp
in certain situations. Detailed efficiency tests must be run to ensure computational gains in various setups. - Calculating determinants and eigenvalues of unstable matrices produces some numerical instabilities. A series of tests need to be designed to verify whether the package is immune against these instabilities. Robust understanding of matrix linear algebra will be useful.
- Integration with
roxygen2
must be implemented as part of the project. This will lead to cleaner documentation. - There are some instances of corner cases where the package currently returns error or warning messages. Such cases have been identified and as part of the project, these will need to be handled by the
mcse.multi
function. - On the completion of the above tasks, a detailed user-testing script will be designed for help in all future developments. This package is a long-term project and such a script will help future developers.
An ideal student for this project is one who has taken sufficient courses in statistics, is aware of Monte Carlo strategies, has experience with Rcpp, C++, and vectorizations in R.
The package mcmcse has been dowloaded over 48,000 times and has 106 citations on Google Scholar. Already the package has been found to be useful by the general scientific community, and any and all improvements in the package will continue to benefit this larger community. Additionally, the mcmcse
package is soon going to be the foundation for a user-oriented package for Simulation Output Analysis.
- EVALUATING MENTOR: Dootika Vats [email protected] is the author and maintainer of R package mcmcse and a contributor on R package stableGR. She was a GSoC student participant in 2015 for this same package and an expert in MCMC output analysis.
- James Flegal [email protected] is the founding author of the package and an expert in MCMC output analysis
Students, please do one or more of the following tests before contacting the mentors above.
MENTORS: write several tests that potential students can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the students write code to solve problems. You'll see that the harder the questions that you ask, the easier it will be for you to choose between the students that apply for your project! Please modify the suggestions below to make them specific for your project.
- Easy: (1) Download the mcmcse package from CRAN and use the function
ess
on a vectorfoo
of length 1e4 randomly drawn from a standard normal distribution. (2) Make a random matrix of size 10 x 10 and produce only the eigenvalues of the matrix. - Medium: Implement an efficiency profile of the
batchSize()
function usingprofvis
. Do this for varying sizes of input matrices. - Hard: (1) Write a code for a random walk Metropolis-Hastings algorithm to sample from a 100 dimensional standard normal Gaussian distribution. Focus on efficient implementation of this code. (2) Calculate the effective sample size as described in this paper in a way that is numerically stable, and does not utilize any inbuilt functions. Make sure you write your own function for this.
Students, please post a link to your test results here.
| S No. | STUDENT NAME | GITHUB PROFILE | TEST RESULTS LINK |
| 1 | Akhil Kumar Jha | Github | Test results |
| 2 | Kushagra Gupta | Github | Test results |