Skip to content

Critical efficiency improvements of mcmcse

Dootika Vats edited this page Apr 8, 2021 · 10 revisions

Background

The mcmcse package is the leading package for estimating Monte Carlo standard errors for Markov chain Monte Carlo. Since 2012, it has since expanded to multivariate output analysis methods and the reliable calculation of effective sample size. Functions are often called on massive matrices with rows in the order of millions and columns in the order of thousands. This creates bottlenecks in efficiency. The primary goal of the project is to systematically identify and clear efficiency bottlenecks via detailed benchmarking and testings.

Most of the heavy code is written in C++ using Rcpp. A CRAN hosted version of the package is here and a GitHub development version of the package is here.

Related work

There are a few other packages in R that do univariate effective sample size calculations, the most popular of which is coda. However, coda does not use consistent estimators of the variance, and the variance estimates are known to be liberal. In addition, there is no other package that we know that does multivariate effective sample size calculations.

Details of your coding project

The following would be the primary tasks of the students

  1. Improve batchSize() function: for large matrices, this function is a prime bottleneck function and is integral to the smooth processing of most other functions in the package. The goal is to identify the bottlenecks and see whether it can be moved to Rcpp. This may be challenging, since the function utilizes the already fast ar implementations.
  2. Develop new batchSize(): The student will be responsible to code the batch size calculations of Andrews (1991) and compare the implementations with the current function. This will require statistical knowledge and background reading in addition to coding aptitude.
  3. Create mcse object class: The output of mcse.multi produces a list that saves integral information about the Monte Carlo standard errors in the Markov chain. In addition, it saves information about batch size used, methods employed, etc. Since output from mcse.multi can be utilized and used by other external packages, it would be essential to make this list identifiable with a class.
  4. Rcpp contains some useful sugar functions for large matrices that should be integrated into the current code. This may require changing from RcppArmadillo to Rcpp in certain situations. Detailed efficiency tests must be run to ensure computational gains in various setups.
  5. Calculating determinants and eigenvalues of unstable matrices produces some numerical instabilities. A series of tests need to be designed to verify whether the package is immune against these instabilities. Robust understanding of matrix linear algebra will be useful.
  6. Integration with roxygen2 must be implemented as part of the project. This will lead to cleaner documentation.
  7. There are some instances of corner cases where the package currently returns error or warning messages. Such cases have been identified and as part of the project, these will need to be handled by the mcse.multi function.
  8. On the completion of the above tasks, a detailed user-testing script will be designed for help in all future developments. This package is a long-term project and such a script will help future developers.

Ideal Student

An ideal student for this project is one who has taken sufficient courses in statistics, is aware of Monte Carlo strategies, has experience with Rcpp, C++, and vectorizations in R.

Expected impact

The package mcmcse has been dowloaded over 48,000 times and has 106 citations on Google Scholar. Already the package has been found to be useful by the general scientific community, and any and all improvements in the package will continue to benefit this larger community. Additionally, the mcmcse package is soon going to be the foundation for a user-oriented package for Simulation Output Analysis.

Mentors

  • EVALUATING MENTOR: Dootika Vats [email protected] is the author and maintainer of R package mcmcse and a contributor on R package stableGR. She was a GSoC student participant in 2015 for this same package and an expert in MCMC output analysis.
  • James Flegal [email protected] is the founding author of the package and an expert in MCMC output analysis

Tests

Students, please do one or more of the following tests before contacting the mentors above.

MENTORS: write several tests that potential students can do to demonstrate their capabilities for this particular project. Ask some hard questions that will give you insight about how the students write code to solve problems. You'll see that the harder the questions that you ask, the easier it will be for you to choose between the students that apply for your project! Please modify the suggestions below to make them specific for your project.

  • Easy: (1) Download the mcmcse package from CRAN and use the function ess on a vector foo of length 1e4 randomly drawn from a standard normal distribution. (2) Make a random matrix of size 10 x 10 and produce only the eigenvalues of the matrix.
  • Medium: Implement an efficiency profile of the batchSize() function using profvis. Do this for varying sizes of input matrices.
  • Hard: (1) Write a code for a random walk Metropolis-Hastings algorithm to sample from a 100 dimensional standard normal Gaussian distribution. Focus on efficient implementation of this code. (2) Calculate the effective sample size as described in this paper in a way that is numerically stable, and does not utilize any inbuilt functions. Make sure you write your own function for this.

Solutions of tests

Students, please post a link to your test results here.

| S No. | STUDENT NAME | GITHUB PROFILE | TEST RESULTS LINK |

| 1 | Akhil Kumar Jha | Github | Test results |

| 2 | Kushagra Gupta | Github | Test results |

Clone this wiki locally