|
| 1 | +--- |
| 2 | +title: Constructing a Sparse Matrix Class in Rcpp |
| 3 | +author: Zach DeBruine and Dirk Eddelbuettel |
| 4 | +license: GPL (>= 2) |
| 5 | +tags: sparse s4 matrix |
| 6 | +summary: Creating a lightweight sparse matrix class in Rcpp |
| 7 | +layout: post |
| 8 | +src: 2021-04-16-sparse-matrix-class.Rmd |
| 9 | +--- |
| 10 | + |
| 11 | + |
| 12 | + |
| 13 | + |
| 14 | + |
| 15 | +### Introduction |
| 16 | + |
| 17 | +It is no secret that sparse matrix operations are faster in C++ than in R. |
| 18 | +[RcppArmadillo](https://github.com/RcppCore/RcppArmadillo) and [RcppEigen](https://github.com/RcppCore/RcppEigen) do a great job copying sparse matrices from R to C++ and back again. |
| 19 | +But note the word "copy". |
| 20 | +Every time RcppArmadillo converts an R sparse matrix to an `arma::SpMat<T>` object, it has to creates a _deep copy_ due to the difference in representation between dense matrices (usually occupying |
| 21 | +contiguous chunks of memory) and sparse matrices (which do not precisely because of _sparsity_). |
| 22 | +Similarly, every time RcppEigen converts an R sparse matrix to an `Eigen::SparseMatrix<T>` object, it also creates a deep copy. |
| 23 | + |
| 24 | +The price of instantiating sparse matrix object is measurable in both time and memory. |
| 25 | +But one of the advantages of types such as `Rcpp::NumericVector` is that they simply re-use the underlying memory of the R object by wrapping around the underlying `SEXP` representation and its |
| 26 | +contiguous dense memory layout so no copy is created! |
| 27 | +Can we create a sparse matrix class using `Rcpp::NumericVector` and `Rcpp::IntegerVector` that uses them similarly as references rather than actual deep-copy of each element? |
| 28 | + |
| 29 | +### Structure of a `dgCMatrix` |
| 30 | + |
| 31 | +We will focus on the `dgCMatrix` type, the most common form of compressed-sparse-column (CSC) matrix in the [Matrix](http://cloud.r-project.org/package=Matrix) package. |
| 32 | +The `dgCMatrix` class is an S4 object with several slots: |
| 33 | + |
| 34 | + |
| 35 | +{% highlight r %} |
| 36 | +library(Matrix) |
| 37 | +set.seed(123) |
| 38 | +str(rsparsematrix(5, 5, 0.5), vec.len = 12) |
| 39 | +{% endhighlight %} |
| 40 | + |
| 41 | + |
| 42 | + |
| 43 | +<pre class="output"> |
| 44 | +Formal class 'dgCMatrix' [package "Matrix"] with 6 slots |
| 45 | + ..@ i : int [1:12] 2 4 0 3 4 0 3 4 2 3 0 2 |
| 46 | + ..@ p : int [1:6] 0 2 5 8 10 12 |
| 47 | + ..@ Dim : int [1:2] 5 5 |
| 48 | + ..@ Dimnames:List of 2 |
| 49 | + .. ..$ : NULL |
| 50 | + .. ..$ : NULL |
| 51 | + ..@ x : num [1:12] 0.5 -1 0.83 -0.056 2.5 0.24 1.7 1.3 0.55 -1.7 -0.78 1.3 |
| 52 | + ..@ factors : list() |
| 53 | +</pre> |
| 54 | + |
| 55 | +Here we have: |
| 56 | + |
| 57 | +* Slot `i` is an integer vector giving the row indices of the non-zero values of the matrix |
| 58 | +* Slot `p` is an integer vector giving the index of the first non-zero value of each column in `i` |
| 59 | +* Slot `x` gives the non-zero elements of the matrix corresponding to rows in `i` and columns delineated by `p` |
| 60 | + |
| 61 | +Realize that `i` and `p` are integer vectors, and `x` is a numeric vector (stored as a `double`), corresponding to `Rcpp::IntegerVector` and `Rcpp::NumericVector`. |
| 62 | +That means that we can construct a sparse matrix class in C++ using Rcpp vectors! |
| 63 | + |
| 64 | +### A dgCMatrix reference class for Rcpp |
| 65 | + |
| 66 | +A `dgCMatrix` is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors. |
| 67 | +That makes class construction easy: |
| 68 | + |
| 69 | + |
| 70 | +{% highlight cpp %} |
| 71 | +#include <Rcpp.h> |
| 72 | + |
| 73 | +namespace Rcpp { |
| 74 | + class dgCMatrix { |
| 75 | + public: |
| 76 | + IntegerVector i, p, Dim; |
| 77 | + NumericVector x; |
| 78 | + List Dimnames; |
| 79 | + |
| 80 | + // constructor |
| 81 | + dgCMatrix(S4 mat) { |
| 82 | + i = mat.slot("i"); |
| 83 | + p = mat.slot("p"); |
| 84 | + x = mat.slot("x"); |
| 85 | + Dim = mat.slot("Dim"); |
| 86 | + Dimnames = mat.slot("Dimnames"); |
| 87 | + }; |
| 88 | + |
| 89 | + // column iterator |
| 90 | + class col_iterator { |
| 91 | + public: |
| 92 | + int index; |
| 93 | + col_iterator(dgCMatrix& g, int ind) : parent(g) { index = ind; } |
| 94 | + bool operator!=(col_iterator x) { return index != x.index; }; |
| 95 | + col_iterator& operator++(int) { ++index; return (*this); }; |
| 96 | + int row() { return parent.i[index]; }; |
| 97 | + int col() { return column; }; |
| 98 | + double& value() { return parent.x[index]; }; |
| 99 | + private: |
| 100 | + dgCMatrix& parent; |
| 101 | + int column; |
| 102 | + }; |
| 103 | + col_iterator begin_col(int j) { return col_iterator(*this, p[j]); }; |
| 104 | + col_iterator end_col(int j) { return col_iterator(*this, p[j + 1]); }; |
| 105 | + |
| 106 | + }; |
| 107 | + |
| 108 | + template <> dgCMatrix as(SEXP mat) { return dgCMatrix(mat); } |
| 109 | + |
| 110 | + template <> SEXP wrap(const dgCMatrix& sm) { |
| 111 | + S4 s(std::string("dgCMatrix")); |
| 112 | + s.slot("i") = sm.i; |
| 113 | + s.slot("p") = sm.p; |
| 114 | + s.slot("x") = sm.x; |
| 115 | + s.slot("Dim") = sm.Dim; |
| 116 | + s.slot("Dimnames") = sm.Dimnames; |
| 117 | + return s; |
| 118 | + } |
| 119 | +} |
| 120 | +{% endhighlight %} |
| 121 | + |
| 122 | +In the above code, first we create a C++ class for a `dgCMatrix` in the Rcpp namespace with public members corresponding to the slots in an R `Matrix::dgCMatrix`. |
| 123 | +Second, we add a constructor for the class that receives an S4 R object (which should be a valid `dgCMatrix` object). |
| 124 | +Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix. |
| 125 | +Finally, we use `Rcpp::as` and `Rcpp::wrap` for seamless conversion between R and C++ and back again. |
| 126 | + |
| 127 | +### Using the Rcpp::dgCMatrix class |
| 128 | + |
| 129 | +We can now simply pull a `dgCMatrix` into any Rcpp function thanks to our handy class methods and the magic of `Rcpp::as`. |
| 130 | + |
| 131 | + |
| 132 | +{% highlight cpp %} |
| 133 | +//[[Rcpp::export]] |
| 134 | +Rcpp::dgCMatrix R_to_Cpp_to_R(Rcpp::dgCMatrix& mat){ |
| 135 | + return mat; |
| 136 | +} |
| 137 | +{% endhighlight %} |
| 138 | + |
| 139 | +And call the following from R: |
| 140 | + |
| 141 | + |
| 142 | +{% highlight r %} |
| 143 | +library(Matrix) |
| 144 | +spmat <- abs(rsparsematrix(100, 100, 0.1)) |
| 145 | +spmat2 <- R_to_Cpp_to_R(spmat) |
| 146 | +all.equal(spmat, spmat2) |
| 147 | +{% endhighlight %} |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | +<pre class="output"> |
| 152 | +[1] TRUE |
| 153 | +</pre> |
| 154 | + |
| 155 | +Awesome! |
| 156 | + |
| 157 | +### Passing by reference versus value |
| 158 | + |
| 159 | +Rcpp structures such as `NumericVector` are wrappers around the existing R objects. |
| 160 | +This means that if we modify our sparse matrix in C++, it will modify the original R object. |
| 161 | +For instance: |
| 162 | + |
| 163 | + |
| 164 | +{% highlight cpp %} |
| 165 | +//[[Rcpp::export]] |
| 166 | +Rcpp::dgCMatrix Rcpp_square(Rcpp::dgCMatrix& mat){ |
| 167 | + for (Rcpp::NumericVector::iterator i = mat.x.begin(); i != mat.x.end(); ++i) |
| 168 | + (*i) *= (*i); |
| 169 | + return mat; |
| 170 | +} |
| 171 | +{% endhighlight %} |
| 172 | +used in |
| 173 | + |
| 174 | + |
| 175 | +{% highlight r %} |
| 176 | +cat("sum before squaring: ", sum(spmat)) |
| 177 | +{% endhighlight %} |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | +<pre class="output"> |
| 182 | +sum before squaring: 801.059 |
| 183 | +</pre> |
| 184 | + |
| 185 | + |
| 186 | + |
| 187 | +{% highlight r %} |
| 188 | +spmat2 <- Rcpp_square(spmat) |
| 189 | +cat("sum of original object after squaring: ", sum(spmat)) |
| 190 | +{% endhighlight %} |
| 191 | + |
| 192 | + |
| 193 | + |
| 194 | +<pre class="output"> |
| 195 | +sum of original object after squaring: 1007.97 |
| 196 | +</pre> |
| 197 | + |
| 198 | + |
| 199 | + |
| 200 | +{% highlight r %} |
| 201 | +cat("sum of assigned object after squaring: ", sum(spmat2)) |
| 202 | +{% endhighlight %} |
| 203 | + |
| 204 | + |
| 205 | + |
| 206 | +<pre class="output"> |
| 207 | +sum of assigned object after squaring: 1007.97 |
| 208 | +</pre> |
| 209 | + |
| 210 | +Notice how the original object AND the returned object are identical, yet they don't point to the same memory address (because of _copy on write_): |
| 211 | + |
| 212 | + |
| 213 | +{% highlight r %} |
| 214 | +tracemem(spmat) |
| 215 | +{% endhighlight %} |
| 216 | + |
| 217 | + |
| 218 | + |
| 219 | +<pre class="output"> |
| 220 | +[1] "<0x55f9609a3ae0>" |
| 221 | +</pre> |
| 222 | + |
| 223 | + |
| 224 | + |
| 225 | +{% highlight r %} |
| 226 | +tracemem(spmat2) |
| 227 | +{% endhighlight %} |
| 228 | + |
| 229 | + |
| 230 | + |
| 231 | +<pre class="output"> |
| 232 | +[1] "<0x55f960f449f0>" |
| 233 | +</pre> |
| 234 | + |
| 235 | +You might further inspect memory addresses within the structure using `.Internal(inspect())` and indeed, you will see the memory addresses are not the same. |
| 236 | +What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator. |
| 237 | + |
| 238 | + |
| 239 | +{% highlight r %} |
| 240 | +set.seed(123) |
| 241 | +spmat <- abs(rsparsematrix(100, 100, 0.1)) |
| 242 | +sum_before <- sum(spmat) |
| 243 | +Rcpp_square(spmat) |
| 244 | +sum_after <- sum(spmat) |
| 245 | +{% endhighlight %} |
| 246 | + |
| 247 | + |
| 248 | +<pre class="output"> |
| 249 | +sum_before = 793.861 |
| 250 | +</pre> |
| 251 | + |
| 252 | + |
| 253 | + |
| 254 | +<pre class="output"> |
| 255 | +sum_after = 970.174 |
| 256 | +</pre> |
| 257 | + |
| 258 | +This principle of course applies to other Rcpp classes such as `NumericVector` as well. |
| 259 | +But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only. |
| 260 | +If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen. |
| 261 | +These libraries are extremely capable and well-documented---and generally give you access to specific sparse Matrix algorithms. |
| 262 | + |
| 263 | +### Sparse iterator class |
| 264 | + |
| 265 | +One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator. |
| 266 | +A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class. |
| 267 | +We can use this iterator in a very Armadillo-esque manner to compute things like column sums: |
| 268 | + |
| 269 | + |
| 270 | +{% highlight cpp %} |
| 271 | +//[[Rcpp::export]] |
| 272 | +Rcpp::NumericVector Rcpp_colSums(Rcpp::dgCMatrix& mat){ |
| 273 | + int n_col = mat.p.size() - 1; |
| 274 | + Rcpp::NumericVector sums(n_col); |
| 275 | + for (int col = 0; col < n_col; col++) |
| 276 | + for (Rcpp::dgCMatrix::col_iterator it = mat.begin_col(col); it != mat.end_col(col); it++) |
| 277 | + sums[col] += it.value(); |
| 278 | + return sums; |
| 279 | +} |
| 280 | +{% endhighlight %} |
| 281 | + |
| 282 | +On the R end: |
| 283 | + |
| 284 | + |
| 285 | +{% highlight r %} |
| 286 | +sums <- Rcpp_colSums(spmat) |
| 287 | +all.equal(Matrix::colSums(spmat), sums) |
| 288 | +{% endhighlight %} |
| 289 | + |
| 290 | + |
| 291 | + |
| 292 | +<pre class="output"> |
| 293 | +[1] TRUE |
| 294 | +</pre> |
| 295 | + |
| 296 | +Great---but is it faster??? |
| 297 | + |
| 298 | + |
| 299 | +{% highlight r %} |
| 300 | +library(microbenchmark) |
| 301 | +microbenchmark(Rcpp_colSums(spmat), Matrix::colSums(spmat)) |
| 302 | +{% endhighlight %} |
| 303 | + |
| 304 | + |
| 305 | + |
| 306 | +<pre class="output"> |
| 307 | +Unit: microseconds |
| 308 | + expr min lq mean median uq max neval cld |
| 309 | + Rcpp_colSums(spmat) 2.510 2.7885 3.32225 3.1355 3.2685 21.842 100 a |
| 310 | + Matrix::colSums(spmat) 11.068 11.5380 13.26691 11.7775 12.0525 79.324 100 b |
| 311 | +</pre> |
| 312 | + |
| 313 | +### Extending the class |
| 314 | + |
| 315 | +There are many ways we can extend `Rcpp::dgCMatrix`. |
| 316 | +For example, we can use `const_iterator`, `row_iterator` and iterators that traverse all values in addition to col_iterator. |
| 317 | +We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc. |
| 318 | + |
| 319 | +We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix |
| 320 | +class. |
| 321 | +For now, see the [Github repo for RcppSparse](https://github.com/zdebruine/RcppSparse) for the in-progress package. |
0 commit comments