Skip to content

Commit 60211a9

Browse files
authored
Merge pull request #135 from zdebruine/gh-pages
Constructing a sparse matrix class in Rcpp
2 parents ed2e856 + 944506f commit 60211a9

File tree

1 file changed

+227
-0
lines changed

1 file changed

+227
-0
lines changed
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
---
2+
title: Constructing a Sparse Matrix Class in Rcpp
3+
author: Zach DeBruine and Dirk Eddelbuettel
4+
license: GPL (>= 2)
5+
tags: sparse s4 matrix
6+
summary: Creating a lightweight sparse matrix class in Rcpp
7+
---
8+
9+
```{r echo = F, cache = F}
10+
knitr::knit_hooks$set(document = function(x){
11+
gsub("```\n*```r*\n*", "", x)
12+
})
13+
```
14+
15+
```{Rcpp, ref.label=knitr::all_rcpp_labels(), include=FALSE}
16+
```
17+
18+
### Introduction
19+
20+
It is no secret that sparse matrix operations are faster in C++ than in R.
21+
[RcppArmadillo](https://github.com/RcppCore/RcppArmadillo) and [RcppEigen](https://github.com/RcppCore/RcppEigen) do a great job copying sparse matrices from R to C++ and back again.
22+
But note the word "copy".
23+
Every time RcppArmadillo converts an R sparse matrix to an `arma::SpMat<T>` object, it has to creates a _deep copy_ due to the difference in representation between dense matrices (usually occupying
24+
contiguous chunks of memory) and sparse matrices (which do not precisely because of _sparsity_).
25+
Similarly, every time RcppEigen converts an R sparse matrix to an `Eigen::SparseMatrix<T>` object, it also creates a deep copy.
26+
27+
The price of instantiating sparse matrix object is measurable in both time and memory.
28+
But one of the advantages of types such as `Rcpp::NumericVector` is that they simply re-use the underlying memory of the R object by wrapping around the underlying `SEXP` representation and its
29+
contiguous dense memory layout so no copy is created!
30+
Can we create a sparse matrix class using `Rcpp::NumericVector` and `Rcpp::IntegerVector` that uses them similarly as references rather than actual deep-copy of each element?
31+
32+
### Structure of a `dgCMatrix`
33+
34+
We will focus on the `dgCMatrix` type, the most common form of compressed-sparse-column (CSC) matrix in the [Matrix](http://cloud.r-project.org/package=Matrix) package.
35+
The `dgCMatrix` class is an S4 object with several slots:
36+
37+
```{r}
38+
library(Matrix)
39+
set.seed(123)
40+
str(rsparsematrix(5, 5, 0.5), vec.len = 12)
41+
```
42+
43+
Here we have:
44+
45+
* Slot `i` is an integer vector giving the row indices of the non-zero values of the matrix
46+
* Slot `p` is an integer vector giving the index of the first non-zero value of each column in `i`
47+
* Slot `x` gives the non-zero elements of the matrix corresponding to rows in `i` and columns delineated by `p`
48+
49+
Realize that `i` and `p` are integer vectors, and `x` is a numeric vector (stored as a `double`), corresponding to `Rcpp::IntegerVector` and `Rcpp::NumericVector`.
50+
That means that we can construct a sparse matrix class in C++ using Rcpp vectors!
51+
52+
### A dgCMatrix reference class for Rcpp
53+
54+
A `dgCMatrix` is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors.
55+
That makes class construction easy:
56+
57+
```{Rcpp, eval = FALSE}
58+
#include <Rcpp.h>
59+
60+
namespace Rcpp {
61+
class dgCMatrix {
62+
public:
63+
IntegerVector i, p, Dim;
64+
NumericVector x;
65+
List Dimnames;
66+
67+
// constructor
68+
dgCMatrix(S4 mat) {
69+
i = mat.slot("i");
70+
p = mat.slot("p");
71+
x = mat.slot("x");
72+
Dim = mat.slot("Dim");
73+
Dimnames = mat.slot("Dimnames");
74+
};
75+
76+
// column iterator
77+
class col_iterator {
78+
public:
79+
int index;
80+
col_iterator(dgCMatrix& g, int ind) : parent(g) { index = ind; }
81+
bool operator!=(col_iterator x) { return index != x.index; };
82+
col_iterator& operator++(int) { ++index; return (*this); };
83+
int row() { return parent.i[index]; };
84+
int col() { return column; };
85+
double& value() { return parent.x[index]; };
86+
private:
87+
dgCMatrix& parent;
88+
int column;
89+
};
90+
col_iterator begin_col(int j) { return col_iterator(*this, p[j]); };
91+
col_iterator end_col(int j) { return col_iterator(*this, p[j + 1]); };
92+
93+
};
94+
95+
template <> dgCMatrix as(SEXP mat) { return dgCMatrix(mat); }
96+
97+
template <> SEXP wrap(const dgCMatrix& sm) {
98+
S4 s(std::string("dgCMatrix"));
99+
s.slot("i") = sm.i;
100+
s.slot("p") = sm.p;
101+
s.slot("x") = sm.x;
102+
s.slot("Dim") = sm.Dim;
103+
s.slot("Dimnames") = sm.Dimnames;
104+
return s;
105+
}
106+
}
107+
```
108+
109+
In the above code, first we create a C++ class for a `dgCMatrix` in the Rcpp namespace with public members corresponding to the slots in an R `Matrix::dgCMatrix`.
110+
Second, we add a constructor for the class that receives an S4 R object (which should be a valid `dgCMatrix` object).
111+
Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix.
112+
Finally, we use `Rcpp::as` and `Rcpp::wrap` for seamless conversion between R and C++ and back again.
113+
114+
### Using the Rcpp::dgCMatrix class
115+
116+
We can now simply pull a `dgCMatrix` into any Rcpp function thanks to our handy class methods and the magic of `Rcpp::as`.
117+
118+
```{Rcpp, eval = FALSE}
119+
//[[Rcpp::export]]
120+
Rcpp::dgCMatrix R_to_Cpp_to_R(Rcpp::dgCMatrix& mat){
121+
return mat;
122+
}
123+
```
124+
125+
And call the following from R:
126+
127+
```{R}
128+
library(Matrix)
129+
spmat <- abs(rsparsematrix(100, 100, 0.1))
130+
spmat2 <- R_to_Cpp_to_R(spmat)
131+
all.equal(spmat, spmat2)
132+
```
133+
134+
Awesome!
135+
136+
### Passing by reference versus value
137+
138+
Rcpp structures such as `NumericVector` are wrappers around the existing R objects.
139+
This means that if we modify our sparse matrix in C++, it will modify the original R object.
140+
For instance:
141+
142+
```{Rcpp, eval = FALSE}
143+
//[[Rcpp::export]]
144+
Rcpp::dgCMatrix Rcpp_square(Rcpp::dgCMatrix& mat){
145+
for (Rcpp::NumericVector::iterator i = mat.x.begin(); i != mat.x.end(); ++i)
146+
(*i) *= (*i);
147+
return mat;
148+
}
149+
```
150+
used in
151+
152+
```{R}
153+
cat("sum before squaring: ", sum(spmat))
154+
spmat2 <- Rcpp_square(spmat)
155+
cat("sum of original object after squaring: ", sum(spmat))
156+
cat("sum of assigned object after squaring: ", sum(spmat2))
157+
```
158+
159+
Notice how the original object AND the returned object are identical, yet they don't point to the same memory address (because of _copy on write_):
160+
161+
```{R}
162+
tracemem(spmat)
163+
tracemem(spmat2)
164+
```
165+
166+
You might further inspect memory addresses within the structure using `.Internal(inspect())` and indeed, you will see the memory addresses are not the same.
167+
What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator.
168+
169+
```{R, results = 'hide'}
170+
set.seed(123)
171+
spmat <- abs(rsparsematrix(100, 100, 0.1))
172+
sum_before <- sum(spmat)
173+
Rcpp_square(spmat)
174+
sum_after <- sum(spmat)
175+
```
176+
177+
```{R, echo = FALSE}
178+
cat("sum_before = ", sum_before)
179+
cat("sum_after = ", sum_after)
180+
```
181+
182+
This principle of course applies to other Rcpp classes such as `NumericVector` as well.
183+
But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only.
184+
If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen.
185+
These libraries are extremely capable and well-documented---and generally give you access to specific sparse Matrix algorithms.
186+
187+
### Sparse iterator class
188+
189+
One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator.
190+
A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class.
191+
We can use this iterator in a very Armadillo-esque manner to compute things like column sums:
192+
193+
```{Rcpp, eval = FALSE}
194+
//[[Rcpp::export]]
195+
Rcpp::NumericVector Rcpp_colSums(Rcpp::dgCMatrix& mat){
196+
int n_col = mat.p.size() - 1;
197+
Rcpp::NumericVector sums(n_col);
198+
for (int col = 0; col < n_col; col++)
199+
for (Rcpp::dgCMatrix::col_iterator it = mat.begin_col(col); it != mat.end_col(col); it++)
200+
sums[col] += it.value();
201+
return sums;
202+
}
203+
```
204+
205+
On the R end:
206+
207+
```{R}
208+
sums <- Rcpp_colSums(spmat)
209+
all.equal(Matrix::colSums(spmat), sums)
210+
```
211+
212+
Great---but is it faster???
213+
214+
```{R}
215+
library(microbenchmark)
216+
microbenchmark(Rcpp_colSums(spmat), Matrix::colSums(spmat))
217+
```
218+
219+
### Extending the class
220+
221+
There are many ways we can extend `Rcpp::dgCMatrix`.
222+
For example, we can use `const_iterator`, `row_iterator` and iterators that traverse all values in addition to col_iterator.
223+
We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc.
224+
225+
We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix
226+
class.
227+
For now, see the [Github repo for RcppSparse](https://github.com/zdebruine/RcppSparse) for the in-progress package.

0 commit comments

Comments
 (0)