Skip to content

Commit e334c55

Browse files
committed
sparse matrix class post
1 parent 60211a9 commit e334c55

File tree

2 files changed

+327
-0
lines changed

2 files changed

+327
-0
lines changed
+321
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,321 @@
1+
---
2+
title: Constructing a Sparse Matrix Class in Rcpp
3+
author: Zach DeBruine and Dirk Eddelbuettel
4+
license: GPL (>= 2)
5+
tags: sparse s4 matrix
6+
summary: Creating a lightweight sparse matrix class in Rcpp
7+
layout: post
8+
src: 2021-04-16-sparse-matrix-class.Rmd
9+
---
10+
11+
12+
13+
14+
15+
### Introduction
16+
17+
It is no secret that sparse matrix operations are faster in C++ than in R.
18+
[RcppArmadillo](https://github.com/RcppCore/RcppArmadillo) and [RcppEigen](https://github.com/RcppCore/RcppEigen) do a great job copying sparse matrices from R to C++ and back again.
19+
But note the word "copy".
20+
Every time RcppArmadillo converts an R sparse matrix to an `arma::SpMat<T>` object, it has to creates a _deep copy_ due to the difference in representation between dense matrices (usually occupying
21+
contiguous chunks of memory) and sparse matrices (which do not precisely because of _sparsity_).
22+
Similarly, every time RcppEigen converts an R sparse matrix to an `Eigen::SparseMatrix<T>` object, it also creates a deep copy.
23+
24+
The price of instantiating sparse matrix object is measurable in both time and memory.
25+
But one of the advantages of types such as `Rcpp::NumericVector` is that they simply re-use the underlying memory of the R object by wrapping around the underlying `SEXP` representation and its
26+
contiguous dense memory layout so no copy is created!
27+
Can we create a sparse matrix class using `Rcpp::NumericVector` and `Rcpp::IntegerVector` that uses them similarly as references rather than actual deep-copy of each element?
28+
29+
### Structure of a `dgCMatrix`
30+
31+
We will focus on the `dgCMatrix` type, the most common form of compressed-sparse-column (CSC) matrix in the [Matrix](http://cloud.r-project.org/package=Matrix) package.
32+
The `dgCMatrix` class is an S4 object with several slots:
33+
34+
35+
{% highlight r %}
36+
library(Matrix)
37+
set.seed(123)
38+
str(rsparsematrix(5, 5, 0.5), vec.len = 12)
39+
{% endhighlight %}
40+
41+
42+
43+
<pre class="output">
44+
Formal class 'dgCMatrix' [package &quot;Matrix&quot;] with 6 slots
45+
..@ i : int [1:12] 2 4 0 3 4 0 3 4 2 3 0 2
46+
..@ p : int [1:6] 0 2 5 8 10 12
47+
..@ Dim : int [1:2] 5 5
48+
..@ Dimnames:List of 2
49+
.. ..$ : NULL
50+
.. ..$ : NULL
51+
..@ x : num [1:12] 0.5 -1 0.83 -0.056 2.5 0.24 1.7 1.3 0.55 -1.7 -0.78 1.3
52+
..@ factors : list()
53+
</pre>
54+
55+
Here we have:
56+
57+
* Slot `i` is an integer vector giving the row indices of the non-zero values of the matrix
58+
* Slot `p` is an integer vector giving the index of the first non-zero value of each column in `i`
59+
* Slot `x` gives the non-zero elements of the matrix corresponding to rows in `i` and columns delineated by `p`
60+
61+
Realize that `i` and `p` are integer vectors, and `x` is a numeric vector (stored as a `double`), corresponding to `Rcpp::IntegerVector` and `Rcpp::NumericVector`.
62+
That means that we can construct a sparse matrix class in C++ using Rcpp vectors!
63+
64+
### A dgCMatrix reference class for Rcpp
65+
66+
A `dgCMatrix` is simply an S4 object containing integer and double vectors, and Rcpp already has implemented exporters and wrappers for S4 objects, integer vectors, and numeric vectors.
67+
That makes class construction easy:
68+
69+
70+
{% highlight cpp %}
71+
#include <Rcpp.h>
72+
73+
namespace Rcpp {
74+
class dgCMatrix {
75+
public:
76+
IntegerVector i, p, Dim;
77+
NumericVector x;
78+
List Dimnames;
79+
80+
// constructor
81+
dgCMatrix(S4 mat) {
82+
i = mat.slot("i");
83+
p = mat.slot("p");
84+
x = mat.slot("x");
85+
Dim = mat.slot("Dim");
86+
Dimnames = mat.slot("Dimnames");
87+
};
88+
89+
// column iterator
90+
class col_iterator {
91+
public:
92+
int index;
93+
col_iterator(dgCMatrix& g, int ind) : parent(g) { index = ind; }
94+
bool operator!=(col_iterator x) { return index != x.index; };
95+
col_iterator& operator++(int) { ++index; return (*this); };
96+
int row() { return parent.i[index]; };
97+
int col() { return column; };
98+
double& value() { return parent.x[index]; };
99+
private:
100+
dgCMatrix& parent;
101+
int column;
102+
};
103+
col_iterator begin_col(int j) { return col_iterator(*this, p[j]); };
104+
col_iterator end_col(int j) { return col_iterator(*this, p[j + 1]); };
105+
106+
};
107+
108+
template <> dgCMatrix as(SEXP mat) { return dgCMatrix(mat); }
109+
110+
template <> SEXP wrap(const dgCMatrix& sm) {
111+
S4 s(std::string("dgCMatrix"));
112+
s.slot("i") = sm.i;
113+
s.slot("p") = sm.p;
114+
s.slot("x") = sm.x;
115+
s.slot("Dim") = sm.Dim;
116+
s.slot("Dimnames") = sm.Dimnames;
117+
return s;
118+
}
119+
}
120+
{% endhighlight %}
121+
122+
In the above code, first we create a C++ class for a `dgCMatrix` in the Rcpp namespace with public members corresponding to the slots in an R `Matrix::dgCMatrix`.
123+
Second, we add a constructor for the class that receives an S4 R object (which should be a valid `dgCMatrix` object).
124+
Third, we add a simple forward-only sparse column iterator with read/write access for convenient traversal of non-zero elements in the matrix.
125+
Finally, we use `Rcpp::as` and `Rcpp::wrap` for seamless conversion between R and C++ and back again.
126+
127+
### Using the Rcpp::dgCMatrix class
128+
129+
We can now simply pull a `dgCMatrix` into any Rcpp function thanks to our handy class methods and the magic of `Rcpp::as`.
130+
131+
132+
{% highlight cpp %}
133+
//[[Rcpp::export]]
134+
Rcpp::dgCMatrix R_to_Cpp_to_R(Rcpp::dgCMatrix& mat){
135+
return mat;
136+
}
137+
{% endhighlight %}
138+
139+
And call the following from R:
140+
141+
142+
{% highlight r %}
143+
library(Matrix)
144+
spmat <- abs(rsparsematrix(100, 100, 0.1))
145+
spmat2 <- R_to_Cpp_to_R(spmat)
146+
all.equal(spmat, spmat2)
147+
{% endhighlight %}
148+
149+
150+
151+
<pre class="output">
152+
[1] TRUE
153+
</pre>
154+
155+
Awesome!
156+
157+
### Passing by reference versus value
158+
159+
Rcpp structures such as `NumericVector` are wrappers around the existing R objects.
160+
This means that if we modify our sparse matrix in C++, it will modify the original R object.
161+
For instance:
162+
163+
164+
{% highlight cpp %}
165+
//[[Rcpp::export]]
166+
Rcpp::dgCMatrix Rcpp_square(Rcpp::dgCMatrix& mat){
167+
for (Rcpp::NumericVector::iterator i = mat.x.begin(); i != mat.x.end(); ++i)
168+
(*i) *= (*i);
169+
return mat;
170+
}
171+
{% endhighlight %}
172+
used in
173+
174+
175+
{% highlight r %}
176+
cat("sum before squaring: ", sum(spmat))
177+
{% endhighlight %}
178+
179+
180+
181+
<pre class="output">
182+
sum before squaring: 801.059
183+
</pre>
184+
185+
186+
187+
{% highlight r %}
188+
spmat2 <- Rcpp_square(spmat)
189+
cat("sum of original object after squaring: ", sum(spmat))
190+
{% endhighlight %}
191+
192+
193+
194+
<pre class="output">
195+
sum of original object after squaring: 1007.97
196+
</pre>
197+
198+
199+
200+
{% highlight r %}
201+
cat("sum of assigned object after squaring: ", sum(spmat2))
202+
{% endhighlight %}
203+
204+
205+
206+
<pre class="output">
207+
sum of assigned object after squaring: 1007.97
208+
</pre>
209+
210+
Notice how the original object AND the returned object are identical, yet they don't point to the same memory address (because of _copy on write_):
211+
212+
213+
{% highlight r %}
214+
tracemem(spmat)
215+
{% endhighlight %}
216+
217+
218+
219+
<pre class="output">
220+
[1] &quot;&lt;0x55f9609a3ae0&gt;&quot;
221+
</pre>
222+
223+
224+
225+
{% highlight r %}
226+
tracemem(spmat2)
227+
{% endhighlight %}
228+
229+
230+
231+
<pre class="output">
232+
[1] &quot;&lt;0x55f960f449f0&gt;&quot;
233+
</pre>
234+
235+
You might further inspect memory addresses within the structure using `.Internal(inspect())` and indeed, you will see the memory addresses are not the same.
236+
What this all means is that we can simply call the function in R and modify the object implicitly without an assignment operator.
237+
238+
239+
{% highlight r %}
240+
set.seed(123)
241+
spmat <- abs(rsparsematrix(100, 100, 0.1))
242+
sum_before <- sum(spmat)
243+
Rcpp_square(spmat)
244+
sum_after <- sum(spmat)
245+
{% endhighlight %}
246+
247+
248+
<pre class="output">
249+
sum_before = 793.861
250+
</pre>
251+
252+
253+
254+
<pre class="output">
255+
sum_after = 970.174
256+
</pre>
257+
258+
This principle of course applies to other Rcpp classes such as `NumericVector` as well.
259+
But especially when working with very large sparse matrices, it is useful to avoid deep copies and pass by reference only.
260+
If you do need to operate on a copy of your matrix in C++, there is no reason to be using an Rcpp container when you can be using RcppArmadillo or RcppEigen.
261+
These libraries are extremely capable and well-documented---and generally give you access to specific sparse Matrix algorithms.
262+
263+
### Sparse iterator class
264+
265+
One of the best ways to read and/or write non-zero values in a sparse matrix is with an iterator.
266+
A basic column forward iterator with read/write access was presented in the declaration of our sparse matrix class.
267+
We can use this iterator in a very Armadillo-esque manner to compute things like column sums:
268+
269+
270+
{% highlight cpp %}
271+
//[[Rcpp::export]]
272+
Rcpp::NumericVector Rcpp_colSums(Rcpp::dgCMatrix& mat){
273+
int n_col = mat.p.size() - 1;
274+
Rcpp::NumericVector sums(n_col);
275+
for (int col = 0; col < n_col; col++)
276+
for (Rcpp::dgCMatrix::col_iterator it = mat.begin_col(col); it != mat.end_col(col); it++)
277+
sums[col] += it.value();
278+
return sums;
279+
}
280+
{% endhighlight %}
281+
282+
On the R end:
283+
284+
285+
{% highlight r %}
286+
sums <- Rcpp_colSums(spmat)
287+
all.equal(Matrix::colSums(spmat), sums)
288+
{% endhighlight %}
289+
290+
291+
292+
<pre class="output">
293+
[1] TRUE
294+
</pre>
295+
296+
Great---but is it faster???
297+
298+
299+
{% highlight r %}
300+
library(microbenchmark)
301+
microbenchmark(Rcpp_colSums(spmat), Matrix::colSums(spmat))
302+
{% endhighlight %}
303+
304+
305+
306+
<pre class="output">
307+
Unit: microseconds
308+
expr min lq mean median uq max neval cld
309+
Rcpp_colSums(spmat) 2.510 2.7885 3.32225 3.1355 3.2685 21.842 100 a
310+
Matrix::colSums(spmat) 11.068 11.5380 13.26691 11.7775 12.0525 79.324 100 b
311+
</pre>
312+
313+
### Extending the class
314+
315+
There are many ways we can extend `Rcpp::dgCMatrix`.
316+
For example, we can use `const_iterator`, `row_iterator` and iterators that traverse all values in addition to col_iterator.
317+
We can also add support for subview copies, dense copies of columns and rows, basic element-wise operations, cross-product, etc.
318+
319+
We have implemented and documented these methods, and hopefully there will be a Rcpp-extending CRAN package in the not-too-distant future that allows you to seamlessly interface with the dgCMatrix
320+
class.
321+
For now, see the [Github repo for RcppSparse](https://github.com/zdebruine/RcppSparse) for the in-progress package.

tags/s4/index.html

+6
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
---
2+
layout: tag
3+
title: s4
4+
---
5+
6+
{% include tag_page.html %}

0 commit comments

Comments
 (0)