Adding R bindings to sassy#61
Conversation
|
Do we need 57 modified files for this? Please shrink it to something a human can understand. Do we need 3 APIs? That's just gonna make confusion and not help anyone. Going via python sounds silly. Otherwise I don't care which one you pick but please reduce it to 1. Would it make sense to have the same API as python? Anyway yes, I'm open to merging R bindings. But you'll have to write short and to the point readme with usage instructions because I have never used R. Does it need pushing something to a registry? Do we need CI for that similar to python? Side note: did you read your full PR? If not, why should i read it and what should the takeaway be? Dropping 4 tables on me without conclusion is useless. (Ie: i don't mind AI code. But please no AI text.) |
The PR was read (and its text written by me), would be irresponsible otherwise. There are no 3 APIs, but 2, the python roundabout (in process R-python interop via reticulate) is to get an interpreted code baseline and test conformance when applicable. This is common when a library does not offer R bindings. The benchmark text is rendered via code https://github.com/sounkou-bioinfo/sassy/blob/19511d76fbb7cd7baeb7fff7c72a7e36b6974c22/r/benchmarks/benchmark-r-bindings.Rmd, i just copy past it. Conclusion is using low level Rsassy approach is the best way to go if speed were the only criteria. There are other criteria described below. The R packages have README.md files generated from README.Rmd files. Other programmatically generated files are the .Rd files in man directories. These are standard literate programming practices in the R community that i agree bloat PRs. The PR includes a workflow for testing following the current setup. On the actual code surface removing generated artifacts for Rsassy.: Rsassy is the low dependency setup i am comfortable with, i am not a rust developer, the R plumbing is done in C. But i guess it is not idiomatic rust work and it definitely is not as clean as the supported PyO3 binding. On the api of the R binding, it is the same as with the python binding, with the following exceptions.
On registry, adding a line on a personal or organization r-universe repo should be enough. I publish packages like this, example here https://sounkou-bioinfo.r-universe.dev/builds, The package can be published on CRAN, but they usually have very old rust edition requirements as this worker point out here https://bsky.app/profile/yutannihilation-en.bsky.social/post/3mm4ole6nmc2q Bottom line, if the contribution is of interest, i will remove sassyRS package from this PR, remove the benchmarks and add Rsassy centered ci and and publish the package on r-universe (which include tests on multiple platforms and R versions) Hopefully this response is informative and does not sound rude or is wasting your time |
In that case I must apologize. Sorry!
I'm not fully following here. Could you explain more?
That makes sense. It's hard for me to say which API would be best. In python, I think we now expose a batch API, so that the function-call overhead is relatively small anyway.
Ah yeah ok that's just silly. If there's no R in there, only keeping the
Ah good. I think this could be simplified a bit though; no need to install rust, update system packages, and upload artefacts?
Hmmm, I'm confused why this needs C. The rust crate already exposes C bindings; why do those not work directly? I guess this file is wrapping the C-API exposed in Rust in an R-style API. If that's standard than I guess it is what it is. But if something like PyO3 for R exists that would be nicer.
ok this is crazy but ok.
Not completely following here yet but that's OK.
This sounds good! Could you make me co-owner just in case future updates are needed (when other people request it). A further thing to keep in mind is that Sassy heavily uses SIMD instructions.
This was very informative! Thanks! |
Thanks for the feedback ! I should re-open this issue then. General note, after some benchmark work around the general FFI barrier crossing, in the generic case, using the RSassy approach is likely more fruitful long term. Two main reasons for this: performing parallel work in Rust with better guarantee compare to C and taking advantage of features not exposed to the current C API. Search_Many type of API is not exposed in sassy.h. But as you said, using extendr or savvy ( a R.rs file) should eliminate the need for the manual C layer (this is a skill issue on my part now). On registry and publishing, as a trial test, the current head of my fork that only contains Rsassy approach has been publish on my r-universe https://sounkou-bioinfo.r-universe.dev/Rsassy. r-universe is the less conservative way too publish Rust using R packages compared to CRAN (which for various reason has conservative approach toward Rust e.g strict rules around cargo https://cran.r-project.org/web/packages/using_rust.html). Setting up r-universe is fairly straightforward, https://docs.r-universe.dev/publish/set-up.html, the package could be own by a R-sassy or similar GitHub org or your own personal account. The binary packages for these systems are made available
I am unsure about SIMD support on these runners ( i assume they all have at least avx2). But the users can always install from source and some adjustment to Makefiles should make sure the features are enabled when the CPU supports them. Edit: on Plateform, it is interesting to know which SIMD architectures would work on WASM. The package now works on webR environment out of the box |
ee0d7bf to
493492c
Compare
oh neat! I think it's fine if you just own it. If/once we push to CRAN I think it makes sense for me to be co-owner, but if the package is namespaced to you, that's fine.
Supporting AVX2 is not necessarily enough to activate it during compilation. Here's some notes for how we did this with python / bioconda. In
Yeah, but that's not really practical. Better to fix it properly.
cool! |
32dda2b to
999da9f
Compare
Multi-Versioning and simd support I decided to go with multiple cross-compilations and ship the DLLs with the package and load specific simd backends at runtime. This should be a useful strategy more generally. The simd backend can be set dynamically, example scalar vs avx2 benchmark here
WebR will always be compiled with simd128. The Demo page for this can be found here https://sounkou-bioinfo.github.io/sassy-webR. WebR related additions Some local scripts to test the webR setup Some additional changes for CRAN compatibility/compliance
The R package versioning will follow sassy_version-r_package-changes (with *.9000 convention for development versions) |
|
There's still >45 files. Could you drop all generated files, and then add those in a single final commit? (Right now it's >5000 lines which is too much to review, and after it's probably still a lot.) Which parts should I review and what can I skip? Did you check all (doc) comments? Should I check the API? (I guess I don't really care if it works for you?)
What's up with all the windows-specific files? Are those really needed? Are all these All this stuff in For python, we only have the In the end it might be simpler if you just put the code in a separate repo and we link it from the readme here? I don't really feel comfortable taking responsibility for this much code. |
We will squash the commit history if/when the final pull request will be made. Rsassy can be put in an independent repository to make it non invasive. It will be clearer below why some of the bloat cannot be avoided. Generic build system requirements and differencesSome notes about the generated files and the bloat compared to the python setup. Given the setup used here, i followed exactly this layout for the R package https://github.com/r-rust/hellorust. R packages have a set of minimal requirements for submission to the common repositories (CRAN, R-multiverse, Bioconductor). Manual files are mandatory in all cases, Bioconductor would require having vignettes too. Build system files are required too. For safety compliance, it is recommended that the package installation does not required network connections from external package managers (cargo here). The r-universe submission https://sounkou-bioinfo.r-universe.dev/Rsassy does not have these restrictions since it is github hosted (soon Bioconductor too since the build system is being transferred to github). It is not a regulated repository and is indeed more like pipy. On CRAN and Bioconductor, all new Package are manually reviewed by humans and cannot violate the rules of these repositories. Side note: i had a package removed from CRAN because starting from go 1.23 the go build system write things in the users go cache because of telemetry issue. In All of these cases, unlike the python case, the only thing we control in the build system are the standard build files: configure scripts, Makevars. These files are required since we are building rust libraries. The vendor.xz file is not required if the package is not submitted to CRAN. I figure even in case of a submission to CRAN, we can generate before submitting the build tarball. But it will need to be generated. FIle Manifest in the current iterationR Package ManifestThese files make the directory a package r/Rsassy/DESCRIPTION LICENCE AND AUTHORSHIP ATTRIBUTION and related toolsThis is important for attribution and mandatory when one provide bindings to non ubiquitous library r/Rsassy/tools/update-authors.R (script to generate the AUTHORS and Licence.note) Package Top Level Documentationr/Rsassy/README.md (generated from README.Rmd) CRAN RelatedVendoringr/Rsassy/src/rust/vendor.tar.xz R's APIr/Rsassy/R/search.R (User API) R's C API plumbingbloated because we are supporting multiversion and because C is verbose. Moreover we are not using extendr or savvy because of the multiversion issue and skill issue on my part. Usually these frameworks generate the *c and *h as the the R wrappers. r/Rsassy/src/.c Rust Librarythis is splitted for readability, overlly verbose because we are handrolling the FFI. Moreover it registers build features r/Rsassy/src/rust/Cargo.toml Some note related to multiversion, i may have misread something inside sassy, but since the SIMD is a crate wide feature, i could not use the function level multiversioning i used in this demo https://github.com/sounkou-bioinfo/hellorust-multiversion. Correct me if i am wrong here because we would not need to build a complicated runtime dispatch system that require multiple DLLs instead of one. Function Level Manualsr/Rsassy/man/*rd On the rd files in man folder, these are generated from comments in the R files using roxygen2 package, i can consolidated them into one file but i have to either commit that file or accept a package dependency on roxygen2 tags to build them at package install time. Help files are mandatory for R packages (otherwise on get build warnings) Mandatory build filesFor building the package, If we support windows, these are required r/Rsassy/.Rbuildignore CIThe bloat is to test the platforms that we are supporting for the package: Linux, MacOS, Windows, WebR/WASM. If the OS_type is limited to Unix in the DESCRIPTION file, the file can be trimmed. Usually it is recommended in the current setup to test on expected user platforms and spot issues. R packages are expected to be installable on the three last R versions (academic setups were installations change more slowly) |
|
Rsassy is now hosted at https://github.com/sounkou-bioinfo/Rsassy/, the scope of the PR is one file change for documentation of its existance. After more testing on larger workloads and avx512 servers, i will start the fairly tedious process of a CRAN or Bioconductor submission |
|
Thanks! Yes looks good like this :) |
Thanks ! |
|
Oh, I hadn't heard of I saw your issue sounkou-bioinfo/Rsassy#1 and am watching that repo, so looking forward to the results. |
Yeah, lot of esoteric Bioconductor Verses out there ! Only now Bioconductor is waking up to Rust for extending R, Zulip channel here https://community-bioc.zulipchat.com/#narrow/channel/518471-rust . Huge potential for modernising the classical workflow/packages. Cargo is a blessing compared to the C/C++ situation |
|
(Ah the zulip needs an invite link. Feel free to mail me one if you think it's useful for me to read it.) |



Hi,
Thanks for this great crate ! This pull request adds two distinct R bindings using the sassy create. A more idiomatic extendr based R binding and a more traditional Rust+ R's C API approach. The second approach allows us more facilities like reading bytes from R connections and greater performance (see benchmark results compared to wrapping the python library in R).
Feel free to ignore this pull request if it is out of scope. It is put as draft to see if it is of interest within the upstream repo
LLM usage: The code and documentation were written with Codex 5.5 xhigh within Pi coding Agent. We take responsibility of the output after testing and benchmark results.
Rsassy, sassyRS, and Python binding benchmark
This report compares the R-facing overhead of three bindings to
sassy:extendrprototype.sassy-rsPythonbinding called from R with
reticulate.The benchmark uses a deterministic 1 Mbp DNA text by default, not the
tiny toy string used in examples. Exact and two-mismatch copies of the
query are inserted every 100 kbp so each binding has real matches to
return.
Machine and build details
The native benchmark build is expected to be installed before rendering
with:
Input data
Benchmarked calls
Summary