Skip to content

Fuzzing: Consider adding R as a target to Google's OSS-Fuzz #53

@kevinushey

Description

@kevinushey

Google maintains a GitHub repository, where open source projects can contribute tools that can be used to fuzz their project on Google infrastructure.

https://github.com/google/oss-fuzz

Claude-generated proposal below.


Proposal: Adding R as a Fuzz Target in OSS-Fuzz

Background

OSS-Fuzz is Google's continuous fuzzing
infrastructure for open source software. It currently supports C/C++, Go, Java,
JavaScript, Python, Rust, Swift, and Ruby, covering over 1,300 projects. R is
not yet among them.

This document proposes adding R as a fuzz target, outlines the approach, and
identifies candidate attack surfaces for initial fuzz targets.

How OSS-Fuzz Works

Each project in oss-fuzz provides three files:

File Purpose
project.yaml Metadata: language, contacts, sanitizers, fuzzing engines
Dockerfile Build environment: base image, dependencies, source checkout
build.sh Compilation script: builds fuzz target binaries, places them in $OUT

The contract

The build script receives environment variables ($CC, $CXX, $CFLAGS,
$CXXFLAGS, $LIB_FUZZING_ENGINE, $SANITIZER, $OUT) and must produce
standalone executables in the $OUT directory. Each executable must export the
LLVMFuzzerTestOneInput symbol -- that is how oss-fuzz discovers fuzz targets.

After the build completes:

  1. The infrastructure scans $OUT for executables containing
    LLVMFuzzerTestOneInput.
  2. The contents of $OUT are archived and uploaded to Google Cloud Storage.
    Everything else (source, build intermediates) is discarded.
  3. ClusterFuzz downloads the archive
    into a fresh runner container and continuously fuzzes each target using the
    configured engines (libfuzzer, AFL++, honggfuzz) and sanitizers (ASan, MSan,
    UBSan).

Supporting files can be placed alongside each target:

  • <target>_seed_corpus.zip -- initial inputs to bootstrap the fuzzer
  • <target>.dict -- fuzzing dictionary of interesting tokens
  • <target>.options -- runtime configuration (e.g., memory limits)

Precedent: How CPython and Ruby Are Fuzzed

R is a C-based interpreter, so the most relevant precedents are CPython and
Ruby -- both of which are fuzzed by compiling the interpreter from source with
sanitizers and linking C/C++ fuzz harnesses against it.

CPython (projects/cpython3/)

  • Declares language: c++ in project.yaml (for libfuzzer linkage).
  • Builds CPython from source with --with-address-sanitizer etc.
  • Fuzz harnesses live upstream in CPython's source tree
    (Modules/_xxtestfuzz/fuzzer.c). A single C file uses preprocessor macros
    (-D _Py_FUZZ_ONE -D _Py_FUZZ_<target>) to select which target to compile.
  • The build script iterates over fuzz_tests.txt, compiles one binary per
    target, and drops them into $OUT.
  • Supports ASan, MSan, UBSan, and all three fuzzing engines.

Ruby (projects/ruby/)

  • Also declares language: c++.
  • Downloads a stable Ruby release as a "baseruby" (built without sanitizers),
    then builds the target Ruby from source with --enable-static and sanitizers.
  • Fuzz harnesses are C++ files in the oss-fuzz project directory (e.g.,
    fuzz_regex.cpp, fuzz_ruby_parser.cpp), compiled and linked against
    libruby-static.a.
  • Supports ASan, UBSan, and all three engines.

Both projects require no changes to oss-fuzz infrastructure -- they use the
standard base-builder image and declare themselves as C++ projects.

Proposed Approach for R

Follow the same pattern: build R from source with sanitizers, write C fuzz
harnesses, and link them against R's static library. No oss-fuzz infrastructure
changes are required.

Project structure

projects/r/
  project.yaml
  Dockerfile
  build.sh
  fuzz_parse.c        # example fuzz target
  fuzz_serialize.c    # example fuzz target
  ...
  r.options           # shared fuzzer options (e.g., max_len, timeout)

project.yaml

homepage: "https://www.r-project.org/"
language: c++
primary_contact: "<contact>@<domain>"
auto_ccs:
  - "<cc1>@<domain>"
sanitizers:
  - address
  - undefined
fuzzing_engines:
  - libfuzzer
  - afl
  - honggfuzz
main_repo: "https://svn.r-project.org/R/trunk/"

Dockerfile (sketch)

FROM gcr.io/oss-fuzz-base/base-builder

RUN apt-get update && apt-get install -y \
    gfortran \
    libreadline-dev \
    libx11-dev \
    libxt-dev \
    libcurl4-openssl-dev \
    libbz2-dev \
    liblzma-dev \
    libpcre2-dev \
    zlib1g-dev \
    libjpeg-dev \
    libpng-dev \
    libtiff-dev \
    libcairo2-dev \
    texinfo

# Clone R source
RUN svn checkout https://svn.r-project.org/R/trunk/ $SRC/r-source \
    --non-interactive --trust-server-cert-failures=unknown-ca

COPY *.sh *.c *.options $SRC/

build.sh (sketch)

#!/bin/bash -eu

export ASAN_OPTIONS="detect_leaks=0"

cd $SRC/r-source

# Configure R with sanitizers, static linking
./configure \
    --prefix="$WORK/r-install" \
    --disable-shared \
    --enable-static \
    --without-recommended-packages \
    --with-x=no \
    CC="$CC" \
    CXX="$CXX" \
    CFLAGS="$CFLAGS" \
    CXXFLAGS="$CXXFLAGS"

make -j$(nproc)

R_BUILD_DIR="$SRC/r-source"
INC_R="-I${R_BUILD_DIR}/include"
LIBS_R="${R_BUILD_DIR}/lib/libR.a"  # path may vary

# Build each fuzz target
for fuzzer_src in $SRC/fuzz_*.c; do
    fuzzer=$(basename "$fuzzer_src" .c)

    $CC $CFLAGS $INC_R -c "$fuzzer_src" -o "$WORK/${fuzzer}.o"
    $CXX $CXXFLAGS $LIB_FUZZING_ENGINE "$WORK/${fuzzer}.o" \
        $LIBS_R -lm -lpthread -ldl -lreadline -lpcre2-8 -llzma -lbz2 -lz \
        -o "$OUT/${fuzzer}"

    if [ -f "$SRC/${fuzzer}.options" ]; then
        cp "$SRC/${fuzzer}.options" "$OUT/${fuzzer}.options"
    fi
done

Example fuzz target: fuzz_parse.c (sketch)

#include <Rinternals.h>
#include <R.h>
#include <Rembedded.h>
#include <stdint.h>
#include <string.h>

/*
 * LLVMFuzzerInitialize is called once by the fuzzing engine before the
 * loop begins. We use it to bootstrap the embedded R session -- this is
 * expensive and only needs to happen once.
 */
int LLVMFuzzerInitialize(int *argc, char ***argv) {
    char *r_argv[] = {"R", "--no-save", "--no-restore", "--silent"};
    Rf_initEmbeddedR(4, r_argv);
    return 0;
}

/*
 * R signals errors (including parse errors) via longjmp, which would
 * crash the fuzzer process. We wrap all R calls inside R_ToplevelExec,
 * which sets up a protected context and returns FALSE if R longjmps.
 */
static void do_parse(void *data) {
    const char *buf = (const char *)data;
    SEXP str = PROTECT(Rf_mkString(buf));
    ParseStatus status;
    SEXP parsed = R_ParseVector(str, -1, &status, R_NilValue);
    UNPROTECT(1);
    (void)parsed;
}

int LLVMFuzzerTestOneInput(const uint8_t *data, size_t size) {
    /* Null-terminate the input for R's parser */
    char *buf = (char *)malloc(size + 1);
    if (!buf) return 0;
    memcpy(buf, data, size);
    buf[size] = '\0';

    R_ToplevelExec(do_parse, (void *)buf);

    free(buf);
    return 0;
}

Candidate Fuzz Targets

These are R's C-level entry points most likely to benefit from fuzzing:

Target R API What it exercises
Parser R_ParseVector R source code parsing
Serialization R_unserialize Deserialization of .rds/.RData files
Regex R_nchar, PCRE2 wrappers Pattern matching in grep, gsub
String encoding Rf_translateCharUTF8, mkCharLenCE Encoding conversions
Connection I/O R_ReadConnection Reading from arbitrary input streams
Numeric coercion Rf_coerceVector Type coercion (character -> numeric, etc.)

Starting with R_ParseVector and R_unserialize would cover two of the
highest-value surfaces: parsing untrusted R code and deserializing untrusted
binary data.

Open Questions

  1. Static linking: R's build system may not produce a static libR.a by
    default. This needs investigation -- we may need --enable-static or may
    need to build and link against individual object files. The Ruby project
    faced a similar challenge and solved it with --with-static-linked-ext.

  2. Fortran dependencies: R's core uses Fortran (BLAS/LAPACK). We need to
    confirm that gfortran's runtime (libgfortran) works correctly under
    ASan/UBSan. If not, we may need to use a reference BLAS/LAPACK written in C.

  3. Embedded R initialization: Rf_initEmbeddedR does significant setup.
    We use LLVMFuzzerInitialize (called once by the engine before the fuzzing
    loop) to pay this cost up front. We need to ensure this is compatible with
    the sanitizer environment and doesn't produce excessive false positives.

  4. Upstream vs. oss-fuzz: Should fuzz harnesses live upstream in R's source
    tree (like CPython) or in the oss-fuzz project directory (like Ruby)? Hosting
    them upstream makes maintenance easier long-term but requires buy-in from
    R-core.

  5. longjmp error handling: R signals errors (including parse errors) via
    longjmp, which would crash the fuzzer process if uncaught. All fuzz
    targets must wrap R calls in R_ToplevelExec, which sets up a protected
    context and returns FALSE on error instead of longjmping. This is
    already reflected in the example harnesses above.

  6. MSan support: MemorySanitizer requires all linked code to be
    instrumented. Given R's Fortran dependencies, MSan may not be feasible
    initially. CPython and Ruby both support MSan, but R's dependency chain is
    more complex.

Next Steps

  1. Prototype the Dockerfile and build.sh locally using oss-fuzz's
    helper.py (python infra/helper.py build_fuzzers r).
  2. Get fuzz_parse and fuzz_serialize working under ASan + libfuzzer.
  3. Submit a PR to oss-fuzz and iterate with Google's oss-fuzz team on review.
  4. Expand fuzz targets based on initial findings.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Ideas

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions