Skip to content
damodardhakad edited this page Mar 25, 2021 · 4 revisions

marshal: Saving and Loading Objects that Otherwise Cannot be Saved or Exported to Parallel Workers

Below is a project proposal for the R Project's participation in the Google Summer of Code 2021 to work on the marshal package. The application deadline for students is on April 13, 2021, but please also note that there is a "soft" deadline to submit project tests on April 6, 2021.

Last updated: 2021-02-09

Background

The ability to save R objects to file and then later be able to load them back in future R session is an important feature in every-day processing of data in R. R provides a few different functions for this where saveRDS() and loadRDS() are the recommended ones. For example, we can create a vector of random numbers, save it to file, quit R:

a <- rnorm(1e3)
saveRDS(a, file = "a.rds")
quit()

and then reload this vector in another R session:

a2 <- readRDS("a.rds")

Internally, R uses "serialization" to encode the R object a to a byte stream and the "unserialization" to decode that byte stream into a new R object with the same representation as the original object.

The ability to serialize and unserialize objects is essential in parallel processing. For example, let's consider the following code that sorts each of the three vectors a, b, and c in list X:

X <- list(a = rnorm(1e3), b = rnorm(1e3), c = rnorm(1e3))
Y <- lapply(X, sort)

This can be parallelized using the parallel package as:

X <- list(a = rnorm(1e3), b = rnorm(1e3), c = rnorm(1e3))

## Call sort() via three parallel R workers running in the background
workers <- parallel::makeCluster(3)
y <- parallel::parLapply(X, sort, cl = workers)

The parallel package takes care of sending the elements to the workers and collecting the sorted results when done. If we look under the hood, we find that element X$a is sent to the first worker as a serialized object over a socket connection. When the worker receives this byte stream it unserializes it to reconstruct the a copy of the original object. This is repeated for all elements in X and all workers. Analogously, the results are sent back from the worker to the main R session in a similar way.

The saveRDS(), readRDS(), and the parLapply() functions all rely on base functions serialize() and unserialize() to "transfer" objects to and from file, or to and from other R processes.

Problem: Not all objects can be serialized as-is

As we saw above, the serialize() and unserialize() functions provide the key mechanism for passing R objects between different R processes. However, some types of R objects break if we attempt to transfer them as-is this way - they will arrive on the other end but they will be incomplete and invalid. This often happens to objects that contain elements which are specific to and only valid in the current R session. In the worst case, we might end up with an object that can crash our R session if used.

For example, the XML package provides methods for working with XML documents, e.g. reading them from file and then manipulating them in different ways:

library(XML)
file <- system.file("exampleData", "tides.xml", package = "XML")
doc <- xmlParse(file)
print(doc)
<?xml version="1.0" encoding="ISO-8859-1"?>
<datainfo>
  <origin>NOAA/NOS/CO-OPS</origin>
  ...  
</datainfo>

Now, let's save this object to file:

saveRDS(doc, file = "doc.rds")

and read this back in, either in the same R session or a new R session:

doc2 <- readRDS("doc.rds")

So far so good. However, although we still don't know it, we now actually have a broken XML object. What is worse is that the XML package does not detect this and if we try to display it as above, R will crash(*):

## Try to display the new object
print(doc2)

 *** caught segfault ***
address 0x8, cause 'memory not mapped'

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 

If we would attempt to send the doc object to a parallel workers, it would cause the parallel worker to crash in a similar manner. This is an extreme but, unfortunately, not a unique example of how some types of R objects cannot be serialized.

Solution: Marshalling of problematic objects

So, is that it? Should we just accept that XML objects cannot be saved as-is to file or exported to parallel workers? No. Fortunately, it turns out we can do more. We can encode a non-serializable object before serializing it such that it can be reconstructed later when unserialized. If we inspect the problematic XML object;

> str(doc)
Classes 'XMLInternalDocument', 'XMLAbstractDocument' <externalptr>

we see that it is an "external-pointer" object, which is a so called "reference" object. It is this internal reference that becomes invalid after serializing the XML object.

To our rescue come base R functions serialize() and unserialize(). They detect (i) when there is a reference object in the stream and (ii) calls an optional reference-hook functions that can be used to deconstruct and reconstruct the original object in an on-the-fly manner. The reference-hook functions are specified via argument refhook which are also exposed via saveRDS() and readRDS(). What remains is custom refhook functions that can handle these problematic reference objects. We often refer to these steps as "marshalling" and "unmarshalling" of objects (Marshalling, Wikipedia, February 2021).

It turns out that the XML package provides "marshalling" functions that makes it possible to serialize an XML object such that it can be fully reconstructed later when unserialized. For example, to save a marshalled version of an XML object to file, we can use:

saveRDS(doc, file = "doc.rds", refhook = XML::xmlSerializeHook)

This object on file can then be reconstructed by unmarshalling it while reading it into R:

doc2 <- readRDS("doc.rds", refhook = XML::xmlDeserializeHook)

This is now a valid "clone" of the original XML object and using it will no longer crash R, e.g.

> print(doc2)
<?xml version="1.0" encoding="ISO-8859-1"?>
<datainfo>
  <origin>NOAA/NOS/CO-OPS</origin>
  ...  
</datainfo>

Pretty neat, eh? Technically, we can use the same approach to export non-serializable objects to parallel R workers.

(*) It is never considered okay for a package to crash your R session, so the fact that it crashes is a bug in XML. At a minimum, the XML package should detect this and give an informative error message. However, that is not our problem and is out of the scope for this project.

Goal of this project

The goal of this project is to:

  1. identify objects in popular R packages that cannot be serialized as-is
  2. implement marshalling reference-hook functions to work around the problem, where possible
  3. if marshalling is not possible, then the reference-hook function should produce an informative error message

Outcome and impact

These marshalling reference-hook functions will provide an ability to save additional types of data and objects to file. Possibly even more important will be the ability to use these type of objects in parallel processing. The demand for parallelization in R has grown during the last few years, e.g. on the local machine or via cloud services. Without marshalling, parallel processing is limited to a subset of data pipelines.

These functions will be included in the marshal package, which will be published on CRAN, where it will reach a large, world-wide audience. If successful, these reference hooks can then be used in R tools such as the future parallelization framework and the targets workflow framework.

Tasks and challenges

The overall objectives of this project and the marshal package is to provide individuals, the research community, and the industry with an awesome and reliable tool. Equally important is that you will learn new things from this project and have fun along the way!

Below is a sketched outline of tasks, challenges, and goals (some may be optional) this project covers. Although we, the mentor, have a good understanding of what needs to be accomplished, we do not know all the details. For example, just like with any other real-world project, there are knowns and unknowns, and some of the unknowns will force us to step back and reconsider the current design. This makes projects challenging, fun, and how we learn from them.

  • As a first use case, identify a popular package that creates non-serializable objects that you think could benefit from marshalling. See https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html for a set of candidates.

  • Design and implement a pair of reference hook functions for marshalling and unmarshalling one of those object types. These functions should be written such that they can be passed as refhook to base::serialize() and base::unserialize().

  • Write unit tests that perform roundtrip validation of these functions. Add them to your own fork of the marshal package.

  • Write a help page for this pair of functions with an illustrative example.

  • Validate your fork on a regular basis with devtools::check() and R CMD check --as-cran.

  • With a working proof-of-concept package for the above use case as a foundation, continue with more use cases from the same or other popular R packages.

  • Identify a few use cases where it is unlikely marshalling will work. For such cases, an informative error message should be produced to prevent that the object is serialized into an invalid object.

  • Document a few small "real-world" examples that show how code that previously could not be parallelized can now be parallelized thanks to the newly developed marshal framework. These examples are preferably written as package vignettes.

  • We often don't know in advance what is being serialized. To handle this case, we need a "super" reference-hook function that can be passed to refhook and that will inspect the reference object passed and call the appropriate reference-hook function (among the ones have been implemented above).

Please note that the above is just a sketch and it is not unusual that you, with or without our help, will discover additional goals and tasks that this project would benefit from - time allowed.

What you will learn or improve upon

Here are a few areas that you will learn or learn more about when working on this project:

  • R package development, e.g. building, installing, and checking

  • Robust code development through unit tests that have a high code coverage

  • Documentation using Roxygen and Markdown, e.g. help pages with examples and vignettes

  • Serialization of objects

  • Parallel processing in R

  • Write reproducible examples, troubleshoot, and fix bugs

  • Version control, pull requests, GitHub

All the above are valuable skills and fundamental concepts in programming and software engineering.

Don't be afraid to ask questions or propose alternative ideas/solutions - we all learn from asking questions and doing mistakes - and behind every great idea there is a large number of "failed" ideas.

Mentors

  1. Henrik Bengtsson, Associate Professor, Department of Epidemiology and Biostatistics, University of California San Francisco (UCSF), [email protected]. Henrik is a member of the R Foundation and the R Consortium Infrastructure Steering Committee. He is the author of a large number of CRAN and Bioconductor packages including the future framework for parallel processing. Henrik has been a GSOC mentor ('Subsetted and parallel computations in matrixStats', GSOC 2015).

  2. ...

If you are interested in conducting this project, please see https://github.com/rstats-gsoc/gsoc2021/wiki for how to proceed. If anything is unclear about this project, drop us a message.

Skills requires

In order to succeed with this project, the following skills are required or a highly recommended:

  • R package development, e.g. building, testing and testing and testing

  • Experience with Git and GitHub is a major advantage, especially since it will be how you will access existing code and share your code

  • Basic understanding what serialization of an object means

  • Basic understanding of the hook-function concept

  • Understand the basics of R's parallel package is a preference

  • Being able to communicate in written English

  • Ability to communicate in spoken English is a preference but not a requirement

We expect that all of the implementation can be done in plain R, i.e. there should be no need for programming in C or C++.

We expect that the participating student meets deadlines and responds to mentors and the GSOC organization in a timely manner.

Skill tests

Students, please send us results (links) of the below tests. You are not required to do all tests, but doing more tests and more difficult tests makes you more likely to be selected for this project. Also, these tests also serves as means for you (and us) to get a feeling for whether this project is a good match for you. The more of the tests that you understand and try to attack and even solve, the more confident you can be that you will be able to deliver on this project.

  1. Easy: Git and R package testing

    • Fork the marshal repository and build the package and check it with both devtools::check() and R CMD check --as-cran.
  2. Easy: Prototyping in R

    • Write XML_serialize_refhook() and XML_unserialize_refhook() functions that are simple wrappers that calls XML::xmlSerializeHook() and XML::xmlDeserializeHook() internally such that they can be used in place of the latter two functions
    • Write a package help page with an example illustrating how to use your new functions for saving and reading XML objects to file
    • Try to make the package passes R CMD check will all OKs
  3. Medium: Package tests

    • Add package tests that validates your new pair of functions. The help example is often a good start

    • Write test cases where you call your functions in non-expected ways, e.g. with objects that your functions don't expect

    • Try to update your functions to handle such unexpected input, e.g. by producing an error, a warning, or silently returning NULL, or what you think makes the most sense

  4. Medium/Hard: Use marshal functions for parallelization

    • With the help of XML_serialize_refhook() and XML_unserialize_refhook(), try to pass an XML object doc to a parallel worker, which should call y <- as(doc, "character") on it and then return y. You may use the parallel package or the future package for this example
  5. Hard: Begin to work on the project.

Solutions of tests

Students, please post a link to your test results here.

Clone this wiki locally