A New Serialization Mechanism For R

Luke Tierney
Department of Statistics and Actuarial Science
University of Iowa

This note describes a new saved workspace format and serialization mechanism that has been added to R. This format is now the default save format in the development version of R. The save function now takes a version argument; specifying version = 2 causes the new format to be used and version = 1 requests the format used through 1.3.1.

A PostScript version of this document is also available.


The main reason a new format is needed is to support name spaces. While the details of name spaces are not yet complete, the following seem clear at this point:

The save mechanism needs to be modified to support this handling of name spaces.

Since a revision of the save format seems necessary to support name spaces, there are some other issues that can be addressed at the same time:

The new serialization approach attempts to address some of these issues. In looking at the new code there are two major questions that need to be answered:

  1. Is the serialization format adequate to do the things we need in the sort term at least?
  2. Is the code that uses the format adequate?
Resolving the first question is more important than the second. For example, exactly how to restore a name space or a package based on the information provided can be resolved later as long as the format allows sufficient information to be stored.

To take full advantage of the options offered it would be useful to have a way for users to customize the action of save.image and the way a work space is saved on exit. This would allow users to request, for example, that their work space be compressed or perhaps that it be stored in a data base. This has not been addressed yet.


R Interface

The only direct R interface provided in the core distribution for now is through the save function: calling it with version=2 produces a work space in the new format.

An experimental interface is available as a package serialize with two functions, serialize and unserialize. The usage is

<serialize package usage>=
serialize(object, connection, ascii = FALSE, refhook = NULL)
unserialize(connection, refhook = NULL)
Defines serialize, unserialize (links are to index).

The connection argument to serialize can be a open connection or NULL. If it is NULL, then object is serialized to a string and that string is returned as the value of serialize. Otherwise object is serialized to the connection and NULL is returned. The connection argument to unserialize can be an open connection or a string.

This interface may need some adjustment. It would be nice if we could use only proper connections, but the current version of text connections, in particular text output connections, doesn't seem quite adequate.

In terms of this interface,

<R session>= [D->]
save(list=mylist, file="myRData", version = 2)

is roughly equivalent to

<save equivalent using serialize package>=
writeChar("RDX2\n", conn, eos=NULL)

and loading this work space with load corresponds to

<load equivalent using serialize package>=
conn<-file("myRData", "rb")
if (readChar(conn,5) == "RDX2\n") {
    val <- unserialize(conn)
    names <- names(val)
    for (i in seq(along=names))
        assign(names[i], val[[i]], envir = envir)

The refhook functions can be used to customize the handling of non-system reference objects (all external pointers, all weak references, and all environments except .GlobalEnv, name spaces, and package environments). If serialize is provided with a refhook function, then that function is called with each reference object before that object is written. If refhook returns NULL then the object is written in the usual way. Otherwise, refhook must return a character vector, and the character strings are saved (only the stings, no names or other attributes). If a serialization contains a value produced by a refhook, then it must be unserialized with a corresponding refhook. The unserialize hook is called with the reconstructed character vector as its argument, and should return whatever object the character vector indicates. Some examples of using this mechanism are given in Section [->].

C Interface

The pstream portion of names used in this interface is meant to be short for persistent stream.

The C interface uses structures to represent input and output serialization streams.

<serialization stream declarations>= [D->]
typedef struct R_outpstream_st {
} *R_outpstream_t;
typedef struct R_inpstream_st {
} *R_inpstream_t;
Defines R_inpstream_st, R_inpstream_t, R_outpstream_st, R_outpstream_t (links are to index).

The structure declarations are in Rinternals.h, and user code must provide storage for the structure, but it should not be used directly as its contents might change. Instead, it should be initialized by one of the initialization functions.

Two sets of higher level initialization functions are provided. One allows writing to an open FILE * pointer; this is used by the internal save code:

<serialization stream declarations>+= [<-D->]
void R_InitFileInPStream(R_inpstream_t stream, FILE *fp,
                         R_pstream_format_t type,
                         SEXP (*phook)(SEXP, SEXP), SEXP pdata);
void R_InitFileOutPStream(R_outpstream_t stream, FILE *fp,
                          R_pstream_format_t type, int version,
                          SEXP (*phook)(SEXP, SEXP), SEXP pdata);
Defines R_InitFileInPStream, R_InitFileOutPStream (links are to index).

The format types are

<serialization stream declarations>+= [<-D->]
typedef enum {
} R_pstream_format_t;
Defines R_pstream_any_format, R_pstream_ascii_format, R_pstream_binary_format, R_pstream_format_t, R_pstream_xdr_format (links are to index).

An explicit format must be provided for output. For input, R_pstream_any_format can be used to indicate that any format is acceptable; if an explicit format is provided on input then an error is raised if the actual stream format does not match the specification.

The second higher level interface is for connections:

<serialization stream declarations>+= [<-D->]
void R_InitConnOutPStream(R_outpstream_t stream, Rconnection con,
                          R_pstream_format_t type, int version,
                          SEXP (*phook)(SEXP, SEXP), SEXP pdata);
void R_InitConnInPStream(R_inpstream_t stream,  Rconnection con,
                         R_pstream_format_t type,
                         SEXP (*phook)(SEXP, SEXP), SEXP pdata);
Defines R_InitConnInPStream, R_InitConnOutPStream (links are to index).

The connection must be open in the appropriate direction, and must be binary for binary streams.

The hook mechanism is analogous to the refhook mechanism in the R interface. For output, phook is called with the reference object and pdata as its arguments, and should return R_NilValue or an STRSXP of length at least one. For input, an STRSXP is supplied and the appropriate object should be returned.

A lower level initialization mechanism can be used to build higher level ones. The lower level mechanism requires two routines, one for handling single characters, used mostly for ascii streams and for writing header information, and one for handling blocks of bytes.

<serialization stream declarations>+= [<-D->]
void R_InitInPStream(R_inpstream_t stream, R_pstream_data_t data,
                     R_pstream_format_t type,
                     int (*inchar)(R_inpstream_t),
                     void (*inbytes)(R_inpstream_t, void *, int),
                     SEXP (*phook)(SEXP, SEXP), SEXP pdata);
void R_InitOutPStream(R_outpstream_t stream, R_pstream_data_t data,
                      R_pstream_format_t type, int version,
                      void (*outchar)(R_outpstream_t, int),
                      void (*outbytes)(R_outpstream_t, void *, int),
                      SEXP (*phook)(SEXP, SEXP), SEXP pdata);
Defines R_InitInPStream, R_InitOutPStream (links are to index).

Once a stream is initialized, it is read and written by

<serialization stream declarations>+= [<-D]
void R_Serialize(SEXP s, R_outpstream_t ops);
SEXP R_Unserialize(R_inpstream_t ips);
Defines R_Serialize, R_Unserialize (links are to index).



Some of these examples use an assert function that can be defines as

<assert function>=
assert <- function(expr)
    if (! expr)
        stop(paste("assertion failed:", deparse(substitute(expr))))
Defines assert (links are to index).

Serialization to Files

The following writes an object to a file and reads it back in:

<file examples>= [D->]
f<-file("sertmp", open="wb")
f<-file("sertmp", open="rb")
assert(identical(x, y))

By using the right magic number header, we can create a saved work space that load can read:

<file examples>+= [<-D->]
writeChar("RDX2\n", f, eos=NULL)
assert(x == 1 && y == 2)

Similarly, we can read a file written by save with version = 2:

<file examples>+= [<-D]
save("x", file="sertmp",version=2)
f<-file("sertmp", "rb")
assert(identical(x, y$x))

Serialization to Strings

When called with NULL for the connection argument serialize will serialize to a string. The string is likely to contain embedded null characters unless ascii=TRUE is specified. unserialize can handle this properly, but since other aspects of R can't it might be worth considering an alternate form of return value for binary serializations to memory.

<string examples>= [D->]
assert(identical(x, y))

Sharing of environments is preserved within a serialization, but identity is not preserved by serialization:

<string examples>+= [<-D->]
e1 <- new.env()
e2 <- new.env()
assert(! identical(y[[1]],e1))

We can use the refhook mechanism to attempt to preserve identity as well (but just in this artificial setting where we are saving from and loading into the same process---in general this is of course impossible):

<string examples>+= [<-D]
outhook <- function(e) {
    if (identical(e,e1)) "e1"
    else if (identical(e,e2)) "e2"
    else NULL
inhook <- function(n) get(n)

A GDBM interface

For the remaining examples we need a very simple (minded) interface to the GNU dbm library. This library implements a simple key/value data base. Unlike the original dbm library, GNU dbm does not limit the size of keys or values.

The interface is quite inefficient since it uses a fresh connection for each operation, but this should be adequate for simple illustrative purposes. The interface, available as the gdbm package, is:

<gdbm interface usage>=
gdbmInsert(name, key, value)
gdbmFetch(name, key)
gdbmExists(name, key)
gdbmDelete(name, key)
Defines gdbmDelete, gdbmExists, gdbmFetch, gdbmInsert, gdbmList, gdbmNew (links are to index).

Some examples:

<gdbm examples>=

Storing Base in a GDBM Data Base

Currently R reads the whole base library on startup. This takes time and uses a significant amount of memory. For embedded applications that might use just a small fraction of R in a given session, it might be useful to avoid this overhead by storing at lest the functions in the base environment in a data base and arranging for the code that finds symbol values to search the data base if a definition is not in memory. In low memory settings, purging functions in base would also be an option.

Implementing this idea would require surgery on envir.c. But we can partially simulate it by replacing closures in base by promises that load the closure from a data base. To measure the effect, we start with a regular R session and look at the memory usage:

<R session>+= [<-D->]
> gc()
         used (Mb) gc trigger (Mb)
Ncells 196188  5.3     407500 10.9
Vcells  37757  0.3     786432  6.0

To start, we need to store the closures in base in a data base. Since this simple approach cannot deal with shared environments, only the closures with .BaseNamespaceEnv as their environment are stored.

<storing base in a gdbm data base>= [D->]
# create the data base
# fill it in
for (name in ls(NULL, all=TRUE)) {
    val <- get(name, env=NULL)
    if (typeof(val) == "closure" &&
        identical(environment(val), .BaseNamespaceEnv))
        gdbmInsert("base", name, serialize(val, NULL))
# check it
for (name in ls(NULL, all=TRUE)) {
    val <- get(name, env=NULL)
    if (typeof(val) == "closure" &&
        identical(environment(val), .BaseNamespaceEnv)) {
        if (! gdbmExists("base", name) ||
            ! identical(val, unserialize(gdbmFetch("base",name))))

Now we can replace all closures in base by promises that load them as needed:

<storing base in a gdbm data base>+= [<-D]
wrap<-function(name) {
   name <- name # need to force evaluation!
for (i in gdbmList("base")) assign(i, wrap(i), env=NULL)

To see the effect, we can again run a gc:

<R session>+= [<-D->]
> gc()
        used (Mb) gc trigger (Mb)
Ncells 41995  1.2     350000  9.4
Vcells 18345  0.2     786432  6.0

Memory usage has dropped from 5.6Mb to 1.4Mb. The promises do take up some space, but even that could be eliminated by making a modification in envir.c.

Using a data base for persistent storage of R code seems like an idea worth exploring in more depth. GDBM is one option for the data base. GDBM ports are available for almost all, if not all, platform where R runs, including Windows and classic Mac OS, so this may be a good default choice. Other data bases may work just as well and may in some cases be more suitable, so allowing a mechanism for choosing the data base is probably a good idea.

Preserving Shared Environments in GDBM Serialization

Allmost all the closures in base have .BaseNamespaceEnv as their environment, but a few do not. The simple approach of the previous section would fail if two closures shared a non-global environment since separate serializations would not preserve that sharing. The refhook mechanism can be used to overcome this problem. This section provides a simple illustration of how this can be done. The code presented here is available as package shelf. The name is taken from a similar facility available in Python (though Python's facility does not try to preserve sharing across entries, just within entries, if I understand it correctly).

Some Utilities

To start off we need a few utilities. The first takes an environment and returns a named list of the name/values:

<shelf utilities>= (U->) [D->]
envlist <- function(e) {
    names <- ls(e, all=TRUE)
    list <- lapply(names, get, env=e, inherits=FALSE)
    names(list) <- names
Defines envlist (links are to index).

The second is sort of an inverse---given a named list and an environment, it adds the contents of the list to the environment.

<shelf utilities>+= (U->) [<-D->]
listIntoEnv <- function(list, e) {
    names <- names(list)
    for (i in seq(along = names))
        assign(names[i], list[[i]], env = e)
Defines listIntoEnv (links are to index).

Next, we need a means of creating unique names for a set of environments---a little data base of sorts. Given an environment, we need to be able to ask for the name of the environment if it is already in the data base. If it is not, we need to insert it and generate a name for it. The generated names are of the form env::<index>. (The getenv function is not actually needed.)

<shelf utilities>+= (U->) [<-D->]
envtable <- function() {
    idx <- 0
    envs <- NULL
    enames <- character(0)
    find <- function(v, keys, vals)
        for (i in seq(along=keys))
            if (identical(v, keys[[i]]))
    getname <- function(e) find(e, envs, enames)
    getenv <- function(n) find(n, enames, envs)
    insert <- function(e) {
        idx <<- idx + 1
        name <- paste("env", idx, sep="::")
        envs <<- c(e, envs)
        enames <<- c(name, enames)
    list(insert = insert, getenv = getenv, getname = getname)
Defines envtable (links are to index).

Some examples:

<R session>+= [<-D->]
> et<-envtable()
> e<-new.env()
> et$getname(e)
> et$insert(e)
[1] "env::1"
> et$getname(e)
[1] "env::1"

Creating Back End Connections

The mechanism described here can in principle be used with different back end key/value data bases. The default will be a GDBM data base, but other options are possible. For a GDBM data base we need a writer connection that provides an insert method,

<shelf utilities>+= (U->) [<-D->]
makeGdbmWriter <- function(file) {
    list(insert = function(name, value) gdbmInsert(file, name, value))
Defines makeGdbmWriter (links are to index).

and a reader connection that provides list, fetch, and exists methods:

<shelf utilities>+= (U->) [<-D]
makeGdbmReader <- function(file) {
    list(list = function() gdbmList(file),
         fetch = function(name) gdbmFetch(file, name),
         exists = function(name) gdbmExists(file, name))
Defines makeGdbmReader (links are to index).

Creating and Opening a Shelf

Now we can create a data base from a named list of values. The result is called a shelf, a term used by Python for a similar facility. The shelf stores its data in a gdbm data base but with slightly mangled keys. A variable x is stored under the key var::x, and environments referenced by values are stored under the generated environment keys with an env:: prefix.

<creating a shelf>= (U->)
makeShelfWriter <- function(db,  ascii = TRUE) {
    if (is.character(db))
        db <- makeGdbmWriter(db)
    table <- envtable()
    ser <- function(val)
        serialize(val, NULL, ascii = ascii, refhook = envhook)
    envhook <- function(e) {
        if (is.environment(e)) {
            name <- table$getname(e)
            if (is.null(name)) {
                name <- table$insert(e)
                data <- list(bindings = envlist(e),
                             enclos = parent.env(e))
                db$insert(name, ser(data))
    putvar <- function(name, val, prefix="var", sep="::") {
        key <- paste("var", name, sep="::")
        db$insert(key, ser(val))
    list(putvar = putvar)

makeShelf <- function(list, db, ascii = TRUE)
    names <- names(list)
    if (length(names) != length(list))
        stop("must provide a named list of values")
    s <- makeShelfWriter(db, ascii)
    for (i in seq(along=names))
        s$putvar(names[i], list[[i]])
Defines makeShelf, makeShelfWriter (links are to index).

To get a listing of the variables in a shelf, we can use

<listing a shelf>= (U->)
listShelf <- function(db) {
    if (is.character(db))
        db <- makeGdbmReader(db)
    prefpat <- "^var::"
    sub(prefpat, "", grep(prefpat,db$list(), value=TRUE))
Defines listShelf (links are to index).

We need a way of retrieving values from a shelf that insures that sharing of environments is maintained for values retrieved separately. We do need some connection between retrievals to allow this, and that connection is provided by a shelf connection object, which is created and returned by openShelf. Sharing is preserved for values retrieved from the same shelf connection.

<opening shelf>= (U->)
openShelf <- function(db) {
    if (is.character(db))
        db <- makeGdbmReader(db)
    envenv <- new.env(hash = TRUE)
    varkey <- function(name) paste("var", name, sep="::")
    fetch <- function(name)
        unserialize(db$fetch(name), refhook = envhook)
    envhook <- function(n) {
       if (exists(n, env = envenv, inherits = FALSE))
           get(n, env = envenv, inherits = FALSE)
       else {
           e <- new.env(hash = TRUE)
           assign(n, e, env = envenv) # MUST do this immediately
           data <- fetch(n)
           parent.env(e) <- data$enclos
           listIntoEnv(data$bindings, e)
    list(getvar = function(name) fetch(varkey(name)),
         exists = function(name) db$exists(varkey(name)),
         list = function() listShelf(db))
Defines openShelf (links are to index).

As a side note, having parent.env<- available at the R level seems like a really bad idea because of the potential for real serious mischief, like clobbering the search list and totally confusing the internal global cache mechanism. But the facility it provides is essential in this case since the environment must be created and registered before its contents and parent are unserialized so that circular references to the environment are handled properly. Currently with parent.env<- available this can be handled in pure R code. But unless we find a good way of preventing inadvertent use, it would probably be good to get rid of this at the R level, and thus require a little bit of C code to implement this stuff.

Here are some examples. First create some environments with some ordinary variables and some references to each other, and place these in a shelf:

<simple shelf example>=
e2<-new.env(parent = e1)
listIntoEnv(list(x=2,y=3, v=3,ee=e2),e1)
listIntoEnv(list(x=e1,y=e2), e2)
makeShelf(list(x=e1,y=e1, z = 3), "mydb")

The listShelf and gdbmList functions show the variables in the shelf and the actual keys in the data base, respectively:

<R session>+= [<-D->]
> listShelf("mydb")
[1] "z" "y" "x"
> gdbmList("mydb")
[1] "env::2" "env::1" "var::z" "var::y" "var::x"

Now we can open the shelf and examine its contents. The pointer values in the environments show that sharing is being handled properly.

<R session>+= [<-D]
> s<-openShelf("mydb")
> s$list()
[1] "z" "y" "x"
> s$getvar("x")
<environment: 0x8f89ba4>
> s$getvar("y")
<environment: 0x8f89ba4>
> ls(s$getvar("x"))
[1] "ee" "v"  "x"  "y" 
> get("ee", s$getvar("x"))
<environment: 0x8f89048>
> parent.env(get("ee",s$getvar("y")))
<environment: 0x8f89ba4>

Lazily Loading and Attaching a Shelf

Finally, we can combine the promise idea used earlier for closures in base to produce lazy load and attach functions for a shelf. The loadShelf functions adds bindings for all variables in a shelf to the specified environment. The bindings are promises that load the values on demand using a common connection created when the shelf is loaded.

<loading a shelf>= (U->)
loadShelf <- function(db, envir = parent.frame()) {
    s <- openShelf(db)
    wrap<-function(name) {
        name <- name # need to force evaluation!
        delay(s$getvar(name), env=environment())
    for (n in listShelf(db))
        assign(n, wrap(n), envir = envir)
Defines loadShelf (links are to index).

attachShelf creates an environment on the search list and fills it using loadShelf.

<attaching a shelf>= (U->)
attachShelf <- function(db, pos = 2, name) {
    if (missing(name)) {
        if (is.character(db))
            name <- paste("shelf", db, sep = ":")
        else name = "shelf:<no name>"
    env <- attach(NULL, pos = pos, name = name)
    loadShelf(db, env)
Defines attachShelf (links are to index).


The shelf system presented here creates the shelf and then provides only read access. This is useful for code libraries---it could for example be used for packages that are installed with the --save option.

This simple shelf system does not allow values to be deleted or new or changed values to be inserted. Something along these lines would be useful of course. It does however raise a number of complicating issues that would need to be addressed. Some are quite standard, such as dealing with managing coherency for different shelf connections within a given process or even from separate processes. Others are more specific to this approach, such as garbage collection. If variables are deleted then there may be environment entries that are no longer needed. Some mechanism for removing these would be needed.

The Shelf Code

<shelf utilities>
<creating a shelf>
<listing a shelf>
<opening shelf>
<loading a shelf>
<attaching a shelf>

Distributed Scope

Serialization can be used with sockets or with the Rpvm library to allow code and data to be transferred between processes or machines for distributed computing. One issue that arises is how to handle environments. The Obliq system uses a notion of distributed scope: environments remain where they were created. This seems to provide for a nice high level model for distributed computation. It should be possible to use the refhook interface together with the active value ideas recently added to R to implement (a part of) this sort of thing, but I have not had a chance to try this yet.

The Serialization Algorithm

This section gives a bit more information on the serialization algorithm.

The algorithm uses a single pass over the node tree to be serialized. Sharing of reference objects is preserved, but sharing among other objects is ignored. The first time a reference object is encountered it is entered in a hash table; the value stored with the object is the index in the sequence of reference objects (1 for first reference object, 2 for second, etc.). When an object is seen again, i.e. it is already in the hash table, a reference marker along with the index is written out. The unserialize code does not know in advance how many reference objects it will see, so it starts with an initial array of some reasonable size and doubles it each time space runs out. Reference objects are entered as they are encountered.

This means the serialize and unserialize code needs to agree on what is a reference object. Making a non-reference object into a reference object requires a version change in the format. An alternate design would be to precede each reference object with a marker that says the next thing is a possibly shared object and needs to be entered into the reference table.

Adding new SXP types is easy, whether they are reference objects or not. The unserialize code will signal an error if it sees a type value it does not know. It is of course better to increment the serialization format number when a new SXP is added, but if that SXP is unlikely to be saved by users then it may be simpler to keep the version number and let the error handling code deal with it.

The output format for dotted pairs writes the ATTRIB value first rather than last. This allows CDRs to be processed by iterative tail calls to avoid recursion stack overflows when processing long lists. The writing code does take advantage of this, but the reading code does not. It hasn't been a big issue so far---the only case where it has come up is in saving a large unhashed environment where saving succeeds but loading fails because the PROTECT stack overflows. With the ability to create hashed environments at the user level this is likely to be even less of an issue now. But if we do need to deal with it we can do so without a change in the serialization format---just rewrite ReadItem to pass the place to store the CDR it reads. (It's a bit of a pain to do, that is why it is being deferred until it is clearly needed.)

CHARSXP are now handled in a way that preserves both embedded null characters and NA_STRING values.

The XDR save format now only uses the in-memory XDR facility for converting integers and doubles to a portable format.

The output format packs the type flag and other flags into a single integer. This produces more compact output for code; it has little effect on data.

Environments recognized as package or name space environments are not saved directly. Instead, a STRSXP is saved that is then used to attempt to find the package/name space when unserialized. The exact mechanism for choosing the name and finding the package/name space from the name still has to be developed, but the serialization format should be able to accommodate any reasonable mechanism.

A mechanism is provided to allow special handling of non-system reference objects (all weak references and external pointers, and all environments other than package environments, name space environments, and the global environment). The hook function consists of a function pointer and a data value. The serialization function pointer is called with the reference object and the data value as arguments. It should return R_NilValue for standard handling and an STRSXP for special handling. In an STRSXP is returned, then a special handing mark is written followed by the strings in the STRSXP (attributes are ignored). On unserializing, any specially marked entry causes a call to the hook function with the reconstructed STRSXP and data value as arguments. This should return the value to use for the reference object. A reasonable convention on how to use this mechanism is need, but again the format should be compatible with any reasonable convention.

Eventually it may be useful to use these hooks to allow objects with a class to have a class-specific serialization mechanism. The serialization format should support this. It is trickier than in Java and other reference based languages where creation and initialization can be separated--we don't really have that option at the R level.