A Regular Expressions Package For R

Luke Tierney
School of Statistics
University of Minnesota

2009/05/28

Introduction

This is a simple regular expression package provided as an illustration of the pointer and finalization mechanism in R 1.2. It is a translation of a similar package for xlispstat, which is in turn based on the Tcl 8.0 interface. I believe more recent Tcl's have switched to a new regular expression library by Henry Spencer that provides some additional features. The code is available as a package.

Interface and Examples

Higher-Level Interface

This higher-level interface provides two public R functions, regexp and regsub, based on the Tcl functions with the same names.

The syntax for regexp is

regexp(pattern, string, ignore.case=FALSE, extended=TRUE, index.only=TRUE)
pattern is a regular expression to be matched to string. pattern and string must be character vectors of length 1. If ignore.case is TRUE, the case of characters is ignored in comparisons. If extended is TRUE, then extended regular expressions are used (the REG_EXTENDED flag is given to regcomp). If index.only is TRUE, an integer matrix with two columns, the starts and the ends of the matched substrings (C indexing) is returned. Otherwise, a character vector of the substrings is returned.

The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example 11-2], uses regexp to decompose a URL into its components:

> regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)",
+        "http://stat.umn.edu:80/xyz")
[1] "http://stat.umn.edu:80/xyz" "http://"                   
[3] "http"                       "stat.umn.edu"              
[5] ":80"                        "80"                        
[7] "/xyz"                      
Element 3 is the protocol, 4 is the host, 6 is the port, and 7 is the path. We can use this to make a function for extracting the parts of a URL:

<url.parts definition>= (U->)
url.parts <- function(url) {
    val <- regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", url)
    if (length(val) == 0)
        stop(paste("not a valid URL:", url))
    structure(val[c(3,4,6,7)], names=c("protocol","host","port","path"))
}
    
Defines url.parts (links are to index).

Using the example above,

> url.parts("http://stat.umn.edu:80/xyz")
      protocol           host           port           path 
        "http" "stat.umn.edu"           "80"         "/xyz" 

The function regsub performs substitution based on pattern matching. The syntax is

regsub(pattern, string, sub, ignore.case=FALSE, extended=TRUE, all=FALSE)
A match of pattern in string is replaced using sub. If all is TRUE, then all occurrences are replaced; otherwise, only the first is replaced. The substitution argument sub can be a string, which is used literally, or it can be a function. The function is called with the substring vector produced by the match as its argument and should return a string to use as a replacement for the match.

An example [These examples are based on the URL Decoding example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple form of regsub:

> regsub("\\+", "abc+def+ghi", " ", all=TRUE)
[1] "abc def ghi"
To illustrate the more complex form, the following example replaces %xx hexadecimal encodings in URL's with the character they represent. It uses the utility function deHex also included in this package (there may be a better way to do this with some functions already in R).
> regsub("%([0-9a-hA-H][0-9a-hA-H])", "%7ewelc%68",
+           function(s) deHex(s[2]), all=TRUE)
[1] "~welch"
These two examples can be combined into a simple URL decoding function:

<url.decode definition>= (U->)
url.decode <- function(url)
    regsub("%([0-9a-hA-H][0-9a-hA-H])", regsub("\\+", url, " ", all=TRUE),
           function(s) deHex(s[2]), all=TRUE)
Defines url.decode (links are to index).

An illustration:

> url.decode("%7ewelc%68+book")
[1] "~welch book"

Lower-Level Interface

The lower-level interface provides a closer interface to the POSIX regular expression functions. In particular regcomp can be used to compile a regular expression that is then reused repeatedly in calls to regexec. The functions return results as R objects and signal errors when appropriate rather than returning error codes.

There are six public constants to be used as flags. These are integer values that can be combined with by addition (which should correspond to a logical inclusive or for the se flag values). The four flag values for regcomp are

REG.EXTENDED Use extended regular expressions
REG.NEWLINE Special handling of newline characters
REG.NOSUB Report only success/fail in regexec
REG.ICASE Ignore case in match.
The two flags for regexec are
REG.NOTBOL First character of string is not the beginning of the line
REG.NOTEOL Last character of string is not the end of the line

The two public functions are regcomp and regexec. The syntax for regcomp is

regcomp(pattern, flags=REG.EXTENDED)
where pattern is a string containing a regular expression and flags is an integer flag constructed from the constants given above. regcomp returns a regular expression pointer or signals an error.

The syntax for regexec is

regexec(rex, string, flags=0)
where rex is a regular expression structure created by regcomp, string is a string to be matched, and flags is an integer flag.

As an example, we can compile the URL pattern used above as

> rex <- regcomp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")
and use it to match a URL with
> regexec(rex, "http://stat.umn.edu:80/xyz")
     [,1] [,2]
[1,]    0   26
[2,]    0    7
[3,]    0    4
[4,]    7   19
[5,]   19   22
[6,]   20   22
[7,]   22   26

Implementation

The implementation consists of the C file regexp.c and the R file regexp.R.

<regexp.c>=
#include "Rinternals.h"
#include <sys/types.h>
#include <regex.h>
<utility functions>
<regex_t interface>
<regmatch_t interface>
<finalization>
<error reporting>
<main functions>
<initialization>

*

<regexp.R>=
<lower-level R implementation>
<higher-level R implementation>

Higher-Level R Implementation

The higher level component consist of R code that builds on the lower level.

<higher-level R implementation>= (U->)
<compile.regex definition>
<get.substrings definition>
<regexp definition>
<regsub definition>
<url.parts definition>
<url.decode definition>

The regexp function compiles the regular expression, runs regexpr, and then turns the resulting matrix of index pairs into a vector of substrings using get.substrings.

<regexp definition>= (<-U)
regexp <- function(pat, str, index.only=FALSE, ...) {
    rex <- compile.regex(pat, ...)
    pairs <- regexec(rex, str, 0)
    if (index.only) pairs
    else get.substrings(pairs, str)        
}
Defines regexp (links are to index).

The regular expression is compiled with

<compile.regex definition>= (<-U)
compile.regex <- function(pat, extended=TRUE, ignore.case=FALSE) {
    flags <- if (extended) REG.EXTENDED else 0
    if (ignore.case) flabs <- flags + REG.ICASE
    regcomp(pat, flags)
}
Defines compile.regex (links are to index).

The function get.substrings is

<get.substrings definition>= (<-U)
get.substrings <- function(pairs, str) {
    if (is.null(pairs)) character(0)
    else apply(pairs, 1, function(i, s) substr(s, i[1] + 1, i[2]), str)
}
Defines get.substrings (links are to index).

The regsub function is defined as

<regsub definition>= (<-U)
regsub <- function(pat, str, sub, all=FALSE, ...) {
    strcat <- function(...) paste(..., sep="")
    rex <- compile.regex(pat, ...)
    head <- ""
    tail <- str
    repeat {
        val <- regexec(rex, tail, 0)
        if (is.null(val))
            return(strcat(head, tail))
        sval <- if (is.character(sub)) sub else sub(get.substrings(val, tail))
        tail.head <- substr(tail, 1, val[1,1])
        head <- strcat(head, tail.head, sval)
        tail <- substr(tail, val[1,2] + 1, nchar(tail))
        if (! all)
            return(strcat(head, tail))
    }
}
Defines regsub (links are to index).

Lower-Level Component

The lower-level component is a wrapper around the POSIX regular expression library. The approach I've taken here is to make the C level code as minimal as possible and do as much as possible in R. For most of the things called with .Call this means that the C functions and the basic wrappers around the .Call could be generated automatically from a simple interface description---this is what I do in xlispstat.

Regular Expression Data Types

There are two data types, regex_t and regmatch_t. The interfaces for these types define a tag variable, a constructor, a converter and field accessors.

The type tags are symbols that are initialized by the package initialization function.

<regex_t interface>= (<-U) [D->]
static SEXP REGEX_type_tag;
Defines REGEX_type_tag (links are to index).

<regmatch_t interface>= (<-U) [D->]
static SEXP REGMATCH_type_tag;
Defines REGMATCH_type_tag (links are to index).

<initialize type tags>= (U->)
REGEX_type_tag = install("REGEX_TYPE_TAG");
REGMATCH_type_tag = install("REGMATCH_TYPE_TAG");

The constructor for regex_t objects allocates space for a specified number of regex_t structures on the R heap and returns it wrapped as an external pointer object. The R constructor provides a wrapper around the C interface and provides a default value for the number of structures

<regex_t interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regex_t(SEXP rn)
{
    int n = sexp2int(rn, FALSE);
    if (n <= 0) return R_NilValue;
    else return R_AllocatePtr(n, sizeof(regex_t), REGEX_type_tag);
}

*

<lower-level R implementation>= (U->) [D->]
make.regex <- function(n = 1) .Call("REGEXP_make_regex_t", n)
Defines make.regex (links are to index).

The C and R constructors for regmatch_t objects are analogous.

<regmatch_t interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regmatch_t(SEXP rn)
{
    int n = sexp2int(rn, FALSE);
    if (n <= 0) return R_NilValue;
    else return R_AllocatePtr(n, sizeof(regmatch_t), REGMATCH_type_tag);
}

*

<lower-level R implementation>+= (U->) [<-D->]
make.regmatch <- function(n = 1) .Call("REGEXP_make_regmatch_t", n)
Defines make.regmatch (links are to index).

The converter functions convert from the SEXP pointer wrappers to the native C pointers, with appropriate error checking done by some utility functions. These functions are static since they will only be used internally by the C interface.

<regex_t interface>+= (<-U) [<-D->]
static regex_t *sexp2regex_t_p(SEXP s, Rboolean null_ok)
{
    return sexp2ptr(s, null_ok, REGEX_type_tag, "regex_t");
}
Defines sexp2regex_t_p (links are to index).

<regmatch_t interface>+= (<-U) [<-D->]
static regmatch_t *sexp2regmatch_t_p(SEXP s, Rboolean null_ok)
{
    return sexp2ptr(s, null_ok, REGMATCH_type_tag, "regmatch_t");
}
Defines sexp2regmatch_t_p (links are to index).

The regexp_t structure has one integer field, re_nsub. The reader interface consists of a C function and an R wrapper. An offset can be specified in case the pointer refers to data containing more than one structure; the R interface sets the default offset to zero.

<regex_t interface>+= (<-U) [<-D]
SEXP REGEXP_regex_t_re_nsub(SEXP rp, SEXP ri)
{
  regex_t *p = sexp2regex_t_p(rp, FALSE);
  return ScalarInteger(p[sexp2int(ri, FALSE)].re_nsub);
}
Defines REGEXP_regex_t_re_nsub (links are to index).

<lower-level R implementation>+= (U->) [<-D->]
regex.nsub <- function(re, i = 0) .Call("REGEXP_regex_t_re_nsub", re, i)
Defines regex.nsub (links are to index).

The regmatch_t structure has two integer fields, rm_so and rm_eo. The reader interface is analogous to the previous one.

<regmatch_t interface>+= (<-U) [<-D]
SEXP REGEXP_regmatch_t_rm_so(SEXP rp, SEXP ri)
{
  regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
  return ScalarInteger(p[sexp2int(ri, FALSE)].rm_so);
}

SEXP REGEXP_regmatch_t_rm_eo(SEXP rp, SEXP ri)
{
  regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
  return ScalarInteger(p[sexp2int(ri, FALSE)].rm_eo);
}
Defines REGEXP_regmatch_t_rm_eo, REGEXP_regmatch_t_rm_so (links are to index).

<lower-level R implementation>+= (U->) [<-D->]
regmatch.so <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_so", rm, i)
regmatch.eo <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_eo", rm, i)
Defines regmatch.eo, regmatch.so (links are to index).

Finalization

Compiled regular expressions need to have regfree called on them to release resources allocated to them by regcomp. To do this, a C finalizer is registered for the compiled regular expression. It would be possible to do this as part of the C wrapper to regcomp, but I am doing it at the R level to illustrate the possibility of automating the creation of interfaces.

<finalization>= (<-U)
static void finalize_regexp(SEXP s)
{
    regex_t *re = sexp2regex_t_p(s, TRUE);
    if (re != NULL)
        regfree(re);
}

SEXP REGEXP_register(SEXP rre)
{
    sexp2regex_t_p(rre, FALSE);
    R_RegisterCFinalizer(rre, finalize_regexp);
    return R_NilValue;
}
Defines finalize_regexp, REGEXP_register (links are to index).

Error Reporting

Errors are raised by calling REGEXP_regerror from R, which in turn calls regerror to obtain an error message. A prefix is placed before the error message to make clear it is a regular expression error.

<error reporting>= (<-U)
SEXP REGEXP_regerror(SEXP rcode, SEXP rre)
{
    char buf[512];
    int len;
    strcpy(buf, "regex error: ");
    len = strlen(buf);
    regerror(sexp2int(rcode, FALSE), sexp2regex_t_p(rre, FALSE),
             buf + len, sizeof(buf) - len);
    error(buf);
    return R_NilValue; /* not reached */
}
Defines REGEXP_regerror (links are to index).

Main Functions

The C interface to regcomp simply unpacks the arguments, calls the function, and packs up and returns its result. This code could easily be generated automatically from the function prototype.

<main functions>= (<-U) [D->]
SEXP REGEXP_regcomp(SEXP rre, SEXP pat, SEXP rflags)
{
    return ScalarInteger(regcomp(sexp2regex_t_p(rre, FALSE),
                                 sexp2char_p(pat),
                                 sexp2int(rflags, FALSE)));
}
Defines REGEXP_regcomp (links are to index).

The R implementation creates the regular expression object, compiles it, checks for errors and registers the compiled object for finalization before returning it. It might be a good idea to suspend interrupts while this is going on to insure that a compiled regular expression will have its finalizer registered (unless there is an out of memory error).

<lower-level R implementation>+= (U->) [<-D->]
regcomp <- function(pat, flags = REG.EXTENDED) {
    #**** without interrupts??
    rex <- make.regex()
    result <- .Call("REGEXP_regcomp", rex, pat, flags)
    if (result != 0)
        .Call("REGEXP_regerror", result, rex)
    .Call("REGEXP_register", rex)
    rex
}
Defines regcomp (links are to index).

The C and R interface to regexec are in the same spirit. Here there is no need for finalization though.

<main functions>+= (<-U) [<-D->]
SEXP REGEXP_regexec(SEXP rre, SEXP str, SEXP rn, SEXP rrm, SEXP rflags)
{
    return ScalarInteger(regexec(sexp2regex_t_p(rre, FALSE),
                                 sexp2char_p(str),
                                 sexp2int(rn, FALSE),
                                 sexp2regmatch_t_p(rrm, FALSE),
                                 sexp2int(rflags, FALSE)));
}
Defines REGEXP_regexec (links are to index).

<lower-level R implementation>+= (U->) [<-D->]
regexec <- function(rex, str, flags = 0) {
    nmatch <- regex.nsub(rex) + 1
    rm <- make.regmatch(nmatch)
    result <- .Call("REGEXP_regexec", rex, str, nmatch, rm, flags)
    if (result == 0) {
        val <- matrix(integer(2 * nmatch), nmatch)
        for (i in 1:nmatch) {
            val[i, 1] <- regmatch.so(rm, i - 1)
            val[i, 2] <- regmatch.eo(rm, i - 1)
        }
        val
    }
    else if (result == REG.NOMATCH)
        NULL
    else
        .Call("REGEXP_regerror", result, rex)
}
Defines regexec (links are to index).

Finally, for the url.decode example we need to be able to convert a string with a hex representation of a character into a string with the character. There is probably already some way to do this in R, but since I didn't find one here is a quick and dirty way.

<main functions>+= (<-U) [<-D]
SEXP REGEXP_deHex(SEXP s)
{
    char out[2], *str = sexp2char_p(s);
    int ch;
    if (sscanf(str, "%x", &ch) <= 0)
        error("bad hex string");
    out[0] = ch;
    out[1] = 0;
    return ScalarString(mkChar(out));
}
Defines REGEXP_deHex (links are to index).

<lower-level R implementation>+= (U->) [<-D->]
deHex <- function(str) .Call("REGEXP_deHex", str)
Defines deHex (links are to index).

Initialization

The R package initialization function .First.lib loads the library and runs the C initialization function REGEXP_init. The package environment is given to this function as an argument.

<lower-level R implementation>+= (U->) [<-D]
.First.lib <- function(lib, pkg) {
    library.dynam( "regexp", pkg, lib )
    pkgname <- paste("package", pkg, sep = ":")
    .Call("REGEXP_init", pos.to.env(match(pkgname, search())))
}

The C initialization function REGEXP_init initializes the type tags and then assigns the R variables corresponding to the C constants in the POSIX regular expression interface. The constants are assigned into the environment passed to the C initialization function by .First.lib.

<initialization>= (<-U)
SEXP REGEXP_init(SEXP env)
{
    <initialize type tags>

    defineIntVar(env, "REG.EXTENDED", REG_EXTENDED);
    defineIntVar(env, "REG.ICASE", REG_ICASE);
    defineIntVar(env, "REG.NEWLINE", REG_NEWLINE);
    defineIntVar(env, "REG.NOSUB", REG_NOSUB);
    defineIntVar(env, "REG.NOMATCH", REG_NOMATCH);

    defineIntVar(env, "REG.NOTBOL", REG_NOTBOL);
    defineIntVar(env, "REG.NOTEOL", REG_NOTEOL);
    return R_NilValue;
}
Defines REG.EXTENDED, REG.ICASE, REG.NEWLINE, REG.NOMATCH, REG.NOSUB, REG.NOTBOL, REG.NOTEOL, REGEXP_init (links are to index).

Utility Functions

The functions in this section are utility functions that would be useful in defining other R interfaces to C libraries. For now they are static functions in this library. We may eventually want to include some versions of these in the R library as support code for this kind of interface.

The R_AllocatePtr function is reproduced from the example in the simple reference implementation notes. It allocates data from the R heap and packages it in an external pointer object.

<utility functions>= (<-U) [D->]
static SEXP R_AllocatePtr(size_t nmemb, size_t size, SEXP tag)
{
    SEXP data, val;
    int bytes;
    if (INT_MAX / size < nmemb)
        error("allocation request is too large");
    bytes = nmemb * size;
    PROTECT(data = allocString(bytes));
    memset(CHAR(data), 0, bytes);
    val = R_MakeExternalPtr(CHAR(data), tag, data);
    UNPROTECT(1);
    return val;
}
Defines R_AllocatePtr (links are to index).

The function sexp2int is similar to the function asInteger in the R base code. The main difference is that sexp2int allows an error to be raised if the result would be NA. Another small difference is that complex number will not be converted to integers unless their imaginary parts are zero.

<utility functions>+= (<-U) [<-D->]
static int sexp2int(SEXP s, Rboolean na_ok)
{
    int val = NA_INTEGER;
    switch (TYPEOF(s)) {
    case LGLSXP:
        if (LENGTH(s) == 1 && LOGICAL(s)[0] != NA_LOGICAL)
            val = LOGICAL(s)[0] != 0;
        break;
    case INTSXP:
        if (LENGTH(s) == 1)
            val = INTEGER(s)[0];
        break;
    case REALSXP:
        if (LENGTH(s) == 1 && R_FINITE(REAL(s)[0]))
            val = (int) (REAL(s)[0]);
        break;
    case CPLXSXP:
        if (LENGTH(s) == 1 &&
            R_FINITE(COMPLEX(s)[0].r) && R_FINITE(COMPLEX(s)[0].i) &&
            COMPLEX(s)[0].i == 0.0)
            val = (int) (COMPLEX(s)[0].r);
        break;
    }
    if (! na_ok && val == NA_INTEGER)
        error("not a valid integer");
    return val;
}
Defines sexp2int (links are to index).

The functions sexp2char checks that its argument is a character vector of length one and returns the pointer to the string's data.

<utility functions>+= (<-U) [<-D->]
static char *sexp2char_p(SEXP s)
{
    if (TYPEOF(s) != STRSXP || length(s) != 1)
        error("argument not a string vector of length one");
    return CHAR(STRING_ELT(s, 0));
}
Defines sexp2char (links are to index).

sexp2ptr handles checking an external pointer for its type tag, optionally for a NULL value, and returning the pointer value. It might be more useful to adopt a convention where the tag contains a list of types and the beginning of the pointer object's list must match. This would allow the tag to represent a simple inheritance structure using a display for the type hierarchy. (Checking the last symbol in the display matches should be sufficient). We could allow both approaches, require that pointers use one or the other, or leave the field NULL, and use this info for a nicer printed representation of pointers.

<utility functions>+= (<-U) [<-D->]
static void *sexp2ptr(SEXP s, Rboolean null_ok, SEXP tag, char *type)
{
    void *p;
    if (TYPEOF(s) != EXTPTRSXP || R_ExternalPtrTag(s) != tag)
        error("bad %s pointer", type);
    p = R_ExternalPtrAddr(s);
    if (! null_ok && p == NULL)
        error("null %s pointer", type);
    return p;
}
Defines sexp2ptr (links are to index).

Finally, defineIntVar is a bit like defineVar, except that its symbol is specified as a C string and is value as a C int. This function is useful for installing integer constants like the flag constants in the regular expression interface.

<utility functions>+= (<-U) [<-D]
static void defineIntVar(SEXP env, char *name, int ival)
{
    SEXP sym, val;
    sym = install(name);
    PROTECT(val = ScalarInteger(ival));
    defineVar(sym, val, env);
    UNPROTECT(1);
}

[BibTeX bibliography]

Indices

Chunks

Identifiers