2009/05/28
xlispstat
, which is in turn based on the Tcl
8.0 interface. I believe more recent Tcl's have switched to a new
regular expression library by Henry Spencer that provides some
additional features. The code is available as a
package.
regexp
and regsub
, based on the Tcl functions with the same
names.
The syntax for regexp
is
regexp(pattern, string, ignore.case=FALSE, extended=TRUE, index.only=TRUE)
pattern
is a regular expression to be matched to string
.
pattern
and string
must be character vectors of length 1. If
ignore.case
is TRUE
, the case of characters is ignored in
comparisons. If extended
is TRUE
, then extended regular
expressions are used (the REG_EXTENDED
flag is given to
regcomp
). If index.only
is TRUE
, an integer matrix with
two columns, the starts and the ends of the matched substrings (C
indexing) is returned. Otherwise, a character vector of the substrings
is returned.
The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example
11-2], uses regexp
to decompose a URL
into its components:
> regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", + "http://stat.umn.edu:80/xyz") [1] "http://stat.umn.edu:80/xyz" "http://" [3] "http" "stat.umn.edu" [5] ":80" "80" [7] "/xyz"Element 3 is the protocol, 4 is the host, 6 is the port, and 7 is the path. We can use this to make a function for extracting the parts of a URL:
<url.parts
definition>= (U->)
url.parts <- function(url) {
val <- regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", url)
if (length(val) == 0)
stop(paste("not a valid URL:", url))
structure(val[c(3,4,6,7)], names=c("protocol","host","port","path"))
}
Definesurl.parts
(links are to index).
Using the example above,
> url.parts("http://stat.umn.edu:80/xyz") protocol host port path "http" "stat.umn.edu" "80" "/xyz"
The function regsub
performs substitution based on pattern
matching. The syntax is
regsub(pattern, string, sub, ignore.case=FALSE, extended=TRUE, all=FALSE)A match of
pattern
in string
is replaced using sub
. If
all
is TRUE
, then all occurrences are replaced; otherwise,
only the first is replaced. The substitution argument sub
can be a
string, which is used literally, or it can be a function. The function
is called with the substring vector produced by the match as its
argument and should return a string to use as a replacement for the
match.
An example [These examples are based on the URL Decoding
example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple
form of regsub
:
> regsub("\\+", "abc+def+ghi", " ", all=TRUE) [1] "abc def ghi"To illustrate the more complex form, the following example replaces
%xx
hexadecimal encodings in URL's with the character they
represent. It uses the utility function deHex
also included in
this package (there may be a better way to do this with some functions
already in R).
> regsub("%([0-9a-hA-H][0-9a-hA-H])", "%7ewelc%68", + function(s) deHex(s[2]), all=TRUE) [1] "~welch"These two examples can be combined into a simple URL decoding function:
<url.decode
definition>= (U->)
url.decode <- function(url)
regsub("%([0-9a-hA-H][0-9a-hA-H])", regsub("\\+", url, " ", all=TRUE),
function(s) deHex(s[2]), all=TRUE)
Definesurl.decode
(links are to index).
An illustration:
> url.decode("%7ewelc%68+book") [1] "~welch book"
regcomp
can be used to
compile a regular expression that is then reused repeatedly in calls
to regexec
. The functions return results as R objects and signal
errors when appropriate rather than returning error codes.
There are six public constants to be used as flags. These are integer
values that can be combined with by addition (which should correspond
to a logical inclusive or for the se flag values). The four flag
values for regcomp
are
The two flags for
REG.EXTENDED
Use extended regular expressions REG.NEWLINE
Special handling of newline characters REG.NOSUB
Report only success/fail in regexec
REG.ICASE
Ignore case in match.
regexec
are
REG.NOTBOL
First character of string is not the beginning of the line REG.NOTEOL
Last character of string is not the end of the line
The two public functions are regcomp
and regexec
. The syntax
for regcomp
is
regcomp(pattern, flags=REG.EXTENDED)where
pattern
is a string containing a regular expression and
flags
is an integer flag constructed from the constants given
above. regcomp
returns a regular expression pointer or signals an
error.
The syntax for regexec
is
regexec(rex, string, flags=0)where
rex
is a regular expression structure created by
regcomp
, string
is a string to be matched, and flags
is an
integer flag.
As an example, we can compile the URL pattern used above as
> rex <- regcomp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")and use it to match a URL with
> regexec(rex, "http://stat.umn.edu:80/xyz") [,1] [,2] [1,] 0 26 [2,] 0 7 [3,] 0 4 [4,] 7 19 [5,] 19 22 [6,] 20 22 [7,] 22 26
regexp.c
and the R file
regexp.R
.
<regexp.c>= #include "Rinternals.h" #include <sys/types.h> #include <regex.h> <utility functions> <regex_t
interface> <regmatch_t
interface> <finalization> <error reporting> <main functions> <initialization>
<regexp.R>= <lower-level R implementation> <higher-level R implementation>
<higher-level R implementation>= (U->) <compile.regex
definition> <get.substrings
definition> <regexp
definition> <regsub
definition> <url.parts
definition> <url.decode
definition>
The regexp
function compiles the regular expression, runs
regexpr
, and then turns the resulting matrix of index pairs into a
vector of substrings using get.substrings
.
<regexp
definition>= (<-U)
regexp <- function(pat, str, index.only=FALSE, ...) {
rex <- compile.regex(pat, ...)
pairs <- regexec(rex, str, 0)
if (index.only) pairs
else get.substrings(pairs, str)
}
Definesregexp
(links are to index).
The regular expression is compiled with
<compile.regex
definition>= (<-U)
compile.regex <- function(pat, extended=TRUE, ignore.case=FALSE) {
flags <- if (extended) REG.EXTENDED else 0
if (ignore.case) flabs <- flags + REG.ICASE
regcomp(pat, flags)
}
Definescompile.regex
(links are to index).
The function get.substrings
is
<get.substrings
definition>= (<-U)
get.substrings <- function(pairs, str) {
if (is.null(pairs)) character(0)
else apply(pairs, 1, function(i, s) substr(s, i[1] + 1, i[2]), str)
}
Definesget.substrings
(links are to index).
The regsub
function is defined as
<regsub
definition>= (<-U)
regsub <- function(pat, str, sub, all=FALSE, ...) {
strcat <- function(...) paste(..., sep="")
rex <- compile.regex(pat, ...)
head <- ""
tail <- str
repeat {
val <- regexec(rex, tail, 0)
if (is.null(val))
return(strcat(head, tail))
sval <- if (is.character(sub)) sub else sub(get.substrings(val, tail))
tail.head <- substr(tail, 1, val[1,1])
head <- strcat(head, tail.head, sval)
tail <- substr(tail, val[1,2] + 1, nchar(tail))
if (! all)
return(strcat(head, tail))
}
}
Definesregsub
(links are to index).
.Call
this means that the C
functions and the basic wrappers around the .Call
could be
generated automatically from a simple interface description---this is
what I do in xlispstat
.
regex_t
and regmatch_t
. The
interfaces for these types define a tag variable, a constructor, a
converter and field accessors.
The type tags are symbols that are initialized by the package initialization function.
<regex_t
interface>= (<-U) [D->]
static SEXP REGEX_type_tag;
DefinesREGEX_type_tag
(links are to index).
<regmatch_t
interface>= (<-U) [D->]
static SEXP REGMATCH_type_tag;
DefinesREGMATCH_type_tag
(links are to index).
<initialize type tags>= (U->) REGEX_type_tag = install("REGEX_TYPE_TAG"); REGMATCH_type_tag = install("REGMATCH_TYPE_TAG");
The constructor for regex_t
objects allocates space for a
specified number of regex_t
structures on the R heap and returns
it wrapped as an external pointer object. The R constructor provides
a wrapper around the C interface and provides a default value for the
number of structures
<regex_t
interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regex_t(SEXP rn)
{
int n = sexp2int(rn, FALSE);
if (n <= 0) return R_NilValue;
else return R_AllocatePtr(n, sizeof(regex_t), REGEX_type_tag);
}
<lower-level R implementation>= (U->) [D->] make.regex <- function(n = 1) .Call("REGEXP_make_regex_t", n)
Definesmake.regex
(links are to index).
The C and R constructors for regmatch_t
objects are analogous.
<regmatch_t
interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regmatch_t(SEXP rn)
{
int n = sexp2int(rn, FALSE);
if (n <= 0) return R_NilValue;
else return R_AllocatePtr(n, sizeof(regmatch_t), REGMATCH_type_tag);
}
<lower-level R implementation>+= (U->) [<-D->] make.regmatch <- function(n = 1) .Call("REGEXP_make_regmatch_t", n)
Definesmake.regmatch
(links are to index).
The converter functions convert from the SEXP
pointer wrappers to
the native C pointers, with appropriate error checking done by some
utility functions. These functions are static
since they will only
be used internally by the C interface.
<regex_t
interface>+= (<-U) [<-D->]
static regex_t *sexp2regex_t_p(SEXP s, Rboolean null_ok)
{
return sexp2ptr(s, null_ok, REGEX_type_tag, "regex_t");
}
Definessexp2regex_t_p
(links are to index).
<regmatch_t
interface>+= (<-U) [<-D->]
static regmatch_t *sexp2regmatch_t_p(SEXP s, Rboolean null_ok)
{
return sexp2ptr(s, null_ok, REGMATCH_type_tag, "regmatch_t");
}
Definessexp2regmatch_t_p
(links are to index).
The regexp_t
structure has one integer field, re_nsub
. The
reader interface consists of a C function and an R wrapper. An offset
can be specified in case the pointer refers to data containing more
than one structure; the R interface sets the default offset to zero.
<regex_t
interface>+= (<-U) [<-D]
SEXP REGEXP_regex_t_re_nsub(SEXP rp, SEXP ri)
{
regex_t *p = sexp2regex_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].re_nsub);
}
DefinesREGEXP_regex_t_re_nsub
(links are to index).
<lower-level R implementation>+= (U->) [<-D->] regex.nsub <- function(re, i = 0) .Call("REGEXP_regex_t_re_nsub", re, i)
Definesregex.nsub
(links are to index).
The regmatch_t
structure has two integer fields, rm_so
and
rm_eo
. The reader interface is analogous to the previous one.
<regmatch_t
interface>+= (<-U) [<-D]
SEXP REGEXP_regmatch_t_rm_so(SEXP rp, SEXP ri)
{
regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].rm_so);
}
SEXP REGEXP_regmatch_t_rm_eo(SEXP rp, SEXP ri)
{
regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].rm_eo);
}
DefinesREGEXP_regmatch_t_rm_eo
,REGEXP_regmatch_t_rm_so
(links are to index).
<lower-level R implementation>+= (U->) [<-D->] regmatch.so <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_so", rm, i) regmatch.eo <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_eo", rm, i)
Definesregmatch.eo
,regmatch.so
(links are to index).
regfree
called on them
to release resources allocated to them by regcomp
. To do this, a
C finalizer is registered for the compiled regular expression. It
would be possible to do this as part of the C wrapper to regcomp
,
but I am doing it at the R level to illustrate the possibility of
automating the creation of interfaces.
<finalization>= (<-U) static void finalize_regexp(SEXP s) { regex_t *re = sexp2regex_t_p(s, TRUE); if (re != NULL) regfree(re); } SEXP REGEXP_register(SEXP rre) { sexp2regex_t_p(rre, FALSE); R_RegisterCFinalizer(rre, finalize_regexp); return R_NilValue; }
Definesfinalize_regexp
,REGEXP_register
(links are to index).
REGEXP_regerror
from R, which in turn
calls regerror
to obtain an error message. A prefix is placed
before the error message to make clear it is a regular expression
error.
<error reporting>= (<-U) SEXP REGEXP_regerror(SEXP rcode, SEXP rre) { char buf[512]; int len; strcpy(buf, "regex error: "); len = strlen(buf); regerror(sexp2int(rcode, FALSE), sexp2regex_t_p(rre, FALSE), buf + len, sizeof(buf) - len); error(buf); return R_NilValue; /* not reached */ }
DefinesREGEXP_regerror
(links are to index).
regcomp
simply unpacks the arguments, calls the
function, and packs up and returns its result. This code could easily
be generated automatically from the function prototype.
<main functions>= (<-U) [D->] SEXP REGEXP_regcomp(SEXP rre, SEXP pat, SEXP rflags) { return ScalarInteger(regcomp(sexp2regex_t_p(rre, FALSE), sexp2char_p(pat), sexp2int(rflags, FALSE))); }
DefinesREGEXP_regcomp
(links are to index).
The R implementation creates the regular expression object, compiles it, checks for errors and registers the compiled object for finalization before returning it. It might be a good idea to suspend interrupts while this is going on to insure that a compiled regular expression will have its finalizer registered (unless there is an out of memory error).
<lower-level R implementation>+= (U->) [<-D->] regcomp <- function(pat, flags = REG.EXTENDED) { #**** without interrupts?? rex <- make.regex() result <- .Call("REGEXP_regcomp", rex, pat, flags) if (result != 0) .Call("REGEXP_regerror", result, rex) .Call("REGEXP_register", rex) rex }
Definesregcomp
(links are to index).
The C and R interface to regexec
are in the same spirit. Here
there is no need for finalization though.
<main functions>+= (<-U) [<-D->] SEXP REGEXP_regexec(SEXP rre, SEXP str, SEXP rn, SEXP rrm, SEXP rflags) { return ScalarInteger(regexec(sexp2regex_t_p(rre, FALSE), sexp2char_p(str), sexp2int(rn, FALSE), sexp2regmatch_t_p(rrm, FALSE), sexp2int(rflags, FALSE))); }
DefinesREGEXP_regexec
(links are to index).
<lower-level R implementation>+= (U->) [<-D->] regexec <- function(rex, str, flags = 0) { nmatch <- regex.nsub(rex) + 1 rm <- make.regmatch(nmatch) result <- .Call("REGEXP_regexec", rex, str, nmatch, rm, flags) if (result == 0) { val <- matrix(integer(2 * nmatch), nmatch) for (i in 1:nmatch) { val[i, 1] <- regmatch.so(rm, i - 1) val[i, 2] <- regmatch.eo(rm, i - 1) } val } else if (result == REG.NOMATCH) NULL else .Call("REGEXP_regerror", result, rex) }
Definesregexec
(links are to index).
Finally, for the url.decode
example we need to be able to convert
a string with a hex representation of a character into a string with
the character. There is probably already some way to do this in R,
but since I didn't find one here is a quick and dirty way.
<main functions>+= (<-U) [<-D] SEXP REGEXP_deHex(SEXP s) { char out[2], *str = sexp2char_p(s); int ch; if (sscanf(str, "%x", &ch) <= 0) error("bad hex string"); out[0] = ch; out[1] = 0; return ScalarString(mkChar(out)); }
DefinesREGEXP_deHex
(links are to index).
<lower-level R implementation>+= (U->) [<-D->] deHex <- function(str) .Call("REGEXP_deHex", str)
DefinesdeHex
(links are to index).
.First.lib
loads the library
and runs the C initialization function REGEXP_init
. The package
environment is given to this function as an argument.
<lower-level R implementation>+= (U->) [<-D] .First.lib <- function(lib, pkg) { library.dynam( "regexp", pkg, lib ) pkgname <- paste("package", pkg, sep = ":") .Call("REGEXP_init", pos.to.env(match(pkgname, search()))) }
The C initialization function REGEXP_init
initializes the type
tags and then assigns the R variables corresponding to the C constants
in the POSIX regular expression interface. The constants are assigned
into the environment passed to the C initialization function by
.First.lib
.
<initialization>= (<-U) SEXP REGEXP_init(SEXP env) { <initialize type tags> defineIntVar(env, "REG.EXTENDED", REG_EXTENDED); defineIntVar(env, "REG.ICASE", REG_ICASE); defineIntVar(env, "REG.NEWLINE", REG_NEWLINE); defineIntVar(env, "REG.NOSUB", REG_NOSUB); defineIntVar(env, "REG.NOMATCH", REG_NOMATCH); defineIntVar(env, "REG.NOTBOL", REG_NOTBOL); defineIntVar(env, "REG.NOTEOL", REG_NOTEOL); return R_NilValue; }
DefinesREG.EXTENDED
,REG.ICASE
,REG.NEWLINE
,REG.NOMATCH
,REG.NOSUB
,REG.NOTBOL
,REG.NOTEOL
,REGEXP_init
(links are to index).
The R_AllocatePtr
function is reproduced from the example in the
simple
reference implementation notes. It allocates data from the R heap
and packages it in an external pointer object.
<utility functions>= (<-U) [D->] static SEXP R_AllocatePtr(size_t nmemb, size_t size, SEXP tag) { SEXP data, val; int bytes; if (INT_MAX / size < nmemb) error("allocation request is too large"); bytes = nmemb * size; PROTECT(data = allocString(bytes)); memset(CHAR(data), 0, bytes); val = R_MakeExternalPtr(CHAR(data), tag, data); UNPROTECT(1); return val; }
DefinesR_AllocatePtr
(links are to index).
The function sexp2int
is similar to the function asInteger
in
the R base code. The main difference is that sexp2int
allows an
error to be raised if the result would be NA
. Another small
difference is that complex number will not be converted to integers
unless their imaginary parts are zero.
<utility functions>+= (<-U) [<-D->] static int sexp2int(SEXP s, Rboolean na_ok) { int val = NA_INTEGER; switch (TYPEOF(s)) { case LGLSXP: if (LENGTH(s) == 1 && LOGICAL(s)[0] != NA_LOGICAL) val = LOGICAL(s)[0] != 0; break; case INTSXP: if (LENGTH(s) == 1) val = INTEGER(s)[0]; break; case REALSXP: if (LENGTH(s) == 1 && R_FINITE(REAL(s)[0])) val = (int) (REAL(s)[0]); break; case CPLXSXP: if (LENGTH(s) == 1 && R_FINITE(COMPLEX(s)[0].r) && R_FINITE(COMPLEX(s)[0].i) && COMPLEX(s)[0].i == 0.0) val = (int) (COMPLEX(s)[0].r); break; } if (! na_ok && val == NA_INTEGER) error("not a valid integer"); return val; }
Definessexp2int
(links are to index).
The functions sexp2char
checks that its argument is a character
vector of length one and returns the pointer to the string's data.
<utility functions>+= (<-U) [<-D->] static char *sexp2char_p(SEXP s) { if (TYPEOF(s) != STRSXP || length(s) != 1) error("argument not a string vector of length one"); return CHAR(STRING_ELT(s, 0)); }
Definessexp2char
(links are to index).
sexp2ptr
handles checking an external pointer for its type tag,
optionally for a NULL
value, and returning the pointer value. It
might be more useful to adopt a convention where the tag
contains
a list of types and the beginning of the pointer object's list must
match. This would allow the tag
to represent a simple inheritance
structure using a display for the type hierarchy. (Checking the last
symbol in the display matches should be sufficient). We could allow
both approaches, require that pointers use one or the other, or leave
the field NULL
, and use this info for a nicer printed
representation of pointers.
<utility functions>+= (<-U) [<-D->] static void *sexp2ptr(SEXP s, Rboolean null_ok, SEXP tag, char *type) { void *p; if (TYPEOF(s) != EXTPTRSXP || R_ExternalPtrTag(s) != tag) error("bad %s pointer", type); p = R_ExternalPtrAddr(s); if (! null_ok && p == NULL) error("null %s pointer", type); return p; }
Definessexp2ptr
(links are to index).
Finally, defineIntVar
is a bit like defineVar
, except that its
symbol is specified as a C string and is value as a C int
. This
function is useful for installing integer constants like the flag
constants in the regular expression interface.
<utility functions>+= (<-U) [<-D] static void defineIntVar(SEXP env, char *name, int ival) { SEXP sym, val; sym = install(name); PROTECT(val = ScalarInteger(ival)); defineVar(sym, val, env); UNPROTECT(1); }
compile.regex
definition>: U1, D2
get.substrings
definition>: U1, D2
regex_t
interface>: U1, D2, D3, D4, D5
regexp
definition>: U1, D2
regmatch_t
interface>: U1, D2, D3, D4, D5
regsub
definition>: U1, D2
url.decode
definition>: D1, U2
url.parts
definition>: D1, U2