2009/05/28
xlispstat, which is in turn based on the Tcl
8.0 interface. I believe more recent Tcl's have switched to a new
regular expression library by Henry Spencer that provides some
additional features. The code is available as a
package.
regexp and regsub, based on the Tcl functions with the same
names.
The syntax for regexp is
regexp(pattern, string, ignore.case=FALSE, extended=TRUE, index.only=TRUE)
pattern is a regular expression to be matched to string.
pattern and string must be character vectors of length 1. If
ignore.case is TRUE, the case of characters is ignored in
comparisons. If extended is TRUE, then extended regular
expressions are used (the REG_EXTENDED flag is given to
regcomp). If index.only is TRUE, an integer matrix with
two columns, the starts and the ends of the matched substrings (C
indexing) is returned. Otherwise, a character vector of the substrings
is returned.
The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example
11-2], uses regexp to decompose a URL
into its components:
> regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)",
+ "http://stat.umn.edu:80/xyz")
[1] "http://stat.umn.edu:80/xyz" "http://"
[3] "http" "stat.umn.edu"
[5] ":80" "80"
[7] "/xyz"
Element 3 is the protocol, 4 is the host, 6 is the port, and 7 is the path.
We can use this to make a function for extracting the parts of a URL:
<url.parts definition>= (U->)
url.parts <- function(url) {
val <- regexp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)", url)
if (length(val) == 0)
stop(paste("not a valid URL:", url))
structure(val[c(3,4,6,7)], names=c("protocol","host","port","path"))
}
Definesurl.parts(links are to index).
Using the example above,
> url.parts("http://stat.umn.edu:80/xyz")
protocol host port path
"http" "stat.umn.edu" "80" "/xyz"
The function regsub performs substitution based on pattern
matching. The syntax is
regsub(pattern, string, sub, ignore.case=FALSE, extended=TRUE, all=FALSE)A match of
pattern in string is replaced using sub. If
all is TRUE, then all occurrences are replaced; otherwise,
only the first is replaced. The substitution argument sub can be a
string, which is used literally, or it can be a function. The function
is called with the substring vector produced by the match as its
argument and should return a string to use as a replacement for the
match.
An example [These examples are based on the URL Decoding
example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple
form of regsub:
> regsub("\\+", "abc+def+ghi", " ", all=TRUE)
[1] "abc def ghi"
To illustrate the more complex form, the following example replaces
%xx hexadecimal encodings in URL's with the character they
represent. It uses the utility function deHex also included in
this package (there may be a better way to do this with some functions
already in R).
> regsub("%([0-9a-hA-H][0-9a-hA-H])", "%7ewelc%68",
+ function(s) deHex(s[2]), all=TRUE)
[1] "~welch"
These two examples can be combined into a simple URL decoding
function:
<url.decode definition>= (U->)
url.decode <- function(url)
regsub("%([0-9a-hA-H][0-9a-hA-H])", regsub("\\+", url, " ", all=TRUE),
function(s) deHex(s[2]), all=TRUE)
Definesurl.decode(links are to index).
An illustration:
> url.decode("%7ewelc%68+book")
[1] "~welch book"
regcomp can be used to
compile a regular expression that is then reused repeatedly in calls
to regexec. The functions return results as R objects and signal
errors when appropriate rather than returning error codes.
There are six public constants to be used as flags. These are integer
values that can be combined with by addition (which should correspond
to a logical inclusive or for the se flag values). The four flag
values for regcomp are
The two flags for
REG.EXTENDEDUse extended regular expressions REG.NEWLINESpecial handling of newline characters REG.NOSUBReport only success/fail in regexecREG.ICASEIgnore case in match.
regexec are
REG.NOTBOLFirst character of string is not the beginning of the line REG.NOTEOLLast character of string is not the end of the line
The two public functions are regcomp and regexec. The syntax
for regcomp is
regcomp(pattern, flags=REG.EXTENDED)where
pattern is a string containing a regular expression and
flags is an integer flag constructed from the constants given
above. regcomp returns a regular expression pointer or signals an
error.
The syntax for regexec is
regexec(rex, string, flags=0)where
rex is a regular expression structure created by
regcomp, string is a string to be matched, and flags is an
integer flag.
As an example, we can compile the URL pattern used above as
> rex <- regcomp("^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)")
and use it to match a URL with
> regexec(rex, "http://stat.umn.edu:80/xyz")
[,1] [,2]
[1,] 0 26
[2,] 0 7
[3,] 0 4
[4,] 7 19
[5,] 19 22
[6,] 20 22
[7,] 22 26
regexp.c and the R file
regexp.R.
<regexp.c>= #include "Rinternals.h" #include <sys/types.h> #include <regex.h> <utility functions> <regex_tinterface> <regmatch_tinterface> <finalization> <error reporting> <main functions> <initialization>
<regexp.R>= <lower-level R implementation> <higher-level R implementation>
<higher-level R implementation>= (U->) <compile.regexdefinition> <get.substringsdefinition> <regexpdefinition> <regsubdefinition> <url.partsdefinition> <url.decodedefinition>
The regexp function compiles the regular expression, runs
regexpr, and then turns the resulting matrix of index pairs into a
vector of substrings using get.substrings.
<regexp definition>= (<-U)
regexp <- function(pat, str, index.only=FALSE, ...) {
rex <- compile.regex(pat, ...)
pairs <- regexec(rex, str, 0)
if (index.only) pairs
else get.substrings(pairs, str)
}
Definesregexp(links are to index).
The regular expression is compiled with
<compile.regex definition>= (<-U)
compile.regex <- function(pat, extended=TRUE, ignore.case=FALSE) {
flags <- if (extended) REG.EXTENDED else 0
if (ignore.case) flabs <- flags + REG.ICASE
regcomp(pat, flags)
}
Definescompile.regex(links are to index).
The function get.substrings is
<get.substrings definition>= (<-U)
get.substrings <- function(pairs, str) {
if (is.null(pairs)) character(0)
else apply(pairs, 1, function(i, s) substr(s, i[1] + 1, i[2]), str)
}
Definesget.substrings(links are to index).
The regsub function is defined as
<regsub definition>= (<-U)
regsub <- function(pat, str, sub, all=FALSE, ...) {
strcat <- function(...) paste(..., sep="")
rex <- compile.regex(pat, ...)
head <- ""
tail <- str
repeat {
val <- regexec(rex, tail, 0)
if (is.null(val))
return(strcat(head, tail))
sval <- if (is.character(sub)) sub else sub(get.substrings(val, tail))
tail.head <- substr(tail, 1, val[1,1])
head <- strcat(head, tail.head, sval)
tail <- substr(tail, val[1,2] + 1, nchar(tail))
if (! all)
return(strcat(head, tail))
}
}
Definesregsub(links are to index).
.Call this means that the C
functions and the basic wrappers around the .Call could be
generated automatically from a simple interface description---this is
what I do in xlispstat.
regex_t and regmatch_t. The
interfaces for these types define a tag variable, a constructor, a
converter and field accessors.
The type tags are symbols that are initialized by the package initialization function.
<regex_t interface>= (<-U) [D->]
static SEXP REGEX_type_tag;
DefinesREGEX_type_tag(links are to index).
<regmatch_t interface>= (<-U) [D->]
static SEXP REGMATCH_type_tag;
DefinesREGMATCH_type_tag(links are to index).
<initialize type tags>= (U->)
REGEX_type_tag = install("REGEX_TYPE_TAG");
REGMATCH_type_tag = install("REGMATCH_TYPE_TAG");
The constructor for regex_t objects allocates space for a
specified number of regex_t structures on the R heap and returns
it wrapped as an external pointer object. The R constructor provides
a wrapper around the C interface and provides a default value for the
number of structures
<regex_t interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regex_t(SEXP rn)
{
int n = sexp2int(rn, FALSE);
if (n <= 0) return R_NilValue;
else return R_AllocatePtr(n, sizeof(regex_t), REGEX_type_tag);
}
<lower-level R implementation>= (U->) [D->]
make.regex <- function(n = 1) .Call("REGEXP_make_regex_t", n)
Definesmake.regex(links are to index).
The C and R constructors for regmatch_t objects are analogous.
<regmatch_t interface>+= (<-U) [<-D->]
SEXP REGEXP_make_regmatch_t(SEXP rn)
{
int n = sexp2int(rn, FALSE);
if (n <= 0) return R_NilValue;
else return R_AllocatePtr(n, sizeof(regmatch_t), REGMATCH_type_tag);
}
<lower-level R implementation>+= (U->) [<-D->]
make.regmatch <- function(n = 1) .Call("REGEXP_make_regmatch_t", n)
Definesmake.regmatch(links are to index).
The converter functions convert from the SEXP pointer wrappers to
the native C pointers, with appropriate error checking done by some
utility functions. These functions are static since they will only
be used internally by the C interface.
<regex_t interface>+= (<-U) [<-D->]
static regex_t *sexp2regex_t_p(SEXP s, Rboolean null_ok)
{
return sexp2ptr(s, null_ok, REGEX_type_tag, "regex_t");
}
Definessexp2regex_t_p(links are to index).
<regmatch_t interface>+= (<-U) [<-D->]
static regmatch_t *sexp2regmatch_t_p(SEXP s, Rboolean null_ok)
{
return sexp2ptr(s, null_ok, REGMATCH_type_tag, "regmatch_t");
}
Definessexp2regmatch_t_p(links are to index).
The regexp_t structure has one integer field, re_nsub. The
reader interface consists of a C function and an R wrapper. An offset
can be specified in case the pointer refers to data containing more
than one structure; the R interface sets the default offset to zero.
<regex_t interface>+= (<-U) [<-D]
SEXP REGEXP_regex_t_re_nsub(SEXP rp, SEXP ri)
{
regex_t *p = sexp2regex_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].re_nsub);
}
DefinesREGEXP_regex_t_re_nsub(links are to index).
<lower-level R implementation>+= (U->) [<-D->]
regex.nsub <- function(re, i = 0) .Call("REGEXP_regex_t_re_nsub", re, i)
Definesregex.nsub(links are to index).
The regmatch_t structure has two integer fields, rm_so and
rm_eo. The reader interface is analogous to the previous one.
<regmatch_t interface>+= (<-U) [<-D]
SEXP REGEXP_regmatch_t_rm_so(SEXP rp, SEXP ri)
{
regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].rm_so);
}
SEXP REGEXP_regmatch_t_rm_eo(SEXP rp, SEXP ri)
{
regmatch_t *p = sexp2regmatch_t_p(rp, FALSE);
return ScalarInteger(p[sexp2int(ri, FALSE)].rm_eo);
}
DefinesREGEXP_regmatch_t_rm_eo,REGEXP_regmatch_t_rm_so(links are to index).
<lower-level R implementation>+= (U->) [<-D->]
regmatch.so <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_so", rm, i)
regmatch.eo <- function(rm, i = 0) .Call("REGEXP_regmatch_t_rm_eo", rm, i)
Definesregmatch.eo,regmatch.so(links are to index).
regfree called on them
to release resources allocated to them by regcomp. To do this, a
C finalizer is registered for the compiled regular expression. It
would be possible to do this as part of the C wrapper to regcomp,
but I am doing it at the R level to illustrate the possibility of
automating the creation of interfaces.
<finalization>= (<-U)
static void finalize_regexp(SEXP s)
{
regex_t *re = sexp2regex_t_p(s, TRUE);
if (re != NULL)
regfree(re);
}
SEXP REGEXP_register(SEXP rre)
{
sexp2regex_t_p(rre, FALSE);
R_RegisterCFinalizer(rre, finalize_regexp);
return R_NilValue;
}
Definesfinalize_regexp,REGEXP_register(links are to index).
REGEXP_regerror from R, which in turn
calls regerror to obtain an error message. A prefix is placed
before the error message to make clear it is a regular expression
error.
<error reporting>= (<-U)
SEXP REGEXP_regerror(SEXP rcode, SEXP rre)
{
char buf[512];
int len;
strcpy(buf, "regex error: ");
len = strlen(buf);
regerror(sexp2int(rcode, FALSE), sexp2regex_t_p(rre, FALSE),
buf + len, sizeof(buf) - len);
error(buf);
return R_NilValue; /* not reached */
}
DefinesREGEXP_regerror(links are to index).
regcomp simply unpacks the arguments, calls the
function, and packs up and returns its result. This code could easily
be generated automatically from the function prototype.
<main functions>= (<-U) [D->]
SEXP REGEXP_regcomp(SEXP rre, SEXP pat, SEXP rflags)
{
return ScalarInteger(regcomp(sexp2regex_t_p(rre, FALSE),
sexp2char_p(pat),
sexp2int(rflags, FALSE)));
}
DefinesREGEXP_regcomp(links are to index).
The R implementation creates the regular expression object, compiles it, checks for errors and registers the compiled object for finalization before returning it. It might be a good idea to suspend interrupts while this is going on to insure that a compiled regular expression will have its finalizer registered (unless there is an out of memory error).
<lower-level R implementation>+= (U->) [<-D->]
regcomp <- function(pat, flags = REG.EXTENDED) {
#**** without interrupts??
rex <- make.regex()
result <- .Call("REGEXP_regcomp", rex, pat, flags)
if (result != 0)
.Call("REGEXP_regerror", result, rex)
.Call("REGEXP_register", rex)
rex
}
Definesregcomp(links are to index).
The C and R interface to regexec are in the same spirit. Here
there is no need for finalization though.
<main functions>+= (<-U) [<-D->]
SEXP REGEXP_regexec(SEXP rre, SEXP str, SEXP rn, SEXP rrm, SEXP rflags)
{
return ScalarInteger(regexec(sexp2regex_t_p(rre, FALSE),
sexp2char_p(str),
sexp2int(rn, FALSE),
sexp2regmatch_t_p(rrm, FALSE),
sexp2int(rflags, FALSE)));
}
DefinesREGEXP_regexec(links are to index).
<lower-level R implementation>+= (U->) [<-D->]
regexec <- function(rex, str, flags = 0) {
nmatch <- regex.nsub(rex) + 1
rm <- make.regmatch(nmatch)
result <- .Call("REGEXP_regexec", rex, str, nmatch, rm, flags)
if (result == 0) {
val <- matrix(integer(2 * nmatch), nmatch)
for (i in 1:nmatch) {
val[i, 1] <- regmatch.so(rm, i - 1)
val[i, 2] <- regmatch.eo(rm, i - 1)
}
val
}
else if (result == REG.NOMATCH)
NULL
else
.Call("REGEXP_regerror", result, rex)
}
Definesregexec(links are to index).
Finally, for the url.decode example we need to be able to convert
a string with a hex representation of a character into a string with
the character. There is probably already some way to do this in R,
but since I didn't find one here is a quick and dirty way.
<main functions>+= (<-U) [<-D]
SEXP REGEXP_deHex(SEXP s)
{
char out[2], *str = sexp2char_p(s);
int ch;
if (sscanf(str, "%x", &ch) <= 0)
error("bad hex string");
out[0] = ch;
out[1] = 0;
return ScalarString(mkChar(out));
}
DefinesREGEXP_deHex(links are to index).
<lower-level R implementation>+= (U->) [<-D->]
deHex <- function(str) .Call("REGEXP_deHex", str)
DefinesdeHex(links are to index).
.First.lib loads the library
and runs the C initialization function REGEXP_init. The package
environment is given to this function as an argument.
<lower-level R implementation>+= (U->) [<-D]
.First.lib <- function(lib, pkg) {
library.dynam( "regexp", pkg, lib )
pkgname <- paste("package", pkg, sep = ":")
.Call("REGEXP_init", pos.to.env(match(pkgname, search())))
}
The C initialization function REGEXP_init initializes the type
tags and then assigns the R variables corresponding to the C constants
in the POSIX regular expression interface. The constants are assigned
into the environment passed to the C initialization function by
.First.lib.
<initialization>= (<-U)
SEXP REGEXP_init(SEXP env)
{
<initialize type tags>
defineIntVar(env, "REG.EXTENDED", REG_EXTENDED);
defineIntVar(env, "REG.ICASE", REG_ICASE);
defineIntVar(env, "REG.NEWLINE", REG_NEWLINE);
defineIntVar(env, "REG.NOSUB", REG_NOSUB);
defineIntVar(env, "REG.NOMATCH", REG_NOMATCH);
defineIntVar(env, "REG.NOTBOL", REG_NOTBOL);
defineIntVar(env, "REG.NOTEOL", REG_NOTEOL);
return R_NilValue;
}
DefinesREG.EXTENDED,REG.ICASE,REG.NEWLINE,REG.NOMATCH,REG.NOSUB,REG.NOTBOL,REG.NOTEOL,REGEXP_init(links are to index).
The R_AllocatePtr function is reproduced from the example in the
simple
reference implementation notes. It allocates data from the R heap
and packages it in an external pointer object.
<utility functions>= (<-U) [D->]
static SEXP R_AllocatePtr(size_t nmemb, size_t size, SEXP tag)
{
SEXP data, val;
int bytes;
if (INT_MAX / size < nmemb)
error("allocation request is too large");
bytes = nmemb * size;
PROTECT(data = allocString(bytes));
memset(CHAR(data), 0, bytes);
val = R_MakeExternalPtr(CHAR(data), tag, data);
UNPROTECT(1);
return val;
}
DefinesR_AllocatePtr(links are to index).
The function sexp2int is similar to the function asInteger in
the R base code. The main difference is that sexp2int allows an
error to be raised if the result would be NA. Another small
difference is that complex number will not be converted to integers
unless their imaginary parts are zero.
<utility functions>+= (<-U) [<-D->]
static int sexp2int(SEXP s, Rboolean na_ok)
{
int val = NA_INTEGER;
switch (TYPEOF(s)) {
case LGLSXP:
if (LENGTH(s) == 1 && LOGICAL(s)[0] != NA_LOGICAL)
val = LOGICAL(s)[0] != 0;
break;
case INTSXP:
if (LENGTH(s) == 1)
val = INTEGER(s)[0];
break;
case REALSXP:
if (LENGTH(s) == 1 && R_FINITE(REAL(s)[0]))
val = (int) (REAL(s)[0]);
break;
case CPLXSXP:
if (LENGTH(s) == 1 &&
R_FINITE(COMPLEX(s)[0].r) && R_FINITE(COMPLEX(s)[0].i) &&
COMPLEX(s)[0].i == 0.0)
val = (int) (COMPLEX(s)[0].r);
break;
}
if (! na_ok && val == NA_INTEGER)
error("not a valid integer");
return val;
}
Definessexp2int(links are to index).
The functions sexp2char checks that its argument is a character
vector of length one and returns the pointer to the string's data.
<utility functions>+= (<-U) [<-D->]
static char *sexp2char_p(SEXP s)
{
if (TYPEOF(s) != STRSXP || length(s) != 1)
error("argument not a string vector of length one");
return CHAR(STRING_ELT(s, 0));
}
Definessexp2char(links are to index).
sexp2ptr handles checking an external pointer for its type tag,
optionally for a NULL value, and returning the pointer value. It
might be more useful to adopt a convention where the tag contains
a list of types and the beginning of the pointer object's list must
match. This would allow the tag to represent a simple inheritance
structure using a display for the type hierarchy. (Checking the last
symbol in the display matches should be sufficient). We could allow
both approaches, require that pointers use one or the other, or leave
the field NULL, and use this info for a nicer printed
representation of pointers.
<utility functions>+= (<-U) [<-D->]
static void *sexp2ptr(SEXP s, Rboolean null_ok, SEXP tag, char *type)
{
void *p;
if (TYPEOF(s) != EXTPTRSXP || R_ExternalPtrTag(s) != tag)
error("bad %s pointer", type);
p = R_ExternalPtrAddr(s);
if (! null_ok && p == NULL)
error("null %s pointer", type);
return p;
}
Definessexp2ptr(links are to index).
Finally, defineIntVar is a bit like defineVar, except that its
symbol is specified as a C string and is value as a C int. This
function is useful for installing integer constants like the flag
constants in the regular expression interface.
<utility functions>+= (<-U) [<-D]
static void defineIntVar(SEXP env, char *name, int ival)
{
SEXP sym, val;
sym = install(name);
PROTECT(val = ScalarInteger(ival));
defineVar(sym, val, env);
UNPROTECT(1);
}
compile.regex definition>: U1, D2
get.substrings definition>: U1, D2
regex_t interface>: U1, D2, D3, D4, D5
regexp definition>: U1, D2
regmatch_t interface>: U1, D2, D3, D4, D5
regsub definition>: U1, D2
url.decode definition>: D1, U2
url.parts definition>: D1, U2