regexp and regsub, based on the Tcl functions with the same
names[cite welch97:_pract_progr_tcl_tk]. The package is
REGULAR-EXPRESSIONS with nickname REGEXP.
The syntax for regexp is
regexp pattern string &key ignore-case extended index-only
pattern is a regular expression to be matched to string. If
ignore-case is true, the case of characters is ignored in
comparisons; by default it is false. If extended is true, then
extended regular expressions are used (the REG_EXTENDED flag is
given to regcomp); extend is true by default. If
index-only is true, a list of (start . end) pairs for the
matched substrings is returned. Otherwise, a list of the substrings is
returned. The default value is false.
The following example, adapted slightly from [cite welch97:_pract_progr_tcl_tk, Example
11-2], uses regexp to decompose a URL
into its components:
> (use-package "REGEXP")
T
> (regexp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)"
"http://stat.umn.edu:80/xyz")
("http://stat.umn.edu:80/xyz"
"http://"
"http" ;; protocol
"stat.umn.edu" ;; host
":80"
"80" ;; port
"/xyz") ;; path
The return value has been edited and comments added for clarity.
The function regsub performs substitution based on pattern
matching. The syntax is
regsub pattern string sub &key ignore-case extended allA match of
pattern in string is replaced using sub. If
all is true then all occurrences are replaced; otherwise, only the
first is replaced. By default, all is false. The substitution
argument sub can be a string, which is used literally, or it can
be a function. The function is called with the substrings produced by
the match as arguments and should return a string to use as a
replacement for the match.
An example [These examples are based on the URL Decoding
example in [cite welch97:_pract_progr_tcl_tk, p. 128]] of the simple
form of regsub:
> (regsub "\\+" "abc+def+ghi" " " :all t) "abc def ghi"To illustrate the more complex form, the following example replaces
%xx hexadecimal encodings in URL's with the character they
represent.
> (regsub "%([0-9a-hA-H][0-9a-hA-H])" "%7ewelc%68"
#'(lambda (a b)
(string (int-char (let ((*read-base* 16))
(read-from-string b)))))
:all t)
"~welch"
These two examples can be combined into a simple URL decoding
function:
<url-decode definition>= (U->)
(defun url-decode (url)
(flet ((parse-hex (s)
(let ((*read-base* 16))
(read-from-string s))))
(regsub "%([0-9a-hA-H][0-9a-hA-H])"
(regsub "\\+" url " " :all t)
#'(lambda (a b) (string (int-char (parse-hex b))))
:all t)))
Definesurl-decode(links are to index).
An illustration:
> (url-decode "%7ewelc%68+book") "~welch book"
regcomp can be used to
compile a regular expression that is then reused repeatedly in calls
to regexec. The functions are ``Lispified'' by returning results
as Lisp objects and signaling errors when appropriate rather than
returning error codes. Currently these are in the same package as the
higher-level functions. [Perhaps this should be changed.]
There are six exported constants to be used as flags. These are
integer values that can be combined with logior. The four flag
values for regcomp are
The two flags for
REG_EXTENDEDUse extended regular expressions REG_NEWLINESpecial handling of newline characters REG_NOSUBReport only success/fail in regexecREG_ICASEIgnore case in match.
regexec are
REG_NOTBOLFirst character of string is not the beginning of the line REG_NOTEOLLast character of string is not the end of the line
The two public functions are regcomp and regexec. The syntax
for regcomp is
regcomp pattern flagswhere
pattern is a string containing a regular expression and
flags is an integer flag constructed from the constants given
above. regcomp returns a regular expression structure or signals
an error.
The syntax for regexec is
regexec rex string flagswhere
rex is a regular expression structure created by
regcomp, string is a string to be matched, and flags is an
integer flag.
As an example, we can compile the URL pattern used above as
> (setf rex (regcomp "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)" REG_EXTENDED)) #<Regular Expression "^(([^:]+)://)?([^:/]+)(:([0-9]+))?(/.*)">and use it to match a URL with
> (regexec rex "http://stat.umn.edu:80/xyz" 0) ((0 . 26) (0 . 7) (0 . 4) (7 . 19) (19 . 22) (20 . 22) (22 . 26))
regexp.wrp.
<regexp.wrp>= <package and module setup> <lower-level component> <higher-level component>
The package and module setup code is
<package and module setup>= (<-U U->)
(provide "regexp")
(defpackage "REGULAR-EXPRESSIONS"
(:use "COMMON-LISP")
(:nicknames "REGEXP"))
(in-package "REGEXP")
(export '(<exported variable symbols>
<exported function symbols>))
The package does not :use use the C-WRAPPERS package since the
wrappers are not needed at runtime. Instead, the wrapper macros are
referenced with the package nickname wrap qualifier.
<lower-level component>= (<-U)
(wrap:c-lines "#include <sys/types.h>"
"#include <regex.h>")
<constant wrappers>
<compound data wrappers>
<function wrappers>
<Lisp regex data structure>
<support functions for the lower-level>
<public low-level functions>
<constant wrappers>= (<-U) [D->] (wrap:c-constant REG_EXTENDED "REG_EXTENDED" :integer) (wrap:c-constant REG_NEWLINE "REG_NEWLINE" :integer) (wrap:c-constant REG_NOSUB "REG_NOSUB" :integer) (wrap:c-constant REG_ICASE "REG_ICASE" :integer) (wrap:c-constant REG_NOTBOL "REG_NOTBOL" :integer) (wrap:c-constant REG_NOTEOL "REG_NOTEOL" :integer)
DefinesREG_EXTENDED,REG_ICASE,REG_NEWLINE,REG_NOSUB,REG_NOTBOL,REG_NOTEOL(links are to index).
<exported variable symbols>= (<-U U->) REG_EXTENDED REG_NEWLINE REG_NOSUB REG_ICASE REG_NOTBOL REG_NOTEOL
In addition we need the no-match result constant for regexec,
<constant wrappers>+= (<-U) [<-D] (wrap:c-constant REG_NOMATCH "REG_NOMATCH" :integer)
DefinesREG_NOMATCH(links are to index).
but this does not need to be exported.
There are two compound C data types, regex_t for compiled regular
expressions and regmatch_t for the match results.
<compound data wrappers>= (<-U)
(wrap:c-pointer "regex_t"
(:make make-regex-t)
(:get re-nsub "re_nsub" :integer))
(wrap:c-pointer "regmatch_t"
(:make make-regmatch-t)
(:get rm-so "rm_so" :integer)
(:get rm-eo "rm_eo" :integer))
Definesmake-regex-t,make-regmatch-t,re-nsub,rm-eo,rm-so(links are to index).
There are four functions. regcomp compiles and regexec
executes regular expressions. The corresponding Lisp functions are
base-regcomp and base-regexec.
<function wrappers>= (<-U) [D->] (wrap:c-function base-regcomp "regcomp" ((:cptr "regex_t") :string :integer) :integer) (wrap:c-function base-regexec "regexec" ((:cptr "regex_t") :string :integer (:cptr "regmatch_t") :integer) :integer)
Definesbase-regcomp,base-regexec(links are to index).
The regfree function releases storage allocated for compiled
expressions and regerror fills a string with an error message.
<function wrappers>+= (<-U) [<-D] (wrap:c-function regfree "regfree" ((:cptr "regex_t")) :void) (wrap:c-function regerror "regerror" (:integer (:cptr "regex_t") :string :integer) :integer)
For the public interface, regular expressions are represented by a
Lisp structure with two slots, ptr for holding the allocated C
data, and pat for holding the pattern string.
<Lisp regex data structure>= (<-U) [D->]
(defstruct (regexp (:print-function print-regexp)
(:constructor (make-regexp (ptr pat))))
ptr pat)
Definesregexp(links are to index).
The printed representation is provided by
<Lisp regex data structure>+= (<-U) [<-D]
(defun print-regexp (x s d)
(format s "#<Regular Expression ~s>" (regexp-pat x)))
Definesprint-regexpr(links are to index).
The lower-level component exports two functions, the public versions
of regcomp and regexec.
<exported function symbols>= (<-U U->) [D->] regcomp regexec
The public version of regcomp takes a pattern and a flags argument
and returns a regexp structure or signals an
error. [perhaps it would be better to accept keywords or
optional arguments instead of flags.]
<public low-level functions>= (<-U) [D->]
(defun regcomp (pat flags)
(system:without-interrupts
(let* ((rex (make-regex-t))
(result (base-regcomp rex pat flags))
(value (make-regexp rex pat)))
(unless (= result 0)
(raise-regex-error "regcomp" result rex))
(system:register-finalizer value #'regexp-free)
value)))
Definesregcomp(links are to index).
The regexp-free function is called by the
finalization
mechanism when a regexp object is garbage-collected.
<support functions for the lower-level>= (<-U) [D->]
(defun regexp-free (x)
(without-interrupts
(let ((rex (regexp-ptr x)))
(unless (null rex)
(setf (regexp-ptr x) nil)
(regfree rex)))))
Definesregexp-free(links are to index).
Interrupts are disabled around the creation of the expression object
to insure that it will be registered for garbage collection. If
regexp-free is only called from the finalization mechanism then
interrupts will already be suspended. The without-interrupts
wrapper is used to prevent an interrupt between removing the reference
and releasing the memory with regfree if regexp-free is ever
called directly.
The public regexec function takes a regexp object, a string,
and a flags argument and returns NIL for no match, a list of index
pairs for a match, or signals an error.
<public low-level functions>+= (<-U) [<-D]
(defun regexec (r str flags)
(let* ((rex (regexp-ptr r))
(nmatch (+ (re-nsub rex) 1))
(rm (make-regmatch-t nmatch))
(result (base-regexec rex str nmatch rm flags)))
(cond
((= result 0) <convert data in rm to list of index pairs>)
((= result REG_NOMATCH) nil)
(t (raise-regex-error "regexec" result rex)))))
Definesregexec(links are to index).
The list of index pairs is constructed by
<convert data in rm to list of index pairs>= (U->)
(let ((val nil))
(dotimes (i nmatch (nreverse val))
(let ((so (rm-so rm i))
(eo (rm-eo rm i)))
(push (if (or (= so -1) (= eo -1))
nil
(cons so eo))
val))))
Errors are raised by calling raise-regex-error. This function
generates an error message by calling regerror. regerror is
called twice, once with an empty string and buffer size zero to get
the required buffer size, and once with a buffer of the correct size.
The buffer is filled with a null-terminated string; the subseq
trims off the null character.
<support functions for the lower-level>+= (<-U) [<-D]
(defun raise-regex-error (fun result rex)
(let ((size (regerror result rex "" 0))
(buf (make-string size)))
(regerror result rex buf size)
(error "~a in ~a" (subseq buf 0 (- size 1)) fun)))
<higher-level component>= (<-U) <get-substringsdefinition> <regexpdefinition> <regsubdefinition> <url-decodedefinition>
The higher-level component exports three functions.
<exported function symbols>+= (<-U U->) [<-D] regexp regsub url-decode
The regexp function calls regcomp and regexpr and then
turns the resulting list of index pairs into a list of substrings
using get-substrings.
<regexpdefinition>= (<-U) (defun regexp (pat str &key ignore-case (extended t) index-only) (let* ((rex <compile regular expression forpat>) (pairs (regexec rex str 0))) (if index-only pairs (get-substrings str pairs))))
Definesregexp(links are to index).
<compile regular expression for pat>= (<-U U->)
(let ((icase (if ignore-case REG_ICASE 0))
(ext (if extended REG_EXTENDED 0)))
(regcomp pat (logior icase ext)))
The function get-substrings is
<get-substrings definition>= (<-U)
(defun get-substrings (str val)
(mapcar #'(lambda (x) (if x (subseq str (car x) (cdr x)) nil))
val))
Definesget-substrings(links are to index).
The regsub function is defined as
<regsubdefinition>= (<-U) (defun regsub (pat str sub &key ignore-case (extended t) all) (let ((rex <compile regular expression forpat>) (head "") (tail str)) (loop (let ((val (regexec rex tail 0))) (unless val (return (concatenate 'string head tail))) (let* ((sval <string to substitute>) (start (car (first val))) (end (cdr (first val))) (tail-head (subseq tail 0 start))) (setf head (concatenate 'string head tail-head sval)) (setf tail (subseq tail end)))) (unless all (return (concatenate 'string head tail))))))
Definesregsub(links are to index).
If sub is itself a string then it is substituted
directly. Otherwise it is assumed to be a function and is applied to
the match's substrings.
<string to substitute>= (<-U)
(if (stringp sub)
sub
(apply sub (get-substrings tail val)))
_autoidx.lsp
file for the regular expression module is
<_autoidx.lsp>=
<package and module setup>
(system:define-autoload-module "regexp"
(variable <exported variable symbols>)
(function <exported function symbols>))
<rtest.c>=
#include <stdio.h>
#include <sys/types.h>
#include <regex.h>
int match(char *string, char *pattern)
{
int i;
regex_t re;
char buf[256];
i=regcomp(&re, pattern, REG_EXTENDED|REG_NOSUB);
if (i != 0) {
(void)regerror(i,&re,buf,sizeof buf);
printf("%s\n",buf);
return(0); /* report error */
}
i = regexec(&re, string, (size_t) 0, NULL, 0);
regfree(&re);
if (i != 0) {
(void)regerror(i,&re,buf,sizeof buf);
printf("%s\n",buf);
return(0); /* report error */
}
return(1);
}
void main()
{
printf("%s\n", match("ABCDE", "[A-Z]*") ? "success" : "failure");
}
[1] Brent B. Welch. Practical Programming in Tcl and Tk. Prentice-Hall, Upper Saddle River, NJ, 2nd edition, 1997.
_autoidx.lsp>: D1
pat>: U1, D2, U3
rm to list of index pairs>: U1, D2
get-substrings definition>: U1, D2
regex data structure>: U1, D2, D3
regexp definition>: U1, D2
regsub definition>: U1, D2
url-decode definition>: D1, U2