10. Regular Expressions

Part of CS:2820 Object Oriented Software Development Notes, Fall 2017
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

 

A digression into Automata Theory

To understand the scanner mechanism in Java, it is useful to digress into the world of Automata theory. This is the theory that discusses what a machine can and cannot do, in principle. Automata theory has its origins in the 1930s, with studies at places like Bell Labs of what a mechanical or electromechanical system such as a telephone exchange could do, in principle. People like Alan Turing were major contributers to this work.

In the 1950s and 1960s, formal linguistics was married to automata theory as it dawned on people that for each class of grammar, there was a corresponding class of machines that could, in principle, be configured to process languages from that grammar. Noam Chompsky was the key person in this development. (Today, many intellectuals know of Chompsky as a leftist political thinker, but his contributions to formal language theory form a key part of the foundation for modern computer science.)

The Chompsky Language Hierarchy divides languages into 4 classes. Here are the levels in the hierarchy and their relationship to automata theory:

Each successive member of this hierarchy is more powerful than the one above it, so all regular languages, for example, can be processed using the methodologies of the other layers, but recursively-enumerable languages cannot be processed using the methodologies of any of the other layers.

This brings us to regular expressions. Automata theorists developed the following notion of a regular expression as the grammar describing the structure of a regular language. A regular expression is either:

As another example, we can describe the conventional rules for forming an integer using this regular expression:

( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*

That is, an integer is a digit followed by zero or more additional digits. Mathematicians and linguists concerned primarily with writing proofs about what regular expressions can do find the above notation entirely suficient, but as the second example shows, it gets rather verbose. One of the key personalities in the development of the mathematics of regular expressions was Stephen Kleene; the term Kleene closure in reference to the star notation for regular expressions is a reference to his work.

When computer scientists began creating software that could process regular expressions, they began developing abbreviations. This began during the development of the QED text editor in the late 1960's, and from there, it entered the world of Unix and then Linux and Windows. The common abbreviations are:

Note that none of the above extensions change the basic power of regular expressions, they are either abbreviations for things that can be done in more long-winded form with the original notation. In the case of notations for non-printing characters, this is just a matter of moving from the abstract domain of mathematics to the concrete domain of the kind of computer character sets we use.

In the vi text editor, a direct descendant of the old QED editor, search patterns are regular expressions, so if you type /[0-9][0-9]*/ the editor will move the cursor to the start of the next integer in the text (and in some versions of vi, also highlight every integer anywhere in your code). The C and C++ libraries incorporate standard tools that process this notation, so similar regular expressions are supported by a wide variety of programs that are written using the C library for support. With the standardizaiton of the C library came official standardization of C's regular expression notation,

And, since Java is patterned on C++, the Java library includes its own support for regular expressions, buried in class Pattern. In both the C and Java regular expresson notations, the implementations include features that go beyond the automata theoretic definition of regular expressions, but the core features from automata theory are all present.

Scanners

We've reached the point where it is useful to descend into scanners in a bit more detail. Java's input scanning tool, class Scanner() is quite complex, and using it is both a huge benefit (writing code to replace this class can take weeks), and a problem (understanding the full potential of this class can take weeks). Fortunately, we don't need to understand the full potential of this class, we just need to scan an input file describing a road network, a neuron network or a logic circuit.

Class Scanner() has a number of methods called nextXX() which extract the next XX from the input, the next character, the next integer, the next token, and for each of these, it has methods called hasNextXX() that test to see if the input stream has input that matches the pattern for textual representations of an XX. The simplest of these are next() and hasNext() which take the next nonblank item from the stream.

Tokens in the input stream are delimted by sequences of delimiters. By default, delimiters are any white-space character (blank, tab, newline), but the useDelimiter() method allows you to change the pattern (that is, the regular expression) used to match delimiters. For example, consider this code operating on sc, an object of class Scanner:

sc.useDelimiter( "[ \\t\\x0B\\f\\r]*" );

The string passed to the above code is the representation of a Pattern used to control the scanner. Java's class Pattern is itself quite complex, and forms the underpinning of the scanner mechanism.

Why are all the backslashes above doubled? Because Java's input scanner converts the sequence \\ within quotation marks to a single backslash character, so the string passed to the useDelimiter method has this text:

[ \t\x0B\f\r]*

The above text is a regular expression. We can deconstruct this expression as follows:

[ \t\x0B\f\r]*
             * -- any number of repeats of the preceeding regular expression
[           ]  -- any of the characters enclosed in brackets
               -- a space or
  \t           -- a tab or
    \x0B       -- a vertical tab (look it up, it's a standard control character)
        \f     -- a form feed
          \r   -- a carriage return

This list contains all the common white-space characters except for one, the \n newline character. As a result, newline is now a distinct token and next() will now return "\n" as the next token after you have scanned all the tokens on a line.

The hasNext() and next() methods

The scanner provides a skip() method. This is the most elementary of the scanner's tools. What it does is skip over whatever text from the input stream matches the pattern proviced. For example, to skip white space, you can use this code:

sc.skip( "[ \\t\\x0B\\f\\r]*" );

Most of the scanner's hasNextXX() and nextXX() methods can be replaced by code that first skips whitespace and then matches the pattern for XX, whatever it may be. Thus, sc.hasNextInt() is functionally the same as this:

sc.hasNext( "[ \\t\\x0B\\f\\r]*" + "[0-9][0-9]*" )

Similarly, the assignment i = sc.nextInt() is functionally the same as this:

sc.skip( "[ \\t\\x0B\\f\\r]*" );
i = Integer.parseInt( sc.next( "[0-9][0-9]*" ) );

Patterns

There is one problem. Interpreting regular expressions directly from the text of the expression is slow. Therefore, Java has a mechanism called a pattern compiler that reduces the textual representation of a regular expression to an object of class Pattern. A pattern is really a table that a finite-state-machine interpreter can efficiently process. This idea of compiling expressions into patterns is not new; the QED text editor did this in the early 1970s. Consider this code:

static final Pattern whitespace = Pattern.compile( "[ \t]*" );

This creates a final variable (effectively a constant) that holds, as its value, the output of the pattern compiler for a regular expression that matches strings of spaces and tabs. Now, if sc is a scanner, sc.skip(whitespace) will efficiently advance the scanner over whitespace.

After the scanner processes any pattern, it saves the text that matched that pattern. If you examine sc.match().group(), you will see the text that matched the most recent pattern. The result returned by sc.match() is actually an instance of class MatchResult, a difficult-to-understand class. You don't need to understand it, however, to use this bit of code to see the text that matched some pattern.

The designers of Java could have directly processed regular expressions, but that would be inefficient. Regular expressions are text strings that conform to a human-orented notation, while pattern matching is an algorithmic problem. The way a pattern is compiled does not really matter to users of regular expressions or class Pattern, but it is interesting. A compiled pattern is actually a table with one row for each state of the finite-state machine, where each row has one entry for each letter in the alphabet. Given such a table, the following general algorithm can be used to match a pattern against the input stream:

state = 0;
thumb = input.mark(); // remember where in the input stream we were
while (state < table.length) {
    state = table[ state ][ input.next() ];
}
if (state == table.length) { // success
    return success;
} else {                     // failure
    input.reset( thumb );    //   undo all the next operations
    return failure;
}

The problem of constructing the table is an interesting algorithm that is typically discussed in automata theory courses.

Scan Support

The above leads to the following proposal for class ScanSupport:

/** Support methods for scanning
 *  @see Errors
 */
class ScanSupport {

    /** Pattern for identifers
     */
    private static final Pattern name
        = Pattern.compile( "[a-zA-Z0-9_]*" );

    /** Pattern for whitespace excluding things like newline
     */
    private static final Pattern whitespace
        = Pattern.compile( "[ \t]*" );

    /** Get next name without skipping to next line (unlike sc.Next())
     *  @param sc the scanner from which end of line is scanned
     *  @return the name, if there was one, or an empty string
     */
    public static String nextName( Scanner sc ) {
        sc.skip( whitespace );

        // the following is weird code, it skips the name
        // and then returns the string that matched what was skipped
        sc.skip( name );
        return sc.match().group();
    }

    /** Advance to next line and complain if is junk at the line end
     *  @see Errors
     *  @param message gives a prefix to give context to error messages
     *  @param sc the scanner from which end of line is scanned
     */
    public static void lineEnd( Scanner sc, String message ) {
        sc.skip( whitespace );
        String lineEnd = sc.nextLine();
        if ( (!lineEnd.equals( "" ))
        &&   (!lineEnd.startsWith( "--" )) ) {
            Errors.warn(
                message +
                " followed unexpected by '" + lineEnd + "'"
            );
        }
    }
}

With the above, the following two calls are very similar in their effects:

String s1 = sc.next();
String s2 = ScanSupport.nextName( sc );

The big difference is that the call to sc.next() will skip newlines, while ScanSupport.nextNaame(sc) will not. There are a number of other differences.

The code for ScanSupport took some time to work out. The Java Scanner class is full of tempting alternative paths that look like they might lead to the same end. For example, if you don't want to skip newlines, you could scan the file into lines first using sc.nextLine to pull the input file apart into separate lines, and then scan each line. This means you scan the entire file twice, but nonetheless, it is the path taken by many Java programmers.

Another alternative is to change the definition of delimiters used by the scanner. Java scanners allow you to call sc.useDelimiter() to change the delimiter that the scanner uses to separate tokens, so you can change the delimiter to exclude newlines. My attempts to follow this route threatened to consume too much time, so I abandoned it in favor of the above code.

Java's scanners and patterns illustrate a common problem with very large software libraries. Sometimes, the amount of research you have to do to use a library the right way to do the job you want takes more time than doing it yourself or some hybrid making light use of the library. I remain convinced that there is probably a good way to use sc.useDelimiter() to change the behavior of the scanner so it won't skip newlines except when I want to do so, but it wasn't worth my time to find it, and it was easier to develop the ScanSupport code given here.