11. Regular Expressions and Java Scanners

Part of CS:2820 Object Oriented Software Development Notes, Fall 2020
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

 

A digression into Automata Theory

To understand the scanner mechanism in Java, it is useful to digress into the world of Automata theory. This is the theory that discusses what a machine can and cannot do, in principle. Automata theory has its origins in the 1930s, with studies at places like Bell Labs of what a mechanical or electromechanical system such as a telephone exchange could do, in principle. People like Alan Turing were major contributers to this work.

In the 1950s and 1960s, formal linguistics was married to automata theory as it dawned on people that for each class of grammar, there was a corresponding class of machines that could, in principle, be configured to process languages from that grammar. Noam Chompsky was the key person in this development. (Today, many intellectuals know of Chompsky as a leftist political thinker, but his contributions to formal language theory form a key part of the foundation for modern computer science.)

The Chompsky Language Hierarchy divides languages into 4 classes. Here are the levels in the hierarchy and their relationship to automata theory:

Each successive member of this hierarchy is more powerful than the one above it, so all regular languages, for example, can be processed using the methodologies of the other layers, but recursively-enumerable languages cannot be processed using the methodologies of any of the other layers.

This brings us to regular expressions. Automata theorists developed the following notion of a regular expression as the grammar describing the structure of a regular language. A regular expression is either:

As another example, we can describe the conventional rules for forming an integer using this regular expression:

( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*

That is, an integer is a digit followed by zero or more additional digits. Mathematicians and linguists concerned primarily with writing proofs about what regular expressions can do find the above notation entirely suficient, but as the second example shows, it gets rather verbose. One of the key personalities in the development of the mathematics of regular expressions was Stephen Kleene; the term Kleene closure in reference to the star notation for regular expressions is a reference to his work.

When computer scientists began creating software that could process regular expressions, they began developing abbreviations. This began during the development of the QED text editor in the late 1960's, and from there, it entered the world of Unix and then Linux and Windows. The common abbreviations are:

Note that none of the above extensions change the basic power of regular expressions, they are either abbreviations for things that can be done in more long-winded form with the original notation. In the case of notations for non-printing characters, this is just a matter of moving from the abstract domain of mathematics to the concrete domain of the kind of computer character sets we use.

In the vi text editor, a direct descendant of the old QED editor, search patterns are regular expressions, so if you type /[0-9][0-9]*/ the editor will move the cursor to the start of the next integer in the text (and in some versions of vi, also highlight every integer anywhere in your code). The C and C++ libraries incorporate standard tools that process this notation, so similar regular expressions are supported by a wide variety of programs that are written using the C library for support. With the standardizaiton of the C library came official standardization of C's regular expression notation,

And, since Java is patterned on C++, the Java library includes its own support for regular expressions, buried in class Pattern. In both the C and Java regular expresson notations, the implementations include features that go beyond the automata theoretic definition of regular expressions, but the core features from automata theory are all present.

Scanners

We've reached the point where it is useful to descend into scanners in a bit more detail. Java's input scanning tool, class Scanner() is quite complex, and using it is both a huge benefit (writing code to replace this class can take weeks), and a problem (understanding the full potential of this class can take weeks). Fortunately, we don't need to understand the full potential of this class, we just need to scan an input file describing a road network, a neuron network or a logic circuit or a community suffering an epidemic.

Class Scanner() has a number of methods called nextXX() which extract the next XX from the input, the next character, the next integer, the next token, and for each of these, it has methods called hasNextXX() that test to see if the input stream has input that matches the pattern for textual representations of an XX. The simplest of these are next() and hasNext() which take the next nonblank item from the stream.

Tokens in the input stream are delimted by sequences of delimiters. By default, delimiters are any white-space character (blank, tab, newline), but the useDelimiter() method allows you to change the pattern (that is, the regular expression) used to match delimiters. For example, consider this code operating on sc, an object of class Scanner:

sc.useDelimiter( "[ \\t\\x0B\\f\\r]*" );

The string passed to the above code is the representation of a Pattern used to control the scanner. Java's class Pattern is itself quite complex, and forms the underpinning of the scanner mechanism.

Why are all the backslashes above doubled? Because Java's input scanner converts the sequence \\ within quotation marks to a single backslash character, so the string passed to the useDelimiter method has this text:

[ \t\x0B\f\r]*

The above text is a regular expression. We can deconstruct this expression as follows:

[ \t\x0B\f\r]*
             * -- any number of repeats of the preceeding regular expression
[           ]  -- any of the characters enclosed in brackets
               -- a space or
  \t           -- a tab or
    \x0B       -- a vertical tab (look it up, it's a standard control character)
        \f     -- a form feed
          \r   -- a carriage return

This list contains all the common white-space characters except for one, the \n newline character. As a result, newline is now a distinct token and next() will now return "\n" as the next token after you have scanned all the tokens on a line.

The hasNext() and next() methods

The scanner class provides a skip() method. This is the most elementary of the scanner's tools. What it does is skip over whatever text from the input stream matches the pattern proviced. For example, to skip white space, you can use this code:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );

Most of the scanner's hasNextXX() and nextXX() methods can be replaced by code that first skips whitespace and then matches the pattern for XX, whatever it may be. Thus, sc.hasNextInt() is functionally the same as this:

sc.hasNext( "[ \\t\\x0B\\f\\r\\n]*" + "[0-9][0-9]*" )

In more detail, this could be expressed as:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );

To skip the delimiter, followed by

sc.hasNext( "[0-9][0-9]*" )
To see if the text after the delimiter matches the pattern for an integer.

Similarly, the assignment i = sc.nextInt() is functionally the same as this:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );
i = Integer.parseInt( sc.next( "[0-9][0-9]*" ) );

If you wanted to parse an integer up to but not including the next character that was not part of an integer, regardless of whether it was officially a delimiter, you could write this:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );
i = Integer.parseInt( sc.findWithinHorizon( "[0-9][0-9]*" ), 0 );

Unlike the various next methods, findWithinHorizon() picks off and returns (as a string) the text that matches the pattern, stopping as soon as anything is found that doesn't match. If nothing matches, it simply returns an empty string. This method is the key to using scanners for such things as scanning identifiers or integers that might be terminated either by delimiters or by punctuation marks that ought to be processed as separate items.

Lots of these methods actually seem to rest on top of a strange feature of the skip() method. Whatever is skips is held, as a string, available to the user through a strange backdoor into the scanner. So, for example, this code:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );
i = Integer.parseInt( sc.findWithinHorizon( "[0-9][0-9]*" ), 0 );

can be replaced with:

sc.skip( "[ \\t\\x0B\\f\\r\\n]*" );
sc.skip( "[0-9][0-9]*" );
i = Integer.parseInt( sc.match().group() );

And that can be shortened to this:

i = Integer.parseInt(
   sc.skip( "[ \\t\\x0B\\f\\r\\n]*" ).skip( "[0-9][0-9]*" ).match().group()
);

That compact form is possible because skip() returns this after doing its job.

Patterns

There is one problem. Interpreting regular expressions directly from the text of the expression is slow. Therefore, Java has a mechanism called a pattern compiler that reduces the textual representation of a regular expression to an object of class Pattern. A pattern is really a table that a finite-state-machine interpreter can efficiently process. This idea of compiling expressions into patterns is not new; the QED text editor did this in the early 1970s. Consider this code:

static final Pattern whitespace = Pattern.compile( "[ \t]*" );

This creates a final variable (effectively a constant, since patterns themselves are not mutable) that holds, as its value, the output of the pattern compiler for a regular expression that matches strings of spaces and tabs. Now, given scanner sc, sc.skip(whitespace) will efficiently advance the scanner over whitespace.

After the scanner processes any pattern, it saves the text that matched that pattern. If you examine sc.match().group(), you will see the text that matched the most recent pattern. The result returned by sc.match() is actually an instance of class MatchResult, a difficult-to-understand class. You don't need to understand it, however, to use this bit of code to see the text that matched some pattern.

The designers of Java could have directly processed regular expressions, but that would be inefficient. Regular expressions are text strings that conform to a human-orented notation, while pattern matching is an algorithmic problem. The way a pattern is compiled does not really matter to users of regular expressions or class Pattern, but it is interesting. A compiled pattern is actually a table with one row for each state of the finite-state machine, where each row has one entry for each letter in the alphabet. Given such a table, the following general algorithm can be used to match a pattern against the input stream:

state = 0;
thumb = input.mark(); // remember where in the input stream we were
while (state < table.length) {
    state = table[ state ][ input.next() ];
}
if (state == table.length) { // success
    return success;
} else {                     // failure
    input.reset( thumb );    //   undo all the next operations
    return failure;
}

The problem of constructing the table is an interesting algorithm that is typically discussed in automata theory courses.

Regardless of how it works, this means that we should really rewrite this code:

i = Integer.parseInt(
   sc.skip( "[ \\t\\x0B\\f\\r\\n]*" ).skip( "[0-9][0-9]*" ).match().group()
);

By first creating static variables to hold each pattern:

static final Pattern delimPat = Pattern.compile( "[ \\t\\x0B\\f\\r\\n]*" );
static final Pattern intPat = Pattrn.compile( "[0-9][0-9]*" );

Having done this, we can then write this compact bit of still unreadable code:

i = Integer.parseInt( sc.skip( delimPat ).skip( numPat ).match().group() );

Of course, the above code will throw an exception when the next non-delimiter in the input stream doesn't begin with an integer. You really do need to add code to either check for the match group returning an empty string or catch the exception.

Scan Support

The above leads to the following proposal for class MyScanner:

/** Wrapper for class Scanner that makes it work the way we want
 *  @see Java.util.Scanner
 *  @see Java.util.Pattern
 *  @see Errors
 */
class MyScanner {

    ... various omitted details ...

    // Pattern for identifers -- a letter followed by letters or digits
    private static final Pattern name
        = Pattern.compile( "[a-zA-Z][a-zA-Z0-9_]*" );

    // Pattern for whitespace
    private static final Pattern whitespace
        = Pattern.compile( "[ \\t\\n\\r]*" );

    /** Get next name, allowing for a delimiter immediately after the name
     *  @param sc the scanner from which end of line is scanned
     *  @return the name, if there was one, or an empty string if not
     */
    public String getName() {
        this.skip( whitespace );

        // the following is weird code, it skips the name
        // and then returns the string that matched what was skipped
        this.skip( name );
        return self.match().group();
    }
}

With the above, the following two calls are very similar in their effects:

String s1 = sc.next();
String s2 = sc.getName();

The big difference is that the call to sc.next() will return whatever is delimited by any kind of delimiter, while sc.getName() will only return the text of a well-formed identifier that starts with a letter followed by any number of letters or digits.

The code for MyScanner took some time to work out. The Java Scanner class is full of tempting alternative paths that look like they might lead to the same end. For example, if you don't want to skip newlines, you could scan the file into lines first using sc.nextLine to pull the input file apart into separate lines, and then open a new scanner to scan each line for things within that line. This means you scan the entire file twice, which will double the processing time, but it is the path taken by many Java programmers.

Another alternative is to change the definition of delimiters used by the scanner. Java scanners allow you to call sc.useDelimiter() to change the delimiter that the scanner uses to separate tokens, so you can change the delimiter to exclude newlines. My attempts to follow this route threatened to consume too much time, so I abandoned it in favor of the above code.

Java's scanners and patterns illustrate a common problem with very large software libraries. Sometimes, the amount of research you have to do to use a library the right way to do the job you want takes more time than doing it yourself or some hybrid making light use of the library. I remain convinced that there is probably a good way to use sc.useDelimiter() to change the behavior of the scanner so it won't skip newlines except when I want to do so, but it wasn't worth my time to find it, and it was easier to develop the MyScanner code given here.