14. Regular Expressions

Part of CS:2820 Object Oriented Software Development Notes, Fall 2015
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

 

A digression into Automata Theory

To understand the scanner mechanism in Java, it is useful to digress into the world of Automata theory. This is the theory that discusses what a machine can and cannot do, in principle. Automata theory has its origins in the 1930s, with studies at places like Bell Labs of what a mechanical or electromechanical system such as a telephone exchange could do, in principle. People like Alan Turing were major contributers to this work.

In the 1950s and 1960s, formal linguistics was married to automata theory as it dawned on people that for each class of grammar, there was a corresponding class of machines that could, in principle, be configured to process languages from that grammar. Noam Chompsky was the key person in this development. (Today, many intellectuals know of Chompsky as a leftist political thinker, but his contributions to formal language theory lay a pivotal foundation for modern computing.)

The Chompsky Language Hierarchy divides languages into 4 classes. Here are the levels in the hierarchy and their relationship to automata theory:

Each successive member of this hierarchy is more powerful than the one above it, so all regular languages, for example, can be processed using the methodologies of the other layers, but recursively-enumerable languages cannot be processes using the methodologies of any of the other layers.

This brings us to regular expressions. Automata theoriets developed the following notion of a regular expression as the grammar describing the structure of a regular language. A regular expression is either:

For example, we can describe the conventional rules for forming an integer using this regular expression:

( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )*

That is, an integer is a digit followed by zero or more additional digits. When computer scientists began creating software that could process regular expressions, they began developing abbreviations. This began during the development of the QED text editor in the late 1960's, and from there, it entered the world of Unix and then Linux and Windows. The common abbreviations are:

In the vi text editor, a direct descendant of the old QED editor, search patterns are regular expressions, so if you type /[0-9][0-9]*/ the editor will move the cursor to the start of the next integer in the text (and highlight every integer anywhere in your code). The C and C++ libraries incorporate standard tools that process this notation, so similar regular expressions are supported by a wide variety of programs that are written using the C library for support.

And, since Java is patterned on C++, the Java library includes its own support for regular expressions, buried in class Pattern. In both the C and Java regular expresson notations, the implementations include features that go beyond the automata theoretic definition of regular expressions, but the core features from automata theory are all present.

Scanners

We've reached the point where we need to descend into scanners in a bit more detail. Java's input scanning tool, class Scanner() is quite complex, and using it is both a huge benefit (writing code to replace this class can take weeks), and a problem (understanding the full potential of this class can take weeks). Fortunately, we don't need to understand the full potential of this class, we just need to scan an input file describing a road network or a neuron network.

Class Scanner() has a number of methods called nextXX() which extract the next XX from the input, the next character, the next integer, the next token, and for each of these, it has methods called hasNextXX() that test to see if the input stream has input that matches the pattern for textual representations of an XX. The simplest of these are next() and hasNext() which take the next nonblank item from the stream.

Tokens in the input stream are delimted by sequences of delimiters. By default, delimiters are any white-space character (blank, tab, newline), but the useDelimiter() method allows you to change the pattern used to match delimiters. For example, consider this:

sc.useDelimiter( "[ \\t\\x0B\\f\\r]*" );

The string passed to the above code is the representation of a Pattern used to control the scanner. Java's class Pattern is itself quite complex, and forms the underpinning of the scanner mechanism.

Why are all the backslashes above doubled? Because Java's input scanner converts the sequence \\ within quotation marks to a single backslash character, and the useDelimiter method is prepared to process \t for example, as meaning tab. Here we will deconstruct the regular expression given above:

[ \t\x0B\f\r]*
             * -- any number of repeats of the preceeding regular expression
[           ]  -- any of the characters enclosed in brackets
               -- a space or
  \t           -- a tab or
    \x0B       -- a vertical tab (look it up, it's a standard control character)
        \f     -- a form feed
          \r   -- a carriage return

This list contains all the common white-space characters except for one, the \n newline character. As a result, newline is now a distinct token and next() will now return "\n" as the next token after you have scanned all the tokens on a line.