17. MyScanner and regular expressions

Part of CS:2820 Object Oriented Software Development Notes, Spring 2021
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

 

Class MyScanner

Integrating what we've learned about regular expressions into the MyScanner class will allow us to use semicolons (and other punctuation) that immediately follows numbers instead of requiring spaces around everything:

/** Wrapper for class Scanner that makes it work the way we want
 *  @see Java.util.Scanner
 *  @see Java.util.Pattern
 *  @see Errors
 */
class MyScanner {

    ... various omitted details ...

    // Pattern for identifers -- a letter followed by letters or digits
    private static final Pattern name
        = Pattern.compile( "[a-zA-Z][a-zA-Z0-9_]*" );

    // Pattern for whitespace
    private static final Pattern whitespace
        = Pattern.compile( "[ \\t\\n\\r]*" );

    /** Get next name, allowing for a delimiter immediately after the name
     *  @param sc the scanner from which end of line is scanned
     *  @return the name, if there was one, or an empty string if not
     */
    public String getNextName() {
        this.skip( whitespace );

        // the following is weird code, it skips the name
        // and then returns the string that matched what was skipped
        this.skip( name );
        return self.match().group();
    }
}

With the above, the following two calls are very similar in their effects:

String s1 = sc.next();
String s2 = sc.getNextName();

The big difference is that the call to sc.next() will return whatever is delimited by any kind of delimiter, while sc.getNextName() will only return the text of a well-formed identifier that starts with a letter followed by any number of letters or digits.

The above code has a problem. If there is no next name, the pattern does not match at all. We really want a method that takes a default value and an error message as parameters. That version of getNextName() will have to deal with the possibility that the call to skip() matched nothing. In that case, skip() throws noSuchElementException, so we either have to catch that or write a pattern that matches the empty string.

A tricky issue: Consider the following two patterns:

Pattern float1 = Pattern.compile(
    "([0-9][0-9]*\\.[0-9]*)|([0-9]*\\.[0-9][0-9]*)|([0-9]*)"
);
Pattern float2 = Pattern.compile(
    "([0-9]*)|([0-9][0-9]*\\.[0-9]*)|([0-9]*\\.[0-9][0-9]*)"
);

These are identical except for the order of the 3 alterenatives, which are:

In float1 the simplest pattern comes first, while float2 it comes last. This makes a difference, because the pattern matcher in Java library is greedy. Patterns in a list of alternatives are tried from left to right, and whichever pattern matches first seems to be the one that wins.

To code things like getNextLiteral() and getNextFloat(), we need to invent patterns that match the desired target, and for getNextFloat(), we also need to convert the string that matched the pattern into a float. All of Java's numeric classes provide tools for this. In the case of class Float, ther is Float.parseFloat() that takes a string and converts it to a float.

The code for MyScanner took some time to work out. The Java Scanner class is full of tempting alternative paths that look like they might lead to the same end. For example, if you don't want to skip newlines, you could scan the file into lines first using sc.nextLine to pull the input file apart into separate lines, and then open a new scanner to scan each line for things within that line. This means you scan the entire file twice, which will double the processing time, but it is the path taken by many Java programmers.

Another alternative is to change the definition of delimiters used by the scanner. Java scanners allow you to call sc.useDelimiter() to change the delimiter that the scanner uses to separate tokens, so you can change the delimiter to exclude newlines. My attempts to follow this route threatened to consume too much time, so I abandoned it in favor of the above code.

Java's scanners and patterns illustrate a common problem with very large software libraries. Sometimes, the amount of research you have to do to use a library the right way to do the job you want takes more time than doing it yourself or some hybrid making light use of the library. I remain convinced that there is probably a good way to use sc.useDelimiter() to change the behavior of the scanner so it won't skip newlines except when I want to do so, but it wasn't worth my time to find it, and it was easier to develop the MyScanner code given here.

Preventing a Constructor from Constructing

Here is the constructor for class Road we currently have:

    public Road( MyScanner sc ) {
        // keyword Road was already scanned
        final String src;	// where does it come from
        final String dst;	// where does it go

        src = sc.getNext( "???", "road source missing" );
        dst = sc.getNext( "???", "road " + src + " to missing destination" );
        travelTime = sc.getNextFloat(
            Float.NaN, "road " + src + " to missing destination"
        );
        destination = Intersection.lookup( dst );
        if (destination == null) {
            Error.warn( "road " + src + " " + dst + " undefined: " + dst );
        }
        source = Intersection.lookup( src );
        if (source == null) {
            Error.warn( "road " + src + " " + dst + " undefined: " + src );
        }
        // BUG: Can we prevent creation of malformed roads (see toString bug)

        allRoads.add( this ); // this is the only place items are added!
    }

One way to handle errors in a constructor -- that is, to prevent the constructor from returning a defective object, is to have the constructor throw an exception. We could throw some generic exception, just search through the list of exceptions that Java has already defined and find the one that comes closest, but this is not a very satisfying soution because most of them are obviously special purpose exceptions for some other domanin. As an alternative, we can define a new subclass of exceptions. Consider doing this in class Road in our running example:

class Road {
    // constructors may throw this when an error prevents construction
    public static class ConstructorFailure extends Exception {}

    ...

    public Road( MyScanner sc ) throws ConstructorFailure {
        // keyword Road was already scanned
        final String src;	// where does it come from
        final String dst;	// where does it go

        src = sc.getNext( "???", "road source missing" );
        dst = sc.getNext( "???", "road " + src + " to missing destination" );
        travelTime = sc.getNextFloat(
            Float.NaN, "road " + src + " to missing destination"
        );
        if ((src == "???") || (dst == "???") || Float.isNaN( travelTime )) {
            // this takes care of the errors detected above
            throw new ConstructorFailure();
        }
        destination = Intersection.lookup( dst );
        if (destination == null) {
            Error.warn( "road " + src + " " + dst + " undefined: " + dst );
            throw new ConstructorFailure();
        }
        source = Intersection.lookup( src );
        if (source == null) {
            Error.warn( "road " + src + " " + dst + " undefined: " + src );
            throw new ConstructorFailure();
        }
        if (travelTime < 0) {
            Error.warn( this.toString() + ": negative travel time?" );
            throw new ConstructorFailure();
        }

        allRoads.add( this ); // this is the only place items are added!
    }

Now, the constructor will not return a newly constructed road if that road is defective, and the defective road will not be added to the list of all roads. Instead, it will throw an exception, and the calling code will have to deal with that exception.

One line in the above raises a number of questions:

        if ((src == "???") || (dst == "???") || Float.isNaN( travelTime )) {

First, why didn't we use the "???".equals(src) construction? The answer to this is that the Java compiler collects all of the string constants used in a program and makes sure that just a single constant object is created for each distinct string. As a result, no matter how many times the string constant "???" show up in the program, every single use of this constant is a reference to exactly the same object. As a result, using the == operator in this case will return true if and only if the values on both sides come from the string constant "???" and not from any other source.

In fact, because of the way string constants are handled, if the scanner finds the string ??? in the user input and that ends up in src, Java guarantees that src=="???" will be false, while "???".equals(src) would have been true. If our goal is to detect use of the default constants pased to the get methods of class MyScanner, src=="???" works better because while "???".equals(src) can also match text from the input file.

The second question is, why Float.isNaN(travelTime)) instead of something more readable like travelTime==Float.isNaN. The problem here is that not-a-number is a weird value. The IEEE floating point standard, which Java obeys, requires that if x is not a number, then the comparison x==y for all values of y should return false. This means that even x==x will be false when x is not a number. This means, paridoxically, that we could have written travelTime!=travelTime to detect that the value is not a number. That works, but it's not readable.

Looking in the definition of classes Float and Double, we find two solutions: travelTime.isNan(), a method that applies to an object of the class, and Float.isNaN(travelTime), a static method that takes an instance of the class as a parameter. Why does Java allow both? Wouldn't one suffice?

The answer to the above question has to do with the difference between float (lower case) and Float (upper case). The primitive data type float is not a class. Variables of this type are not objects, arithmetic on these variables is fast, being done by machine hardware. But, because float is not a class, you can't create things like a LinkedList of float values.

In contrast, Float is a real class. Eacn Float object can be thought of as a box around a float value. You can create a LinkedList of Float values, and you can apply all kinds of useful methods to those values, such as isNaN().

To help naive users from having to know about all this, Java has a feature called autoboxing. If f is a float value and it is used in a context where a Float is expected, for example f.isNan(). The Java compiler automatically "boxes up f" as a Float creating code equivalent to new Float(f).isNan. This is called autoboxing.

Similarly, if g is a Float and it is used in a context where a float is expected, such as g+1.0F the compiler unboxes it, compiling it as if it had been written g.floatValue()+1.0. This is called auto-unboxing. We used Float.isNaN(travelTime) above in order to avoid the computational cost of autoboxing.

If you look at the methods of Boolean, Integer, Real and Double you will find many pairs of methods like f.isNan() and Float.isNaN(f). These are all there so that the programmer can avoid the cost of boxing and unboxing arguments depending on whether the value being operated on is an object or one of the primitive machine data types.