Lecture 1, Overview

Part of the notes for CS:4908:0004 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

This is a practical course, where our goal is to build a compiler. That means that we must study practical issues, such as how to use the tools of a programming language on a machine running a real operating system to build a compiler. We must also, of course, study the practical details of the programming language we are compiling, and the practical details of the machine for which we are generating code.

But, there is something else we need to study, and that is the nature of programming languages, as languages, and how software can be written to analyze the structure of a language in enough detail that the software can be said to understand the language.

The practical side

We will use C++

We will used C++ as the implementation language for the compiler we construct in this course. There is nothing wrong with writing a compiler in Python or COBOL, but in this course, C++ is our language of choice.

Wny C++? Because it is very widely available, and not a particularly bad choice for compiler writing. Also note that C++ is, by design, agnostic about issues of programming style. Object-oriented programming is possible, but the language does not force gratuitous use of objects. In contrast, some languages try to force a particular programming style. When this is the style you want, fine, but it is annoying to be forced to declare gratuitous container objects just because some style guru involved in developing the language decided that everyone should always use objects.

We will use Subversion

Subversion, the svn command as accessed from the Linux/Unix shell, is a very powerful example of a class of tools known informally as groupware. Subversion allows a group of people working independently to collaborate on the development of any textual document, whether that document is in a single file or multiple files, and whether that document is written in English or C. There are front ends for subversion that give it a fancy drag-and-drop GUI, but in this class, we recommend that everyone use the Unix/Linux command line interface.

Subversion allows several things: It allows multiple users to edit the same file at the same time using whatever text editor they prefer, and so long as their changes are not in conflict, it merges the results of their editing. If the results are in conflict, Subversion detects the conflict and reports it, leaving it to the people working on the project to resolve the conflict.

Subversion is a version control system. It allows you to revert to a previous version of the document collection if you discover that a sequence of changes was made in error. Almost anything you do can be undone under Subversion, unless you delete the subversion archive.

Finally, Subversion allows you to track who did what. If someone makes an edit that breaks something, Subversion can tell you who did it. The Subversion logs are an excellent record of who did what on a project. A classic complaint about courses with group projects is that they allow a slacker to get by on the strength of their collaborators. Subersion largely corrects that, creating a record of exactly who did what.

The Kestrel programming language

Our compiler will process code written in the Kestrel programming language. Why Kestrel? Coming from the Hawkeye state, it is, of course, natural to name a toy language after some kind of raptor -- hawks are raptors.

Kestrels are small nimble raptors, and for the purpose of this class, we need a small nimble language. Real programming languages can be huge, preventing a reasonable effort from being made in the course of one semester. While we are at it, Falcon also sets out to eliminate much of the superfluous syntax and language design errors that are part of the C and Algol traditions. Why not?

The target machine

Our compiler will produce output for the ARM processor, more specifically, t, the ARM 11, and excessively specifically, the ARM1176JZF-S. Note that that is the specific CPU found in the Raspberry Pi, a very cheap Linux computer. Also note that the vast majority of our work will not depend on which ARM processor is involved.

Why the ARM? Because it is widely used (mostly on Cellphones, these days), but more, because it is a modern RISC architecture and therefore fairly easy to write code for. In contrast, the ubquitous Intel x86 architecture is a monster.

Language and translator theory

The classic approach to compiler construction begins with lexical analysis, the division of the source text into a string of lexemes, followed by syntax analysis, also called parsing, fitting those lexemes into the grammar of the language. The code generation phase then uses the result of parsing the input text to guide the production of output text.

Lexical analysis

Lexical analysis can be described as easily for English as for any computer language. At the most trivial level, the lexemes of English are words and punction marks. Consider the sentence "This is a sentence." Lexical analysis of this would yield the following sequence of lexemes:

  1. This
  2. is
  3. a
  4. sentence
  5. .

A typical lexical analyzer is linked to an assortment of dictionaries, so it would usually annotate the lexemes with various attributs:

  1. This -- pronoun
  2. is -- verb
  3. a -- indefinite article
  4. sentence -- noun
  5. . -- punctuation

Lexical analysis for English is, of course, more complex, as it is perfectly legal, in English, to construct new words from well understood root words by the addition of standard prefixes and suffixes. That fact that such a word is not in the dictionary does not make it illegal. Consider the made-up word antiindentationist. An English reader sould be easily able to recognize the prefix anti-, the root indent, and the two suffixes -ation and -ist. The problem is even worse with German, but fortunately, the computer languages with which we are concerned do not require words to be broken down into components.

In exchange for this, our programming languages offer an alternative source of complexity, compound punctuation marks such as >= and <=.

Syntax analysis

The syntax of a programming language is its grammar. Grammar determines which combinarions of lexemes are legal strings in the language, and which are not. Consider the sentence "This is a sentence." From the point of view of English grammar, this has the following somewhat simplified structure:

To do justice to this example, we must give the grammar for the language. That is difficult for English, but fortunately, the computer languages that interest us here have grammars that are small enough to deal with.

Code generation

Once the syntax of the program is understood, the compiler can output the corresponding code. We will need to go into more detail about grammar and about the specific grammar of our programming language to say much more about his, but we note that there are some important components to code generation.

For machines with multiple registers, a code generator must include a register allocation subsystem. This subsystem determines which of the variables needed by the program will be held in registers and which will be held only in main memory at each point in the code.

Compilers may take any of a number of approaches to optimization. Some generate code and then optimize that code using various rewriting systems. Rewrite rules typically look for patterns in the code and then perform replacements based on those patterns. Alternatively, the code generator may be sufficiently context aware that it can attempt to generate good code from the start.

Combining the pieces

A multipass compiler uses intermediate files to communicate between the components of the compiler. For example, the lexical analyzer outputs a file of lexemes for input to the syntax analyzer, and then the syntax analyzer outputs an annotated syntax file for input to the code generator. This approach was historically important in the era when computers had very small main memories, but it is also useful if, for example, a common code generator is to be used to generate code for several languages, each with its own lexical and syntax analyzers.

Another approach is to write the components as packages, with a software interface between components. In this case, one component is generally the main program, calling on the services of the other components. In a syntax-directed translator the syntax analyzer acts as the main program, calling the services of the lexical analyzer each time it needs to look at the next successive lexeme, and calling on the services of the code generator whenever enough of the input program's syntax has been understood to allow code generation. This is the approach we will take.

Compiler compilers and other alternatives

Starting in the 1960s, a large number of efforts have been directed at creating compiler compilers. These are programs that take, as input, the specification of the lexical structure, grammar, and semantics of a programming language and produce, as output, a compiler for that language.

For the Unix, C and C++ programming environment, the combination of Lex and Yacc have become almost standard. Yacc, developed in the 1970s, stands for "Yet another compiler compiler", a hint about the proliferation of such tools.

This is not a course on the use of Lex and Yacc. Instead, in this course, we will discuss how to write a compiler from scratch, working directly in a language like C++. This approach is justified by the simple observation that compiler compilers really only solve the easy problems in compiler construction. Lexical analysis and syntax analysis are easy enough that they do not really need automation. The hard part is code generation.