Lecture 24, Types

Part of the notes for CS:4980:1
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

What is a type

Consider the following Kestrel code again:

rt: type record
	f: ft
end;
et: type @ rt
at: type array it of et;
a:  at;

This artificial example declares the variable a of type at, where type at is an array type indexed by an index variable of type it and containing elements of type et. Type it is not defined here; presumably it is a subrange type that has been defined somewhere else, perhaps in an enclosing block. Type et, the type of the elements of array a is a pointer type, constrained to point to objects of type rt. Those objects are records with one field each, a variable field named f of type ft In the context of this example a[i]@.f means a reference to field f of the record referenced by element i of array a.

What are the attributes of a type? Some of the attributes are purely dependent on the grammar, and for these, a simple parse tree contains all of the necessary information.

<type> ::= <reference>
        |  <enumeration>
        |  <subrange>
        |  <pointer>
        |  <array>
        |  <set>
        |  <record>

A type can be declared by a reference to another type, that is, a type can simply rename a type that already exists. In this case, the new type identifier is just a new way to refer to an existing type. The simplest example of this is a simple renaming:

mybool: type boolean;

This declares mybool to be a synonym for boolean. There are, however, more complex uses. Consider:

a: array 1..10 of int32;
i: type a.index;

This declares type i to be the subrange type 1..10. The declaration of the array a could be far away from the point in the code where it is convenient to name the type i.

<enumeration> ::= "enum" "(" <identifier> { [ "," ] <identifier>} ")"
               |  "enum" "[" <identifier> { [ "," ] <identifier>} "]"
               |  "enum" "{" <identifier> { [ "," ] <identifier>} "}"

We will defer a discussion of enumerations, they are a bit more complex, although we have hinted at this in previous lectures.

Subranges

<subrange> ::= <expression> ".." <expression>

Subrange types are specified by two constant-valued expressions, the first giving the lower bound, and the second giving the upper bound. Both values must be members of finite scalar types, that is, they must be members of either subrange types or enumerated types.

The most obvious way to represent a subrange type inside the compiler is by a pair of expression values. All expression values have a type, and constant-valued expressions have a constant value of that type, while expressions that depend on variables have a reference to a variable as their value. When the value of the expression is a finite scalar type, the value is represented by an integer.

The types of each of the two expressions must be the same. If subrange types are inferred for the two expressions, those subranges must have the same base type.

Pointers

<pointer> ::= "@" <type>

Pointers have just one attribute, the type of the object they point to.

Arrays

<array> ::= "array" <type> [ "of" ] <type>
         |  "array" "of" <type>

Arrays have two obvious attributes, the finite scalar type giving the range of values permitted for the array index, and the type of the array elements.

The second form of array definition, without an index type, is only permitted in formal parameter lists. In this case, the actual parameter determines the number of array elements present. This is an obvious feature to omit in the early stages of compiler development.

Sets

<set> ::= "set" [ "of" ] <type>

Sets have a type, the type of the set elements. This must be a finite scalar type. It is fair to say that it must be very finite, since the obvious representation of a set is as a bit vector with one bit per potential set element.

In the early stages of compiler development, it would be quite fair to require that the base type for sets have 32 or fewer elements, so that sets can be represented as single words, using the set representation we have been using inside our parser.

Records

<record> ::= "record" [ "+" <reference> ] <block> "end"

Record types are the most complex type in Kestrel. These are defined by the list of fields, which is to say, they are defined by the environment constructed in the block that makes up the record body. Only the public fields in this block are visible from outside the block, while the private fields are visible only during the defintion of the block.

For now, we will ignore subclasses (the optional reference to a superclass), and this means, we can ignore protected declarations.

Enumerated Types

Now, back to enumerated types. An enumerated type is really the same as a subrange type, it has a lower bound, a finite scalar range of values, and an upper bound. As this type is defined, however, a collection of new constants is declared. Each of these new constants is a constant of that type, and they have successively larger values, starting with the lower bound and continuing to the upper bount. Inside the compiler, once this is taken care of, there is no difference between a subrange type and an enumerated type. Consider the definition of the boolean type:

boolean: type enum (false, true);

This is effectively equivalent to something like the following:

boolean: type false .. false+1;
false: const ??
true: const false + 1;

Here, we cannot say, specifically, what the value of false is, it could be anything, so long as it is a value of type boolean. As a practical matter, the value zero is obvious as a representation for false, but a Kestrel programmer cannot determine the actual representation. There have been computer architectures where -1 is the natural representation for false and 0 is the natural representation for true. On such systems, it would make sense to use these values, and a Kestrel programmer would never know.

Of course, if a Kestrel program calls a C function, the truth will be revealed because C has no real boolean data type. The C type bool is just defined as int in the header file stdbool.h. For compatibility with calls to C and C++ code from Kestrel programs, enumerations should start at zero.

Storage Requirements

To generate code within the compiler, we need to know the storage requirements for objects of each type. Storage requirements can be measured in bits, bytes or words, and each of these has a useful place.

Subranges

The minimum number of bits, bytes and 32-bit words required to represent a value in the subrange type t is

bits(t) = ⌈log₂(1+t.max–t.min)⌉
bytes(t) = ⌈bits(t)/8⌉
words(t) = ⌈bits(t)/32⌉

In a first-draft of a compiler, it is quite reasonable to represent every object of a finite scalar type as a full word. As the compiler is developed, however, it becomes useful and even necessary to use bytes or halfwords when those are large enough. In languages designed for systems work, it is even appropriate to have compiler support for data objects that fill fractions of a byte.

In such cases, compilers usually support "advice giving" directives to tell the compiler whether, for any particular declaration, you are more interested in access speed or memory compaction. For example, in Pascal, programmers could declare an array or record type to be packed, asking that all excess unused bits be squeezed out; by default, Pascal compiled for speed, not storage efficiency.

Pointers

A pointer is typically one word, 32 bits in our case, regardless of the type of the object pointed to. On 64-bit machines, pointers are sometimes only 48 bits (the high bits are ignored).

On some old machines, byte pointers consist of a word pointer combined with a displacement in bytes from the word pointer, but on most modern machines, the least significant bits of pointers specify the byte in word, and word pointers simply have these bits set to zero.

Arrays

The size of an array is the number of elements in that array times the size of each element. All elements must be the same size.

Sets

The size of a set, in bits, is the number of elements in the range of the base type, so a set of boolean values takes 2 bits, while a set of 8-bit character values takes 256 bits.

In a first draft of a compiler, it is sensible to represent all sets as one-word values, and limit the range of the base type to 32 elements.

Records

The size of a record is the sum of the sizes of all of the variables in that record. As we will find later, records sometimes contain hidden fields that are implicitly included. In that case, the size of these hidden fields needs to be added in.