Some Thoughts on Writing a Pascal Compiler

Some Thoughts on Writing a Pascal Compiler

James L. Peterson

Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712

1. Introduction

The compiler should be easily broken down into the following functional modules:

Input/Output/Host operating system interface
Symbol table manipulation
Lexical Analysis
Syntax Analysis
Semantic Analysis
Code Generation
Code Emitting

One fundamental decision to be made is the number of passes to be used in compilation. Pascal has been explicitly designed to allow one pass compilation, however, this generally produces rather poor code. Hence we would suggest a minimum of two passes, the first being lexical and syntactic analysis and the second semantic analysis and code generation. Additional passes can improve the code optimization.

Another primary concern is the output object code. A production compiler should obviously produce directly relocatable loader code. On the other hand, for many systems programming functions, and while the compiler is being tested and refined, assembly language output may be more suitable.

2. Operating System Interactions

The operating system interface modules should do whatever is necessary to open, create, and setup the following functional files.

Input Source File
Output Listing File
Output Error File
Output Object File

The listing and error files are often mapped onto the same file. All input is one character at a time. Output for the listing and error files is also 1 character, or a sequence of characters, at a time. The object file may be character (assembly language) or word (binary loader input) oriented.

3. Symbol Table Manipulation

This is the key to a successful Pascal compiler. The symbol table is used to collect and organize all information in the program. Each symbol table entry would typically include,

symbol name
level of symbol (1 for outermost; each level of nesting increases the level by one. 0 can be used for intrinsics).
kind of symbol (constant, type, variable, var parameter, value parameter, procedure, function, record field, ...)
the value of the symbol (depending upon its kind: address for variables, value for constants, ...)
the type of the symbol

3.1. Types

The key to a symbol table for a Pascal compiler is the type information. In addition to a few intrinsic types, users are allowed to define types. Types are treated just as other symbols, that is, they are put into the symbol table, but identified as type symbols, not variable name symbols. Thus, the type field of a symbol is a pointer to another symbol table entry, which is its type. The symbol table entry for a type depends upon the kind of type as follows:

char - intrinsic type, minimum 0, maximum 127 (ASCII).
boolean - intrinsic, minimum 0, maximum 1.
subrange - minimum value, maximum value, and pointer to its component type. INTEGER is simply a subrange of minimum integer to maximum integer.
array - pointer to the subrange type which is its index and to the component type.
record - pointer to a list of the field elements for this record. Each field element is a symbol, type and pointer to next field.
file - pointer to component type.

This basic approach may require a fair amount of pointer following (e.g. to generate an array reference, we follow the variable type pointer to its type, then the index pointer to the subrange to find the lower bounds, then the type pointer to find the type of the component item.), but allows a general mechanism for creating and describing user defined types. See Figure 1 for an example of the structure of the following declarations.

	type t1 = 0..10;
	     r2 = record x: t1; y: t1; end;
	     a3 = array [t1] of r2;
	var x: array [0..10] of char;
	    y: array[char] of a3;

Notice that some types are declared implicitly, that is they are not explicitly named and given a type. These are "constant" types and can be treated as types with no name.

3.2. Symbol Names

Symbol names can be treated in two ways. The traditional approach is to fix a maximum number of characters which are stored. If the maximum is selected at 5, then only the first 5 characters are stored and used to identify symbols. One could then limit all identifiers to 5 characters or less, or allow longer names, but only use the first 5. Another approach uses the first 5 characters plus the actual length of the input symbol; this distinguishes between "currentlevel" and "currentprocedure" but not between "currentl" and "currentp".

A better approach is to save the entire variable name. This is easy to do by using a string table. The string table is an array of characters. each symbol table entry contains simply the length of the name and a pointer to the string table. the extra programming caused by this extra level of indirection is minimal.

3.3. Block Structure

The block structure of Pascal poses the standard problems of scope and identical variables defined at multiple levels. Although

it is possible, when exitting a procedure block at level i, to delete all symbol table entries at level i or greater, leaving only the entries at lower levels, the better approach is to simply use the stack nature of the blocks. Allocate all space in the symbol table at the end of the table. Whenever the level increases, save the current free space pointer. Upon block exit merely reet the free space pointer to the saved value. if a string table is used, a pointer will be needed for both. This is much easier and less error prone than individually deleting each element from the highest level.

3.4. Search Structures

Notice that some symbol table entries will normally not be findable by referring to them by name, but must be in the symbol table anyway. For example, field names in records should not be found unless quantified by a record name. Unnamed types can not be referred to by name. This implies the existence of a search structure which differs from the symbol table. We would suggest that an extra pointer be included in each symbol table entry which allows each entry to be linked into the search structure. Unnamed types are not linked into the search structure. Field names are not linked. The WITH statement implementation however would link them into the search structure for the duration of the WITH.

For searching, we suggest a chained hash table. A hash function of the variable name gives an index into a small hash table, each entry of which is the head of a chain of entries, ending with a null pointer. Entries are put on the chain by a stack mechanism (put on the front of the chain) so all searches automatically find the most recent version of the variable name.

Formal procedure parameters are another example of the use of a separate search structure and symbol table. When the procedure is being processed, formal parameter names are linked into the search structure, acting as normal variables, in terms of scope. When the procedure is complete, the formal parameter names are deleted from the search structure, but reman in the symbol table attached to the procedure name symbol table entry. Thus on procedure calls, the number, type and order of parameters can be checked against the formal declaration.

3.5. Operations

On the basis of this discussion, we see the following operations will be needed.

enter a symbol in the symbol table
link a symbol into the search structure
search for a symbol
delete an entire level of the symbol table (except formal parameters).
unlink an entire level from the search structure
unlink an individual entry from the search structure (for the WITH statement)
manage free space in the symbol table

4. Lexical Analysis

Classical lexical analysis techniques can be used. The lexical analyzer needs to return the following information to ther syntax analysis routine:

a token type, identifying keywords, special symbols, constants and variables.
For variables, a pointer to the symbol name. This should not be automatically put in the symbol table since it may be needed for use as a declaration (should not be in a symbol table at this level), a variable (should be in symbol table) or a field of a record (should be there but not searchable).
For a constant its value and type (integer, char, string, real, boolean).
The line number and column number of the token (for error reporting).

5. Syntax Analysis

Here several analysis techniques appear to be reasonable. The general function is of course to convert the token stream of the lexical scanner into a parse of the input program. The parse should then be communicated to the next stage, semantic analysis.

The traditional approach for Pascal is top-down recursive descent, with procedures based on the syntax diagrams. This is easy, intuitive and works reasonably well.

Another approach is to use one of the bottom-up parsing methods. Pascal is not simple precedence parsable because of conflicts with certain symbols, and particularly with the equal sign and constant. The relation between "=" and <constant> can be,

= because const <identifier> = <constant>
> because <expression> <relop> <expression> and <relop> => "=" while <expression> => <constant>
< because type <identifier> = <subrange> and <subrange> => <constant> .. <constant>.
This seems a fundamental conflict in the language design for simple precedence parsing.

Pascal is SLR(1) parsable or LALR parsable. These parsers may be small, fast and machine generated.

6. Semantic Analysis

Most of semantic analysis involves type checking, and the generation of some intermediate code for alter code generation and optimization phases.

Type checking can be done at two levels. The first can be restricted to mainly checking the type pointers in the symbol table fields for equality. This means that

	a: array[0..1] of char;
	b: array[0..1] of char;

are of different types since each will have a pointer to a different (unnamed) type entry in the symbol table.

The obvious improvement is to check for type compatibility, not type identity. This requires checking that types are of the same structure, and that any subtypes are compatible. If a single routine COMPATIBLE(t1,t2) returns true or false if type t1 is or is not compatible with type t2, then the complexity of this routine can be hidden within this one routine and need not affect the remainder of the compiler. This is another example of the need for information hiding in the compiler.

A further problem with types is to provide information for operations which may have operands of differing types (e.g. adding integers, reals, or a mixture; comparing any two types). In addition, these operations must identify the correct output type. It can be quite useful to limit expression types as much as possible to improve code generation. For example, the addition of two variables in the subrange 0..10 could result in a sum which is typed as INTEGER or, in the most restrictive case, 0..20. The INTEGER sum would need to be stored in a full word register, while the 0..20 type could potentially be stored in a small register, like an index register. This could be important in the code generation for a machine like the CDC 6600 or MIX computers with long and short registers.

The major problem in semantic analysis is the generation of appropriate intermediate code for later code generation. Several forms have been suggested, such as trees, reverse Polish, triples, quadruples, and so on. There does not appear to be any good description of these different techniques at the practical level. However, from the discussion in Aho and Ullman, "Principles of Compiler Design", we would lean towards the use of quadruples.

7. Code Generation and Emitting

Code generation is the process of converting the intermediate code from the semantic analysis routines into output code with is then formatted and output by the code emitting routines. The code generation phase must worry about such things as memory allocation, register allocation and instruction selection.

8. Conclusion

This is just a brief statement of some thoughts on writing a Pascal compiler. Hopefully, we will shortly have a compiler which satisfies most of the design criteria listed here.
Home Comments?

=	because	const <identifier> = <constant>
>	because	<expression> <relop> <expression> and <relop> => "=" while <expression> => <constant>
<	because	type <identifier> = <subrange> and <subrange> => <constant> .. <constant>.