ASSEMBLERS

Assemblers are the simplest of a class of systems programs called translators. A translator is simply a program which translates from one (computer) language to another (computer) language. In the case of an assembler, the translation is from assembly language to object language (which is input to a loader). Notice that an assembler, like all translators, adds nothing to the program which it translates, but merely changes the program from one form to another. The use of a translator allows a program to be written in a form which is convenient for the programmer, and then the program is translated into a form convenient for the computer.

Assembly language is almost the same as machine language. The major difference is that assembly language allows the declaration and use of symbols to stand for the numeric values to be used for opcodes, fields, and addresses. An assembler inputs a program written in assembly language, and translates all of the symbols in the input into numeric values, creating an output object module, suitable for loading. The object module is output to a storage device which allows the assembled program to be read back into the computer by the loader.

This is an external view of an assembler. Now we turn our attention to an internal view, in order to see how the assembler is structured internally to translate assembly programs into object programs.

8.1 DATA STRUCTURES

Appropriate data structures can make a program much easier to understand, and the data structures for an assembler are crucial to its programming. An assembler must translate two different kinds of symbols: assembler-defined symbols and programmer-defined symbols. The assembler-defined symbols are mnemonics for the machine instructions and pseudo-instructions. Programmer-defined symbols are the symbols which the programmer defines in the label field of statements in his program. These two kinds of symbols are translated by two different tables: the opcode table, and the symbol table.

The opcode table

The opcode table contains one entry for each assembly language mnemonic. Each entry needs to contain several fields. One field is the character code of the symbolic opcode. In addition, for machine instructions, each entry would contain the numeric opcode and default field specification. These are the minimal pieces of information needed. With an opcode table like this, it is possible to search for a symbolic opcode and find the correct numeric opcode and field specification.

Pseudo instructions require a little more thought. There is no numeric opcode or field specification associated with a pseudo-instruction; rather, there is a function which must be performed for each pseudo-instruction. This can be encoded in several ways. One method is to include in the opcode table, a type field. The type field is used to separate pseudo-instructions from machine instructions and to separate the various pseudo-instructions. Different things need to be done for each of the different pseudo-instructions, and so each has its own type. All machine instructions can be handled the same, however, so only one type is needed for them. One possible type assignment is then,
type   instruction class
1   Machine instruction
2   ORIG pseudo-instruction
3   CON pseudo-instruction
4   ALF pseudo-instruction
5   EQU pseudo-instruction
6   END pseudo-instruction.

Other type assignments are also possible. For example, in the above assignment, all machine instructions have one type, and are treated equally. This allows both instructions such as

LDA X(0:3)
and
ENTA X(0:3)
But the ENTA instruction is encoded as an opcode of 60 and a field of 02. Thus, it is not correct to specify a field with an ENTA. This reasoning can lead to separating memory reference instructions (which may have field specifications) from non-memory reference instructions (which should not have field specifications). Even further classification could separate the I/O instructions (which use the F field for a device number) and the MOVE instruction. The idea of attaching a type to a table entry is quite general.

Another approach to separating the pseudo-instructions in the opcode table is to consider how the type field of the above discussion would be used. It would be used in a jump table. In this case, rather than using the type to specify the code address through a jump table, we could store the address of the code directly in the opcode table. In each opcode table entry, we can store the address of the code to be executed for proper treatment of this instruction, whether it is a machine instruction or a pseudo-instruction. This is a commonly used technique for assemblers.

Other opcode table structures are possible also. Some assemblers have separate tables for machine instructions and for pseudo-instructions. Others will use a type field of one bit to distinguish between machine instructions and pseudo-instructions, and then store a numeric opcode and field specification for the machine instructions or an address of code to execute for pseudo-instructions.

For simplicity, let us assume that we store, for each entry, the symbolic mnemonic, numeric opcode, default field specification and a type, as defined above. For pseudo-instructions, the numeric opcode and default field specification of the entry will be ignored. How should we organize our opcode table entries? The opcode table should be organized to minimize both search time and table space. These two goals may not be achievable at the same time. The fastest access to table entries would require that each field of an entry be in the same relative position of a memory word, such as in Figure 8.1. But notice that, in this case, three bytes of each entry are unused, so that the table includes a large amount of wasted space. (Actually, three bytes, two signs, and the upper three bits of the type byte are unused.) To save this space would require more code and longer execution times for packing and unpacking operations. Thus, to save time it seems wise to accept this wasted space in the opcode table.


FIGURE 8.1 Opcode table entry (each opcode table entry contains the symbolic opcode, numeric opcode and field specification, and a type field).

To a great extent this wasted space is due to the design of the assembly language mnemonics. If the mnemonics had been defined as a maximum of three characters (instead of four), it would have been possible to store the mnemonic, field specifications, and opcode in one word. The sign bit would be "+"for machine instructions and "-" for pseudo-instructions. Pseudo instructions could have a type field or address in the lower bytes while machine instructions would store opcode and field specifications. On the other hand, once the decision is made that four character mnemonics are needed, then it is necessary to go to a multi-word opcode table entry. In this case, we could allow mnemonics of up to seven characters, by using the currently unused three bytes of the opcode table entry. On the other hand, there may be no desire to have opcode mnemonics of more than four characters; mnemonics should be short.

Another consideration is the order of entries in the table. This relates to the expected search time to find a particular entry in the table. The simplest search is a linear search starting at one end of the table and working towards the other end until a match is found (or the end of the table is reached). In this case, one should organize the table so that the more common mnemonics will be compared first. If, as expected, the LDA mnemonic is the most common instruction, then it should be put at the end of the table which is searched first.

Rather than use a linear search, it can be noted that the opcode table is a static structure; it does not change. The opcode table is like a dictionary explaining the numeric meaning of symbolic opcodes. Like a dictionary, it can be usefully organized in an alphabetical order. By ordering the entries, the table can be searched with a binary search. A binary search is the method commonly used to look for an entry in a dictionary, telephone book, or encyclopedia. First the middle of the book is compared with the entry being searched for (the search key). If the search key is smaller than the middle entry, then the key must be in the first half of the book, if it is larger it must be in the latter half of the book. After this comparison, the same idea can be repeated on the half of the book which still needs to be searched. Each comparison splits the remaining section of the book in half.

In MIX, this might be coded as follows. Location KEY contains the value for which we are searching. LOW contains our low index, while HIGH contains our high index. Initially, LOW would point to the first entry of the table; HIGH would point to the last entry. The search loop would then be

SEARCHLOOP LDA LOW ADD HIGH A = LOW + HIGH SRAX 5 DIV =2= A = (LOW + HIGH)/2 = MIDDLE STA *+1(0:2) MOVE TO INDEX 3 ENT3 0 LDA TABLE,3 MIDDLE OF TABLE CMPA KEY JE FOUND IF EQUAL, FOUND ENTRY JL 1F * DEC3 1 IF TABLE[MIDDLE] > KEY ST3 HIGH HIGH = MIDDLE - 1 JMP 2F * 1H INC3 1 IF TABLE[MIDDLE] < KEY ST3 LOW LOW = MIDDLE + 1 2H LDA HIGH CMPA LOW IF LOW > HIGH, STOP JGE SEARCHLOOP NOTFOUND ...
The search code has two exits. If the search key is found in the table, a jump is made to the label FOUND, with index register 3 being the index into the table of the search key entry. If the key is not in the table, control transfers to the label NOTFOUND.

A binary search is not always the best search method to use. Each time through, the search loop cuts the size of the table yet to be searched in half. In general, a table of size 2n will take about n comparisons to find an entry. Thus, a table of size 32 will take only 5 comparisons, while for a linear search, it is normally assumed, that on the average, half of the table must be searched, resulting in 16 comparisons. Thus, the binary search almost always requires fewer executions of its search loop than a linear search. However, notice that the binary search requires considerably more computation per comparison than a linear search. The binary search loop requires about 35 time units per comparison while a linear search can require only 5 time units. Thus, for a table of size 32, the binary search takes 5 comparisons at 35 time units each, for 175 time units, while the linear search takes 16 comparisons at 5 time units each for 80 time units. This is not to say that a binary search should never be used. For a table of size 128, a binary search will take about 245 (= 7 × 35) time units, while a linear search will take 320 (= 64 × 5). Thus, for a large table a binary search is better. Also, if a shift (SRB) could be used instead of the divide, the time per loop could be cut by 11 time units, making the binary search better. Since there are around 150 MIX opcodes, we use a binary search for the opcode table.

The symbol table

The opcode table is used to translate the assembler-defined symbols into their numeric equivalents; the symbol table is used to translate programmer-defined symbols into their numeric equivalents. Thus, these two tables can be quite alike. A symbol table, like an opcode table, is a table of many entries, one entry for each programmer-defined symbol. Each entry is composed of several fields.

For the symbol table, only two fields are basically needed. It is necessary to store the symbol and the value of the symbol. A symbol in a MIXAL program can be up to 10 characters in length. This requires two MIX words. In addition, the value of the symbol can be up to five bytes plus sign, requiring another MIX word. Thus, each entry in the symbol table takes at least three words. Additional fields may be added for some assemblers. A bit may be needed to indicate if the symbol has been defined or is undefined (i.e., a forward reference). Another bit may specify whether the symbol is an absolute symbol or a relative symbol (depending on whether the output is to be used with a relocatable or absolute loader). Other fields may also be included in a symbol table entry, but for the moment let us use only the two fields, for the symbol and its value.

As with the opcode table, the organization of the symbol table is very important for proper use. But the symbol table differs from the opcode table in one important respect: it is dynamic. The opcode table was static; it is never changed, neither during nor between executions of the assembler. It is the same for each and every assembly program for the entire assembly process. The symbol table is dynamic; each program has its own set of symbols with their own values and new symbols are added to the symbol table as the assembly process proceeds. Initially the symbol table is empty; no symbols have been defined. By the time assembly is completed, however, all symbols in the program have been entered into the symbol table.

This requires the definition of two subroutines to manipulate the symbol table: a search routine and an enter routine. The search routine searches the symbol table for a symbol (its search key) and returns its value (or an index into the table to its value). The enter subroutine puts a new symbol and its value into the table.

These two subroutines need to be designed together, since both of them affect the symbol table. A binary search might be quite efficient, but it requires that the table be kept ordered. This means that the enter subroutine would have to adjust the table for each new entry, so that the table was always sorted into the correct order. Thus, although a binary search might be quick, the combination of a binary search plus an ordered enter might be very expensive.

Also consider that a linear search is more efficient than a binary search for small tables. Many assembly language programs have less than 200 different symbols, and so a linear search may be quite reasonable. A linear search allows the enter routine to simply put any new symbol and its value at the end of the table. Thus, both the search and enter routines are simple.

Many other table management techniques have been considered and are used. Some of these are quite complex and useful only for special cases. Others have wide applicability. One of the most commonly used techniques is hashing. The objective of hashing is quite simple. Rather than have to search for a symbol at all, we would prefer to be able to compute the location in the table of a symbol from the symbol itself. Then, to enter a symbol, we compute where it should go, and define that entry. For accessing, we compute the address of the entry and use the value stored in the table at that location.

As a simple example, assume that all of our symbols were one-letter symbols (A, B, C, …, Z). Then if we simply allocated a symbol table of 26 locations, we could index directly to a table entry for each symbol by using its character code for an index. No searching would be needed. If our symbols were any two-letter symbols, we could apply the same idea if we had a symbol table of 676 (= 26 × 26) entries, where our hash function would be to multiply the character code of the first letter by 26 and add the character code of the second (and subtract 26 to normalize). However, for three-letter symbols, we would need 17,576 spaces in our symbol table. This is clearly impossible. (Our MIX memory is only 4000 words long.) It also is not necessary, since we assume that at most only a few hundred symbols will be used in each different assembly program. What is needed is a function which produces an address for each different symbol, but maps them all into a table of several hundred words.

Many different hashing functions can be used. For example, we can add together the character codes for the different characters of the symbol, or multiply them, or shift some of them a few bits and exclusive-OR them with the other characters, or whatever we wish. Then, after we have hashed up the input characters to get some weird number, we can divide by the length of the table and use the remainder. The remainder is guaranteed to be between 0 and the length of the table and hence can be used as an index into the table. For a binary machine and a table whose length is a power of two, the division can be done by simply masking the low-order bits.

The objective of all this calculation is to arrive at an address for a symbol which hopefully is unique for each symbol. But since there are millions of 10-character symbols, and only a few hundred table entries, it must be the case that two different symbols may hash into the same table entry. For example, if we use a hash function which adds together the character codes of the letters in the symbol, then both EVIL and LIVE will hash to the same location. This is called a collision. The simplest solution to this problem is to then search through successive entries in the table, until an empty table entry is found. (If the end of the table is found, start over at the beginning).

The search and enter routines are now straightforward. To enter a new symbol, compute its hash function, and find an empty entry in the table. Enter the new symbol in this entry. To search for a symbol, compute its hash function and search the table starting at that entry. If an empty entry is found, the symbol is not in the table. Otherwise, it will be found before the first empty location.

The problem with using hashing is defining a good hash function. A good hash function will result in very few collisions. This means that both the search and enter routines will be very fast. In the worst case, all symbols will hash to the same location and a linear search will result. Hashing is sometimes also used for opcode tables, where, since the opcode table is static and known, a hashing function can be constructed which guarantees no collisions.

More about hashing can be found in the paper by Morris (1968), or in Knuth (1973), Volume 3.

Other data structures

The opcode table and the symbol table are the major data structures for an assembler, but not the only ones. The other data structures differ from assembler to assembler, depending upon the design of the assembly language and the assembler.

A buffer is probably needed to hold each input assembly language statement, and another buffer is needed to hold the line image corresponding to that statement for the listing of the program. In addition buffers may be needed to create the object module output for the loader, and to perform double buffering in order to maximize CPU-I/O overlap. Various variables are needed to count the number of cards read, the number of symbols in the symbol table, and so on.

One variable in particular is important. This is the location counter. The location counter is a variable which stores the address of the location into which the current instruction is to be loaded. The value of the location counter is used whenever the symbol "*" is used, and this value can be set by the ORIG pseudo-instruction. Normally the value of the location counter is increased by one after each instruction is assembled, in order to load the next instruction into the next location in memory. The value of the location counter for each instruction is used to instruct the loader where to load the instruction.

8.2 GENERAL FLOW OF AN ASSEMBLER

With a familiarity with the basic data structures of an assembler (the opcode table, the symbol table, and the location counter), we can now describe the general flow of an assembler. Each input assembly statement, each card, is handled separately, so our most general flow would be simply, "Process each card until an END pseudo-instruction is found."

More specifically, consider what kind of processing is needed for each card:

  1. Read in a card.
  2. If the card is a comment (column 1 is an "*") then skip over processing to step 4 for printing.
  3. For non-comment cards, get the opcode and search the opcode table for it. Using the type field of the opcode table entry, process this card.
  4. After the card has been processed, print a line of listing for this card.
  5. If the opcode was not an END pseudo-instruction, go back to step 1 to process the next card.

This much of the assembly process is common to all of the input cards. The important processing is in step 3, where each card is processed according to its type. This level of the assembler provides an organizational framework for further development.


FIGURE 8.2 The general flow of an assembler (for MIXAL). Refinement of this general design resulted in the assembler described in Section 8.3.

Machine language instructions

Continuing then, what processing needs to be done for each type of opcode? Consider a machine language instruction. For a machine language instruction, we need to determine the contents of each of the four fields of a machine language instruction. In addition, we must define any label which is in the label field. Let us define the label first. Defining a label is simply a matter of entering the label (if any) in the label field of the assembly language statement into the symbol table with a value which is the value of the current location counter.

Now we need to determine the fields of the machine language instruction. The opcode field (byte 5:5) and the field specification (byte 4:4) are available in the opcode table entry which has already been found. To find the other fields (address and index), let us assume that we have a subroutine which will evaluate an expression. This subroutine will be written later. Then, to define the address field, we call our expression evaluator. The value it returns is the contents of the address field (bytes 0:2). When it returns, we check the next character. If the next character is a comma ",", then an index part follows and we call the expression evaluator again to evaluate the index part. If the next character is not a comma, then the index part is zero (default). When the index has been processed, we check the next character. If it is a left parenthesis, then the expression which follows is a field specification, and we call the expression evaluator again. Otherwise, we use the default specification from the opcode table.

Basically, we have defined an assembly language statement to be of the form

LABEL OP EXPRESSION,EXPRESSION(EXPRESSION)
and have written one subroutine which will evaluate each expression. From this we can determine the value of each of the fields of the machine language instruction to be generated for this assembly language instruction. Now, we assemble the generated word
LDA OPCODE STA MACHLANG(5:5) LDA FIELD STA MACHLANG(4:4) LDA INDEX STA MACHLANG(3:3) LDA ADDRESS STA MACHLANG(0:2)
and output it for the loader, to be loaded at the address indicated by the location counter. Then we advance the location counter and we are ready for the next input card.

Pseudo instruction processing

The processing for each pseudo-instruction is generally easy.

For an ORIG instruction, no code is generated. Rather, only two things need be done. First, if a label is present, it should be defined. Its value is the value of the location counter. Second, the expression in the operand field is evaluated and the value of the expression is stored as the new value of the location counter. We can use the same expression evaluator that is used for evaluation of the machine language operands.

The EQU pseudo-instruction is similarly straightforward. For an EQU instruction, we first evaluate the operand expression (using the expression evaluator as before), and then we enter the symbol in the label field into the symbol table with a value of the expression.

The ALF pseudo-instruction is even easier. In this case we first define the label of the instruction. Then we pick up the character code of the five characters of the ALF operand and issue them as a generated word, incrementing the location counter after we do so.

The most complicated pseudo-instruction is probably the CON instruction. The general form of this instruction is

CON exp1(fexp1),exp2(fexp2), …, expn(fexpn)
Notice that again we can use our expression evaluation routine. With this subroutine, the assembler code to handle a CON instruction is simply,
  1. Initialize the generated word to zero.
  2. Evaluate the next expression.
  3. If the next character is a left parenthesis, then evaluate the next expression to get a field specification; otherwise use a field specification of 0:5.
  4. Store the expression from step 2 in the indicated field of the generated word.
  5. If the next character is a comma, then go back to step 2, to repeat for the next part of the CON instruction.

Once the operand of the CON instruction has been generated, it can be output for loading and the location counter can be incremented.

The last pseudo-instruction the assembler will encounter is the END pseudo-instruction. For this pseudo-instruction, we first define the label. Then we evaluate the operand expression and output it to the loader as the starting address.

As you can see, each pseudo-instruction requires its own section of code for correct processing, and the pseudo-instruction processing for each is different. However, the processing for each individual pseudo-instruction, considered separately, is not overly complicated. The basic functions used in processing all assembly language instructions involve defining labels, evaluating expressions, and generating code. The first of these was discussed in Section 8.1, so now let us consider the evaluation of expressions.

Expression evaluation

An expression in MIXAL is composed of two basic elements: operands and operators. The operators are addition (+), subtraction (-), multiplication (*), division (/), and "multiplication by eight and addition" (:). The operands are of three types: numbers, symbols, and "*". Numbers have a value which is defined by the number itself, interpreted as a decimal number. Symbols are defined by the value field of their symbol table entry. The value of * is the value of the location counter.

The operators are applied to the operands strictly left to right, without precedence. This allows a very simple expression evaluation routine whose code is basically as follows.

  1. Evaluate the first operand. Save its value in VALUE1.
  2. If the next character is an operator, remember it and go on to step 3. Otherwise stop. VALUE1 is the value of the expression.
  3. Evaluate the second operand. Save its value in VALUE2.
  4. Apply the operator to VALUE1 and VALUE2, saving the result in VALUE1. Then go back to step 2.

And that is all that there is to it. Expressions for MIXAL have been defined in such a way that their evaluation is quite simple. Only one problem has been ignored. That problem is forward references.

Forward references

The forward referencing problem arises in a very simple way: the expression evaluator attempts to evaluate an operand which is a symbol by searching the symbol table, and finds that the symbol is not in the symbol table. The symbol is not defined, the expression cannot be evaluated, and the assembly language statement cannot be assembled. What can be done?

One solution is to disallow forward references. MIXAL uses this approach in several places. No pseudo-instruction can make a forward reference. Forward references are not allowed in the index or field specification fields of an instruction. However, a restriction to no forward references would be extremely inconvenient and so two other solutions are more commonly used for allowing some forward references. These two solutions result in two classes of assemblers: one-pass assemblers and two-pass assemblers. Since two-pass assemblers are conceptually simpler, we consider them first.

Two pass assemblers

A two-pass assembler makes two passes over the input program. That is, it reads the program twice. On the first pass, the assembler constructs the symbol table. On the second pass, the complete symbol table is used to allow expressions to be evaluated without problems due to forward references. If any symbol is not in the symbol table during an expression evaluation, it is an error, not a forward reference.


FIGURE 8.3 A block diagram of a two-pass assembler. Pass 1 produces a symbol table and a copy of the input source program for pass 2. Pass 2 produces loader code and a listing.

Briefly, for a two-pass assembler, the first pass constructs the symbol table; the second pass generates object code. This causes a major change in the basic flow of an assembler, and results in another important data structure: the intermediate text. Since two passes are being made over the input assembly language, it is necessary to save a copy of the program read for pass 1, to be used in pass two. Notice that since the assembly listing includes the generated machine language code, and the machine language code is not produced until pass 2, the assembly listing is not produced until pass 2. This requires storing the entire input program, including comments, and so forth, which are not needed by the assembler during pass 2.

The intermediate text can be stored in several ways. The best storage location would be in main memory. However, since MIX memory is so small, this technique is not possible on the MIX computer. On other machines, with more memory, this technique is sometimes used for very small programs. Another approach, used for very small machines, is to simply require the programmer to read his program into the computer twice, once for pass 1, and again for pass 2. A PDP-8 assembler has used this approach, even going so far as to require the program be read in a third time if a listing is desired.

A more common solution is to store the intermediate text on a secondary storage device, such as tape, drum, or disk. During pass 1, the original program is read in, and copied to secondary storage as the symbol table is constructed. Between passes, the device is repositioned (rewound for tapes, or the head moved back to the first track for moving head disk or drum). During pass 2, the program is read back from secondary storage for object code generation and printing a listing. (Notice that precautions should be taken that the same tape is not used for both storage of the intermediate text input for pass 2 and the storage of the object code produced as output from pass 2 for the loader.)

The general flow of a two-pass assembler differs from that given above, in that now each card must be processed twice, with different processing on each pass. It is also still necessary to process each type of card differently, depending upon its opcode type. This applies to both pass 1 and pass 2.

For machine instructions, pass 1 processing involves simply defining the label (if any) and incrementing the location counter by one. ALF and CON pseudo-instructions are handled in the same way. ORIG statements must be handled exactly as we described above, however, in order for the location counter to have the correct value for defining labels. Similarly, EQU statements must also be treated on pass 1. This means that no matter what approach is used, no forward references can be allowed in ORIG or EQU statements, since both are processed during pass 1. The END statement needs to have its label defined and then should jump to pass 2 for further processing.

During pass 2, EQU statements can be treated as comments. The label field can likewise be ignored. ALF, CON, and machine instructions will be processed as described above, as will the END statement.

The need to make two passes over the program can result in considerable duplication in code and computation during pass 1 and pass 2. For example, on both passes, we need to find the type of the opcode; on pass 1, to be able to treat ORIG, EQU, and END statements; on pass 2, for all types. This can result in having to search the opcode table twice for each assembly language statement. To prevent this, we need only save the index into the opcode table of the opcode (once it is found during pass 1) with each instruction. Also, consider that since the operand "*" may be used in the expressions evaluated during pass 2, it is necessary to duplicate during pass 2 the efforts of pass 1 to define the location counter, unless we can simply store with each assembly language statement the value of the location counter for that statement.

These considerations can result in extending the definition of the intermediate text to be more than simply the input to pass 1. Each input assembly language statement can be a record containing (at least) the following fields.

  1. the input card image.
  2. the index of the opcode into the opcode table.
  3. the location counter value for this statement.
Additional fields may include error flags, a pointer to the column in which the operand starts (for free-format assemblers), a line number, and so on; whatever is conveniently computed on pass 1 and needed on pass 2.

Even more can be computed during pass 1. For example, ALF and CON statements can be completely processed during pass 1. The opcode field, field specification, and index field for a machine instruction can be easily computed during pass 1, and most of the time the address field, too, can be processed on pass 1. It is only the occasional forward reference which causes a second pass to be needed. Most two-pass assemblers store a partially digested form of the input source program as their intermediate text for pass 2.

Assemblers, like loaders, are almost always I/O-bound programs, since the time to read a card generally far exceeds the time to assemble it. Thus, requiring two passes means an assembler takes twice as long to execute as an assembler which only uses one pass.

One pass assemblers

It is these considerations which have given rise to one-pass assemblers. A one-pass assembler does everything in one pass through the program, much as described earlier. The only problem with a one-pass assembler is caused by forward references, of course. The solutions to this problem are the same as were presented earlier for one-pass relocatable linking loaders: Use-tables or Chaining.

In either case, Use-table or Chaining, it is not possible to completely generate the machine language statement which corresponds to an assembly language statement; the address field cannot always be defined. Thus, some later program must fix-up those instructions with forward reference. In a two-pass assembler, this program is pass 2. In a one-pass assembler, there is no second pass, and so the program which must fix-up the forward references is the loader. The loader resolves forward references for a one-pass assembler.

If Use-tables are used to solve the future reference problem, then the assembler keeps track of all forward references to each symbol. After the value of the symbol is defined, the assembler generates special loader instructions to tell the loader the addresses of all forward references to a symbol and its correct value. When the loader encounters these special instructions during loading, it will fix-up the address field of the forward reference instruction to have the correct value.

A variation of this same idea is to use chaining. With chaining, the entries in the Use-table are kept in the address fields of the instructions which forward reference symbols. Only the address of the most recent use must be kept. When a new forward reference is encountered, the address of the previous reference is used, and the address of the most recent reference is kept. When a symbol is defined which has been forward referenced (or at the end of the program), special instructions are again issued to the loader to fix-up the chains which have been produced.

Variations on this basic theme are possible also. For example, if the standard loader will not fix-up forward references, it would be possible for the assembler to generate some special instructions which would be executed first, after loading, but before the assembled program is executed to fix-up forward references. But the basic idea remains the same: a one-pass assembler generates its object code for the loader in such a way that the loader, or the loaded program itself, will fix-up forward references.

Other considerations

Listings

Throughout the discussion of the assembly process, so far, we have ignored many of the (relatively) minor points concerning the writing of an assembler. One of the most visible of these considerations is the assembly listing. The assembly listing gives, for each input assembly line, the correspondingly generated machine language code. Notice, however, that not all assembly language statements result in the generation of machine language code, and not all code is meant to be interpreted in the same way (some are instructions, others numbers or character code).

For a machine language statement, useful information which can be listed includes the card image, card number, location counter, and generated code, broken down by field. For an ORIG, EQU, or END statement, on the other hand, no code is generated, but the value of the operand expression would be of interest. A CON or ALF statement would need to list the generated word, but as a number, not broken down by fields, as an instruction. Forward references in a one-pass assembler might be specially marked, as might a relocatable address field in a relocatable assembly program.

In addition to the listing of the input assembly, some assemblers also print a copy of their symbol table after it is completed. This symbol table listing would include the defined symbols, their values and perhaps the input card number where they were defined. Some assemblers will also produce a cross reference listing, which is a listing, for each symbol, of the card numbers or memory locations in the program of all references to that symbol. Most symbol table listings and cross reference listings are ordered alphabetically by symbol to aid in finding a particular symbol.

Errors

Another concern is the problem of errors. Many programs have assembler errors in them. These are errors not in what the program does, but in its form. These can be caused by mistakes in writing the program, or in the transcription of the program into a machine-readable form, like keypunching. The assembler must check for errors at all times. Typical errors include, multiply-defined symbols (using the same symbol in the label field twice), undefined symbols (using a symbol in the operand field of a statement, but never defining it in the label field), undefined opcodes (opcode cannot be found in the opcode table), illegal use of a forward reference, an illegal value for a field (index or field specification value not between 0 and 63, or address expression not between -4095 and +4095), and so on.

For each possible error, two things must be decided: (1) how to notify the programmer of the error, and (2) what to do next. The notification is often simplest. An error symbol or error number is associated with each type of error. The degree of specification of the exact error varies. One approach is to simply declare "error in assembly" or "error in xxxx part", where xxxx may be replaced by "label", "opcode", or "operand." However, these approaches may not give the programmer enough information to find and correct the error. The opposite approach is also sometimes taken, over-specifying the error to the extent that the user's manual lists thousands of errors, most of which are equivalent (from the point of view of the programmer), and with each error highly unlikely.

More typically, the major errors are classified and identified, while more obscure and unlikely errors are grouped into a category such as "syntax error in operand part." The listing line format may include a field in which errors are flagged, or a statement with an error may be followed by an error message.

In any case, the writer of the assembler must check for all possible errors in the input assembly program. If one is found, a decision must be made as to what to do next. This often may depend upon the type of error which is found, and the error handling code for each error is generally different. For a multiply-defined label, the second definition is often ignored, or the latest definition may always be used. An undefined opcode may be assumed to be a comment card, or treated like a HLT or NOP instruction. Illegal values for any of the fields of an instruction can result in default values being used.

Whatever the specific approach taken to a specific error, a general approach must also be taken. From the way in which an assembly language program is written, each input statement is basically a separate item to be translated. Thus, an error in one statement still allows the assembler to continue its assembly of the remaining program by simply ignoring the statement which is in error. This is in contrast to some systems which cease all operations whenever the first error is found. An assembler can always continue with the next card, after an error is found in the current card.

A more subtle point is whether or not an assembler should attempt to continue assembling the card in which the error occurs. If the opcode is undefined, the assembler may misinterpret the label field or operand field, causing it to appear that there are more (or fewer) errors than exist, once the incorrect opcode is corrected. (For example, a typing error on an ALF card may result in an undefined opcode. If the operand field of the incorrect ALF is treated as if the opcode were a machine instruction, then the character sequence for the ALF may result in an apparent reference to an undefined symbol.) On the other hand, attempting to continue assembling a card after an error may identify additional errors which would save the programmer several extra assemblies to find.

Relocatable versus absolute assembly

The discussion of loaders and linkers in Chapter 7 mentioned several differences between an assembly language which is used with a relocatable loader and an assembly language for an absolute loader. These differences show up in the assembly language in terms of the pseudo-instructions available and also in restrictions on the types of expressions (absolute or relocatable) which can be used.

It should be relatively obvious that the changes in the assembler for absolute and relocatable programs are not major changes, but consist mainly in the writing of some new code to handle the new pseudo-instructions and some changes in code generation format to match the input expected by a relocatable loader. Extra tables may be required for listing entry points and external symbols, and a type field will need to be added to the symbol table to distinguish between absolute symbols, relocatable symbols, entry points, and external symbols.

Load-and-go assemblers

Another type of assembler, in addition to relocatable versus absolute and one-pass versus two-pass, is the load-and-go assembler. These assemblers are typically used on large computers for small programs (like programs written by students just learning the assembly language). The idea of a load-and-go assembler is to both assemble and load a program at the same time. The major problem with this, on most machines, including the MIX computer, is the size of memory and the size of the assembler. There is simply not enough room for both a program and the assembler in memory at the same time.

However, for a simple assembly language, giving rise to a simple and small assembler, and with a large memory, it is possible to write an assembler which, rather than generating loader code, loads the program directly into memory as it is assembled. The assembler acts as both an assembler and a loader at the same time. These load-and-go systems often are one-pass, absolute assemblers, but can just as well be two-pass and/or relocatable assemblers.

Where they are possible, load-and-go systems are generally significantly faster than non-load-and-go systems, since they save on I/O.

8.3 AN EXAMPLE ASSEMBLER

To demonstrate the techniques which have been discussed in this Chapter, we present here the actual code for a one-pass absolute assembler for MIXAL. The assembler accepts a MIXAL program from the card reader. It produces a listing on the line printer, and object code similar to that accepted by the absolute loader of Section 7.1 (with modifications for fix-ups).

The assembler is presented from the bottom up. That is, the simpler routines, which do not call other routines, are written first. Then we write routines which may call these routines, and so forth until the main program is written. This order of writing the assembler is possible mainly because the discussion in Section 8.2 has shown which routines we need for the assembler and how they fit together.

Each routine is written to be an independent function, as much as possible. Registers should be saved and restored by each routine as needed, with the exception of registers I5 and I6. Register I6 is used as the location counter of the assembly program. Register I5 is a column indicator for the lexical scan portion of the assembler.

Tape output routines

Since the loader output is being written to tape, but is produced one word at a time by the assembler, several routines are needed to handle the tape output. The first two routines TAPEOUT and FINISHBUF are standard blocking routines such as discussed in Chapter 5. TAPEOUT accepts as a parameter one word, in the A register. These words are stored into a buffer of 100 words (one physical record) until the buffer is full. When the buffer is full, TAPEOUT calls FINISHBUF, which simply outputs the current buffer and resets the buffer pointers to allow double buffering. FINISHBUF can also be called to empty the last, partially filled buffer before the assembler halts.


FIGURE 8.4 The subroutine structure of the example MIXAL assembler. Each name is the name of a subroutine, and an arrow from one subroutine to another indicates that one subroutine may call the other in the course of its program. For example EXPRESSION may call EVALSYM, Which may call GETSYM, SEARCHSYM, and DEFINESYM.

Variables for these routines would include the two buffers (BUF1 and BUF2), pointers to these two buffers for double buffering, and a counter of the number of words in the buffer.

* * TAPE BUFFER VARIABLES * BUF1 ORIG *+100 BUFFERS BUF2 ORIG *+100 STOREBUF CON BUF1 POINTER TO BUFFER FOR STORING TAPEBUF CON BUF2 POINTER TO BUFFER FOR OUTPUT TAPEBUFPTR CON BUF1 CURRENT WORD POINTER TAPECNTR CON 100 NUMBER OF WORDS LEFT * * * SUBROUTINE FINISHBUF * * LOADER WORDS TO BE WRITTEN TO TAPE FOR * LOADING ARE STORED IN A BUFFER UNTIL 100 * WORDS ARE ACCUMULATED. THEN THIS ROUTINE * IS CALLED TO DUMP THE BUFFER TO TAPE. DOUBLE * BUFFERING IS USED. * FINISHBUF STJ FIBEXIT ST1 FIBSAVE1(0:2) SAVE REGISTERS ST2 FIBSAVE2(0:2) * LD1 STOREBUF CURRENT BUFFER OUT 0,1(TAPE) WRITE BUFFER TO TAPE LD2 TAPEBUF SWITCH BUFFER POINTERS ST2 STOREBUF ST2 TAPEBUFPTR AND RESET POINTER ST1 TAPEBUF * FIBSAVE1 ENT1 * RESTORE REGISTERS FIBSAVE2 ENT2 * FIBEXIT JMP * * * * SUBROUTINE TAPEOUT * * THIS SUBROUTINE ACCEPTS ONE WORD TO BE * WRITTEN TO THE TAPE FOR THE LOADER AND * STORES IT IN THE BUFFER UNTIL THE BUFFER IS * FULL. THEN IT CALLS FINISHBUF TO EMPTY * THE BUFFER. * * INPUT WORD IS IN THE A REGISTER * TAPEOUT STJ TOEXIT SAVE REGISTERS ST1 TOSAVE1(0:2) * LD1 TAPEBUFPTR NEXT WORD POINTER STA 0,1 SAVE WORD INC1 1 ST1 TAPEBUFPTR UPDATE POINTER * LD1 TAPECNTR CHECK FOR FULL BUFFER DEC1 1 J1P STILLROOM JMP FINISHBUF BUFFER FULL ENT1 100 RESET COUNTER * STILLROOM ST1 TAPECNTR RESTORE COUNTER * TOSAVE1 ENT1 * TOEXIT JMP * *

Loader code generation routines

The TAPEOUT routine is called mainly by the routines which create the loader output. These routines are mainly concerned with formatting the output from the assembler. The loader output is a series of logical records, as described in Chapter 7. For a one-pass absolute loader, we need three types of loader records: (a) words to be stored in memory, (b) chain addresses for forward reference fix-up, and (c) the start address of the assembled program. Each block is identified by a header word of the format shown in Figure 8.5.


FIGURE 8.5 The format of the header word for a loader record.

Byte 3 (T) is a type byte which is used to indicate the type of information in N (bytes 4:5) and LA (bytes 0:2). They can have only a value of 0, 1, or 2, as follows,

  1. T = 0. The address LA is a load address. The next N words on the tape should be loaded into locations LA, LA+1, …, LA+N-1. A checksum follows these N words.
  2. T = 1. The address LA is the first address of a chain of addresses of locations which were forward references. The value N should be used to fix-up this chain of forward references.
  3. T = 2. The address LA is the starting address of the program. N should be zero.

FIGURE 8.6 The three types of loader records. The type of record is determined by byte 3:3.

Most of the output to the tape will be of type 0. These words are produced by the assembler one at a time and need to be blocked into N word groups with a header in front and a checksum following. Notice that the header cannot be output until the value of N is known. So words are stored in a buffer (LDRBLOCK) until a discontinuity in load addresses occurs, or the buffer is full. Then the header and checksum are attached and the entire block written to tape by use of the routine TAPEOUT. The variables and code for this are

* * LOADER BLOCK VARIABLES * BLCKLENGTH CON 63 LENGTH OF LDRBLOCK LDRBLOCK ORIG *+64 BLOCK FOR LOADER RECORDS NXTLOADLOC CON 0 INDEX INTO LDRBLOCK

Two routines are used; one (GENERATE) puts the words into the loader record block (LDRBLOCK), while the other (FINISHBLCK) will empty the buffer, output it to tape, and compute the checksum.

* * GENERATE * * GENERATE A WORD OF LOADER OUTPUT. * THE LOADER WORD IS IN VALUE. REGISTER I6 HAS * THE ADDRESS OF THE LOCATION WHERE IT SHOULD * BE LOADED. IF THIS WORD IS A CONTINUATION * OF THE CURRENT BUFFER, IT IS SIMPLY STORED. * IF THE WORD IS NONCONTIGUOUS OR FILLS THE * BUFFER, THE BUFFER IS EMPTIED. * GNSAVEA ORIG *+1 * GENERATE STJ GENEXIT STA GNSAVEA ST1 GNSAVE1(0:2) * CMP6 NXTLOADLOC CHECK IF CONTIGUOUS JNE FINISHBLCK IF NOT, FINISH OLD BLOCK * LD1 LDRBLOCK(4:5) NUMBER OF WORDS INC1 1 ST1 LDRBLOCK(4:5) * LDA VALUE STA LDRBLOCK,1 STORE GENERATED WORD * INC6 1 INCREASE LOCATION COUNTER ST6 NXTLOADLOC * CMP1 BLCKLENGTH CHECK FOR END OF BLOCK JGE FINISHBLCK IF SO, FINISH BLOCK LDA GNSAVEA GNSAVE1 ENT1 * RESTORE REGISTER GENEXIT JMP * * * * SUBROUTINE FINISHBLCK * * OUTPUT TO THE LOADER THE BLOCK IN LDRBLOCK. * NUMBER OF WORDS IS IN BYTE 4:5 OF FIRST WORD. * (MAY BE ZERO, IN WHICH CASE IGNORE CALL). * COMPUTE CHECKSUM AND OUTPUT IT TOO * CHECKSUM ORIG *+1 FLBSAVEA ORIG *+1 FINISHBLCK STJ FLBEXIT STA FLBSAVEA ST1 FLBSAVE1(0:2) * LD1 LDRBLOCK(4:5) J1Z FLBQUIT IF BLOCK IS EMPTY * STZ CHECKSUM INITIALIZE CHECKSUM ENT1 0 INDEX AND COUNTER * BLOCKOUT LDA LDRBLOCK,1 JMP TAPEOUT OUTPUT EACH WORD ADD CHECKSUM(1:5) STA CHECKSUM(1:5) AND COMPUTE CHECKSUM * INC1 1 CMP1 LDRBLOCK(4:5) CHECK ALL WORD OUT JLE BLOCKOUT * LDA CHECKSUM(1:5) OUTPUT CHECKSUM JMP TAPEOUT * JOV *+1 TURN OVERFLOW OFF (IF ON) * FLBQUIT STZ LDRBLOCK NEW HEADER WORD ST6 LDRBLOCK(0:2) * LDA FLBSAVEA RESTORE REGISTERS FLBSAVE1 ENT1 * FLBEXIT JMP * *

Input routines

Subroutine READCARD will read one card and unpack it into a one-character-per-word card image form. Double buffering is used.

* INBUF ORIG *+16 CARD ORIG *+80 CARDNUMBER CON 0 NUMBER OF CARDS READ * * * SUBROUTINE READCARD * * READ THE NEXT CARD FROM THE CARD READER AND * UNPACK IT INTO THE CHARACTER ARRAY CARD. THE * CARD IS READ INTO THE BUFFER INBUF AND WE * TRY TO KEEP THE CARD READER BUSY BY THIS * DOUBLE BUFFERING. * RCSAVEA ORIG *+1 * READCARD STJ RCEXIT STA RCSAVEA SAVE REGISTERS ST1 RCSAVE1(0:2) ST2 RCSAVE2(0:2) ST3 RCSAVE3(0:2) * JBUS *(CR) WAIT TILL CARD READ ENT1 79 79..0 CHARACTER COUNTER ENT2 15 15..0 WORD COUNTER * * UNPACK CARD FROM RIGHT TO LEFT * NEXTWORD LDA INBUF,2 DEC2 1 ENT3 4 4..0 NUMBER OF CHARACTERS * NEXTCHAR STA CARD,1(5:5) DEC1 1 J1N CARDDONE IF ALL DONE DEC3 1 J3N NEXTWORD OR NEW WORD NEEDED SRA 1 JMP NEXTCHAR ELSE SHIFT AND CONTINUE * CARDDONE IN INBUF(CR) START NEXT READ LDA CARDNUMBER INCA 1 INCREMENT NUMBER OF CARDS STA CARDNUMBER * LDA RCSAVEA RESTORE REGISTERS RCSAVE1 ENT1 * RCSAVE2 ENT2 * RCSAVE3 ENT3 * RCEXIT JMP * *

Lexical scan routines

GETSYM is the main lexical scan routine. Using index register I5 as a pointer to the card image in CARD, GETSYM packs the next symbol into SYM. SYM is 2 words, to allow up to 10 characters per symbol. A symbol is delimited by any non-alphabetic, non-numeric character. If the current character is non-alphanumeric, then SYM will be blank. The variable LETTER is used to indicate if the symbol is strictly numeric (LETTER zero) or contains an alphabetic character (LETTER non-zero). Numeric symbols are right-justified (to allow NUMing), while symbols with letters are left-justified.

* SYM ORIG *+2 SYMBOL LETTER ORIG *+1 NUMERIC=ZERO * * * SUBROUTINE GETSYM * * GET THE NEXT SYMBOL FROM CARD(15) AND PACK * IT INTO SYM UNTIL A DELIMITER. I5 WILL BE * MOVED TO POINT TO THE DELIMITER. LETTER IS * ZERO IF NO LETTER FOUND. * GSSAVEA ORIG *+1 GSSAVEX ORIG *+1 * GETSYM STJ GSEXIT SAVE REGISTERS STA GSSAVEA STX GSSAVEX ST1 GSSAVE1(0:2) ST2 GSSAVE2(0:2) * ENT1 10 MAXIMUM NUMBER OF CHARACTERS STZ LETTER ENTX 0 ENTA 0 BLANK AX * SCANSYM LD2 CARD,5 CMP2 CHARA(5:5) MUST BE AT LEAST A JL ENDSYM CMP2 CHAR9(5:5) AND NOT MORE THAN 9 JG ENDSYM CMP2 CHARO(5:5) ALSO CHECK IF 0..9 JGE *+2 STJ LETTER(4:5) LETTER FOUND * DEC1 1 DECREMENT NUMBER OF CHARACTERS J1N *+3 SLAX 1 INCX 0,2 ADD NEW CHARACTER TO SYMBOL INC5 1 NEXT COLUMN JMP SCANSYM * SYMERROR JMP BADSYM BAD SYMBOL (TOO LONG) JMP GSQUIT NO JUSTIFICATION NEEDED * ENDSYM J1N SYMERROR CHECK IF TOO LONG LD2 LETTER CHECK IF NEED JUSTIFY. J2Z *+2 SLAX 0,1 REGISTER 1 HAS COUNT * GSQUIT STA SYM SAVE SYMBOL STX SYM+1 * LDA GSSAVEA LDX GSSAVEX GSSAVE1 ENT1 * GSSAVE2 ENT2 * GSEXIT JMP *

GETSYM is used by GETFIELDS (among others). GETFIELDS gets the label and opcode fields of an assembly language program for later processing. This MIXAL assembler accepts only fixed-format input, and GETFIELDS reflects this. Notice that the change to a free-format assembler would require only that this one subroutine need be changed.

* LABEL ORIG *+2 SPACE FOR LABEL (IF ANY) OP ORIG *+1 AND OPCODE * * * SUBROUTINE GETFIELDS * * GET THE FIELDS FOR THE ASSEMBLY LANGUAGE. * FIELDS ARE LABEL, OPCODE AND OPERAND. LABEL * MAY BE MISSING. REGISTER I5 IS LEFT * POINTING JUST BEFORE THE OPERAND. * THE EXPRESSION ROUTINE WILL SKIP ONE COLUMN * TO BEGIN EVALUATION, SO LEAVE I5 ONE BEFORE * THE START OF THE OPERAND. * * FIXED FIELD * GFSAVEA ORIG *+1 * GETFIELDS STJ GFEXIT SAVE REGISTERS STA GFSAVEA ST1 GFSAVE1(0:2) * STZ LABEL DEFAULT IS BLANK LABEL LDA CARD CHECK COLUMN ONE JAZ NOLABEL ENT5 0 COLUMN OF LABEL JMP GETSYM * LDA LETTER ERROR CHECKING JANZ LEGALLABEL JMP BADSYM NUMERIC SYMBOL JMP NOLABEL * LEGALLABEL ENT1 LABEL MOVE LABEL FROM SYM MOVE SYM(2) * NOLABEL ENT5 11 OPCODE JMP GETSYM LDA SYM(1:4) FOUR CHAR OPCODE STA OP * ENT5 15 READY FOR OPERAND * (COLUMN 17) * LDA GFSAVEA GFSAVE1 ENT1 * GFEXIT JMP * *

Table search and enter routines

Both the opcode table and symbol table need search routines, and the symbol table needs an enter routine.

Each entry of the opcode table is two words. The symbolic opcode is in bytes 1:4 of the first word. Byte 5:5 is a type field. For machine instructions, word 2 has the default field and the numeric opcode in bytes 4 and 5, while word 2 is ignored for pseudo-instructions.

OPTAB EQU * ALF ADD A ADD ALF ALF D CON 0 ALF CHARA CHAR ALF CMPAA CMPA ALF CMPXA CMPX ALF CMP1A CMP1 ALF CMP2A CMP2 ... ALF ENT6A ENT6 ALF EQU E CON 0 * HLTINDEX EQU *-OPTAB/2 INDEX OF HLT ALF HLT A HLT ALF IN A IN ALF INCAA INCA ... ALF ST6 A ST6 ALF SUB A SUB * NUMBEROPS EQU *-OPTAB/2 2 LOCATIONS PER ENTRY * OPTYPE CON 0 * VALUE ORIG *+1

Each entry of the symbol table takes four words. The name of the symbol is stored in the first two words, and the value in the third word. If there were any forward references to this symbol, the fourth word contains the address of the last forward reference in bytes 4:5. Forward references are chained through this address.


FIGURE 8.7 A symbol table entry (each entry contains the 10-character symbol, value, and a fix-up address, along with a defined/forward reference bit, D).

The sign bit of the first word (D) is used to indicate if the symbol has been defined (D = +) or undefined (D = -). A symbol is undefined only if there has been a (forward) reference to it but it has not yet appeared in the label field.

* NSYMBOL CON 0 NUMBER OF SYMBOLS (TIMES 4) SYMINDEX ORIG *+1 SYMTAB ORIG 4*MAXSYMBOL+* SYMBOL TABLE *

The opcode table search routine uses a binary search on the ordered opcode table. If we search with a binary search, we need the opcode table to be sorted alphabetically. If the opcode is not found, an error is flagged and a HLT instruction is assumed as the opcode. The division by two in the binary search could be replaced by a right shift by one bit if a binary MIX machine is used.

* * SUBROUTINE SEARCH OP * * SEARCH THE OPCODE TABLE (OPTAB) FOR THE * OPCODE IN OP. A BINARY SEARCH IS USED. IF * THE OPCODE IS NOT FOUND, AN UNOP ERROR * OCCURS. A HLT IS USED FOR UNOP ERRORS. THE * VALUE OF THE SECOND WORD IS * STORED IN VALUE AND THE TYPE OF THE OPCODE * IN OPTYPE. * SOPSAVEA ORIG *+1 SOPSAVEX ORIG *+1 OPHIGH ORIG *+1 OPLOW ORIG *+1 OPMID ORIG *+1 * SEARCHOP STJ SOPEXIT SAVE REGISTERS STA SOPSAVEA STX SOPSAVEX ST1 SOPSAVE1(0:2) * ENT1 NUMBEROPS-1 ST1 OPHIGH STZ OPLOW INITIALIZE HIGH AND LOW * SOPLOOP LDA OPLOW ADD OPHIGH SRAX 5 DIV TWO COMPUTE OPMID = (HIGH+LOW)/2 STA OPMID LD1 OPMID INC1 0,1 2*OPMID (TWO WORDS PER ENTRY) * LDA OPTAB,1(1:4) CMPA OP COMPARE OPCODES JE SOPFOUND * JL UPLOW DOWNHIGH LD1 OPMID OP < OPTAB[OPMID] DEC1 1 ST1 OPHIGH SO HIGH = OPMID - 1 JMP SOPDONE * UPLOW LD1 OPMID OP > OPTAB[MID] INC1 1 ST1 OPLOW SO LOW = OPMID + 1 * JMP SOPDONE * SOPDONE LDA OPHIGH CHECK FOR END CMPA OPLOW JGE SOPLOOP CHECK FROM LOW TO HIGH * JMP UNOP OPCODE NOT FOUND IN OPTAB ENT1 2*HLTINDEX DEFAULT OPCODE * SOPFOUND LDA OPTAB+1,1 NUMERIC OPCODE STA VALUE LD1 OPTAB,1(5:5) OPCODE TYPE ST1 OPTYPE * LDA SOPSAVEA RESTORE REGISTERS LDX SOPSAVEX SOPSAVE1 ENT1 * SOPEXIT JMP * *

A linear search is used for the symbol table. This is not the best search possible, but it is relatively simple. Since the search is done only in this one routine, it can be changed later, if we find a linear search to be too slow. The symbol definition routine simply adds symbols to the end of the table.

* * SUBROUTINE DEFINESYM * * DEFINE A SYMBOL. PUT IT IN THE SYMBOL * TABLE. THE A REGISTER HAS ITS VALUE. THE * FORWARD REFERENCE FIELD IS SET TO INDICATE * NO FORWARD REFERENCES (YET) BY SETTING IT * TO CHAINEND. * DEFINESYM STJ DSEXIT SAVE REGISTERS ST1 DSSAVE1(0:2) * LD1 NSYMBOL NEXT SYMBOL TABLE SPACE STA SYMTAB+2,1 SYMBOL VALUE LDA SYM STA SYMTAB,1 FIRST FIVE CHARS LDA SYM+1 STA SYMTAB+1,1 SECOND FIVE LDA CHAINEND STA SYMTAB+3,1(4:5) FORWARD REFERENCE * ST1 SYMINDEX SYMBOL INDEX INC1 4 ST1 NSYMBOL UPDATE NUMBER OF SYMBOLS * LDA SYMTAB-2,1 RESTORE A REGISTER DSSAVE1 ENT1 * DSEXIT JMP * * * * SUBROUTINE SEARCHSYM * * SEARCH SYMBOL TABLE. SYMBOL TABLE * IS SEARCHED FROM THE BACK TO THE FRONT. * THE RESULT OF THE SEARCH IS RETURNED IN * SYMINDEX. SYMINDEX IS AN INDEX INTO SYMTAB * IF FOUND, OR NEGATIVE * SSSAVEA ORIG *+1 SSSAVEX ORIG *+1 * SEARCHSYM STJ SSEXIT SAVE REGISTERS STA SSSAVEA STX SSSAVEX ST1 SSSAVE1(0:2) * LDA SYM LDX SYM+1 SYMBOL TO SEARCH FOR LD1 NSYMBOL SYMTAB INDEX * SYMLOOP DEC1 4 J1N SYMDONE IF NEGATIVE, NOT FOUND CMPA SYMTAB,1(1:5) COMPARE JNE SYMLOOP CMPX SYMTAB+1,1(1:5) AND COMPARE JNE SYMLOOP * SYMDONE ST1 SYMINDEX EITHER FOUND OR NOT * SSSAVE1 ENT1 * LDA SSSAVEA LDX SSSAVEX SSEXIT JMP * *

Expression evaluation

Expression evaluation for the MIXAL language is relatively simple since it involves only three types of operands (symbols, numbers, and *) and evaluation of operators is strictly left to right. After checking first for a *, the EVALSYM routine uses GETSYM to get the next symbol from the input card. If the symbol has no letters (LETTER = 0), it is numeric and is converted from character code to a numeric value by the NUM instruction. If the symbol has letters, then the symbol table is searched and the value of the symbol in the symbol table is used.

The evaluation of a symbol is actually more complicated due to forward references. Two special cases may arise in the symbol table search. First, the symbol may not be there. In this case, it must be entered into the table (using DEFINESYM). Then it can be treated like the second case, which involves symbols in the table, but not defined. These are symbols which have previously appeared as forward references. They are distinguished by a "-" sign in the sign bit of the first word of the symbol table entry. In this case, the "value" of the symbol is the previous reference address and the reference address is updated to be this instruction. Since forward references are not always allowed, the undefined nature of the symbol is noted by setting the variable UNDEFSYM to non-zero.

EVALSYM is used by EXPRESSION to evaluate the components of an expression. EXPRESSION evaluates the first component (using EVALSYM) and then examines the next character. If it is a delimiter, the evaluation stops; if it is an operator, the evaluation continues. This decision is made by the use of an array (OPERATOR) which is indexed by the character code of the next character. If the value is zero, the character is a delimiter. If it is non-zero, then the character is an operator and the value is in the range 1 to 5, to be used in a jump table for interpreting that operator. Thus, since the character code for "+" is 44, OPERATOR+44 is 1, OPERATOR+45 is 2 ("-"), OPERATOR+46 is 3 ("*"), OPERATOR+47 is 4 ("/"), and OPERATOR+54 is 5 (":"). All other values are zero.

If the character following an evaluated symbol is an operator, EVALSYM is called again and the two values are combined according to the operator. This process of finding operators, calling EVALSYM to evaluate the symbol, and combining its value with the previously computed value, continues until a delimiter is found.

Several problems must be attended to. Forward references are only allowed when NOFORWARD is zero. UNDEFSYM is used to tell when a symbol is a forward reference. Overflow must also be checked and flagged as an error. This includes expressions which exceed the range which is appropriate for their intended use. (An expression for an index or field specification can only be in the range 0 to 63.) These upper and lower bounds are passed as a parameter to EXPRESSION. The address of the lower bound is passed in register I1; the upper bound follows at 1,1.

* UNDEFSYM ORIG *+1 NOFORWARD CON 1 NONZERO, NO FORWARDS * * UPPER AND LOWER BOUNDS FOR EXPRESSIONS * HLBYTE CON 0 INDEX, FIELD CON 63 HLADDR CON 0 ADDRESSES (ORIG, END) CON 3999 HL2BYTE CON -4095 ADDRESS FIELD CON +4095 HLWORD CON -1073741823 MIN AND MAX WORD CON +1073741823 * * OPERATOR ORIG *+44 FIRST 44 ZEROS CON 1 ADD CON 2 SUBTRACT CON 3 MULTIPLY CON 4 DIVIDE ORIG *+6 CON 5 COLON OPERATOR ORIG *+10 * * * SUBROUTINE EVALSYM * * EVALUATE THE NEXT SYMBOL. VALUE RETURNED * IN A. UNDEFSYM IS NONZERO IF VALUE IS * UNDEFINED. SYMBOLS MAY BE *, NUMBER * OR SYMBOL. GETSYM IS USED TO GET NUMBERS * OR SYMBOLS. * ESSAVEX ORIG *+1 * EVALSYM STJ ESEXIT SAVE REGISTERS ST1 ESSAVE1(0:2) STX ESSAVEX * LDA CARD,5 CARD(COLUMN) CMPA CHARSTAR(5:5) CHECK FOR * JNE NOTSTAR * ISSTAR INC5 1 INCREASE COLUMN COUNTER ENTA 0,6 VALUE STZ UNDEFSYM DEFINED JMP ESQUIT * NOTSTAR JMP GETSYM GET SYMBOL LD1 LETTER J1NZ ISSYMBOL LETTER NONZERO MEANS SYMBOL * ISNUMBER LDA SYM CONVERT SYMBOL TO NUMERIC LDX SYM+1 NUM JOV EXPOV IF OVERFLOW, ERROR STZ UNDEFSYM DEFINED JMP ESQUIT * ISSYMBOL JMP SEARCHSYM SEARCH FOR SYMBOL LD1 SYMINDEX J1N ISNOTTHERE NOT IN SYMBOL TABLE LDA SYMTAB,1 JAN ISNOTDEF IF IN TABLE, NOT DEFINED * ISDEFINED LDA SYMTAB+2,1 STZ UNDEFSYM DEFINED JMP ESQUIT * ISNOTTHERE ENTA -1 NOT IN SYMBOL TABLE, ENTER JMP DEFINESYM ARBITRARY VALUE LD1 SYMINDEX STA SYMTAB,1(0:0) MARK NEGATIVE: FORWARD REF * ISNOTDEF LDA SYMTAB+3,1 FORWARD REFERENCE STJ UNDEFSYM(4:5) UNDEFINED VALUE ST6 SYMTAB+3,1 UPDATE CHAIN ADDRESS * ESQUIT EQU * LDX ESSAVEX ESSAVE1 ENT1 * ESEXIT JMP * * * * SUBROUTINE EXPRESSION * * EVALUATE THE NEXT EXPRESSION. EXPRESSION * STARTS AT COLUMN+1 (COLUMN IN I5). (THE * PLUS ONE IS TO SKIP OVER THE LAST DELIMITER) * OPERANDS ARE EVALUATED BY EVALSYM. * OPERATORS ARE + - * / : * AND ARE APPLIED LEFT TO RIGHT UNTIL A * DELIMITER IS FOUND. * VALUE1 ORIG *+1 VALUE2 ORIG *+1 EXPSAVEX ORIG *+1 * EXPRESSION STJ EXPEXIT SAVE REGISTERS ST1 EXPSAVE1(0:2) ST2 EXPSAVE2(0:2) STX EXPSAVEX * INC5 1 JMP EVALSYM EVALUATE FIRST OPERAND * EXPLOOP LD2 CARD,5 CHECK NEXT COLUMN LD2 OPERATOR,2 FOR OPERATOR J2NP EXPOVER * LD1 UNDEFSYM NO UNDEFS AND OPERATORS J1NZ FORERROR FORWARD REFERENCE ERROR * INC5 1 SKIP OPERATOR STA VALUE1 JMP EVALSYM SECOND OPERAND STA VALUE2 LDA VALUE1 JMP *,2 APPROPRIATE OPERATOR JMP OPADD JMP OPSUB JMP OPMUL JMP OPDIV JMP OP8ADD * OPADD ADD VALUE2 VALUE1 = VALUE1 + VALUE2 JMP NEXTOP * OPSUB SUB VALUE2 VALUE1 = VALUE1 - VALUE2 JMP NEXTOP * OPMUL MUL VALUE2 VALUE1 = VALUE1 * VALUE2 JANZ EXPOV IF A NONZERO, OVERFLOW SLAX 5 PUT VALUE IN A JMP NEXTOP * OPDIV SRAX 5 VALUE1 = VALUE1 / VALUE2 DIV VALUE2 JMP NEXTOP * OP8ADD MUL EIGHT VALUE1 = 8*VALUE1 + VALUE2 JANZ EXPOV SLAX 5 ADD VALUE2 JMP NEXTOP * NEXTOP JOV EXPOV CHECK OVERFLOW JMP EXPLOOP * EXPOVER LD1 UNDEFSYM J1Z EXPQUIT CHECK IF UNDEFINED LD1 NOFORWARD AND FORWARD REFERENCES FORERROR J1NZ ILLFOR NOT ALLOWED => ERROR * EXPQUIT EQU * EXPSAVE1 ENT1 * RESTORE INDEX 1 CMPA 0,1 CHECK LOWER BOUND JGE *+3 JMP EXPOV LESS THAN LOWER LDA 0,1 REPLACE WITH LOWER CMPA 1,1 CHECK UPPER JLE *+3 JMP EXPOV MORE THAN UPPER LDA 1,1 REPLACE WITH UPPER * STJ NOFORWARD(4:5) FORWARDS NOT ALLOWED * LDX EXPSAVEX EXPSAVE2 ENT2 * EXPEXIT JMP * *

Formatting print lines

The print routine is relatively straightforward, although lengthy. Each input statement generates an output line in the listing. The most common format is the format of a machine instruction, which is
Location counter   Generated instruction Card image Card number
+3516   +3514 00 05 30 STA ERRSAVEA 469
Since the individual fields for an instruction are of interest, these fields are separated by a blank.

For other types of assembly language statements, this format is not always appropriate. For the CON and ALF statements, the generated code is not normally interpreted as an instruction, so it could be better presented as a five-byte signed value. For the EQU statement, no code is generated, and no location counter can thus be meaningfully associated with the statement, but the operand expression should be printed. The ORIG and END statements likewise do not generate code, but their operand expressions, which are addresses (not five-byte values) should be printed. Thus, there are several different formats for output, depending upon the type of the opcode. These are encoded in the PRTFMT table.

The PRINTLINE routine formats the output line in LINE according to the entry in PRTFMT determined by OPTYPE. LINE is then packed into PRTBUF and printed. By positioning the CARD array in the LINE array, the copying from CARD to LINE is not needed for the card image. Since the location counter may have been changed by the time that the line is formatted and printed, a separate variable, PRINTLOC, is used to store the location to be printed on the output line.

* * PRINTER SELECTION FORMAT VARIABLES * PRT CON 0 PRINT FORMAT HOLDING VARIABLE ADR EQU 4:4 PRINT VALUE AS ADDRESS VAL EQU 3:3 PRINT VALUE AS WORD INST EQU 2:2 PRINT VALUE AS INSTRUCTION LOC EQU 1:1 PRINT LOCATION COUNTER * PRTFMT CON 0 COMMENT PRINT CON 1(LOC),1(INST) MACHINE: TYPE 1 CON 1(LOC),1(ADR) ORIG: TYPE 2 CON 1(LOC),1(VAL) CON: TYPE 3 CON 1(LOC),1(VAL) ALF: TYPE 4 CON 1(VAL) EQU: TYPE 5 CON 1(LOC),1(ADR) END: TYPE 6 * PRTBUF ORIG *+24 PRINTER BUFFER LINE ORIG *+30 FIRST 30 COLUMNS CARD ORIG *+80 CARD IMAGE ORIG *+10 CARD NUMBER * PRINTLOC ORIG *+1 LOCATION COUNTER FOR PRT *

On a binary machine, it is most useful to print the output in octal, not decimal. Since the CHAR instruction converts from numeric into decimal character code, a separate routine, OCTCHAR, is used to convert into octal character code. Notice that by simply changing the divide by 8 to a divide by 10, decimal output can be generated.

* * SUBROUTINE OCTCHAR * * CONVERTS A NUMBER FROM A NUMERIC FORMAT INTO * AN OCTAL CHARACTER REPRESENTATION. * CHARACTERS ARE STORED ONE PER WORD, SIGN FIRST, * ZERO FILL. THE NUMBER IS IN THE A REGISTER, * THE NUMBER OF CHARACTERS TO BE USED IN I2 * AND THE ADDRESS IN WHICH THEY SHOULD BE * STORED IN REGISTER I1. * OCSAVEX ORIG *+1 OCTEMPA ORIG *+1 * OCTCHAR STJ OCEXIT SAVE REGISTERS STX OCSAVEX * STA OCTEMPA SAVE VALUE STA *+1(0:0) SAVE SIGN FOR TESTING ENTA 1 PLUS OR MINUS ENTX 44 PLUS SIGN JAP STORESIGN ENTX 45 MINUS SIGN STORESIGN STX 0,1 FIRST CHARACTER IS SIGN LDA OCTEMPA(1:5) MAGNITUDE ONLY * INC1 0,2 LOW ORDER CHARACTERS FIRST * NXTDIGIT SRAX 5 SHIFT TO X FOR DIVIDE DIV EIGHT OCTAL INCX 30 X HAS DIGIT, CONVERT TO CHAR STX 0,1 STORE CHARACTER * DEC1 1 DEC2 1 NUMBER OF CHARACTERS J2P NXTDIGIT * LDX OCSAVEX RESTORE REGISTERS OCEXIT JMP * * * * * SUBROUTINE PRINTLINE * * PRINT A LINE FOR THE OUTPUT LISTING. * LINES CAN BE OF DIFFERENT TYPES. * THE FORMAT OF EACH LINE IS * DETERMINED BY PRTFMT(OPTYPE). * PLSAVEA ORIG *+1 PLSAVEX ORIG *+1 * PRINTLINE STJ PLEXIT SAVE REGISTERS STA PLSAVEA STX PLSAVEX ST1 PLSAVE1(0:2) ST2 PLSAVE2(0:2) ST3 PLSAVE3(0:2) * LD1 OPTYPE STZ OPTYPE LAST USE OF OPTYPE, RESET LDA PRTFMT,1 PRINT FORMAT FOR THIS TYPE STA PRT * * CHECK IF LOCATION COUNTER TO BE PRINTED. * IF SO PRINT IN COLUMNS 8-13. * LDA PRT(LOC) LOCATION COUNTER JAZ NOLOCPRT LDA PRINTLOC PRINT VALUE ENT1 LINE+7 ENT2 4 FOUR CHARACTERS (PLUS SIGN) JMP OCTCHAR NOLOCPRT EQU * * * * CHECK IF VALUE SHOULD BE PRINTED AS NUMBER * LDA PRT(VAL) VALUE AS NUMBER JAZ NOVALPRT LDA VALUE VALUE HAS VALUE ENT1 LINE+17 COLUMNS 18 - 27 ENT2 10 JMP OCTCHAR NOVALPRT EQU * * * PRINTVALUE AS AN ADDRESS FOR ORIG AND END * LDA PRT(ADR) JAZ NOADRPRT LDA VALUE VALUE HAS ADDRESS ENT1 LINE+23 COLUMNS 24 - 28 ENT2 4 JMP OCTCHAR NOADRPRT EQU * * * CHECK IF VALUE SHOULD BE AN INSTRUCTION * LDA PRT(INST) JAZ NOOPPRT LDA VALUE(0:2) ADDRESS FIELD ENT1 LINE+14 COLUMNS 15 - 19 ENT2 4 JMP OCTCHAR * LDA VALUE(3:3) INDEX FIELD ENT1 LINE+19 COLUMNS 20-22 BUT ENT2 2 JMP OCTCHAR STZ LINE+19 CLEAR SIGN * LDA VALUE(4:4) FIELD SPECIFICATION ENT1 LINE+22 COLUMNS 23-25 BUT ENT2 2 JMP OCTCHAR STZ LINE+22 CLEAR SIGN * LDA VALUE(5:5) OPCODE FIELD ENT1 LINE+25 COLUMNS 26-28 BUT ENT2 2 JMP OCTCHAR STZ LINE+25 CLEAR SIGN * NOOPPRT EQU * * * CARD IMAGE IS ALREADY IN LINE IMAGE IN CARD. * NOW APPEND CARD NUMBER. CARD NUMBER IS * DECIMAL, SO WE USE CHAR, BUT THEN MUST * SUPPRESS LEADING ZEROS. * LDA CARDNUMBER CHAR ENT1 LINE+110 COLUMNS 111-120 * LEADZERO STZ 0,1 BLANK INC1 1 ENTA 0 SLC 1 CMPA CHARO(5:5) CHECK FOR LEADING ZEROS JE LEADZERO * STRCHR STA 0,1 STORE NONZERO CHARACTER JXZ ENDCARDNUM INC1 1 ENTA 0 SLC 1 NEXT CHARACTER JMP STRCHR ENDCARDNUM EQU * * JBUS *(LP) * * NOW OUTPUT LINE MUST BE PACKED INTO BUFFER * ENT1 119 120 CHARACTERS ENT2 24 NUMBER OF WORDS * PCKWORD ENT3 5 ENTX 0 PACK BACKWARDS * PCKCHAR LDA LINE,1 NEXT CHARACTER STZ LINE,1 BLANK FOR NEXT TIME DEC1 1 SRC 1 DEC3 1 J3P PCKCHAR GET NEXT CHARACTER DEC2 1 STX PRTBUF,2 J2P PCKWORD * OUT PRTBUF(LP) * LDA LINERROR NUMBER OF ERRORS THIS LINE ADD NERROR TOTAL NUMBER OF ERRORS STA NERROR STZ LINERROR * LDA PLSAVEA LDX PLSAVEX PLSAVE1 ENT1 * PLSAVE2 ENT2 * PLSAVE3 ENT3 * PLEXIT JMP * *

Error routines

Any system program must check its input carefully for possible errors, and an assembler has many opportunities for errors. Thus, errors must be checked for continuously, throughout the program.

When an error is detected, it should be signalled to the programmer somehow. For this assembler, we have elected to place flags in the first five characters to indicate any errors. If the first five characters of an output line are blank, no errors were found. If errors are found, the type of error can be identified by the character in the first five columns.
M   Multiply-defined label. This label has been previously defined.
L   Bad symbol or label. A symbol exceeds ten characters in length or the label is numeric.
U   Undefined opcode. The symbolic opcode in the opcode field cannot be found in the opcode table.
F   Illegal forward reference. A forward reference occurs in an expression or where it is not allowed (EQU, CON, ORIG or END, or in I or F field).
O   Expression overflow. The expression evaluation resulted in a value which exceeded one MIX word, or exceeded the allowable range of the expression.
S   Illegal syntax. The assembly language statement does not have correct syntax; it is not of the correct form.

NERROR CON 0 TOTAL NUMBER OF ERRORS IN INPUT LINERROR CON 0 NUMBER OF ERRORS THIS LINE * * SUBROUTINE ERROR * * THIS SUBROUTINE PUTS AN ERROR FLAG INTO THE * OUTPUT LINE IMAGE AND COUNTS THE NUMBER OF * ERRORS PER LINE. UP TO 5 ERROR FLAGS WILL BE * SIGNALLED. THE ERROR FLAG IS THE CHARACTER IN * BYTE 1:1 OF THE WORD FOLLOWING THE CALL TO * ERROR. * ERRSAVA ORIG *+1 * ERROR STJ *+3(0:2) SAVE REGISTERS STA ERRSAVA ST1 ERRSAVE1(0:2) * ENT1 * RETURN ADDRESS LDA 0,1(1:1) ACTUALLY FLAG ADDRESS INC1 1 INCREASE ADDRESS ST1 ERREXIT(0:2) STORE REAL RETURN ADDRESS * LD1 LINERROR INCREASE NUMBER OF INC1 1 ST1 LINERROR ERRORS IN THIS LINE * DEC1 5 CHECK IF MORE THAN 5 J1P *+2 STA LINE+5,1 STORE ERROR FLAG IN OUTPUT * LDA ERRSAVA RESTORE REGISTERS ERRSAVE1 ENT1 * ERREXIT JMP * * * * A SEPARATE ERROR ROUTINE FOR EACH TYPE. * MULDEF STJ *+3 MULTIPLY-DEFINED LABELS JMP ERROR ALF M JMP * * BADSYM STJ *+3 BAD SYMBOL (TOO LONG, NUMERIC) JMP ERROR ALF L L FOR LABEL JMP * * UNOP STJ *+3 UNDEFINED OPCODE JMP ERROR ALF U JMP * * ILLFOR STJ *+3 ILLEGAL FORWARD REFERENCE JMP ERROR ALF F JMP * * EXPOV STJ *+3 OVERFLOW IN EXPRESSION JMP ERROR ALF O JMP * * ILLSYN STJ *+3 ILLEGAL SYNTAX JMP ERROR ALF S JMP * * *

Label definition

One additional routine which is useful is DEFINELAB. This routine uses DEFINESYM to define a label for an assembly language statement. It does not simply call DEFINESYM, however, but must first call SEARCHSYM to check for multiply-defined symbols, or symbols which have been forward referenced. The label is taken out of the variable LABEL, where it was put by GETFIELDS. The value of the label is in the A register.

* * SUBROUTINE DEFINELAB * * DEFINE A LABEL IF THERE IS ONE. THE VALUE * OF THE LABEL IS IN THE A REGISTER. FIRST * SEARCH THE SYMBOL TABLE FOR MULTIPLY * DEFINED LABELS OR FORWARD REFERENCES. * DEFINELAB STJ DLEXIT SAVE REGISTERS ST1 DLSAVE1(0:2) ST2 DLSAVE2(0:2) * LD1 LABEL(1:1) CHECK IF THERE IS LABEL J1Z DLSAVE1 NO LABEL * ENT1 SYM MOVE LABEL(2) MOVE LABEL TO SYM * JMP SEARCHSYM SEARCH TABLE FOR LABEL * LD1 SYMINDEX J1N NEWLABDEFN NOT FOUND * LD2 SYMTAB,1(0:1) CHECK SIGN FOR DEFINED/NOT J2P MULDEF MULTIPLY-DEFINED * STA SYMTAB+2,1 SAVE VALUE OF FORWARD REF STZ SYMTAB,1(0:0) JMP DLSAVE1 * NEWLABDEFN JMP DEFINESYM DEFINE NEW LABEL * DLSAVE1 ENT1 * DLSAVE2 ENT2 * DLEXIT JMP * *

Main loop code

With these subroutines to perform most of the processing, the main assembler code is now quite simple. The main loop is

* * MAIN LOOP STARTS HERE * MAIN JMP INITIALIZE * MAINLOOP JMP READCARD * LDA CARD CHECK FOR COMMENT CMPA CHARSTAR(5:5) JE PRTLINE IF COMMENT JUST PRINT * JMP GETFIELDS GET LABEL, OP, OPERAND JMP SEARCHOP SEARCH FOR OPCODE * ST6 PRINTLOC SAVE LOCATION COUNTER FOR PRT * LD1 OPTYPE JMP *+1,1 JUMP TABLE ON TYPE OF OPCODE JMP PRTLINE COMMENT JMP MACHINEOP MACHINE OPCODE: TYPE 1 (A) JMP ORIGOP ORIG: TYPE 2 (B) JMP CONOP CON: TYPE 3 (C) JMP ALFOP ALF: TYPE 4 (D) JMP EQUOP EQU: TYPE 5 (E) JMP ENDOP END: TYPE 6 (F) * * ENDCASE LDA CARD,5 AFTER PROCESSING COLUMN SHOULD JANZ ILLSYN BE BLANK, IF NOT, ERROR * PRTLINE JMP PRINTLINE PRINT LDA ENDASSEM CHECK FOR END JAZ MAINLOOP JMP FINISHUP FINISH ASSEMBLY HLT END MAIN

Machine instructions

Each opcode type has its own section of code to process each assembly language statement. For a machine opcode, this involves first defining the label (JMP DEFINELAB). Then the address expression is evaluated (JMP EXPRESSION) and saved in the 0:2 field of the word to be generated (VALUE). If the next character is a comma, an index field is evaluated; if the next character is a left parenthesis, a field specification is evaluated. Finally, the word is generated.

MACHINEOP ENTA 0,6 VALUE OF LABEL IS * JMP DEFINELAB * ENT1 HL2BYTE STZ NOFORWARD FORWARDS ALLOWED JMP EXPRESSION GET ADDRESS PART STA VALUE(0:2) * LDA CARD,5 CARD(COLUMN) CMPA CHARCOMMA(5:5) IS IT COMMA? JNE NOIPART * IPART ENT1 HLBYTE I FIELD IS ONE BYTE JMP EXPRESSION STA VALUE(3:3) NOIPART EQU * * LDA CARD,5 CMPA CHARLEFTP(5:5) CHECK FOR ( FOR FIELD JNE NOFPART ENT1 HLBYTE F FIELD IS ONE BYTE JMP EXPRESSION STA VALUE(4:4) * LDA CARD,5 CHECK FOR TRAILING ) CMPA CHARRIGHTP(5:5) JNE ILLSYN INC5 1 SKIP OVER ) NOFPART EQU * * JMP GENERATE JMP ENDCASE *

EQU statements

EQU statements are even simpler. The expression is evaluated, and the label defined to have this value.

* EQUOP ENT1 HLWORD EQU VALUE CAN BE ANY WORD JMP EXPRESSION STA VALUE JMP DEFINELAB DEFINE LABEL JMP ENDCASE *

ORIG statements

ORIG statements are almost as simple as EQUs.

* ORIGOP ENTA 0,6 DEFINE LABEL FIRST JMP DEFINELAB * ENT1 HLADDR ORIG VALUE IS ADDRESS JMP EXPRESSION * STA VALUE (FOR PRINT) LD6 VALUE SET LOCATION COUNTER * JMP ENDCASE

ALF statements

ALF statements are processed by simply picking up the five characters in columns 17 through 21 and packing them into VALUE.

ALFOP ENTA 0,6 DEFINE LABEL JMP DEFINELAB * ENT1 16 COLUMN 17 LDA CARD,1 ENT2 4 FOUR MORE * NXTALFCHAR INC1 1 SLA 1 SHIFT OVER CHARACTERS ADD CARD,1(5:5) AND ADD IN NEXT DEC2 1 ONE LESS CHARACTER J2P NXTALFCHAR * STA VALUE JMP GENERATE OUTPUT WORD JMP ENDCASE *

CON statements

The CON statement is perhaps the most complicated of the pseudo-instructions. It consists of evaluating the first expression, and checking for a field specification. If a field is given, the value is stored in that field. This repeats until no more expressions are found. The major complexity in the code comes from the necessity of checking that the field specification given is valid.

CONTEMPVAL ORIG *+1 CONTEMP ORIG *+1 TEMPORARY VARIABLES * CONOP ENTA 0,6 JMP DEFINELAB DEFINE LABEL FIRST * STZ VALUE DEFAULT VALUE * NEXTCON ENT1 HLWORD EXPRESSION ARE ANY VALUE JMP EXPRESSION * LDX CARD,5 CHECK FOR (FIELD) CMPX CHARLEFTP(5:5) JE CONF STA VALUE NO FIELD IS (0:5) JMP NOF * CONERROR JMP ILLSYN ERROR IN CON JMP NOFSTORE * CONF STA CONTEMPVAL SAVE EXPRESSION ENT1 HLBYTE UNTIL FIELD EVALUATED JMP EXPRESSION * STA FMOD(4:4) CHANGE FIELD OF STORE * SRAX 5 CHECK IF 0 ≤ L ≤ R ≤ 5 DIV EIGHT L IN A, R IN X CMPX FIVE JG CONERROR R > 5 STA CONTEMP CMPX CONTEMP JL CONERROR L > R LDA CONTEMPVAL EXPRESSION FMOD STA VALUE(0) FIELD TO BE CHANGED * NOFSTORE LDA CARD,5 CHECK FOR ) CMPA CHARRIGHTP(5:5) JNE ILLSYN SYNTAX ERROR INC5 1 NOF EQU * * LDA CARD,5 CHECK FOR COMMA CMPA CHARCOMMA(5:5) JE NEXTCON IF SO, DO NEXT PART * JMP GENERATE JMP ENDCASE

END statements

The last pseudo-instruction processed will be an END statement. First, any label is defined. Then the starting address is evaluated and an end-of-assembly flag is set.

ENDASSEM CON 0 STARTADD CON 0 STARTING ADDRESS * * ENDOP ENTA 0,6 JMP DEFINELAB DEFINE LABEL * ENT1 HLADDR STARTING ADDRESS IS ADDRESS JMP EXPRESSION STA VALUE FOR PRINT STA STARTADD FOR LOADER STJ ENDASSEM(4:5) SET NONZERO TO STOP JMP ENDCASE *

Initialization and termination

With most of the assembler written, we can easily see what needs to be initialized. The first card input should be started, the loader tape should be rewound, and the location counter set to zero. All other variables were initialized by CON statements when they were declared.

INITIALIZE STJ INITEXIT IN INBUF(CR) READ FIRST CARD IOC 0(TAPE) REWIND TAPE ENT6 0 SET LOCATION COUNTER TO ZERO INITEXIT JMP * *
In addition, we need to define the following symbols and constants.
* TAPE EQU 0 TAPE UNIT NUMBER FOR LOADER CR EQU 16 INPUT CARD READER LP EQU 18 OUTPUT LINE PRINTER * MAXSYMBOL EQU 250 MAXIMUM NUMBER OF SYMBOLS * CHARA ALF AAAAA CHARACTER CONSTANT CHARO ALF 00000 CHAR9 ALF 99999 CHARSTAR ALF ***** CHARLEFTP ALF ((((( CHARRIGHTP ALF ))))) CHARCOMMA ALF ,,,,, TWO CON 2 FIVE CON 5 EIGHT CON 8 CHAINEND CON 4095 *

Termination is more complicated. First, any unfinished loader block should be finished. Then the symbol table needs to be searched. Any undefined symbols are defined, and fix-up codes are output to the loader tape for symbols which were forward referenced. Finally, the start address is output and the last tape buffer written to tape.

ASMDEF ALF PRINT LINE FOR SYMBOLS ORIG *+1 WHICH ARE DEFINED BY ASSEMBLER ALF ORIG *+5 ALF ORIG ALF *+1 ORIG *+3 ALF SYMBO ALF L DEF ALF INED ALF BY AS ALF SEMBL ALF ER ORIG *+5 * * * SUBROUTINE FINISH UP * * WHEN AN END CARD IS READ, WE MUST FINISH * LAST LOADER BLOCK, OUTPUT FORWARD REFERENCE * FIX-UP COMMANDS, AND START ADDRESS. * FINTEMP ORIG *+5 FOR CHARACTERS FOR LOCATION * FINISHUP STJ FINEXIT JMP FINISHBLCK FINISH LAST BLOCK * * CHECK FOR UNDEFINED SYMBOLS AND DEFINE THEM * ENTA 1 STA VALUE(0:3) TYPE FIELD FOR LOADER FIX-UPS * ENT3 0 COUNTER TO SYMBOL TABLE NEXTSYM1 LDA SYMTAB,3 LOAD DEFINED FLAG (SIGN) JAP DENFDSYM IF DEFINED * STA LABEL(1:5) STA ASMDEF+6(1:5) FOR PRINT LDA SYMTAB+1,3 SECOND PART OF NAME STA LABEL+1 STA ASMDEF+7 ENTA 0,6 VALUE FOR DEFINING SYMBOL JMP DEFINELAB INC6 1 INCREASE * FOR NEXT * ENTA -1,6 ENT1 FINTEMP TEMPORARY FOR OCTAL CHARACTERS ENT2 4 JMP OCTCHAR CONVERT * TO OCTAL FOR PRINT * LDA FINTEMP STA ASMDEF+1(3:3) LDA FINTEMP+1 STA ASMDEF+1(4:4) LDA FINTEMP+2 STA ASMDEF+1(5:5) LDA FINTEMP+3 STA ASMDEF+2(1:1) LDA FINTEMP+4 STA ASMDEF+2(2:2) INSERT * OCTAL IN PRT LINE * OUT ASMDEF(LP) PRINT ASSEMBLER DEFINED JBUS *(LP) * DENFDSYM EQU * SYMBOL IS DEFINED * LDA SYMTAB+3,3(4:5) CHECK FOR CHAIN CMPA CHAINEND JE UPSYM1 IF NOT, GO ON TO NEXT * STA VALUE(4:5) PREPARE VALUE FOR LOADER LDA SYMTAB+2,3(4:5) ADDRESS OF SYMBOL STA VALUE(0:2) LDA VALUE JMP TAPEOUT OUTPUT FIX-UP COMMAND * UPSYM1 INC3 4 CMP3 NSYMBOL CHECK END OF TABLE JL NEXTSYM1 * LDA STARTADD STA VALUE(0:2) STARTING ADDRESS ENTA 2 STA VALUE(3:3) TYPE FOR STARTING ADDRESS STZ VALUE(4:5) BYTES 4:5 ZERO LDA VALUE JMP TAPEOUT OUTPUT STARTING ADDRESS * JMP FINISHBUF CLEAR LAST BUFFER TO TAPE * FINEXIT JMP * *

Evaluation

The MIXAL assembler which we have presented in this chapter is a relatively simple one, despite its length. A number of improvements can be made. However, the basic structure of the assembler is such that most improvements can be made by only local modifications to a few subroutines. The entire assembler will not need to be rewritten.

  1. By changing only the GETFIELDS routine, the assembler could be changed from a fixed-format to a free-format assembler.
  2. Literals were not included. However, the introduction of literals would involve only changing EVALSYM, to recognize a literal by its preceding and terminating equal signs, adding code to FINISHUP to define them and a decision on whether to store literals in the symbol table or in a separate literal table. Notice that the definition of literals as being limited to no more than nine characters would allow them to be stored in the symbol table, and distinguished by their leading equal sign. The trailing equal sign is not needed.
  3. Local symbols (iH, iF, iB) would similarly require only a change in EVALSYM and DEFINESYM plus a local symbol table.
  4. Additional error checking should be added to consider the problems of symbol table overflow, additional types of syntax errors, and programs which attempt to use more than 4000 words of memory.
  5. The binary and floating point opcodes could be added.
  6. The input program, output listing, and loader tape routines could be modified to allow input or output from tape, disk, or drum and to produce loader output on cards, tape, disk, or drum as desired by the programmer.
  7. The FINISHUP subroutine could be extended to print a symbol table following the assembly program listing. With a non-trivial amount of new code, a cross reference listing could be added.

These are just a few of the changes which can be, or should be made. One of the major evaluation criteria is the ability of a programmer, other than its author, to understand a program, and be able to correctly modify it.

Another evaluation criteria is performance. This can be measured in terms of either memory size or speed. For a 4000-word MIX memory, the assembler occupies about 1600 words for its code and opcode table. This leaves over half of memory for the symbol table, allowing the symbol table to hold a maximum of about 600 symbols. Reducing the size of a symbol table entry to 3 words would increase this to 800 symbols.

Since card input, listing output, and tape output are all double buffered, and overlap most of the computation, the speed of the assembler is bounded mainly by the speed of the I/O devices. The assembler took 80.9 seconds to assemble itself (1579 cards) on a 1-microsecond-per-time-unit MIX computer, with a 1200-card-per-minute card reader, and a 1200-line-per-minute line printer. Of this time, 76.1 seconds were spent waiting for I/O devices. This means that only 5.8 percent of the total execution time was needed for the assembly of the program, the remainder of the time is all I/O. Assemblers are typically very I/O bound programs.

This simple measurement means that it would be very difficult to significantly speed up the assembler. A number of minor modifications can be made to speed up the assembler (such as a better symbol table search algorithm, better use of registers, and not saving registers which are not needed). But the effect of these changes on the total processing time would be minimal at best, and hence they are probably not worth the bother.

8.4 SUMMARY

The basic function of an assembler is to translate a program from assembly language into loader code for loading. The major data structures which assist in this translation are the opcode table, which is used to translate from symbolic opcode to numeric opcode, and the symbol table, which is used to translate programmer-defined symbols to their numeric values.

With these major data structures and subroutines to search and enter these tables, as necessary, to evaluate symbols, to evaluate expressions, to print lines, to handle errors, to buffer and block loader output, to read cards and get symbols, to initialize and terminate the assembler, the code for assembling each type of assembly language statement is relatively easy to write and understand.

The major problem for an assembler is forward references. These can be handled by either a two-pass assembler or a one-pass assembler. A one-pass assembler requires the loader to fix-up forward references.

Assemblers are a major topic for books on systems programming, and chapters on assemblers are included in Stone and Siewiorek (1975), Graham (1975), Hsiao (1975), and Donovan (1972). Gear (1974) and Ullman (1976) also discuss assemblers. The book by Barron (1969) has an extensive description of assemblers and how they work. For a look at the insides of a real assembler, try the Program Logic Manual for the assembler for the IBM 360 computers (IBM order number GY26-3716).

EXERCISES

  1. Hand assemble the following MIXAL program. Use two passes. First construct the symbol table, then assemble the program into octal machine language.
    MAXSYM EQU 100 SYMBOL ORIG 2*MAXSYM+2 * SEARCH STJ ENDSEAR ST2 SAVSEAR(0:2) STA 0,1 LD2 1,1 INC2 2,2 2H EQU * DEC2 2 CMPA 0,1:2 JNE 2B ENT1 0,2 SAVSEAR ENT2 * ENDSEAR JMP * *
  2. Hand assemble the above MIXAL program in one pass. Show the assembled code in octal. Chain all forward references in the address fields of instructions with forward references. Use +7777 as the end of the list of forward references to a symbol. Explain any errors discovered. Give the symbol table at the end of assembly.
  3. A friend of mine complained that she had two MIX routines. One was prepared for fixed-format MIXAL; the other for free-format MIXAL. She wants to combine the two subroutines in one program. Which will have to be reformatted for a fixed-format assembler? Which will need reformatting for a free-format assembler?
  4. A free-format assembler will often include the non-free-format restriction that any label in the location field must begin in column 1. Can this restriction be relaxed? What are the problems?
  5. What would be the result of using only one table for both the symbol table and the opcode table?
  6. What information is kept in the symbol table of a one-pass assembler?
  7. Assume that an assembler uses a linear search for a symbol table. Entries into the symbol table are made by adding to the end of the table. Symbols are added to the table whenever they occur in the label field. Multiply-defined symbols are allowed. If the first defined value of a multiply-defined symbol is wanted, how should the search routine search? What if the most recently defined value is desired?
  8. Suppose it were very easy to read the time of day. Would this be useful in computing a hash function?
  9. For a two-pass assembler, which of these functions must be done on pass 1, which must be done on pass 2, which can be done on either, and which must be done on both?
    1. construct the symbol table.
    2. output code to the loader.
    3. scan for the opcode.
    4. print the listing.
    5. process ORIG pseudo-instructions.
    6. search the symbol table.
    7. update the program location counter.
    8. treat an LDA different from an STA.
    9. process a CON pseudo-instructions.
    10. process the label field.
  10. What can be done on the first pass of a two-pass assembler if a symbol in the location field has already been entered into the symbol table (i.e., it is multiply-defined)?
  11. On which pass of a two-pass assembler would you detect the following types of errors?
    1. undefined symbol
    2. undefined opcode
    3. multiply-defined symbol
    4. ORIG to a forward reference
    5. expression value too large for address field
  12. Name three techniques for handling forward references in an assembler?
  13. An assembler gives a listing of the symbol table before listing the assembly program and assembled code. How many passes is it?
  14. Suppose you had available (already written) both a one-pass and a two-pass assembler for a particular machine, and suppose that any program accepted by one was also acceptable to the other. Give one advantage you might expect the one-pass assembler to have, and one advantage you might expect the two-pass assembler to have.
  15. A one-pass assembler cannot handle forward references in expressions. Thus, the following is illegal in MIXAL.
    X EQU Y+1 Y EQU 0
    A two-pass assembler can do this without problems, since on pass 1 it finds the value of Y, and on pass 2 it can compute the value of X, enter it in its symbol table, and use it in the remainder of the program. Does this mean that a two-pass assembler has no restrictions on forward references? If it does not, show an example to demonstrate a problem which even a two-pass assembler cannot handle.
  16. An assembler is assembling for a relocating linking loader. What internal tables and data structures must it have that an absolute assembler need not have?
  17. Suppose we want to extend MIXAL by adding the following new symbolic opcodes.
    LDN* * = A, 1, 2, 3, 4, 5, 6, X
    These new opcodes should be treated like the LDiN opcodes. That is, both the opcodes should generate the same code and accept the same form of operand. How would we need to modify the existing MIXAL assembler to allow these new opcodes?
  18. Consider a restricted MIX machine with only the A register and X register (no index registers). Eliminate all indexing and partial field considerations and all mnemonics involving index registers. Write a fixed-format assembler for this restricted assembly language. Allow only symbols or constants (no expressions) for operands.