Computer Programs for Detecting and Correcting Spelling Errors

Computer Programs for Detecting and Correcting Spelling Errors

James L. Peterson

Department of Computer Sciences
The University of Texas at Austin
Austin, Texas 78712

September 1980

This paper has been published. It should be cited as

James L. Peterson, ``Computer Programs for Detecting and Correcting Spelling Errors'', Communications of the ACM, Volume 23, Number 12, (December 1980), pages 676-687.

ACM Copyright Notice

Copyright © 1978 by the Association for Computing Machinery, Inc. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Publications Dept, ACM Inc., fax +1 (212) 869-0481, or permissions@acm.org.

ACM did not prepare this copy and does not guarantee that is it an accurate copy of the author's original work.

ABSTRACT

With the increase in word and text processing computer systems, programs which check and correct spelling will become common. Several of these programs exist already. We investigate the basic structure of these programs and their approaches to solving the programming problems which arise when such a program is created. This paper provides the basic framework and background necessary to allow a spelling checker or corrector to be written.

1. Introduction

Computers can assist in the production of documents in many ways. Formatting and editing aid in producing a high quality document. Appropriate software can be devised to improve the appearance of output on typewriters and line printers. Use of sophisticated output devices, such as computer driven phototypesetters, xerographic printers, or electrostatic printers, can produce documents of outstanding quality.

Computers are being extensively used for document preparation. The systems used are typically a time-shared computer system with their file systems, text editors and text formatting programs. The author can enter the document directly on-line with the text editor, storing it in the computer's file system. The document has formatting commands indicating appropriate margins, spacing, paging, indenting, and so on. The text formatter interprets this to produce a file suitable for printing. The text file can be modified using a text editor which is generally an interactive program, while the formatter is more often a batch program.

This method of document preparation has created a market for word processing systems. This market will grow as more people apply computers to improve their document preparation.

Once this mode of document preparation is adopted, several new operations on text files become possible. Our interest here is in the possibility of analyzing documents for spelling errors. Several systems can analyze a text file for potential spelling errors, pointing out probable correct spellings. We expect spelling programs to be a standard part of all future text processing systems. Several companies, including IBM, have recently announced systems including spelling programs.

There are two types of spelling programs: spelling checkers and spelling correctors. The problem for a spelling checker is quite simple: Given an input file of text, identify those words which are incorrect. A spelling corrector both detects misspelled words and tries to find the most likely correct word. This problem has elements of pattern recognition and coding theory. A solution is possible only because of redundancy in the English language.

Each word can be considered a point in a multi-dimensional space of letter sequences. Not all letter sequences are words, and the spelling checker must classify each candidate word (called a token) as a correctly spelled word or not. We want to minimize errors, both of Type I (saying a correctly spelled word is incorrect) and errors of Type II (saying an incorrectly spelled word is correct). This involves a search of the word space to select the nearest neighbor(s) of the incorrectly spelled word. (This view of words in a multi-dimensional space can lead to peculiar representations of words such as in [Giangardella, et al 1967] in which each word is represented as a vector in a 2-dimensional space.)

Spelling errors can be introduced in many ways. The following three are probably most important.

Author ignorance. Such errors would lead to consistent misspellings, probably related to the differences between how a word sounds and is spelled.

Typographical errors on typing. These would be less consistent but perhaps more predictable, since they would be related to the position of keys on the keyboard, and probably result from errors in finger movements. Studies have shown that large data bases may have significant keyboarding errors [Bourne 1977].

Transmission and storage errors. These would be related to the specific encoding and transmission mechanisms. Early work in spelling correction was often aimed at the problem of optical character recognition (OCR) input [Bledsoe and Browning 1959; Harmon 1962; Vossler and Branston 1964; Carlson 1966; Rosenbaum and Hilliard 1975] and the recognition of Morse code [McElwain and Evans 1962].

Different sources of errors suggest that some algorithms will be better for some errors.

The original motivation for research on spellers was to correct errors in data entry; much early work was directed at finding and correcting errors resulting from specific input devices in specific contexts. Davidson [1962] was concerned with finding the (potentially misspelled) names of passengers for a specific airline flight. Either (or both) name might be misspelled. Carlson [1966] was concerned with names and places in a genealogical data base. Freeman [1963] was working only with variable names and keywords in the CORC programming language while McElwain and Evans [1962] were concerned with improving the output of a system to recognize Morse code. Each of these projects considered the spelling problem as only one aspect of a larger problem, and not as a separate software tool on its own.

Many academic studies have been made on the general problem of string matching and correction algorithms [Damerau 1964; Alberga 1967; Riseman and Ehrich 1971; Wagner and Fisher 1974; Riseman and Hanson 1974; Lowrance and Wagner 1975; Wong and Chandra 1976], but not with the aim of producing a working spelling program for general text.

Recently several spelling checkers have been written for the sole purpose of checking text. Research extends back to 1957, but the first spelling checker written as an application program (rather than research) appears to have been SPELL for the DEC-10, written by Ralph Gorin at Stanford in 1971. This program, and its revisions, are widely distributed today.

The UNIX operating system provides two spelling checkers: TYPO and SPELL. These are different approaches. Another spelling checker has been written for IBM/360 and IBM/370 systems and is in use at the IBM Thomas J. Watson Research Center at Yorktown Heights. [Late addition: ispell ]

This paper reports on an investigation of these programs: how they work and their advantages and disadvantages. As a result, we determined how a spelling checker can be written using existing software technology. On the basis of this study, we then proceeded to write our own spelling checker/corrector using a top-down programming methodology [Peterson 1980].

2. Token Lists

The simplest assistance in checking spelling is an algorithm which lists all distinct tokens in the input. This requires a person to identify any misspellings. A program of this sort is a simple exercise.

Figure 1. Constructing a word list. The words have been ordered by number of occurrences.

Even at this point significant decisions must be made: e.g., what is a potential word (token)? It is a sequence of letters, separated by delimiters. Delimiters include blanks and special characters, such as commas, periods, colons, and so on. In most cases, the classification of a character as a letter or a delimiter is clear, but careful thought should be given to the interpretation of numbers, hyphens, and apostrophes.

Tokens which include numbers are often ignored by spelling checkers. This includes both tokens which are totally numbers ("1978", "10", and so on), and tokens which are part number and part letter ("3M", "IBM360", "R2D2", and so on). Whether these sorts of tokens should be considered by a speller might be best left as an option to the user.

Hyphens are generally considered delimiters; each part of the hyphenated token is considered a separate token. This follows from the normal use of a hyphen to construct compound words (such as "German-American", "great-grandmother", "four-footed", and so on). Hyphenation at the end of a line is best not allowed in the input to a spelling checker, since it should be well known that it is not possible, in general, to undo such hyphenation correctly.

Apostrophes are generally considered to be letters for the purpose of spelling correction. This is because of their use as an indicator of omitted letters (as in "don't" and "I'd"), and possessives ("King's" and "John's") An apostrophe at the beginning or end of a token might be considered a delimiter, since they are sometimes used as quote markers (as in: "the token 'ten' ..."), but plural possessives also have apostrophes at the ends of words ("kids'").

Another concern is for the case of letters in a token. Most frequently all letters will be in lower case, although the first letter is capitalized at the start of a sentence. Since "the" is normally considered the same word as "The", most spellers map all letters into one case (upper or lower) for analysis. A flag may be associated with the token to allow proper capitalization for spelling correctors, since we want the capitalization of corrected tokens to match the capitalization of the input token. Thus "amung" should be corrected to "among" while "Amung" is corrected to "Among".

The case problem is actually more complex because of the existence of words, particularly in computer literature, which are strictly upper case (such as "FORTRAN" and "IBM") and the widespread use of acronyms ("GCD" and "CPU") which are upper case. Should "fortran" or "ibm" be considered misspellings or separate words? What should be the cases of the letters used to correct an input token such as "aMunG"? These problems are minor but need thought in designing spelling correctors. Spelling checkers need only report that "aMunG" is not correctly spelled in any case; this allows all tokens to be mapped into a uniform case for checking.

A separate problem is a possible decision that tokens which consist solely of upper case characters should (optionally) not even be considered by a spelling corrector since they will most likely be proper names, variable names, or acronyms, none of which would typically be understood by a spelling program anyway.

Once these decisions are made, it is relatively straightforward to create a list of all distinct tokens. The basic algorithm is:

Initialize. Get the name of the input file and the output file, open them, and set all variables to their initial state.

Loop. While there are still tokens in the input file, get the next token, and enter it into an internal table of tokens.

To enter the token, first search to see if it is already in the table. If not, add it to the table.

Print. When all tokens have been read, print the table and stop.

The key to this program is obviously its internal table of tokens. We need to be able to search the table quickly and find a token (if it exists), or determine that it is not in the table. We also need to be able to add new tokens to the table, though we need not delete them. The speed of the program is most dependent on the search time. A hash table approach would seem the most reasonable for the table.

The table would be most useful if it were sorted. Sorting can be either alphabetically or by frequency. Attaching a frequency count to each table entry provides the number of uses of each token. This can speed the process by searching higher frequency items first (a self-modifying table structure); it also may give help in determining misspellings. Typographical errors are generally not repeated, and so tokens typed incorrectly will tend to have very low frequency. Any token with low frequency is thus suspect. Consistent misspellings (due to the author not knowing the correct spelling) are not as easily found by this technique.

This form of spelling program was used to check spelling for the text, Computer Organization and Assembly Language Programming, (Academic Press, 1978). Although it was quite tedious to check the resulting list of words (5,649 unique words out of 156,134 tokens), the result was a book with no (known) spelling errors.

3. TYPO

An extension of this idea is the basis of the TYPO program on UNIX [Morris and Cherry 1975; McMahon, et al 1978]. This program resulted from research on the frequency of two letter pairs (digrams) and three letter triples (trigrams) in English text. If there are 28 letters (alphabetic, blank, and apostrophe) then there are 28 letters, 28² (= 784) digrams and 28³ (= 21,952) trigrams. However, the frequency of these digrams and trigrams varies greatly, with many being extremely rare. In a large sample of text, only 550 digrams (70%) and 5000 trigrams (25%) actually occurred. If a token contains several very rare digrams or trigrams, it is potentially misspelled.

The use of digrams and trigrams to both detect probable spelling errors and indicate probable corrections has been one of the most popular techniques in the literature [Harmon 1962; McElwain and Evans 1962; Vossler and Branston 1964; Carlson 1966; Cornew 1968; Riseman and Ehrich 1971; Riseman and Hanson 1974; Morris and Cherry 1975; McMahon, et al 1978].

TYPO computes the actual frequency of digrams and trigrams in the input text and a list of the distinct tokens in the text. Then for each distinct token, an index of peculiarity is computed. The index for a token is the root-mean-square of the indices for each trigram of the token. The index for a trigram xyz given digram and trigram frequencies f(xy), f(yz) and f(xyz) is [log(f(xy)-1) + log(f(yz)-1)]/2 - log(f(xyz)-1). (The log of zero is defined as -10 for this computation.) This index is a statistical measure of the probability that the trigram xyz was produced by the same source that produced the rest of the text.

The index of peculiarity measures how unlikely the token is in the context of the rest of the text. The output of TYPO is a list of tokens, sorted by index of peculiarity. Experience indicates that misspelled tokens tend to have high indices of peculiarity, and hence appear towards the front of the list. Errors tend to be found since (a) misspelled words are found quickly at the beginning of the list (motivating the author to continue looking), and (b) the list is relatively short. In a document of ten thousand tokens only approximately 1500 distinct tokens occur. This number is further reduced in TYPO by comparing each token with a list of over 2500 common words. If the token occurs in this list, it is known to be correct and is not output thereby eliminating about half the distinct input tokens, producing a much shorter list.

4. Dictionary Look-up

The use of an external list (a dictionary) of correctly spelled words is the next level of sophistication in spelling checkers. The spelling checker algorithm using a dictionary is:

Initialize.

Build List. Construct a list of all distinct tokens in the input file.

Search. Look up each token of the list in the dictionary.

If the token is in the dictionary, it is correctly spelled.

If the token is not in the dictionary, it is not known to be correctly spelled, and is put on the output list.

Print. Print the list of tokens which were not found in the dictionary.

If the list of input tokens and the dictionary are sorted, then only one pass through the list and dictionary is needed to check all input tokens, providing a fast checker.

The major component of this algorithm is the dictionary. Several dictionaries exist in machine readable form. It may be most practical to use the output of the spelling checker to create the dictionary: i.e., a list of tokens which are not in the dictionary. Starting with a small dictionary, many correctly spelled words will be output by the checker. By deleting spelling errors and proper names from this list, we have a list of new words which can be added to the dictionary. A new dictionary can easily be created by merging the two.

The size of a reasonable dictionary can be estimated by the existence of books which list words, mainly for use by secretaries; e.g., [Leslie 1977]. These usually have from twenty to forty thousand words.

One must be careful not to produce too large a dictionary. A large dictionary may include rare, archaic or obsolete words. In technical writing, these words are most likely to be the result of a misspelling. Another problem with most large dictionaries is the tendency to include incorrectly spelled words. It is neither interesting nor pleasant to verify the correct spelling of 100,000 words.

A dictionary of 10,000 words should be quite reasonable for a small community of users. Since the average English word is about eight characters long, this requires about 100K bytes of storage. Controls will be needed on who may modify the dictionary.

The need for controls creates the post of system dictionary administrator. This person would be responsible for maintaining the shared system dictionary: the addition of new words and the selective deletion of those infrequently used.

One approach to limiting the size of the system dictionary is to create multiple dictionaries, one for each major topic area. This leads naturally to the concept of a common base dictionary with multiple local dictionaries of special words specific to a given topic. When a particular file is to be checked, a temporary master dictionary is created by augmenting the common base with selected local dictionaries. These local dictionaries can be created and maintained by the dictionary administrator or by individual users reflecting their own vocabularies.

The above algorithm for dictionary look-up is a batch algorithm. It is essentially the algorithm used for the UNIX and IBM spelling programs. There are some problems with this approach. First, a substantial real-time wait may be required while the program is running; this can be quite annoying to a user at an interactive console. Second, the output list of misspelled and unknown tokens lacks context. It may be difficult to find these tokens in the text using some text editors due to differences in case (for editors which pay attention to upper/lower case) and search difficulties (the file may be quite large, and/or the misspelled token may be a commonly occurring substring in correctly spelled words).

5. Interactive Spelling Checkers

Problems such as those just discussed can be easily corrected with an interactive spelling checker. An interactive checker uses the following basic algorithm.

Initialize. Possibly asking the user for mode settings and the names of local dictionaries.

Check. For each input token, search the dictionary.

If the token is not in the dictionary, ask the user what to do.

This is the approach of the DEC-10 SPELL program and the approach which we used with our own spelling corrector.

5.1. Modes of Operation

Several modes of operation may be available in an interactive checker. The interaction with the user may be verbose (novice users) or terse (experts familiar with the program). Local dictionaries of specialized terms may be temporarily added to the large shared system dictionary. A training mode (where all tokens are defined to be correctly spelled) may be useful for constructing local dictionaries.

The options available to the user are determined by user needs. The following list indicates some options:

Replace. The unknown token is taken to be misspelled and is to be replaced. The token is deleted and a correctly spelled word is requested from the user.

Replace and Remember. The unknown token is replaced by a user specified word, and all future uses of this token are also replaced by the new word.

Accept. The token is to be considered to be correct (in this context) and left alone.

Accept and Remember. The unknown token is correct and to be left alone. In addition, all future uses of this token are correct in this document; the user should not be asked about them again.

Edit. Enter an editing submode, allowing the file to be modified in its local context.

Use of a CRT as the terminal, lets the spelling checker display the context of an unknown token: at least the line containing the token should be displayed. On terminals with sufficient bandwidth and features, a larger context of several lines or an entire page can be displayed with the token emphasized by brightness, blinking, size or font.

Figure 2. An interactive spelling program as it might appear on a CRT terminal. Here, a page of text with a highlighted misspelled word, along with a list of corrective actions, are displayed.

5.2. The Dictionary

The performance of an interactive spelling checker is important. The user is waiting for the checker and so the checker must be sufficiently fast to avoid frustration. Also, unlike batch checkers which need to look up each distinct token once (sorting them to optimize the search), an interactive checker must look up each occurrence in the order in which they are used. Thus, the interactive checker must do more work.

(It would be possible to batch small portions, up to say a page, of the input file. A list of all distinct tokens on a page would be created and the dictionary searched for each token. Each token which was not found would be presented to the user. This is repeated for each page in the input file. The main problems are with response time and in keeping track of the context of each usage of a token for display.)

The structure of the dictionary is thus of great importance. It must allow very fast searches. The "correct" structure must be determined for the configuration of the computer system. Such factors as memory size, file access methods, and the existence of virtual memory can be significant factors determining appropriate structures. If memory is large enough, the entire dictionary can be kept in memory, making operations easier and faster. If memory is virtual as opposed to physical, however, the dictionary structure should minimize page faults while searching. If memory is too small, a two-layer structure is needed, keeping the most frequently reference words in memory, while accessing the remainder with reads as necessary.

Algorithms and data structures for text processing may be tailored to the properties of the English language. Several studies have created large computer files of text which can be analyzed to determine the statistical properties of the language. The most commonly cited collection is the Brown Corpus [Kucera and Francis 1967] created at Brown University. The Brown Corpus contains 1,014,232 total tokens with 50,406 distinct words. The longest word is 44 characters, while 99% are 17 characters or less in length; the average word length is 8.1 characters. It is well known, however, that short words are used more frequently than long words. The ten most common words are "the", "of", "and", "to", "a", "in", "that", "is", "was", and "he". Because of this usage of shorter words, the average token length, when weighted by frequency of use, is only 4.7 characters, with 99% having 12 characters or less.

Knuth [1973] is a good source for different data structures and search algorithms. However, there is no "best" algorithm; each machine and each system makes different demands on the dictionary data structure. A few approaches are now outlined.

The DEC-10 SPELL program uses a hash chain table of 6760 entries. The hash function for a token is the first two letters (L1 and L2) and the length (WL) of the token (2, 3, ..., 10, 11 and over) as (L1*26 + L2) * 10 + min(WL-2,9). Each hash table entry is a pointer to a chain of words all of which have the same first two letters and the same length. (This program assumes all tokens of length 1 are correctly spelled.)

Another suggested structure for the dictionary is based on tries [Knuth 1973; Partridge and James 1974; Muth and Tharp 1977]. A large tree structure represents the dictionary. The root of the tree branches to 26 different nodes, one for each of the possible first letters of words in the dictionary. Each of these nodes would branch according to the possible second letters, given the known first letter associated with the node. These new nodes branch on possible third letters, and so on. A special bit would indicate the possible ends of words. This structure is particularly appropriate for small computer systems in which a token must be compared with a dictionary entry one character at a time. To search for a token of WL characters requires following the tree WL levels down and checking the end-of-word bit.

Alternative structures are generally based on the frequency of use of words. It is known that usage is extremely variable, and a small number of words are used most often. Hence, we want to structure our search strategy so that the most common words are searched first. One suggestion [Sheil 1978] is for a two-level search strategy.

In the two-level search strategy, the token would first be compared with entries in the small in-core table of most frequently used words. If the token is not in this table, a search would be made of the remaining words. This larger table might be stored on secondary disk or drum storage, or in a separate part of virtual memory, requiring longer access and search times. If this table is on secondary storage, access to a block can be accomplished by an in-core index.

From the Brown Corpus, we can determine the following statistics for the percentage of most common English words. The size of the table determines the percentage of tokens found. As the table size is doubled, a larger percentage are found.

Table Size          Percent Tokens Found        Gain
----------          --------------------        ----
   16                   28.8
   32                   36.0                    7.2
   64                   43.2                    7.2
   128                  49.6                    6.4
   256                  55.6                    6.0

Thus over half of the tokens in normal English occur in a vocabulary of only 134 words. A table of this size can easily be stored in memory. A good search strategy can be designed for such a small table; for example, Knuth [1973] gives a hashing algorithm which results in only one comparison to identify any of the 31 most frequently used words in English (35.7% of all tokens).

Figure 3. The possible size and use of three separate dictionaries.

Another improvement in search time can be made by noticing that the total number of distinct tokens in a document tends to be small to moderate (1500 for a 10,000 word document) and often uses words of particular interest to the subject area. This means that for each specific document, there exists a (small) table of words which occur frequently in that document (but not in normal English). Thus, it is wise to build another table of words which have been used in this specific document. This three-table structure would be searched by:

First, the small table of most common English words.

Next, the table of words which have already been used in this document.

Finally, the large list of the remaining words in the main dictionary. If a word is found at this level, add it to the table of words for this document.

Distinct data structures may be appropriate for these three tables since they exhibit the following different properties.

Most common English words -- static, small (100-200).

Document specific words -- dynamic, small to moderate (1000-2000).

Secondary storage dictionary -- static, very large (10,000-100,000).

(You may recognize the latter two data structures as similar to a paging or segmentation scheme. Notice that even if the underlying computer system provides a virtual memory by such a memory management scheme, this data structure will provide improved locality and hence a lower fault rate.)

6. Spelling Correction

An interactive spelling checker can also be helpful in correction. In the simplest case, the checker remembers all tokens which the user indicates should be replaced, and the words with which they are to be replaced. After the first such replacement, future occurrences of the misspelled token can be automatically corrected, though this feature might be undesirable in some instances.

A similar approach is to include common misspellings in the dictionary with the correct spelling. Thus "greatfully" would be tagged as a misspelling of "gratefully" and "prehaps" as a misspelling of "perhaps". This approach has not been included in any current spellers: probably because of the lack of an obvious source of known misspellings and the low frequency of even common misspellings.

Most misspellings can be generated from correct spellings by a few simple rules. Damerau [1964] indicates that 80% of all spelling errors are the result of:

1. transposition of two letters;

2. one letter extra;

3. one letter missing;

4. one letter wrong.

Figure 4. An illustration of an incorrectly spelled token and the correctly spelled tokens which result when certain simple rules are applied.

These four rules are the basis of the correction portion of the DEC-10 spelling corrector. The basic algorithm for correction is:

For each token which is not found in the dictionary, construct a list of all words which could produce this token by one of the above rules. These are candidate spellings.

If the list has exactly one candidate, guess that word, and ask the user if that word is the correct spelling.

If the list contains several words, state this. The user may then request the list and select one as the correct word, or indicate one of the normal options of replace, replace and remember, accept, accept and remember, or edit.

The candidate list is formed by multiple searches of the dictionary. Transpositions can be detected by transposing each pair of adjacent characters in the token, one at a time, and searching for the resulting token. If the resulting token is found, it is a candidate and is added to the list. For a token of WL characters, this requires WL-1 searches of the dictionary. Checking for one extra letter requires deleting each character one at a time, and searching for the resulting tokens. This requires an additional WL searches; most tokens are short (five characters on average) so this need not be expensive. However, no statistics are available on the average length of misspelled words.

The remaining two types of errors (one missing letter and one wrong letter) are difficult to detect. Two strategies are possible. The comparison operation can be defined to include a special match-any character. This means that a search for a missing letter could be achieved by inserting or appending the special match-any character and searching. The characteristics of this type of search are more complex than a simple bit-wise compare and will take more time. Considering the critical nature of performance, two separate searches may be needed: with and without the match-any character feature. A search with a match-any character feature cannot stop when the first word match is found, but must continue, since many words may match the token. This approach requires WL+1 searches for a missing letter error and WL searches for a wrong letter error.

A second approach is to substitute each potential character in each character position and search normally. If there are N potential characters (N approximately 27 depending on the classification of the apostrophe and similar characters), then a missing letter error requires N*WL+1 searches and a wrong letter requires N*WL searches.

A specific dictionary structure may allow improvements. The DEC-10 SPELL program separates the dictionary into 6760 chains by the first two letters and length. For a token of length WL, candidate spellings for one letter wrong and transposition errors are of the same length, candidates for one letter extra errors are of length WL-1, and one letter missing candidates are of length WL+1. This considerably shortens the number of chains which must be searched. The search strategy is best described by quoting from the comments of the program itself.

"There are four kinds of errors that the program attempts to correct:
one wrong letter.
one missing letter.
one extra letter.
two transposed letters.
For a wrong letter in the third or subsequent character, all words that are candidates must exist on the same chain that the suspect token hashes to. Hence, each entry on that chain is inspected to determine if the suspect differs from the entry by exactly one character. This is accomplished by an exclusive-or (XOR) between the suspect and the dictionary. Then a JFFO instruction selects the first nonzero byte in the XOR. This byte is zeroed and if the result is all zero then the dictionary word differs from the suspect in only one letter. All such words are listed at CANDBF, where they can be inspected later.

For a wrong letter in the first or second character, the program tries varying the second letter through all 26 possible values, searching for an exact match. Then all 26 possible values of the first letter are tried, after setting the second letter to its original value. This means that 52 more chains are searched for possible matches.

To correct transposed letters, all combinations of transposed letters are tried. There are only WL-1 such combinations, so it is fairly cheap to do that.

To correct one extra letter, WL copies of the token are made, each with some letter removed. Each of these is looked up in the dictionary. This takes WL searches.

To correct one missing letter, WL+1 copies of the token are made, each time inserting a null character in a new position in the suspect. The null character is never part of any word, so the suspect token augmented by an embedded null can be thought of as a word with one wrong letter (the null). Then the algorithm for matching one wrong letter is used. If the first character is omitted, all 26 possible first characters are tried. Also, 26 more tokens are formed by varying the second character in case that had been omitted."

Counting, we find that a total of 3*WL+103 chains must be searched, with WL such chain searches requiring a special algorithm to look for (exactly) one wrong character. The dictionary (of say 100,000 words) is split into 6760 chains, so a chain search looks at only on the order of 5 to 50 words. For a token of 8 characters this might mean 3000 comparisons.

Spelling correction is certainly not cheap. But it is not prohibitively expensive nor is it normally needed, only those tokens not found in the dictionary.

Other correction algorithms have been suggested. One theoretically pleasing suggestion [Alberga 1967] is to determine the lengths of common substrings in a token and a dictionary word. This can be used to determine an index of matching. The index of matching can be considered to be the probability that the token resulted from a misspelling of the dictionary word. The word with the highest probability can be selected as the most likely correct spelling. This seems to be a very expensive algorithm.

Similar algorithms have been investigated by Wagner and Fischer [1974] and Lowrance and Wagner [1975].

Additional information can often be used to increase correction accuracy or speed. One important source of additional information is the source of possible errors. For example, errors introduced by OCR input devices are only the substitution of one letter for another; characters are never inserted or deleted by the OCR reader. In addition, it is possible to create a confusion matrix giving the probability p_i,j of reading a character i when the correct character is j. This confusion matrix can be used to indicate which characters should be considered when an input letter is to be corrected. In a more general system, digram and trigram frequencies could be used to limit the number of alternative spellings which must be considered.

The layout of the standard typewriter keyboard may also be of use in correcting errors introduced in keyboarding text.

Another technique attacks the problem of words which the original author spelled incorrectly. The misspellings in this case are often based on word sound. It has been suggested that tokens should be converted to standard phonetic spelling, which would be used to find similarly sounding words in the dictionary. This would help correct such apparent double errors as using "f" for "ph" or "k" for "qu".

On the other hand, correction based upon the four rules used by the DEC-10 SPELL program is quite successful and relatively cheap, leaving the more difficult corrections to the user.

7. Affix Analysis

For performance reasons, it is desirable to keep the dictionary in memory. Thus serious thought has gone into means of reducing the size of the dictionary. In the Brown Corpus of 50,406 words, half were used only once or twice. A dictionary of 8000 words would represent 90% of the words; 16,000 words would yield 95%.

A major reduction may be possible however by considering common affixes. (An affix is either a suffix or a prefix.) Most plural nouns and verbs result from adding an "-s" to the end of the singular form. Other common suffixes include "-ly", "-ing", "-est", "-er", "-ed", "-ion" and so on. By removing these suffixes and storing only the root word, the dictionary can be significantly reduced. Similar approaches are possible for prefixes.

Two approaches are possible. In the simplest case, each token is examined for a list of common affixes. These are removed if found. Hence any token ending in "-s", has the "-s" removed. (The order of search can assure that larger suffixes, like "-ness" are removed first.) Then the dictionary is searched. If found, the token is judged correct. A major problem with this approach is that it does not catch misspelled tokens which are the result of correct affixes incorrectly applied to correct words, such as "implyed" or "fixs". This can allow misspelled tokens to escape detection.

A solution to this problem is to flag each word in the dictionary with its legal affixes. Then the dictionary is first searched for the token. If the token is not found, it is examined for possible affixes. These are removed and the dictionary searched for the root. If the root is found, its flags are examined to see that the particular affix is legal for this root.

The rules associated with affix removal can be complex. As an example, the following flags, and their interpretation are used in the DEC-10 SPELL program. These rules show how to add a suffix to a root word. Let a and b be "variables" that can stand for any letter. Upper case letters are constants. "..." stands for any string of zero or more letters.

"G" flag:
        ...E --> ...ING
        ...a --> ...aING, if a is not E.


"J" flag:
        ...E --> ...INGS
        ...a --> ...aINGS, if a is not E.


"O" flag:
        ...E --> ...ED
        ...bY --> ...bIED, if b is not A, E, I, O, or U
        ...ba --> ...baED, otherwise.


"S" flag:
        ...bY --> ...bIES, if b is not A, E, I, O, or U
        ...a --> ...aES, if a is S, X, Z, or H
        ...a --> ...aS, otherwise.

Suffix removal requires the correct reversal of the above rules, showing how to produce a root and suffix from an input token. Similar rules are defined for prefix removal.

The interaction of affix removal and spelling correction is not well understood, nor is iterated affix removal (removing several affixes).

Figure 5. Prefix and suffix analysis iin a spelling corrector showing three different levels of difficulty.

Of the three spelling checkers, each uses a different approach to affix analysis. The IBM speller uses no affix analysis, preferring a larger but simpler dictionary. The UNIX speller removes a few common suffixes and prefixes with no check for the legality of these affixes on the resulting root. Some of the DEC-10 SPELL programs maintain the flags listed above.

8. Design and Implementation

Based upon the research on spelling programs recounted above, we implemented a program to check and correct spelling errors [Peterson 1980]. Our intent was twofold: (1) to produce a machine independent spelling program design, and (2) to test modern programming principles in the construction of a non-trivial program.

A spelling program is an excellent example for a programming project, especially for students. The concept of a program which detects (and possibly corrects) spelling errors in a text file excites students with its apparent intelligence. Yet in its simplest implementation, it is little more than a table look-up program. At the same time, advanced students can extend the project to almost arbitrary complexity.

Our spelling program is an interactive spelling corrector. A three-level dictionary of common English words (256 of them), a hash table of document words, and a disk-based dictionary is used. Local dictionaries are supported. The program was developed completely top-down in a Pascal-like pseudo-language.

The design consists of two parts. Starting with the main program, a text portion discusses the possible alternative designs and the decisions which lead us to our choice. This choice is then written in a code portion. The code portion uses procedure calls to represent those parts of the design yet to be completed. This approach is then repeated with the undefined procedures, alternating between a text discussion of the design and an eventual code representation of our design choice.

The result of this approach to design is a document which details not only what the program is, but also why the program has developed as it has. The design document serves to both describe and explain.

To implement the speller, a text editor was used to delete the text portion of the design document, leaving only the code portion. The text editor was then used to effect a massive inversion of the code. Note that a top-down design produces first the main program and then its subprocedures. This is the inverse order required by, for example, Pascal. Thus, it is necessary to completely reorder the procedures as produced by the top-down design.

In addition, the use of abstract data types resulted in associating data type and variable declarations with the few procedures and functions which actually use the data items. The declarations in a Pascal program must be at the beginning of the program, preceding all procedure declarations. This required additional movement of code. Finally, some limitations of the Pascal language (e.g. inability to use a set value in a case statement) were overcome by local recoding.

The result of this relatively mechanical editting of the design document was an executable Pascal program. There were no major errors; the errors were either syntactic (e.g. missing semicolons or index variable declarations) or due to failing to properly initialize some local variables.

There are still some problems with our spelling program, particularly in terms of performance. The program is somewhat sluggish. We are investigating the cause of this sluggish behavior, but feel it is probably the result of one or more of the following three factors:

The dictionary is disk based. With the dictionary stored on disk, a search for a new word requires considerable time, both for the search and for the disk I/O. This is particularly a problem when the program is attempting to generate correct spellings of a misspelled word. As many as 200 potential spellings, almost all of them incorrect, must be searched for, each generally requiring a disk access.
The use of separate procedures for subgoals in the top-down design has resulted in, as it now stands, 105 procedures. No single procedure is more than a page in length; many are only a couple of lines long. All but a handful of these procedures are called from exactly one place in the program. This means that substantial procedure call overhead could be eliminated by textually substituting the body of the procedure for the procedure call, renaming local variables if necessary. This might (substantially) improve execution time, but would destroy program readability. Hence, we are investigating the possiblity of mechanically performing these substitutions as a prepass before compilation.
Finally, the use of a high-level language, such as Pascal, for this program may result in substantial execution time increase over a lower-level language. We expect to translate the program design into lower-level languages, such as C and various assembly languages, in order to both improve memory storage requirements (for both code and data) and execution time. This problem may be aggravated in the case of Pascal since it lacks a built-in string data type.

9. Future Checkers

The existing spelling checkers clearly demonstrate the feasibility of spelling checkers and correctors for English, and by implication, for similar languages. Suffix and prefix analysis for Romance languages might be easier and more necessary. The applicability of this work to German, Russian or Oriental languages is unknown.

The existence of a spelling checker and corrector suggests that the next step is to build more knowledge into the checking programs. The following checkers and correctors are possible next steps.

Consistency. In some cases, several correct spellings exist for a given word, such as "grey" and "gray". In addition to assuring that all tokens in a file are correctly spelled, we should assure that a consistent spelling is used. In addition we should check that appropriate case is used on proper names such as IBM, Algol, Jim and so on.

Syntax. Given that we can identify words in a sentence, the next step is to classify each word according to its word class (noun, verb, adjective, article, and so on). Given this identification, the structure of each sentence can be checked to assure that each sentence is properly constructed and syntactically correct.

Grammar. Once syntax can be checked, we must check grammar. This includes spelling and syntax, but also proper usage of the words and punctuation.

Semantics. Finally, a program which examines a document for correct meaning would be useful. This program would check that the ideas in the document are properly developed and presented, that conclusions are correctly based on the facts presented (or known), that the document is complete and consistent and that the document contributes new information or insights into significant problems.

Each checker would have an associated corrector which would attempt to correct the problem. These programs would greatly aid the preparation of documents, although they would be more difficult to construct.

We have begun efforts to construct a syntax checker. The on-line dictionary is being expanded to include parts of speech for each word, and various table-driven parsing methods are being researched. There is a considerable question of how much syntax checking can be done without semantic information, particularly for such problems as number agreement and missing articles.

Discussions with William B. Ackerman, Robert Amsler, Michael H. Conner and Leigh R. Power have helped in the writing of this paper.

10. Bibliography

C. N. Alberga, "String Similarity and Misspellings", Communications of the ACM, Volume 10, Number 5, (May 1967), pages 302-313.
Master's Thesis. Reviews previous work. Mentions two researchers at IBM Watson who suggest finding the longest common substrings and assigning probabilities based on the portion of the correct string matched. Does rather extensive but unreadable analysis of different algorithms, but with no real results. Reviewed in Computing Reviews, Volume 8, Number 5, Review 12,712.
W. W. Bledsoe and I. Browning, "Pattern Recognition and Reading by Machine", Proceedings of the Eastern Joint Computer Conference, (1959), pages 225-232.
Uses a small dictionary with probability of each word for OCR.
C. R. Blair, "A Program for Correcting Spelling Errors", Information and Control, Volume 3, Number 1, (March 1960), pages 60-67.
Weights the letters to create a four or five letter abbreviation for each word. If abbreviations match, the words are assumed to be the same. Mentions the possibility (impossibility) of building in rules like: i before e except after c and when like a as in neighbor and weigh, ...
C. P. Bourne, "Frequency and Impact of Spelling Errors in Bibliographic Data Bases", Information Processing and Management, Volume 13, Number 1, (1977), pages 1-12.
Examines the frequency of spelling errors in a sample drawn from 11 machine-readable bibliographic data bases. Finds that spelling errors are sufficiently severe that they should influence the search strategy to find information in the data base. Errors are not only in the input queries, but also in the data base itself.
G. Carlson, "Techniques for Replacing Characters that are Garbled on Input", Proceedings of the 1966 Spring Joint Computer Conference, (1966), pages 189-192.
Uses trigrams to correct OCR input of genealogical records.
R. W. Cornew, "A Statistical Method of Spelling Correction", Information and Control, Volume 12, Number 2, (February 1968), pages 79-93.
Uses digrams first, then a dictionary search to correct one character substitutions. Dictionary already exists for speech output problem.
F. J. Damerau, "A Technique for Computer Detection and Correction of Spelling Errors", Communications of the ACM, Volume 7, Number 3, (March 1964), pages 171-176.
The four errors: wrong, missing, extra, transposed, are mentioned here as accounting for 80% of errors. Uses a bit vector for preliminary compare. (bit[i] is on if letter i is in word). Reviewed in Computing Reviews, Volume 5, Number 4, Review 5,962.
L. Davidson, "Retrieval of Misspelled Names in an Airlines Passenger Record System", Communications of the ACM, Volume 5, Number 3, (March 1962), pages 169-171.
Abbreviates name to match stored name. Either name (token or dictionary) may be wrong.
E. G. Fisher, "The Use of Context in Character Recognition", COINS TR 76-12, Department of Computer and Information Sciences, University of Massachusetts, Amherst, (July 1976), 189 pages.
Considers the problem of automatically reading addresses from letters by the Post Office; also Morse code recognition.
D. N. Freeman, Error Correction in CORC: The Cornell Computing Language, Ph.D. Thesis, Department of Computer Science, Cornell University, (September 1963).
E. J. Galli and H. M. Yamada, "An Automatic Dictionary and Verification of Machine-Readable Text", IBM Systems Journal, Volume 6, Number 3, (1967), page 192-207.
Good discussion of the general problem of token identification and verification.
E. J. Galli and H. M. Yamada, "Experimental Studies in Computer-Assisted Correction of Unorthographic Text", IEEE Transactions on Engineering Writing and Speech, Volume EWS-11, Number 2, (August 1968), page 75-84.
Good review and explanation of techniques and problems.
J. J. Giangardella, J. Hudson and R. S. Roper, "Spelling Correction by Vector Representation Using a Digital Computer", IEEE Transactions on Engineering Writing and Speech, Volume EWS-10, Number 2, (December 1967), pages 57-62.
Defines hash functions to give a vector representation of a word as: Norm, Angle, and Distance. This speeds search time (over linear search) and aids in localizing the search for correct spellings since interchanged characters have the same Norm and extra or deleted letter is within fixed range.
H. T. Glantz, "On the Recognition of Information with a Digital Computer", Journal of the ACM, Volume 4, Number 2, (April 1957), pages 178-188.
Seems to want either exact match or greatest number of equal characters in equal positions. Good for OCR.
A. R. Hanson, E. M. Riseman and E. G. Fisher, "Context in Word Recognition", Pattern Recognition, Volume 8, Number 1, (January 1976), pages 35-45.
L. D. Harmon, "Automatic Reading of Cursive Script", Proceedings of a Symposium on Optical Character Recognition, Spartan Books, (January 1962), pages 151-152.
Uses digrams and a "confusion matrix" to give the probability of letter substitutions.
L. D. Harmon, "Automatic Recognition of Print and Script", Proceedings of the IEEE, Volume 60, Number 10, (October 1972), pages 1165-1176.
Surveys the techniques for computer input of print, including a section on error detection and correction. Indicates that digrams can catch 70 percent of incorrect letter errors.
E. B. James and D. P. Partridge, "Tolerance to Inaccuracy in Computer Programs", Computer Journal, Volume 19, Number 3, (August 1976), pages 207-212.
D. E. Knuth, The Art of Computer Programming, Volume 3: Sorting and Searching, Addison-Wesley, (1973), 722 pages.
H. Kucera and W. N. Francis, Computational Analysis of Present-Day American English, Brown University Press, (1967), 424 pages.
Gives frequency and statistical information for the Brown Corpus of over a million tokens.
L. A. Leslie, 20,000 Words, McGraw-Hill, (1977), 282 pages.
Representative of several books which list words.
R. Lowrance and R. A. Wagner, "An Extension of the String-to-String Correction Problem", Journal of the ACM, Volume 22, Number 2, (April 1975), pages 175-183.
Extends Wagner and Fischer [1974] to include adjacent transpositions as an edit operation.
C. K. McElwain and M. E. Evans, "The Degarbler -- A Program for Correcting Machine-Read Morse Code", Information and Control, Volume 5, Number 4, (December 1962), pages 368-384.
Uses digrams, trigrams and a dictionary to correct up to 70% of errors in machine recognized Morse code. Uses 5 special rules for the types of errors which can occur (dot interpreted as dash, ...)
L. E. McMahon, L. L. Cherry, and R. Morris, "Statistical Text Processing", The Bell System Technical Journal, Volume 57, Number 6, Part 2, (July-August 1978), pages 2137-2154.
Good description of how computer systems can be used to process text, including spelling correction and an attempt at a syntax checker.
H. L. Morgan, "Spelling Correction in Systems Programs", Communications of the ACM, Volume 13, Number 2, (February 1970), pages 90-94.
Use of spelling correction for compilers and operating system JCL. Uses dictionary with the four classes of errors. Also possible to use syntax and semantics to narrow search space. Reports on the CORC compiler [Freeman 1963] which associated a probability with each possible misspelling.
R. Morris and L. L. Cherry, "Computer Detection of Typographical Errors", IEEE Transactions on Professional Communications, Volume PC-18, Number 1, (March 1975), pages 54-64.
Describes the TYPO program for the UNIX system.
F. Muth and A. L. Tharp, "Correcting Human Error in Alphanumeric Terminal Input", Information Processing and Management, Volume 13, Number 6, (1977), pages 329-337.
Suggests a tree structure (like a trie) with special search procedures to allow corrections to be found. Damerau's review points out that their search strategies need improvement and that their tree is much too big to be practical. Each node of the tree has one character (data) and three pointers (father, brother, son). Reviewed in Computing Reviews, Volume 19, Number 6, Review 33,119.
J. A. O'Brien, "Computer Program for Automatic Spelling Correction", Technical Report RADC-TR-66-696, Rome Air Development Center, New York, (March 1967).
T. Okuda, E. Tanaka and T. Kasai, "A Method for the Correction of Garbled Words Based on the Levenshtein Metric", IEEE Transactions on Computers, Volume C-25, Number 2, (February 1976), pages 172-177.
D. P. Partridge and E. B. James, "Natural Information Processing", International Journal of Man-Machine Studies, Volume 6, Number 2, (March 1974), pages 205-235.
Uses a tree structure representation of words to allow checks for incorrect input words. Done in the context of correcting keywords in a Fortran program, but more is there. Frequencies are kept with tree branches to allow the tree to modify itself to optimize search.
J. L. Peterson, "Design of a Spelling Program: An Experiment in Program Design", Volume 96, Lecture Notes in Computer Science, Springer-Verlag, New York, (October 1980).
Gives the complete top-down design of a program to detect and correct spelling errors. The design is machine-independent, but directly results in a Pascal program which implements the design.
E. M. Riseman and R. W. Ehrich, "Contextual Word Recognition Using Binary Digrams", IEEE Transactions on Computers, Volume C-20, Number 4, (April 1971), pages 397-403.
Indicates the important property of digrams is only their zero or non-zero nature.
E. M. Riseman and A. R. Hanson, "A Contextual Postprocessing System for Error Correction Using Binary n-Grams", IEEE Transactions on Computers, Volume C-23, Number 5, (May 1974), pages 480-493.
Suggests using digrams (2-grams), trigrams (3-grams), or in general n-grams, but only storing whether the probability is zero or non-zero (1 bit). Also positional n-grams which keeps a separate n-gram table for each pair of positions (for each i and j we have the digram table for characters in position i and position j in a word).
W. S. Rosenbaum and J. J. Hilliard, "Multifont OCR Postprocessing System", IBM Journal of Research and Development, Volume 19, Number 4, (July 1975), pages 398-421.
Very specific emphasis on OCR problems. Some on searching with a match-any character.
B. A. Sheil, "Median Split Trees: A Fast Look-up Technique for Frequently Occurring Keys", Communications of the ACM, Volume 21, Number 11, (November 1978), pages 947-958.
A. J. Szanser, "Error-Correcting Methods in Natural Language Processing", Information Processing 68 - Proceedings of IFIP 68, North Holland, Amsterdam, (August 1968), pages 1412-1416.
Confused paper dealing with correction for machine translation and automatic interpretation of shorthand transcript tapes. Suggests "elastic" matching.
A. J. Szanser, "Automatic Error-Correction in Natural Languages", Information Storage and Retrieval, Volume 5, Number 4, (February 1970), pages 169-174.
E. Tanaka and T. Kasai, "Correcting Method of Garbled Languages Using Ordered Key Letters", Electronics and Communications in Japan, Volume 55, Number 6, (1972), pages 127-133.
P. J. Tenczar and W. W. Golden, "Spelling, Word and Concept Recognition", Report CERL-X-35, University of Illinois, Urbana, Illinois, (October 1962).
R. B. Thomas and M. Kassler, "Character Recognition in Context", Information and Control, Volume 10, Number 1, (January 1967), pages 43-64.
Considers tetragrams (sequences of 4 letters). Of 27⁴ possible tetragrams, only 12 percent (61,273) are legal.
L. Thorelli, "Automatic Correction of Errors in Text", BIT, Volume 2, Number 1, (1962), pages 45-62.
Sort of a survey/tutorial. Mentions digrams and dictionary look-up. Suggests maximizing probabilities.
C. M. Vossler and N. M. Branston, "The Use of Context for Correcting Garbled English Text", Proceedings of the 19th ACM National Convention, (August 1964), pages D2.4-1 to D2.4-13.
Uses confusion matrix and word probabilities to select the most probable input word. Also uses digrams. Trying to improve OCR input.
R. A. Wagner and M. J. Fischer, "The String-to-String Correction Problem", Journal of the ACM, Volume 21, Number 1, (January 1974), pages 168-173.
Algorithm for determining similarity of two strings as minimum number of edit operations to transform one into the other. Allowed edit operations are add, delete or substitute one character.
C. K. Wong and A. K. Chandra, "Bounds for the String Editing Problem", Journal of the ACM, Volume 23, Number 1, (January 1976), pages 13-16.
Shows that the complexity bounds of [Wagner and Fischer 1974] are not only sufficient but also necessary.

Home

Comments?