Increasing use is being made of computer based systems for the preparation of written material. Word and text processing systems allow easy entry of text which can be modified, corrected and updated with a text editor. Reports and papers can be formatted, introducing line justification, spacing, paging, by a text formatter. Filing systems allow documents to be stored and retrieved with ease.
A more sophisticated use of computers in text processing is the use of a spelling checker. A spelling checker is a program that searches a document and helps to identify incorrectly spelled words. This raises the quality of the document, making it easier to read and understand, by eliminating misleading and annoying spelling (or keyboarding) errors.
The technology of a spelling checker is relatively simple: a large dictionary of correctly spelled words is kept in computer memory. Each word in a document is then searched for in the dictionary. If it is found, it is correctly spelled; if it is not found, the user is given the opportunity to decide whether the word should be added to the dictionary or whether it is an error.
This process is an example of work-sharing between a human user and a computer, in which the computer does those tasks that it does best (fastest, most accurately, most consistently) and the person does the tasks he or she can do better. As computer technology improves, it becomes feasible to shift more and more work onto the computer. This proposal consists of one more step in this direction, a step motivated by the limitations of current spelling checkers.
A straightforward spelling checker can only assure that every
word in a document is some correctly spelled word, not
necessarily the desired correctly spelled word. For example,
the following sentence is perfectly acceptable to a spelling
We propose to attempt to create such a program: a syntax checker. This program would read through a document flagging possible syntax errors. Like a spelling checker, the syntax checker would have to have access to a dictionary. It will proceed by attempting to parse the input sentences one word at a time, looking up words in the dictionary as necessary to determine their part of speech. This will require that the dictionary contain not only a list of words but also their possible parts of speech. Syntax checking and spelling checking can go on in tandem thus allowing a single dictionary access per word.
A syntax checker such as this could detect such errors as the
by reducing it to a statement of the form:
Table 1. A first classification of words into parts of speech.
noun verb transitive intransitive adjective adverb article conjunction preposition pronoun interjection
The parse may be very difficult. One problem is simple
combinatorics: each word may be many parts of speech. The
checker must chose a correct interpretation of each word. We
realize that, without understanding the meaning or semantics
of each word, we cannot make the "correct" choice for each
word. However we can determine (by exhaustively
trying all possible combinations if necessary) if there
exists any possible interpretation which is correct
syntax. We will accept any statement as correct syntax if
any acceptable parse exists.
We thus bypass many of the problems that have troubled previous
English parsers. For example, we would not care whether the
correct parse of
We realize that this may allow some incorrect statements to be accepted. However to do more would require a program which "understood" the meaning of the sentence - a most difficult problem which is currently under investigation by many AI researchers. We expect however that our approach will detect a significant number of typical errors. The amount of success such an attempt will achieve will depend both on the overall accuracy of the grammar rules we develop and on the way the rules apply to the sentences people actually write.
Several problems must be addressed in the design of such a program: human interface, dictionary structure, parsing method, and especially grammar definition. Some of these problems are classical systems problems (human interface, dictionary), while others (grammar definition) involve linguistics and artificial intelligence areas of research. The major problem would seem to be with the definition of an acceptable grammar. There is no complete universally accepted definition of English grammar.
The definition of the grammar will be crucial to the construction of the syntax checker. At one extreme one could allow any sequence of words to be a sentence (sentence ::= word*); at the other extreme, one could define separate productions for each existing English sentence. In between lies the more reasonable definition in terms of a few (on the order of 100) productions concerning the definitions and use of common syntactic constructs (noun phrase, verb phrase, ...). These rules will need revision as experience is obtained. Thus it seems that an easily changed, perhaps table-driven, parser is desirable. Augmented Transition Nets (ATN) seem to be a reasonable choice.
One advantage of a definable grammar is the ability to produce grammars which are progressively more sophisticated and complex in order to check for more and more errors. For example many of the problems in written English tend to involve use of mismatched types. Nouns and verbs can be typed according to many features: gender (male/female), number (singular/plural), tense (past, present, future), case (objective, subjective, nominative) and so on. These types can only be combined in defined ways. One common problem is number agreement: use of a singular noun with a plural verb (or vice versa). An interesting empirical question to be addressed by this research is that of determining the size of the grammar required to catch most of the errors people actually make. We will have to experiment with these and other features to determine when additional rules cease to provide significant increases in accuracy.
A syntax checker such as the one we are describing would be useful for correcting the ordinary typing and grammar mistakes that we all make everyday. It could be of even more benefit, however, to authors whose native language is not English, particularly when the native language does not include certain constructs (such as articles). The problems of these authors are evident in the University environment in papers and theses of foreign students, but is also a problem in papers presented at international conferences or in multinational business environments.
A couple of possible obstacles to the feasibility of a practical syntax checker are not as serious as they might at first appear. One problem, particularly with technical writing, is that new words appear considerably more often than do new dictionaries. So the syntax checker may look up a word to discover its part of speech only to discover that it is not in the dictionary. But of the major parts of speech, only four (noun, verb, adjective, and adverb) ever acquire new words. So in the worst case, only four possibilities would have to be considered. Often a simple morphological analysis will narrow the options even further. So the problem of unknown words is not likely to be as serious as it might at first seem.
Another difficulty with technical text is that it often contains many non-sentences in such structures as enumerations of things, section headings, and figure captions. Here we may be aided by one recent trend in the development of text processing systems. In at least one such system (Scribe) all such segments would be demarcated by their function (such as heading or list) so that they can be correctly formatted by the system. Using that same information, the syntax checker could discover that those segments should not be analyzed as sentences.
We believe that a syntax checker is now feasible and would be of great use in improving written material, catching syntax errors which would slow down or confuse the reader, hindering effective communication. In addition once the syntax checker is properly working, it should also be possible to extend it to check certain stylistic constructs of proper usage, such as found in Strunck and White's Elements of Style.
We are currently in the process of creating a dictionary, complete with parts of speech information, based upon a tape of Webster's Seventh Collegiate Dictionary. We expect to begin shortly to construct the parser and to experiment with various grammars.