Use of Webster's Seventh New Collegiate Dictionary To Construct a Master Hyphenation List




James L. Peterson




Department of Computer Sciences
University of Texas
Austin, TX




June 1982




This paper has been published. It should be cited as

James L. Peterson, ``Use of Webster's Seventh New Collegiate Dictionary to Construct a Master Hyphenation List'', Proceedings of the AFIPS 1982 National Computer Conference, (June 1982), pages 665-670.




ABSTRACT

A machine-readable form of Webster's Seventh New Collegiate Dictionary has been obtained. After substantial processing to understand the form and content of the dictionary and to correct residual typographical errors, we have begun the task of constructing a master word-hyphenation list based on this dictionary and other sources. Substantial problems can arise in preparing the master hyphenation list because of incompatible hyphenation definitions from various sources. Some statistics are given.


MOTIVATION

Over the past decade an increasing number of documents have been prepared with the use of a computer. These computer-based systems provide certain standard functions: input, storage, editing, formatting, and output. Recently, however, a new function has been introduced: analysis. The idea is to have the computer analyze the text of the document to catch errors and improve the quality of the document.

Many different types of analysis are possible. Perhaps the best known is spelling checking and correcting [1]. A large number of programs currently available will check each word in a document for correct spelling. This is generally done by providing a list of words that are correctly spelled. Any word in the document that is not in the list of correctly spelled words is a candidate misspelling and is flagged for the author's attention.

Although the best-known form of document analysis is detecting spelling errors, it is far from the only form. The Writer's Workbench of PWB/UNIX[2] provides several forms of document analysis, including readability indexes, wordiness, punctuation, and style. More advanced forms of analysis would include checking grammar and syntax of documents [3][4].

For all of these tasks two things are needed: an analysis algorithm and a suitable word list annotated with the needed properties of the words.

In our case, we are trying to evaluate several different readability formulas[5]. Many of these formulas involve counting the number of syllables in a word or sentence. Thus, we would like to be able to divide a word into its constituent syllables.

Trying to find existing algorithms for splitting a word into syllables, we noticed that this is the hyphenation problem: given a word, where can that word be hyphenated? A word can be hyphenated only on a syllable boundary. Thus, if we can determine hyphenation points correctly, we can use that information to define syllable boundaries and hence the number of syllables.

There are several published hyphenation algorithms [6][7], but they tend to be either very simple (and obviously not very good) or rather complex but of unknown validity. A study of hyphenation rules leads immediately to the conclusion that the only totally correct algorithm would be to look up the word in a word list annotated with hyphenation information [8].

At the same time, we are attempting to investigate the difficulty of checking grammar and syntax in English documents. We need to know, for each word, the possible parts of speech for that word. This, plus a partial English grammar, should allow us to create a syntax-checking program for English. A sentence with no parse in the grammar would be flagged for the author as a possible syntax error.

In searching for a word list with part-of-speech and hyphenation information, we discovered the machine-readable Webster's Seventh New Collegiate Dictionary (W7). [9]

WEBSTER'S SEVENTH NEW COLLEGIATE DICTIONARY

W7 exists in a computer-readable form. This is not just a word list, but a copy of the entire dictionary, including definitions, cross-references, variants, synonyms, and so on. It consists of some 12,242,868 characters, with 68,766 main entries.

The original dictionary was keyboarded onto the Q-32 computer at System Development Corporation (SDC) for a project headed by John Olney[10]. The dictionary was then heavily edited and moved onto an IBM 360. Tapes of this form, which were widely distributed, included a copy sent to the IBM T.J. Watson Research Center and further processed by C. Alberga. A copy of this was acquired by Robert Amsler [11]. We have acquired a copy of the dictionary from Amsler and have modified it in many minor ways.




F;Charybdis;;;33;n;;
P;k{e}-'rib-d{e}s
E;L, fr. Gk
D;0;;;n;a whirlpool off the Sicilian coast personified by the ancients#
as a female monster
F;chase;1;;;vb;;
P;'ch{a-}s
E;ME [italic chasen], fr. MF [italic chasser], fr. (assumed) VL [italic#
captiare] -- more at [mini CATCH]
D;1;a;;vt;to follow rapidly : [mini PURSUE]
D;1;b;;vt;[mini HUNT]
D;1;c;;vt;to follow regularly or persistently with the intention of#
attracting or alluring
L;2;;;[italic obs]
D;2;;;vt;[mini HARASS]
D;3;;;vt;to seek out
D;4;a;;vt;to cause to depart or flee : [mini DRIVE]
L;4;b;;[italic slang]
D;4;b;;vt;to take (oneself) off
D;1;;;vi;to chase an animal, person, or thing
D;2;;;vi;[mini RUSH], [mini HASTEN]
S;0;[mini PURSUE], [mini FOLLOW], [mini TRAIL]:
S;1;[mini CHASE] implies going swiftly after and trying to overtake#
something fleeing or running;
S;2;[mini PURSUE] suggests a continuing effort to overtake, reach,#
attain;
S;3;[mini FOLLOW] puts less emphasis upon speed or intent to overtake#
and may not imply an awareness on the part of the leader that he is#
pursued;
S;4;[mini TRAIL] may stress a following of tracks or traces rather than#
a visible object
F;chase;2;;;n;;
D;1;a;;n;the act of chasing : [mini PURSUIT]
D;1;b;;n;[mini HUNTING] -- used with [italic the]
D;1;c;;n;an earnest or frenzied seeking after something desired
D;2;;;n;something pursued
D;3;a;;n;a franchise to hunt within certain limits of land
D;3;b;;n;a tract of unenclosed land used as a game preserve
F;chase;3;;;vt;;
E;ME [italic chassen], modif. of MF [italic enchasser] to set
D;1;a;;vt;to ornament (metal) by indenting with a hammer and tools#
without a cutting edge
D;1;b;;vt;to make by such indentation
D;1;c;;vt;to set with gems
D;2;a;;vt;[mini GROOVE], [mini INDENT]
D;2;b;;vt;to cut (a thread) with a chaser
F;chase;4;;;n;;
E;F [italic chas] eye of a needle, fr. LL [italic capsus] enclosed#
space, fr. L, pen, alter. of [italic capsa] box -- more at [mini CASE]
D;1;;;n;[mini GROOVE], [mini FURROW]
D;2;;;n;the bore of a cannon
D;3;a;;n;[mini TRENCH]
D;3;b;;n;a channel (as in a wall) for something to lie in or pass#
through
F;chase;5;;;n;;
E;prob. fr. F [italic ch{a^}sse] frame, fr. L [italic capsa]
D;0;;;n;a rectangular steel or iron frame into which letterpress matter#
is locked for printing or plating
X;form;;;4;


Figure 1. Sample of the dictionary file.

Figure 1 shows a sample of the dictionary file. Each line of the file has a character in Column 1 identifying the type and format of the line. Table I shows the number and meaning of each line type.


Table I: Number and meaning of line types in W7 dictionary file
Line type Number Meaning



F 68,766 First line, start for a new word
V 9,957 Variant
D 140,500 Definition, one per line
L 11,990 Label
R 19,123 Related word
X 4,598 Cross-reference
S 834 Synonym block

Each line is composed of a number of fields. Fields are separated by a semicolon and are defined by their position. The first field of each line is the line type character (F, V, D, L, R, X, or S, as given above). The remaining fields depend on the type of the line. For example, the second entry on an F-line is a main-entry word, the fifth field has hyphenation information, and the seventh has part-of-speech information.

Character Codes

A major problem with the dictionary is its character set. First, the dictionary publisher did not feel constrained in use of characters, but chose whatever symbols best fit the purpose. Second, the dictionary was originally encoded in an extended BCD (for the Q-32 computer), then translated into EBCDIC (for the IBM 360/370) and now has been translated into ASCII (for our PDP-11/60). None of these character sets is completely compatible with the others, and none of them are sufficient to represent the variation found in the original printed dictionary. Hence an encoding scheme must be used to expand the set of representable characters. This expansion occurs in two independent directions: font information and special characters.

We have represented font information by use of the square brackets in ASCII to surround any special font material. Five font types are recognized: (1) italic, (2) mini-caps, (3) bold, (4) subscripts, and (5) superscripts. Each is denoted by an identifying keyword immediately after the opening (left) square bracket, followed by a space, followed by the material to be in the defined font, followed by the closing (right) square bracket. For example, an italic was is represented as [italic was], a mini-caps AMBIENT is [mini AMBIENT] and a bold syn is [bold syn]. Superscripts and subscripts may be italic, mini-caps, or bold; and a few super-scripted superscripts also occur, as in 6.24 {times} 10 [sup 10 [sup 10]].

The dictionary includes a large number of special symbols that are not representable in ASCII. These include all the Greek alphabet, the Hebrew alphabet, and many miscellaneous other special symbols. All special symbols which are not available in ASCII (and some that are) have been given representations by enclosing the name in braces, as: {degrees} (for a degree symbol), {times} (for multiplication represented by a small x), and {tau} (for the lower-case Greek letter tau).

Each symbol name has been selected to exclude embedded blanks. Thus all characters between an opening right brace and its closing right brace are non-blank. Certain characters in ASCII (braces, brackets, question mark, exclamation mark, and so on) have also been represented as extended characters to allow the ASCII character to be used for other purposes (such as font and special character representation). They occur only infrequently (fewer than 100 times).

ERRORS IN W7

While processing W7 both to understand its contents and to put those contents into a usable form, we encountered a large number of errors. These errors were of several types:

  1. Merged illustrations. For example, under false the illustration was ( ~ documents ~ teeth) and should have been ( ~ documents) (~ teeth). To correct this we searched for any line of the form ``(... ~ ... ~ ...)''.

  2. Words containing letters with accents (236 entries). The accent field was wrong about half the time. The normal problem was that the accent was on the wrong letter. In these cases, the hyphenation information generally showed syllables that were two letters too long.

  3. Incorrect values in fields. We created a list sorted by frequency of the contents of each field (as listed in the appendix of Peterson[12]). These could then be examined for rare or inappropriate values; for example, a ``g'' in a numeric field, or a zero in an alphabetic field.

  4. Mismatched parentheses or brackets. We wrote a program to simply count parentheses, braces, and brackets. Many were found to be mismatched.

All these errors, once found and verified, were corrected by hand, using a text editor.

A last form of error analysis was an attempt to find typographical errors. The approach was simple: we extracted a list of all unique words used in the dictionary definitions. This produced a list of 54,298 words. We compared this list with the list of all words defined in the dictionary (main entries, variants, or related words). This reduced our list to 20,292 words that were used in definitions but not defined. Many of these were derived forms of defined words: past tense, plurals, and so on. Doing some simple suffix analysis, we were left with about 8,000 words. Most of these were apparently Greek or Latin botanical or zoological names. Deleting those ending in '-ia' or '-ae' and all words in italics in the dictionary left a list of 2,821 words.

These were checked by hand to produce a list of 903 incorrectly spelled words. We also found 54 words which were used, but not defined, such as Australasian, nubby, spondunmene, seneschal. We also found a smaller list of words with typographical errors in the main entry in the computer files.

Of the 903 typographical errors, 543 were the result of a missing blank between two words. Of the remaining 360, 34% were a missing letter, 27% were a wrong letter, 20% were an extra letter, and 13% were the result of transposed letters. The remaining errors were caused by two extra or two missing letters, or by transposing two letters around a third. The middle letter in this case was always a vowel. (For example, min would be typed nim.)

We also found 10 cases of typographical errors in the original printed dictionary. It was interesting to follow these errors through the various printings and editions of the Merriam-Webster dictionaries. Four errors were corrected in the 1970 printing of W7, one in the 1971 printing of W7, and one in the New Collegiate Dictionary[13]; and four errors remain in the most recent Collegiate [14]:

  1. In bitch, doublecross should be double-cross.
  2. In vanity, knicknack should be knickknack.
  3. In drift, quantitive should be quantitative.
  4. In barranca, gulley should be gully.
The first two errors are also in Webster's Third New International Dictionary[15].

CONSTRUCTION OF THE MASTER HYPHENATION LIST

As mentioned before, each line in the dictionary is a sequence of fields, separated by semicolons. The fourth field on F-cards, the second field on R-cards, and the second field on V-cards contain hyphenation information for their respective entries. The hyphenation information is a sequence of one-digit numbers giving the number of characters between possible hyphenation points. As an example, a word such as devilish is hyphenated as dev-il-ish and would be encoded as 323. The distance from the start of the word to the first hyphenation point is 3; the next syllable is of length 2; the last syllable is of length 3. The word ethnological, hyphenated as eth-no-log-i-cal, is encoded as 32313.

The encoding of the distance between hyphenation points is one digit, from 1 to 9. If the distance exceeds 9, we continue with the upper-case letters (as with a hexadecimal representation). Thus A is 10, B is 11, C is 12, . ..., Z is 35. The most common lengths of syllables are 232, 322, and 223.

The longest hyphenation encoding is

pneumonoultramicroscopicsilicovolcanoconiosis;4222323423123222213
pneu-mo-no-ul-tra-mi-cro-scop-ic-sil-i-co-vol-ca-no-co-ni-o-sis
There are five words with nine syllables each.

While we were processing W7, we became aware of several word lists with hyphenation data. We were able to acquire copies of three of these:

  1. LONG, based on Longman's Dictionary of Contemporary English [16]
  2. RADC, from Rome Air Development Center
  3. IBM, from the Advanced Office Systems Laboratory of IBM
Each of these word lists recorded hyphenation information in a different way, generally by an embedded hyphenation indicator (either a hyphen, an equal sign, or a period). We converted all of these to a uniform encoding based on the W7 encoding.

Each source was separately checked for accuracy. We started with the four files: W7, LONG, RADC, IBM. After putting these in a common format, each file was checked for acceptable forms of words: No internal blanks, no apostrophes, no numbers, no hyphens, and no foreign characters. Each file was checked to assure that every syllable had a vowel (except dirndl, Houyhnhnm, Niflheim, McCoy, McCarthy, and their derivatives). Each file was checked to identify cases where the same word might have multiple hyphenations.

We then combined these lists in an attempt to create a master hyphenation list. Major advantages of this approach are the large set of words in the resulting list and the redundant verification of the hyphenation information.

One of the problems with hyphenation is that some spellings have more than one hyphenation, depending on the meaning of the word. The differing hyphenations are a result of varying pronunciations, which are related to the source of the word or its part of speech. For example, project used as a noun is proj-ect, but used as a verb it is pro-ject. Differing meanings are the reason for the differing hyphenation of add-er (one who adds) and ad-der (a type of snake); in both cases the word is a noun. A list of 187 of these words has been constructed.

It is sometimes difficult to determine when a word has, in fact, multiple hyphenations. The word division supplement of the Government Printing Office Style Manual [17] lists 100 such words; the others were found while processing the various lists. However, one also finds anomalous situations such as, in W7, the varying hyphenation of footedness in slow-footedness (footed-ness) and flat-footedness (foot-ed-ness) or the variation in spoken and fair-spoken (spo-ken versus spok-en).

In general, multiple hyphenations come in two forms: compatible and incompatible. If the hyphenation points of one hyphenation are a subset of those of another, the hyphenations are compatible, and we choose the larger set of hyphenation points. Thus, the hyphenations footed-ness and foot-ed-ness are compatible, and we choose foot-ed-ness.

When neither hyphenation is a subset of the other, the two are incompatible (such as spo-ken and spok-en), and other means must be used to resolve the differences. The simplest means would be reference to a universally accepted definitive source. However, no single source appears universally acceptable. We have a list of 32 words for which W7 indicates one hyphenation and all our other sources indicate a different one.

For words not on the list of known multiple hyphenations, multiple hyphenations may mean an error in at least one of the possible hyphenations. We initially had the following number of words in each file:

W7 72,983
LONG 20,506
RADC 28,689
IBM 70,355

If we combine all these files, we end with a list of 108,190 unique words, of which there are 2,011 whose hyphenation is in dispute.

To resolve these disputes, we counted the number of times each hyphenation occurs in the combined files. Where two or three sources showed the hyphenation one way and only one source showed an incompatible hyphenation, the more common hyphenation was chosen. In cases in which two sources indicated one hyphenation and two indicated another, or in which only two or three sources included the word and each indicated a different hyphenation, we had to consider each word separately.

At the moment we are attempting to get either of two additional sources: the word division list of the U.S. Government Printing Office[17] or the word-hyphenation list of the American Heritage Dictionary [18]. With either one of these we will have an odd number of hyphenation sources and hence may be able to use a strict majority rule to arrive at a defined hyphenation.

An interesting side point is to examine the source of hyphenations deemed incorrect. The following list shows the number of hyphenations of words deleted by our selection procedure:

W7 171
LONG 817
RADC 120
IBM 316

The high number of variant hyphenations from LONG undoubtedly reflects the different pronunciation (and hyphenation) resulting from British English.

CONCLUSIONS

We are still tying up minor loose ends in the W7 dictionary and our master hyphenation list. We have recently acquired the word frequency information of the American Heritage word frequency study[19] and are proceeding to add that frequency information to our hyphenation word list information. This will allow us to evaluate hyphenation algorithms and readability formulas in ways not previously possible. We will shortly be able to determine quantitatively the accuracy of various hyphenation algorithms with respect to both a given word list and the word list weighted by frequency. This will allow us to select appropriate algorithms for various situations.

REFERENCES