Word Hy-phen-a-tion by Com-put«er - PDF

Description
August 1983 Report No. STAN-CS Word Hy-phen-a-tion by Com-put«er Franklin Mark Liang Deparlment of Computer Science Stanford University Stanford, CA WORD HY-PHEN-A-TION BY COM-PUT-ER Franklin

Please download to get full document.

View again

of 20
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information
Category:

Industry

Publish on:

Views: 39 | Pages: 20

Extension: PDF | Download: 0

Share
Transcript
August 1983 Report No. STAN-CS Word Hy-phen-a-tion by Com-put«er Franklin Mark Liang Deparlment of Computer Science Stanford University Stanford, CA 94305 WORD HY-PHEN-A-TION BY COM-PUT-ER Franklin Mark Liang Department of Computer Science Stanford University Stanford, California Abstract *. This thesis describes research leading to an improved word hyphenation algorithm for the TjrjX82 typesetting system. Hyphenation is viewed primarily as a data compression problem, where we are given a dictionary of words with allowable division points, and try to devise methods that take advantage of the large amount of redundancy present. The new hyphenation algorithm is based on the idea of hyphenating and inhibiting patterns. These are simply strings of letters that, when they match hi a word, give us information about hyphenation at some point in the pattern. For example, '-tion' and *c-c' are good hyphenating patterns. An important feature of this method is that a suitable set of patterns can be extracted automatical!/ from the dictionary. In order to represent the set of patterns in a compart form that is also reasonably efficient for searching, the author has developed a new data structure called a packed trie. This data structure allows the very fast search times characteristic of indexed tries, but in many cases it entirely eliminates the wasted space for null links usually present in such tries. We demonstrate the versatility and practical advantages of this data structure by using a variant of it as the critical component of the program that generates the patterns from the dictionary. The resulting hyphenation algorithm uses about 4500 patterns that compile into a packed trie occupying 25K bytes of storage. These patterns find 89% of the hyphens in a pocket dictionary word list, with essentially no error. By comparison, the uncompressed dictionary occupies over 500K bytes. This research was supported in part by the National Science Foundation under grants IST-8B and MSC-8S-00984, and by the System Development Foundation. 'TgK' is a trademark of the American Mathematical Society. WORD HY-PHEN-A-TION BY COM-PUT-ER A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE f AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY W PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY by Franklin Mark Liang June 1983 ii Copyright 1983 by Franklin Mark Liang iii Acknowledgments I am greatly indebted to my adviser, Donald Knuth, for creating the research environment that made this work possible. When I began work on the I^jX project as a summer job, I would not have predicted that computer typesetting would become such an active area of computer science research. Prof. Knuth's foresight was to recognize that there were a number of fascinating problems in the field waiting to be explored, and his pioneering efforts have stimulated many others to think about these problems. I am also grateful to the Stanford Computer Science Department for providing the facilities and the community that have formed the major part of my life for the past several yean. I thank my readers, Luis Trabb Pardo and John Gill, as well as Leo Guibas who served on my orals committee on short notice. In addition, thanks to David Fuchs and Tom Pressburger for helpful advice and encouragement. Finally, this thesis is dedicated to my parents, for whom the experience of pursuing a graduate degree has bee.i perhaps even more traumatic than it was for myself. IV Table of contents Introduction 1 Examples. 2 T^X and hyphenation 3 Time magazine algorithm 4 Patterns 5 Overview of thesis 7 The dictionary problem 8 Data structures 0 Superimposed coding. 10 Tries 11 Packed tries 15 Suffix compression Derived forms 18 Spelling checkers 10 Related work 21 Hyphenation 23 Finite-state machines with output. 23 Minimization with don't cares. 24 Pattern matching 26 Pattern generation 29 Heuristics 30 Collecting pattern statistics 31 Dynamic packed tries. 32 Experimental results 34 Examples.37 History and Conclusion 39 Appendix The PATGEW program.,.45 List of patterns References Chapter 1 Introduction * The work described in this thesis was inspired by the need for a word hyphenation routine as part «f Don Knuth's T^jX typesetting system [1]. This system was initially designed in order to typeset Prof. Knuth's seven-volume series of books, The Art of Computer Programming, when he became dissatisfied with the quality of computer typesetting done by his publisher. Since Prof. Knuth's books were to be a definitive treatise on computer science, he could not bear to see his scholarly work presented in an inferior manner, when the degradation was entirely due to the fact that the material had been typeset by a computer! Since then, TjrjX (also known as Tau Epsilon Chi, a system for technical text) has gained wide popularity, and it is being adopted by the American Mathematical Society, the world's largest publisher of mathematical literature, for use in its journals. TjjjX is distinctive among other systems for word processing/document preparation in its emphasis on the highest quality output, especially for technical material. One necessary component of the system is a computer-based algorithm for hyphenating English words. This is part of the paragraph justification routine, and it is intended to eliminate the need for the user to specify word division points explicitly when they are necessary for good paragraph layout. Hyphenation occurs relatively infrequently in most book-format printing, but it becomes rather critical in narrow-column formats such as newspaper printing. Insufficient attention paid to this aspect of layout results in large expanses of unsightly white space, or (even worse) in words split at inappropriate points, e.g. new-spaper. Hyphenation algorithms for existing typesetting systems are usually either rulebased or dictionary-based. Rule-based algorithms rely on a set of division rules such as given for English in the preface of Webster's Un;ibridged Dictionary [2]. These include recognition of common prefixes and suffixes, splitting between double consonants, and other more specialized rules. Some of the rules are not particularly 2 INTRODUCTION amenable to computer implementation; e.g. split between the elements of a compound word . Rule-based schemes are inevitably subject to error, and they rarely cover all possible cases. In addition, the task of finding a suitable set of rules in the first place can be a difficult and lengthy project. Dictionary-based routines sim'ply store an entire word list along with the allowable division points. The obvious disadvantage of this method is the excessive storage required, as well as the slowing down of the justification process when the hyphenation routine needs to access a part of the dictionary on secondary store. Examples To demonstrate the importance of hyphenation, consider Figure 1, which shows a paragraph set in three different ways by T p(. The first example uses TjjjX's normal paragraph justification parameters, but with the hyphenation routine turned off. Because the line width in this example is rather narrow, TjjX is unable to find an acceptable way of justifying the paragraph, resulting in the phenomenon known as an overfull box*. One way to fix this problem is to increase the stretchability of the spaces between words, as shown in the second example. (TjjX users: This was done by increasing the stretch component of spaceship to. 5em.) The right margin is now straight, as desired, but the overall spacing is somewhat loose. In the third example, the hyphenation routine is turned on, and everything is beautiful. In olden time* when wishing till helped one, there lived a king whose daughters were all beautifi I, jut the youngest WM «O beautiful hat the sun itself, which has seer 10 much, was astonished wheneve r I ihone in her face. Close by he king's custlc lay a great dark orcat, and under nn old lime-trcr n tlir Forest a well, and when the Hay wns very warm, the king*: child went out into the forest and sat down by the side of the tool 'nuntain, and when shr was borcc ihe took a golden ball, and threw I up on high and caught it, and this hall was her favorite plaything. In olden limes when wishing still helped one, there lived a king whose daughters were all beautiful, but the youngest was so beautiful that the sun itself, which has seen so much, was astonished whenever it shone in her face. Close by the king's castle lay a great dark forest, and under t \ old lime tree in the forest was a well, and when the day was very warm, the king's child went out into the forest and sat down by the side of the cool fountain, and when she was bored she took a golden ball, and threw it up on high and caught it, and this ball was her favorite plaything. In olden times'when wishing still helped one, there lived a king whose daughters were all beautiful, but the youngest waa so beautiful that the tun itself, which has seen so much, was astonished whenever it shone in her face. Close by the king's castle lay a great dark forest, and under an old lime-tree in the forest was a well, and when the day waa very warm, the king's child went out into the forest and sat down by the side of the cool fountain, and when she was bored she look *. golden ball, and threw it up on high and cnuplit it, and this ball was her favorite plaything. Figure 1. A typical paragraph with and without hyphenation. INTRODUCTION. S sel-fadjotnt as-so-ciate as-so-cl-ate Pit-tsburgh prog-ress pro-gresa clcarin-ghouee rec-ord re-cord fun-draising a-rith me-tic ar-ith-met-ic ho-meowners eve-ning even-ing playw-right pe-ri-od-ic per-i-o-dic algori-thm walkth-rough in-de-pen-dent in-de-jend-ent Re-agan tri-bune trib-une Figure 2. Difficult hyphenations. However, life is not always so simple. Figure 2 shows that hyphenation can be difficult. The first column shows erroneous hyphenations made by various typesetting systems (which shall remain nameless). The next group of examples are words that hyphenate differently depending on how they are used. This happens most commonly with words that can serve as both nouns and verbs. The last two examples show that different dictionaries do not always agree on hyphenation (in this case Webster's vs. American Heritage). TjrjX and hyphenation The original TgX hyphenation algorithm was designed by Prof. Knuth and the author in the summer of It is essentially a rule-based algorithm, with three main types of rules: (1) suffix removal, (2) prefix removal, and (3) vowelconsonant-consonant-vowel (veev) breaking. The latter rule states that when the pattern 'vowel-consonant-consonant-vowcl' appears in a word, we can in most cases split between the consonants. There are also many special case rules; for example, break vowel-q or break after ck . Finally a small exception dictionary (about 300 words) is used to handle particularly objectionable errors made by the above rules, and to hyphenate certain common words (e.g. pro-gram) that are not split by the rules. The complete algorithm is described in Appendix H of the old TjjX manual. In practice, the above algorithm has served quite well. Although it does not find all possible division points in a word, it very rarely makes an error. Tests on a pocket dictionary word list indicate that about 40% of the allowable hyphen points are found, with 1% error (relative to the total number of hyphen points). The algorithm requires 4K 36-bit words of code, including the exception dictionary. 4 INTRODUCTION The goal of the present research was to develop a better hyphenation algorithm. By better we mean finding more hyphens, with little or no error, and using as little additional space as possible. Recall that one way to perform hyphenation is to simply store the entire dictionary. Thus we can view our task as a data compression problem. Since there is a good deal of redundancy in English, we can hope for substantial improvement over the straightforward representation. Another goal was to automate the design of the algorithm as much as possible. The original T rjx algorithm was developed mostly by hand, with a good deal of trial and etror. Extending such a rule-based scheme to find the remaining hyphens seems very difficult. Furthermore such an effort must be repeated for each new language. The former approach can be a problem even for English, because pronunciation (and thus hyphenation) tends to change over time, and because different types of publication may call for different sets of admissible hyphens., Time magazine algorithm A number of approaches were considered, including methods that have been discussed in the literature or implemented in existing typesetting systems. One of the methods studied was the so-called Time magazine algorithm, which is table-based rather than rule-based. The idea is to look at four letters surrounding each possible ' Breakpoint, namely two letters preceding and two letters following the given point. However we do not want to store a table of 26 4 = 456,976 entries representing all possible four-letter combinations. (In practice only about 15% of these four-letter combinations actually occur in English words, but it is not immediately obvious how to take advantage of this.) Instead, the method uses three tables of size 26 2, corresponding to the two letters preceding, surrounding, and following a potential hyphen point. That is, if the letter pattern wx-yz occurs in a word, we look up three values corresponding to the letter pairs wx, xy, and yz, and use these values to determine if we can split the pattern. What should the three tables contain? In the T:\ne algorithm the table values were the probabilities that a hyphen could occur after, between, or before two given letters, respectively. The probability that the pattern wx-yz can be split is then estimated as the product of these three values (as if the probabilities were independent, which they aren't). Finally the estimated value is compared against a threshold to determine hyphenation. Figure 3 shows an example of hyphenation probabilities computed by this method. INTRODUCTION 1,1 I SUPERCALIFRAGIIISTICEIPIALIDOCIOUS Figure S. Hyphenation probabilities. The advantage of this table-based approach is that the tables can be generated automatically from the dictionary. However, some experiments with the method yielded discouraging results. One estimate is 40% of the hyphens found, with 8% error. Thus a large exception dictionary would be required for good performance. The reason for the limited performance of the above scheme is that just four letters of context surrounding the potential break point are not enough in many cases. In an extreme example, we might have to look as many as 10 letters ahead in order to determine hyphenation, e.g. dem-on-stra-tion vs. de-mpn-stra-tive. So a more powerful method is needed. Patterns A good deal of experimentation led the author to a more powerful method based on the idea of hyphenation patterns. These ore simply strings of letters that, when they match in a word, will tell us how to hyphenate at some point in the pattern. For example, the pattern 'tion' might tell us that we can hyphenate before the V. Or when the pattern 'cc' appears in a word, we can usually hyphenate between the c's. Here arc some more examples of good hyphenating patterns:.in-d.in-s.in-t.un-d b-s -cia con-s con-t e-ly er-1 er-ra ex- -ful it-t i-ty -lees 1-ly -ment n-co -ness n-f n-1 n-ei n-v om-m -sion 8-ly s-nos ti-ca x-p (The character '.' matches the beginning or end of a word.) 6 INTRODUCTION Patterns have many advantages. They arc a general form of hyphenation rule that can include prefix, suffix, and other rules as special cases. Patterns can even describe an exception dictionary, namely by using entire words as patterns. (Actually, patterns are often more concise than an exception dictionary because a single pattern can handle several variant forms of a word; e.g. pro-gram, pro-grams, and pro-grammed.) More importantly, the pattern matching approach has proven very effective. An ft appropriate set of patterns captures very concisely the information needed to perform hyphenation. Yet the pattern rules are of simple enough form that they can be generated automatically from the dictionary. When looking for good hyphenating patterns, we soon discover that almost all of them have some exceptions. Although -tion is a very safe pattern, it fails on the word cat-ion. Most other cases are less clear-cut; for example, the common pattern n-t can be hyphenated about SO percent of the time. It definitely seems worthwhile to use such patterns, provided that we can deal with the exceptions in some manner. After chooskg a set of hyphenating patterns, we may end up with thousands of exceptions. Theie could be listed in an exception dictionary, but we soon notice there are many similarities among the exceptions. For example, in the original T]jjX algorithm we found that the vowcl-consonant-consonant-vowel rule resulted in hundreds f errors of the form X-Yer or X-Yers, for certain consonant pairs XY, so we put in a new rule to prevent those errors. Thus, there may be rules that can handle large classes of exceptions. To take advantage of this, patterns come to the rescue again; but this time they are inhibit-* ivg patterns, because they show where hyphens should not be placed. Some good eximples of inhibiting patterns are: b=ly (don't break between b and ly), bs=, =cing, io=n, i«tin, «l8, nn», ns s t, n=ted, =pt, ti=al, =tly, «ts, and tt«. As it turns out, this approach is worth pursuing further. That is, after applying hyphenating and inhibiting patterns as discussed above, we might have another»e( of hyphenating patterns, then another set of inhibiting patterns, and BO on. We can think of each level of patterns as being exceptions to the exceptions of the previous level. The current Tj}X82 algorithm uses five alternating levels of hyphenating and inhibiting patterns. The reasons for this will be explained in Chapter 4. The idea of patterns is the basis of the new TJJX hyphenation algorithm, and it was the inspiration for much of the intermediate investigation, that will be described. INTRODUCTION 7 Overview of thesis In developing the pattern scheme, two main questions arose: (1) How can we represent the set of hyphenation patterns in a compact form that is also reasonably efficient for searching? (2) Given a hyphenated word list, how can we generate a suitable set of patterns? To solve these problems, the author has developed a new data structure called a parked trie. This data structure allows the very fast search times characteristic of indexed tries, but in many cases it entirely eliminates the wasted space for null links usually present in such tries. We will demonstrate the versatility and practical advantages of this data structure *y using it not only to represent the hyphenation patterns in the final algorithm, but also d'j the critical component of the program that generates the patterns from the dictionary. Packed tries have many other potential applications, including identifier lookup, spelling checking, and lexicographic sorting. Chapter 2 considers the simpler problem of recognizing, rather than hyphenating, a set of words such as a dictionary, and uses this problem to motivate and explain the advantages of the packed trie data, structure. We also point out the close relationship between tries and finite-state machines. Chapter 3 discusses ways of applying these ideas to hyphenation. After considering various approaches, including minimization with don't cares, we return to the idea of patterns. Chapter 4 discusses the heuristic method used to select patterns, introduces dynamic packed tries, and describes some experiments with the pattern generation pro-* gram. Chapter 5 gives a brief history, and mentions ideas for futu
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks