`
`Design and Implementation of the WordNet Lexical Database
`and Searching Software†
`
`Richard Beckwith, George A. Miller, and Randee Tengi
`
`Lexicographers must be concerned with the presentation as well as the content of
`their work, and this concern is heightened when presentation moves from the printed
`page to the computer monitor. Printed dictionaries have become relatively standardized
`through many years of publishing (Vizetelly, 1915); expectations for electronic lexicons
`are still up for grabs. Indeed, computer technology itself is evolving rapidly; an
`indefinite variety of ways to present lexical information is possible with this new
`technology, and the advantages and disadvantages of many possible alternatives are still
`matters for experimentation and debate. Given this degree of uncertainty, manner of
`presentation must be a central concern for the electronic lexicographer.
`WordNet is a pioneering excursion into this new medium. Considerable attention
`has been devoted to making it useful and convenient, but the solutions described here are
`unlikely to be the final word on these matters. It is hoped that readers will not merely
`note the shortcomings of this work, but will also be inspired to make improvements on it.
`One’s first impression of WordNet is likely to be that it is an on-line thesaurus. It is
`true that sets of synonyms are basic building blocks, and with nothing more than these
`synonym sets the system would have all the power of a thesaurus. When short glosses
`are added to the synonym sets, it resembles an on-line dictionary that has been
`supplemented with synonyms for cross referencing (Calzolari, 1988). But WordNet
`includes much more information than that. In an attempt to model the lexical knowledge
`of a native speaker of English, WordNet has been given detailed information about
`relations between word forms and synonym sets. How this relational structure should be
`presented to a user raises questions that outrun the experience of conventional
`lexicography.
`In developing this on-line lexical database, it has been convenient to divide the
`work into two interdependent tasks which bear a vague similarity to the traditional tasks
`of writing and printing a dictionary. One task was to write the source files that contain
`the basic lexical data — the contents of those files are the lexical substance of WordNet.
`The second task was to create a set of computer programs that would accept the source
`hhhhhhhhhhhhhhh
`† This is a revised version of "Implementing a Lexical Network" in CSL Report #43, prepared
`by Randee Tengi. UNIX is a registered trademark of UNIX System Laboratories, Inc. Sun, Sun 3
`and Sun 4 are trademarks of Sun Microsystems, Inc. Macintosh is a trademark of Macintosh La-
`boratory, Inc. licensed to Apple Computer, Inc. NeXT is a trademark of NeXT. Microsoft Win-
`dows is a trademark of Microsoft Corporation. IBM is a registered trademark of International
`Business Machines Corporation. X Windows is a trademark of the Massachusetts Institute of
`Technology. DECstation is a trademark of Digital Equipment Corporation.
`
`Page 1 of 25
`
`GOOGLE EXHIBIT 1030
`
`
`
`- 63 -
`
`files and do all the work leading ultimately to the generation of a display for the user.
`The WordNet system falls naturally into four parts: the WordNet lexicographers’
`source files; the software to convert these files into the WordNet lexical database; the
`WordNet lexical database; and the suite of software tools used to access the database.
`The WordNet system is developed on a network of Sun-4 workstations. The software
`programs and tools are written using the C programming language, Unix utilities, and
`shell scripts. To date, WordNet has been ported to the following computer systems:
`Sun-3; DECstation; NeXT; IBM PC and PC clones; Macintosh.
`The remainder of this paper discusses general features of the design and
`implementation of WordNet. The ‘‘WordNet Reference Manual’’ is a set of manual
`pages that describe aspects of the WordNet system in detail, particularly the user
`interfaces and file formats. Together the two provide a fairly comprehensive view of the
`WordNet system.
`
`Index of Familiarity
`One of the best known and most important psycholinguistic facts about the mental
`lexicon is that some words are much more familiar than others. The familiarity of a word
`is known to influence a wide range of performance variables: speed of reading, speed of
`comprehension, ease of recall, probability of use. The effects are so ubiquitous that
`experimenters who hope to study anything else must take great pains to equate the words
`they use for familiarity. To ignore this variable in a lexical database that is supposed to
`reflect psycholinguistic principles would be unthinkable.
`In order to incorporate differences in familiarity into WordNet, a syntactically
`tagged index of familiarity is associated with each word form. This index does not
`reflect all of the consequences of differences of familiarity — some theorists would ask
`for strength indices associated with each relation — but accurate information on all of
`the consequences is not easily obtained. The present index is a first step.
`Frequency of use is usually assumed to be the best indicator of familiarity. The
`closed class words that play an important syntactic role are the most frequently used, of
`course, but even within the open classes of words there are large differences in frequency
`of occurrence that are assumed to correlate with — or to explain — the large differences
`in familiarity. The frequency data that are readily available in the technical literature,
`however, are inadequate for a database as extensive as WordNet. Thorndike and Lorge
`(1944) published data based on a count of some 5,000,000 running words of text, but
`they reported their results only for the 30,000 most frequent words. Moreover, they
`defined a ‘‘word’’ as any string of letters between successive spaces, so their counts for
`homographs are untrustworthy; there is no way to tell, for example, how often lead
`occurred as a noun and how often as a verb. Francis and Kucvera (1982) tag words for
`their syntactic category, but they report results for only 1,014,000 running words of text
`— or 50,400 word types, including many proper names — which is not a large enough
`sample to yield reliable counts for infrequently used words. (A comfortable rate of
`speaking is about 120 words/minute, so that 1,000,000 words corresponds to 140 hours,
`or about two weeks of normal exposure to language.)
`
`Page 2 of 25
`
`
`
`- 64 -
`
`Fortunately, an alternative indicator of familiarity is available. It has been known at
`least since Zipf (1945) that frequency of occurrence and polysemy are correlated. That is
`to say, on the average, the more frequently a word is used the more different meanings it
`will have in a dictionary. An intriguing finding in psycholinguistics (Jastrezembski,
`1981) is that polysemy seems to predict lexical access times as well as frequency does.
`Indeed, if the effect of frequency is controlled by choosing words of equivalent
`frequencies, polysemy is still a significant predictor of lexical decision times.
`Instead of using frequency of occurrence as an index of familiarity, therefore,
`WordNet uses polysemy. This measure can be determined from an on-line dictionary. If
`an index value of 0 is assigned to words that do not appear in the dictionary, and if values
`of 1 or more are assigned according to the number of senses the word has, then an index
`value can be made available for every word in every syntactic category. Associated with
`every word form in WordNet, therefore, there is an integer that represents a count (of the
`Collins Dictionary of the English Language) of the number of senses that word form has
`when it is used as a noun, verb, adjective, or adverb.
`A simple example of how the familiarity index might be used is shown in Table 1.
`If, say, the superordinates of bronco are requested, WordNet can respond with the
`sequence of hypernyms shown in Table 1. Now, if all the terms with a familiarity index
`(polysemy count) of 0 or 1 are omitted, which are primarily technical terms, the
`hypernyms of bronco include simply: bronco @fi
`pony @fi
`horse @fi
`animal @fi
`organism @fi
`entity. This shortened chain is much closer to what a layman would
`expect. The index of familiarity should be useful, therefore, when making suggestions
`for changes in wording. A user can search for a more familiar word by inspecting the
`polysemy in the WordNet hierarchy.
`WordNet would be a better simulation of human semantic memory if a familiarity
`index could be assigned to word-meaning pairs rather than to word forms. The noun tie,
`for example, is used far more often with the meaning {tie, necktie} than with the
`meaning {tie, tie beam}, yet both are presently assigned the same index, 13.
`
`Lexicographers’ Source Files
`WordNet’s source files are written by lexicographers. They are the product of a
`detailed relational analysis of lexical semantics: a variety of lexical and semantic
`relations are used to represent the organization of lexical knowledge. Two kinds of
`building blocks are distinguished in the source files: word forms and word meanings.
`Word forms are represented in their familiar orthography; word meanings are represented
`by synonym sets — lists of synonymous word forms that are interchangeable in some
`syntax. Two kinds of relations are recognized: lexical and semantic. Lexical relations
`hold between word forms; semantic relations hold between word meanings.
`WordNet organizes nouns, verbs, adjectives and adverbs into synonym sets
`(synsets), which are further arranged into a set of lexicographers’ source files by syntactic
`category and other organizational criteria. Adverbs are maintained in one file, while
`nouns and verbs are grouped according to semantic fields. Adjectives are divided
`between two files: one for descriptive adjectives and one for relational adjectives.
`
`Page 3 of 25
`
`
`
`- 65 -
`
`Hypernyms of bronco and their index values
`
`Polysemy
`Word
`iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
`bronco
`1
`@fi mustang
`1
`@fi
`pony
`5
`@fi
`horse
`14
`@fi
`equine
`0
`@fi
`odd-toed ungulate
`0
`@fi
`placental mammal
`0
`@fi mammal
`1
`@fi
`vertebrate
`1
`@fi
`chordate
`1
`@fi
`animal
`4
`@fi
`organism
`2
`@fi
`entity
`3
`
`Table 1
`
`Appendix A lists the names of the lexicographers’ source files.
`Each source file contains a list of synsets for one part of speech. Each synset
`consists of synonymous word forms, relational pointers, and other information. The
`relations represented by these pointers include (but are not limited to):
`hypernymy/hyponymy, antonymy, entailment, and meronymy/holonymy. Polysemous
`word forms are those that appear in more than one synset, therefore representing more
`than one concept. A lexicographer often enters a textual gloss in a synset, usually to
`provide some insight into the semantics intended by the synonymous word forms and
`their usage. If present, the textual gloss is included in the database and can be displayed
`by retrieval software. Comments can be entered, outside of a synset, by enclosing the
`text of the comment in parentheses, and are not included in the database.
`Descriptive adjectives are organized into clusters that represent the values, from one
`extreme to the other, of some attribute. Thus each adjective cluster has two (occasionally
`three) parts, each part headed by an antonymous pair of word forms called a head synset.
`Most head synsets are followed by one or more satellite synsets, each representing a
`concept that is similar in meaning to the concept represented by the head synset. One
`way to think of the cluster organization is to visualize a wheel, with each head synset as a
`hub and its satellite synsets as the spokes. Two or more wheels are logically connected
`via antonymy, which can be thought of as an axle between wheels.
`The Grinder utility compiles the lexicographers’ files. It verifies the syntax of the
`files, resolves the relational pointers, then generates the WordNet database that is used
`with the retrieval software and other research tools.
`
`Page 4 of 25
`
`
`
`- 66 -
`
`Word Forms
`In WordNet, a word form is represented as the orthographic representation of an
`individual word or a string of individual words joined with underscore characters. A
`string of words so joined is referred to as a collocation and represents a single concept,
`such as the noun collocation fountain_pen.
`In the lexicographers’ files a word form may be augmented with additional
`information, necessary for the correct processing and interpretation of the data. An
`integer sense number is added for sense disambiguation if the same word form appears
`more than once in a lexicographer file. A syntactic marker, enclosed in parentheses, is
`added to any adjectival word form whose use is limited to a specific syntactic position in
`relation to the noun that it modifies. Each word form in WordNet is known by its
`orthographic representation, syntactic category, semantic field, and sense number.
`Together, these data make a ‘‘key’’ which uniquely identifies each word form in the
`database.
`
`Relational Pointers
`Relational pointers represent the relations between the word forms in a synset and
`other synsets, and are either lexical or semantic. Lexical relations exists between
`relational adjectives and the nouns that they relate to, and between adverbs and the
`adjectives from which they are derived. The semantic relation between adjectives and
`the nouns for which they express values are encoded as attributes. The semantic relation
`between noun attributes and the adjectives expressing their values are also encoded.
`Presently these are the only pointers that cross from one syntactic category to another.
`Antonyms are also lexically related. Synonymy of word forms is implicit by inclusion in
`the same synset. Table 2 summarizes the relational pointers by syntactic category.
`Meronymy is further specified by appending one of the following characters to the
`meronymy pointer: p to indicate a part of something; s to indicate the substance of
`something; m to indicate a member of some group. Holonymy is specified in the same
`manner, each pointer representing the semantic relation opposite to the corresponding
`meronymy relation.
`Many pointers are reflexive, meaning that if a synset contains a pointer to another
`synset, the other synset should contain a corresponding reflexive pointer back to the
`original synset. The Grinder automatically generates the relations for missing reflexive
`pointers of the types listed in Table 3.
`A relational pointer can be entered by the lexicographer in one of two ways. If a
`pointer is to represent a relation between synsets — a semantic relation — it is entered
`following the list of word forms in the synset. Hypernymy always relates one synset to
`another, and is an example of a semantic relation. The lexicographer can also enclose a
`word form and a list of pointers within square brackets ([...]) to define a lexical relation
`between word forms. Relational adjectives are entered in this manner, showing the
`lexical relation between the adjective and the noun that it pertains to.
`
`Page 5 of 25
`
`
`
`- 67 -
`
`WordNet Relational Pointers
`
`iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
`Noun
`Verb
`Adjective
`Adverb
`iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
`Antonym
`Antonym
`Antonym
`Antonym
`!
`!
`!
`!
`Hyponym
`Troponym
`Similar
`& Derived from \
`Hypernym @ Hypernym @ Relational Adj.
`\
`Meronym
`#
`Entailment
`*
`Also See
`ˆ
`=
`Holonym
`% Cause
`>
`Attribute
`=
`Attribute
`Also See
`ˆ
`iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
`
`ccccccccc ccccccccc ccccccccc ccccccccc ccccccccc
`
`Table 2
`
`Reflexive Pointers
`
`iiiiiiiiiiiiiiiiiiiiiii
`Pointer
`Reflect
`iiiiiiiiiiiiiiiiiiiiiii
`Antonym
`Antonym
`Hyponym
`Hypernym
`Hypernym Hyponym
`Holonym
`Meronym
`Meronym
`Holonym
`Similar to
`Similar to
`Attribute
`Attribute
`iiiiiiiiiiiiiiiiiiiiiii
`
`cccccccccc cccccccccc cccccccccc
`
`Table 3
`
`Verb Sentence Frames
`Each verb synset contains a list of verb frames illustrating the types of simple
`sentences in which the verbs in the synset can be used. A list of verb frames can be
`restricted to a word form by using the square bracket syntax described above. See
`Appendix B for a list of the verb sentence frames.
`
`Synset Syntax
`Strings in the source files that conform to the following syntactic rules are treated as
`synsets. Note that this is a brief description of the general synset syntax and is not a
`formal description of the source file format. A formal specification is found in the
`manual page wninput(5) of the ‘‘WordNet Reference Manual’’.
`
`Page 6 of 25
`
`~
`~
`
`
`- 68 -
`
`[1] Each synset begins with a left curly bracket ({).
`[2] Each synset is terminated with a right curly bracket (}).
`[3] Each synset contains a list of one or more word forms, each followed by a
`comma.
`[4] To code semantic relations, the list of word forms is followed by a list of
`relational pointers using the following syntax: a word form (optionally preceded
`by "filename:" to indicate a word form in a different lexicographer file) followed
`by a comma, followed by a relational pointer symbol.
`[5] For verb synsets, "frames:" is followed by a comma separated list of applicable
`verb frames. The verb frames follow all relational pointers.
`[6] To code lexical relations, a word form is followed by a list of elements from [4]
`and/or [5] inside square brackets ([...]).
`[7] To code adjective clusters, each part of a cluster (a head synset, optionally
`followed by satellite synsets) is separated from other parts of a cluster by a line
`containing only hyphens. Each entire cluster is enclosed in square brackets.
`
`Archive System
`The lexicographers’ source files are maintained in an archive system based on the
`Unix Revision Control System (RCS) for managing multiple revisions of text files. The
`archive system has been established for several reasons — to allow the reconstruction of
`any version of the WordNet database, to keep a history of all the changes to
`lexicographers’ files, to prevent people from making conflicting changes to the same file,
`and to ensure that it is always possible to produce an up-to-date version of the WordNet
`database. The programs in the archive system are Unix shell scripts which envelop RCS
`commands in a manner that maintains the desired control over the lexicographers’ source
`files and provides a user-friendly interface for the lexicographers.
`The reserve command extracts from the archive the most recent revision of a given
`file or files and locks the file for as long as a user is working on it. The review command
`extracts from the archive the most recent revision of a given file or files for the purpose
`of examination only, therefore the file is not locked. To discourage making changes,
`review files do not have write permission since any such changes could not be
`incorporated into the archive. The restore command verifies the integrity of a reserved
`file and returns it to the archive system. The release command is used to break a lock
`placed on a file with the reserve command. This is generally used if the lexicographer
`decides that changes should not be returned to the archive. The whose command is used
`to find out whether files are currently reserved, and if so, by whom.
`
`Grinder Utility
`The Grinder is a versatile utility with the primary purpose of compiling the
`lexicographers’ files into a database format that facilitates machine retrieval of the
`information in WordNet. The Grinder has several options that control its operation on a
`set of input files. To build a complete WordNet database, all of the lexicographers’ files
`
`Page 7 of 25
`
`
`
`- 69 -
`
`must be processed at the same time. The Grinder is also used as a verification tool to
`ensure the syntactic integrity of the lexicographers’ files when they are returned to the
`archive system with the restore command.
`
`Implementation
`The Grinder is a multi-pass compiler that is coded in C. The first pass uses a parser,
`written in yacc and lex, to verify that the syntax of the input files conforms to the
`specification of the input grammar and lexical items, and builds an internal representation
`of the parsed synsets. Additional passes refer only to this internal representation of the
`lexicographic data. Pass one attempts to find as many syntactic and structural errors as
`possible. Syntactic errors are those in which the input file fails to conform to the input
`grammar’s specification, and structural errors refer to relational pointers that cannot be
`resolved for some reason. Usually these errors occur because the lexicographer has made
`a typographical error, such as constructing a pointer to a non-existent file, or fails to
`specify a sense number when referring to an ambiguous word form. Pass one cannot
`determine structural errors in pointers to files that are not processed together. When used
`as a verification tool, as from the restore command, only pass one is run.
`In its second pass, the Grinder resolves all of the semantic and lexical pointers. To
`do this, the pointers that were specified in each synset are examined in turn, and the
`target of each pointer (either a synset or a word form in a synset) is found. The source
`pointer is then resolved by adding an entry to the internal data structure which notes the
`‘‘location’’ of the target. In the case of reflexive pointers, the target pointer’s synset is
`then searched for a corresponding reflexive pointer. If found, the data structure
`representing the reflexive pointer is modified to note the ‘‘location’’ of its target, the
`original source pointer. If a reflexive pointer is not found, the Grinder automatically
`creates one with all the pertinent information.
`A subsequent pass through the list of word forms assigns a polysemy index value, or
`sense count, to each word form found in the on-line dictionary. There is a separate sense
`count for each syntactic category that the word form is found in. The Grinder’s final pass
`generates the WordNet database.
`
`Internal Representation
`The internal representation of the lexicographic data is a network of interrelated
`linked lists. A hash table of word forms is created as the lexicographers’ files are parsed.
`Lower-case strings are used as keys; the original orthographic word form, if not in
`lower-case, is retained as part of the data structure for inclusion in the database files. As
`the parser processes an input file, it calls functions which create data structures for the
`word forms, pointers, and verb frames in a synset. Once an entire synset had been
`parsed, a data structure is created for it which includes pointers to the various structures
`representing the word forms, pointers, and verb frames. All of the synsets from the input
`files are maintained as a single linked list. The Grinder’s different passes access the
`structures either through the linked list of synsets or the hash table of word forms. A list
`of synsets that specify each word form is maintained for the purposes of resolving
`
`Page 8 of 25
`
`
`
`- 70 -
`
`pointers and generating the database’s index files.
`
`WordNet Database
`For each syntactic category, two files represent the WordNet database — index.pos
`and data.pos, where pos is either noun, verb, adj or adv (the actual file names may be
`different on platforms other than Sun-4). The database is in an ASCII format that is
`human- and machine-readable, and is easily accessible to those who wish to use it with
`their own applications. Each index file is an alphabetized list of all of the word forms in
`WordNet for the corresponding syntactic category. Each data file contains all of the
`lexicographic data gathered from the lexicographers’ files for the corresponding syntactic
`category, with relational pointers resolved to addresses in data files.
`The index and data files are interrelated. Part of each entry in an index file is a list
`of one or more byte offsets, each indicating the starting address of a synset in a data file.
`The first step to the retrieval of synsets or other information is typically a search for a
`word form in one or more index files to obtain all data file addresses of the synsets
`containing the word form. Each address is the byte offset (in the data file corresponding
`to the syntactic category of the index file) at which the synset’s information begins. The
`information pertaining to a single synset is encoded as described in the Data Files
`section below.
`One shortcoming of the database’s structure is that although all the files are in
`ASCII, and are therefore editable, and in theory extensible, in practice this is almost
`impossible. One of the Grinder’s primary functions is the calculation of addresses for the
`synsets in the data files. Editing any of the database files would (most likely) create
`incorrect byte offsets, and would thus derail many searching strategies. At the present
`time, building a WordNet database requires the use of the Grinder and the processing of
`all lexicographers’ source files at the same time.
`The descriptions of the Index and Data files that follow are brief and are intended to
`provide only a glimpse into the structure, syntax, and organization of the database. More
`detailed descriptions can be found in the manual page wndb(5) included in the
`‘‘WordNet Reference Manual’’.
`
`Index Files
`Word forms in an index file are in lower case regardless of how they were entered in
`the lexicographers’ files. The files are sorted according to the ASCII character set
`collating sequence and can be searched quickly with a binary search.
`Each index file begins with several lines containing a copyright notice, version
`number and license agreement, followed by the data lines. Each line of data contains the
`following information: the sense count from the on-line dictionary; a list of the relational
`pointer types used in all synsets containing the word (this is used by the retrieval
`software to indicate to a user which searches are applicable); a list of indices which are
`byte offsets into the corresponding data file, one for each occurrence of the word form in
`a synset. Each data line is terminated with an end-of-line character.
`
`Page 9 of 25
`
`
`
`- 71 -
`
`Data Files
`A data file contains information corresponding to the synsets that were defined in
`the lexicographers’ files with pointers resolved to byte offsets in data.pos files.
`Each data file begins with several lines containing a copyright notice, version
`number and license agreement. This is followed by a list of the names of all the input
`files that were specified to the Grinder, in the order that they were given on the command
`line, followed by the data lines. Each line of data contains an encoding of the
`information entered by the lexicographer for a synset, as well as additional information
`provided by the Grinder which is useful to the retrieval software and other programs.
`Each data line is terminated with an end-of-line character. In the data files, word forms
`in a synset match the orthographic representation entered in the lexicographers’ files.
`The first piece of information on each line is the byte offset, or address, of the
`synset. This is slightly redundant, since almost any computer program that reads a synset
`from a data file knows the byte offset that it read it from; however this piece of
`information is useful when using UNIX utilities like grep to trace synsets and pointers
`without the use of sophisticated software. It also provides a unique ‘‘key’’ for a synset,
`if a user’s application requires one. An integer, corresponding to the location in the list
`of file names of the file from which the synset originated, follows. This can be used by
`retrieval software to annotate the display of a synset with the name of the originating file,
`and can be helpful for distinguishing senses. A list of word forms, relational pointers,
`and verb frames follows. An optional textual gloss is the final component of a data line.
`Relational pointers are represented by several pieces of information. The symbol
`for the pointer comes first, followed by the address of the target synset and its syntactic
`category (necessary for pointers that cross over into a different syntactic category),
`followed by a field which differentiates lexical and semantic pointers. If a lexical pointer
`is being represented, this field indicates which word forms in the source and target
`synsets the pointer pertains to. For a semantic pointer, this field is 0.
`
`Retrieving Lexical Information
`In order to give a user access to information in the database, an interface is required.
`Interfaces enable end users to retrieve the lexical data and display it via a window-based
`tool or the command line. When considering the role of the interface, it is important to
`recognize the difference between a printed dictionary and a lexical database. WordNet’s
`interface software creates its responses to a user’s requests on the fly. Unlike an on-line
`version of a printed dictionary, where information is stored in a fixed format and
`displayed on demand, WordNet’s information is stored in a format that would be
`meaningless to an ordinary reader. The interface provides a user with a variety of ways
`to retrieve and display lexical information. Different interfaces can be created to serve
`the purposes of different users, but all of them will draw on the same underlying lexical
`database, and may use the same software functions that interface to the database files.
`User interfaces to WordNet can take on many forms. The standard interface is an X
`Windows application, which has been ported to several computer platforms. Microsoft
`Windows and Macintosh interfaces have also been written. An alternative command line
`
`Page 10 of 25
`
`
`
`- 72 -
`
`interface allows the user to retrieve the same data, with exactly the same output as the
`window-based interfaces, although the specification of the retrieval criteria is more
`cumbersome, and the whole effect is less impressive. Nevertheless, the command line
`interface is useful because some users do not have access to windowing environments.
`Shell scripts and other programs can also be written around the command line interface.
`The search process is the same regardless of the type of search requested. The first
`step is to retrieve the index entry located in the appropriate index file. This will contain a
`list of addresses of the synsets in the data file in which the word appears. Then each of
`the synsets in the data file is searched for the requested information, which is retrieved
`and formatted for output. Searching is complicated by the fact that each synset
`containing the search word also contains pointers to other synsets in the data file that may
`need to be retrieved and displayed, depending on the search type. For example, each
`synset in the hypernymic pathway points to the next synset in the hierarchy. If a user
`requests a recursive search on hypernyms a recursive retrieval process is repeated until a
`synset is encountered that contains no further pointers.
`The user interfaces to WordNet and other software tools rely upon a library of
`functions that interface to the database files. A fairly comprehensive set of functions is
`provided: they perform searches and retrievals, morphology, and various other utility
`functions. Appendix C contains a brief description of these functions. The structured,
`flexible design of the library provides a simple programming interface to the WordNet
`database. Low-level, complex, and utility functions are included. The user interface
`software depends upon the more complex functions to perform the actual data retrieval
`and formatting of the search results for display to the user. Low-level functions provide
`basic access to the lexical data in the index and data files, while shielding the
`programmer from the details of opening files, reading files, and parsing a line of data.
`These functions return the requested information in a data structure that can be
`interpreted and used as required by the application. Utility functions allow simple
`manipulations of the search strings.
`The basic searching function, findtheinfo(), receives as its input arguments a word
`form, syntactic category, and search type; findtheinfo() calls a low-level function to find
`the corresponding entry in the index file, and for each sense calls the appropriate function
`to trace the pointer corresponding to the search type. Most traces are done with the
`function traceptrs(), but specialized functions exist for search types which do not
`conform to the standard hierarchical search. As a synset is retrieved from the database, it
`is formatted as required by the search type into a large output buffer. The resulting
`buffer, containing all of the formatted synsets for all of the senses of the search word, is
`returned to the caller. The calling function simply has to print the buffer returned from
`findtheinfo().
`This general search and retrieval algorithm is used in s