\input pcparse % common TeX setup @c -*-texinfo-*- \input texinfo @c %**start of header @setfilename pckimmo.info @settitle PC-Kimmo Reference Manual @c %**end of header @syncodeindex fn cp @set TITLE PC-Kimmo Reference Manual @set SUBTITLE a two-level processor for morphological analysis @set VERSION version 2.1.0 @set DATE October 1997 @set AUTHOR by Evan Antworth and Stephen McConnel @set COPYRIGHT Copyright @copyright{} 2000 SIL International @include front.txi @c ---------------------------------------------------------------------------- @node Top, Introduction, (dir), (dir) @comment node-name, next, previous, up @ifinfo @ifclear txt This is the reference manual for the PC-Kimmo program. @end ifclear @end ifinfo @menu * Introduction:: * Two-level formalism:: * Running PC-Kimmo:: * Rules file:: * Lexicon files:: * Grammar file:: * Convlex:: * Bibliography:: @end menu @c ---------------------------------------------------------------------------- @node Introduction, Two-level formalism, Top, Top @comment node-name, next, previous, up @chapter Introduction to the PC-Kimmo program This document describes PC-Kimmo, an implementation of the two-level computational linguistic formalism for personal computers. It is available for MS-DOS, Microsoft Windows, Macintosh, and Unix.@footnote{The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the Metrowerks C SIOUX function.} The authors would appreciate feedback directed to the following addresses. For linguistic questions, contact: @example Gary Simons SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 gary.simons@@sil.org U.S.A. @end example @noindent For programming questions, contact: @example Stephen McConnel (972)708-7361 (office) Language Software Development (972)708-7561 (fax) SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 steve@@acadcomp.sil.org U.S.A. or Stephen_McConnel@@sil.org @end example An online user manual for PC-Kimmo is available on the world wide web at the URL @ifclear html @code{http://www.sil.org/pckimmo/v2/doc/guide.html}. @end ifclear @ifset html http://www.sil.org/pckimmo/v2/doc/guide.html. @end ifset @c ---------------------------------------------------------------------------- @node Two-level formalism, Running PC-Kimmo, Introduction, Top @chapter The Two-level Formalism Two-level phonology is a linguistic tool developed by computational linguists. Its primary use is in systems for natural language processing such as PC-Kimmo. This chapter describes the linguistic and computational basis of two-level phonology.@footnote{This chapter is excerpted from Antworth 1991.} @menu * Roots:: * Rule application:: * How it works:: * Zero:: @end menu @c ---------------------------------------------------------------------------- @node Roots, Rule application, Two-level formalism, Two-level formalism @section Computational and linguistic roots As the fields of computer science and linguistics have grown up together during the past several decades, they have each benefited from cross-fertilization. Modern linguistics has especially been influenced by the formal language theory that underlies computation. The most famous application of formal language theory to linguistics was Chomsky's (1957) transformational generative grammar. Chomsky's strategy was to consider several types of formal languages to see if they were capable of modeling natural language syntax. He started by considering the simplest type of formal languages, called finite state languages. As a general principle, computational linguists try to use the least powerful computational devices possible. This is because the less powerful devices are better understood, their behavior is predictable, and they are computationally more efficient. Chomsky (1957:18ff) demonstrated that natural language syntax could not be effectively modeled as a finite state language; thus he rejected finite state languages as a theory of syntax and proposed that syntax requires the use of more powerful, non-finite state languages. However, there is no reason to assume that the same should be true for natural language phonology. A finite state model of phonology is especially desirable from the computational point of view, since it makes possible a computational implementation that is simple and efficient. While various linguists proposed that generative phonological rules could be implemented by finite state devices (see Johnson 1972, Kay 1983), the most successful model of finite state phonology was developed by Kimmo Koskenniemi, a Finnish computer scientist. He called his model two-level morphology (Koskenniemi 1983), though his use of the term morphology should be understood to encompass both what linguists would consider morphology proper (the decomposition of words into morphemes) and phonology (at least in the sense of morphophonemics). Our main interest in this article is the phonological formalism used by the two-level model, hereafter called two-level phonology. Two-level phonology traces its linguistic heritage to ``classical'' generative phonology as codified in @cite{The Sound Pattern of English} (Chomsky and Halle 1968). The basic insight of two-level phonology is due to the phonologist C. Douglas Johnson (1972) who showed that the SPE theory of phonology could be implemented using finite state devices by replacing sequential rule application with simultaneous rule application. At its core, then, two-level phonology is a rule formalism, not a complete theory of phonology. The following sections of this article describe the mechanism of two-level rule application by contrasting it with rule application in classical generative phonology. It should be noted that Chomsky and Halle's theory of rule application became the focal point of much controversy during the 1970s with the result that current theories of phonology differ significantly from classical generative phonology. The relevance of two-level phonology to current theory is an important issue, but one that will not be fully addressed here. Rather, the comparison of two-level phonology to classical generative phonology is done mainly for expository purposes, recognizing that while classical generative phonology has been superseded by subsequent theoretical work, it constitutes a historically coherent view of phonology that continues to influence current theory and practice. One feature that two-level phonology shares with classical generative phonology is linear representation. That is, phonological forms are represented as linear strings of symbols. This is in contrast to the nonlinear representations used in much current work in phonology, namely autosegmental and metrical phonology (see Goldsmith 1990). On the computational side, two-level phonology is consistent with natural language processing systems that are designed to operate on linear orthographic input. @c ---------------------------------------------------------------------------- @node Rule application, How it works, Roots, Two-level formalism @section Two-level rule application We will begin by reviewing the formal properties of generative rules. Stated succinctly, generative rules are sequentially ordered rewriting rules. What does this mean? First, rewriting rules are rules that change or transform one symbol into another symbol. For example, a rewriting rule of the form @w{@samp{a --> b}} interprets the relationship between the symbols @samp{a} and @samp{b} as a dynamic change whereby the symbol @samp{a} is rewritten or turned into the symbol @samp{b}. This means that after this operation takes place, the symbol @samp{a} no longer ``exists,'' in the sense that it is no longer available to other rules. In linguistic theory generative rules are known as process rules. Process rules attempt to characterize the relationship between levels of representation (such as the phonemic and phonetic levels) by specifying how to transform representations from one level into representations on the other level. Second, generative phonological rules apply sequentially, that is, one after another, rather than applying simultaneously. This means that each rule creates as its output a new intermediate level of representation. This intermediate level then serves as the input to the next rule. As a consequence, the underlying form becomes inaccessible to later rules. Third, generative phonological rules are ordered; that is, the description specifies the sequence in which the rules must apply. Applying rules in any other order may result in incorrect output. As an example of a set of generative rules, consider the following rules: @example @group (1) Vowel Raising e --> i / ___C_0 i @end group @group (2) Palatalization t --> c / ___i @end group @end example @noindent Rule 1 (Vowel Raising) states that @samp{e} becomes (is rewritten as) @samp{i} in the environment preceding @samp{Ci} (where @samp{C} stands for the set of consonants and @samp{C_0} stands for zero or more consonants). Rule 2 (Palatalization) states that @samp{t} becomes @samp{c} preceding @samp{i}. A sample derivation of forms to which these rules apply looks like this (where UR stands for Underlying Representation, SR stands for Surface Representation):@footnote{This made-up example is used for expository purposes. To make better phonological sense, the forms should have internal morpheme boundaries, for instance @samp{te+mi} (otherwise there would be no basis for positing an underlying @samp{e}). See the section below on the use of zero to see how morpheme boundaries are handled.} @example @group UR: temi (1) timi (2) cimi SR: cimi @end group @end example @noindent Notice that in addition to the underlying and surface levels, an intermediate level has been created as the result of sequentially applying rules 1 and 2. The application of rule 1 produces the intermediate form @samp{timi}, which then serves as the input to rule 2. Not only are these rules sequential, they are ordered, such that rule 1 must apply before rule 2. Rule 1 has a feeding relationship to rule 2; that is, rule 1 increases the number of forms that can undergo rule 2 by creating more instances of @samp{i}. Consider what would happen if they were applied in the reverse order. Given the input form @samp{temi}, rule 2 would do nothing, since its environment is not satisfied. Rule 1 would then apply to produce the incorrect surface form @samp{timi}. Two-level rules differ from generative rules in the following ways. First, whereas generative rules apply in a sequential order, two-level rules apply simultaneously, which is better described as applying in parallel. Applying rules in parallel to an input form means that for each segment in the form all of the rules must apply successfully, even if only vacuously. Second, whereas sequentially applied generative rules create intermediate levels of derivation, simultaneously applied two-level rules require only two levels of representation: the underlying or lexical level and the surface level. There are no intermediate levels of derivation. It is in this sense that the model is called two-level. Third, whereas generative rules relate the underlying and surface levels by rewriting underlying symbols as surface symbols, two-level rules express the relationship between the underlying and surface levels by positing direct, static correspondences between pairs of underlying and surface symbols. For instance, instead of rewriting underlying @samp{a} as surface @samp{b}, a two-level rule states that an underlying @samp{a} corresponds to a surface @samp{b}. The two-level rule does not change @samp{a} into @samp{b}, so @samp{a} is available to other rules. In other words, after a two-level rule applies, both the underlying and surface symbols still ``exist.'' Fourth, whereas generative rules have access only to the current intermediate form at each stage of the derivation, two-level rules have access to both underlying and surface environments. Generative rules cannot ``look back'' at underlying environments or ``look ahead'' to surface environments. In contrast, the environments of two-level rules are stated as lexical-to-surface correspondences. This means that a two-level rule can easily refer to an underlying @samp{a} that corresponds to a surface @samp{b}, or to a surface @samp{b} that corresponds to an underlying @samp{a}. In generative phonology, the interaction between a pair of rules is controlled by requiring that they apply in a certain sequential order. In two-level phonology, rule interactions are controlled not by ordering the rules but by carefully specifying their environments as strings of two-level correspondences. Fifth, whereas generative, rewriting rules are unidirectional (that is, they operate only in an underlying to surface direction), two-level rules are bidirectional. Two-level rules can operate either in an underlying to surface direction (generation mode) or in a surface to underlying direction (recognition mode). Thus in generation mode two-level rules accept an underlying form as input and return a surface form, while in recognition mode they accept a surface form as input and return an underlying form. The practical application of bidirectional phonological rules is obvious: a computational implementation of bidirectional rules is not limited to generation mode to produce words; it can also be used in recognition direction to parse words. @c ---------------------------------------------------------------------------- @node How it works, Zero, Rule application, Two-level formalism @section How a two-level description works To understand how a two-level phonological description works, we will use the example given above involving Raising and Palatalization. The two-level model treats the relationship between the underlying form @samp{temi} and the surface form @samp{cimi} as a direct, symbol-to-symbol correspondence: @example @group UR: t e m i SR: c i m i @end group @end example @noindent Each pair of lexical and surface symbols is a correspondence pair. We refer to a correspondence pair with the notation @w{@samp{:}}, for instance @samp{e:i} and @samp{m:m}. There must be an exact one-to-one correspondence between the symbols of the underlying form and the symbols of the surface form. Deletion and insertion of symbols (explained in detail in the next section) is handled by positing correspondences with zero, a null segment. The two-level model uses a notation for expressing two-level rules that is similar to the notation linguists use for phonological rules. Corresponding to the generative rule for Palatalization (rule 2 above), here is the two-level rule for the @samp{t:c} correspondence: @example @group (3) Palatalization t:c <=> ___ @@:i @end group @end example This rule is a statement about the distribution of the pair @samp{t:c} on the left side of the arrow with respect to the context or environment on the right side of the arrow. A two-level rule has three parts: the correspondence, the operator, and the environment. The correspondence part of rule 3 is the pair @samp{t:c}, which is the correspondence that the rule sanctions. The operator part of rule 3 is the double-headed arrow. It indicates the nature of the logical relationship between the correspondence and the environment (thus it means something very different from the rewriting arrow @samp{-->} of generative phonology). The @samp{<=>} arrow is equivalent to the biconditional operator of formal logic and means that the correspondence occurs always and only in the stated context; that is, @samp{t:c} is allowed if and only if it is found in the context @samp{___@:i}. In short, rule 3 is an obligatory rule. The environment part of rule 3 is everything to the right of the arrow. The long underline indicates the gap where the pair @samp{t:c} occurs. Notice that even the environment part of the rule is specified as two-level correspondence pairs. The environment part of rule 3 requires further explanation. Instead of using a correspondence such as @samp{i:i}, it uses the correspondence @samp{@@:i}. The @samp{@@} symbol is a special ``wildcard'' symbol that stands for any phonological segment included in the description. In the context of rule 3, the correspondence @samp{@@:i} stands for all the feasible pairs in the description whose surface segment is @samp{i}, in this case @samp{e:i} and @samp{i:i}. Thus by using the correspondence @samp{@@:i}, we allow Palatalization to apply in the environment of either a lexical @samp{e} or lexical @samp{i}. In other words, we are claiming that Palatalization is sensitive to a surface (phonetic) environment rather than an underlying (phonemic) environment. Thus rule 3 will apply to both underlying forms @samp{timi} and @samp{temi} to produce a surface form with an initial @samp{c}. Corresponding to the generative rule for Raising (rule 1 above) is the following two-level rule for the @samp{e:i} correspondence: @example @group (4) Vowel Raising e:i <=> ___ C:C* @@:i @end group @end example @noindent (The asterisk in @samp{C:C*} indicates zero or more instances of the correspondence @samp{C:C}) Similar to rule 3 above, rule 4 uses the correspondence @samp{@@:i} in its environment. Thus rule 4 states that the correspondence @samp{e:i} occurs preceding a surface @samp{i}, regardless of whether it is derived from a lexical @samp{e} or @samp{i}. Why is this necessary? Consider the case of an underlying form such as @samp{pememi}. In order to derive the surface form @samp{pimimi}, Raising must apply twice: once before a lexical @samp{i} and again before a lexical @samp{e}, both of which correspond to a surface @samp{i}. Thus rule 4 will apply to both instances of lexical @samp{e}, capturing the regressive spreading of Raising through the word. By applying rules 3 and 4 in parallel, they work in consort to produce the right output. For example, @example @group UR: t e m i | | | | Rules 3 4 | | | | | | SR: c i m i @end group @end example @noindent Conceptually, a two-level phonological description of a data set such as this can be understood as follows. First, the two-level description declares an alphabet of all the phonological segments used in the data in both underlying and surface forms, in the case of our example, @samp{t}, @samp{m}, @samp{c}, @samp{e}, and @samp{i}. Second, the description declares a set feasible pairs, which is the complete set of all underlying-to-surface correspondences of segments that occur in the data. The set of feasible pairs for these data is the union of the set of default correspondences, whose underlying and surface segments are identical (namely @samp{t:t}, @samp{m:m}, @samp{e:e}, and @samp{i:i}) and the set of special correspondences, whose underlying and surface segments are different (namely @samp{t:c} and @samp{e:i}). Notice that since the segment @samp{c} only occurs as a surface segment in the feasible pairs, the description will disallow any underlying form that contains a @samp{c}. A minimal two-level description, then, consists of nothing more than this declaration of the feasible pairs. Since it contains all possible underlying-to-surface correspondences, such a description will produce the correct output form, but because it does not constrain the environments where the special correspondences can occur, it will also allow many incorrect output forms. For example, given the underlying form @samp{temi}, it will produce the surface forms @samp{temi}, @samp{timi}, @samp{cemi}, and @samp{cimi}, of which only the last is correct. Third, in order to restrict the output to only correct forms, we include rules in the description that specify where the special correspondences are allowed to occur. Thus the rules function as constraints or filters, blocking incorrect forms while allowing correct forms to pass through. For instance, rule 3 (Palatalization) states that a lexical @samp{t} must be realized as a surface @samp{c} when it precedes @samp{@@:i}; thus, given the underlying form @samp{temi} it will block the potential surface output forms @samp{timi} (because the surface sequence @samp{ti} is prohibited) and @samp{cemi} (because surface @samp{c} is prohibited before anything except surface @samp{i}). Rule 4 (Raising) states that a lexical @samp{e} must be realized as a surface @samp{i} when it precedes the sequence @samp{C:C} @samp{@@:i}; thus, given the underlying form @samp{temi} it will block the potential surface output forms @samp{temi} and @samp{cemi} (because the surface sequence @samp{emi} is prohibited). Therefore of the four potential surface forms, three are filtered out; rules 3 and 4 leave only the correct form @samp{cimi}. Two-level phonology facilitates a rather different way of thinking about phonological rules. We think of generative rules as processes that change one segment into another. In contrast, two-level rules do not perform operations on segments, rather they state static constraints on correspondences between underlying and surface forms. Generative phonology and two-level phonology also differ in how they characterize relationships between rules. Rules in generative phonology are described in terms of their relative order of application and their effect on the input of other rules (the so-called feeding and bleeding relations). Thus the generative rule 1 for Raising precedes and feeds rule 2 for Palatalization. In contrast, rules in the two-level model are categorized according to whether they apply in lexical versus surface environments. So we say that the two-level rules for Raising and Palatalization are sensitive to a surface rather than underlying environment. @c ---------------------------------------------------------------------------- @node Zero, , How it works, Two-level formalism @section With zero you can do (almost) anything Phonological processes that delete or insert segments pose a special challenge to two-level phonology. Since an underlying form and its surface form must correspond segment for segment, how can segments be deleted from an underlying form or inserted into a surface form? The answer lies in the use of the special null symbol @samp{0} (zero). Thus the correspondence @samp{x:0} represents the deletion of @samp{x}, while @samp{0:x} represents the insertion of @samp{x}. (It should be understood that these zeros are provided by rule application mechanism and exist only internally; that is, zeros are not included in input forms nor are they printed in output forms.) As an example of deletion, consider these forms from Tagalog (where @samp{+} represents a morpheme boundary): @example @group UR: m a n + b i l i SR: m a m 0 0 i l i @end group @end example @noindent Using process terminology, these forms exemplify phonological coalescence, whereby the sequence @samp{nb} becomes @samp{m}. Since in the two-level model a sequence of two underlying segments cannot correspond to a single surface segment, coalescence must be interpreted as simultaneous assimilation and deletion. Thus we need two rules: an assimilation rule for the correspondence @samp{n:m} and a deletion rule for the correspondence @samp{b:0} (note that the morpheme boundary @samp{+} is treated as a special symbol that is always deleted). @example @group (5) Nasal Assimilation n:m <=> ___ +:0 b:@@ @end group @group (6) Deletion b:0 <=> @@:m +:0 ___ @end group @end example @noindent Notice the interaction between the rules: Nasal Assimilation occurs in a lexical environment, namely a lexical @samp{b} (which can correspond to either a surface @samp{b} or @samp{0}), while Deletion occurs in a surface environment, namely a surface @samp{m} (which could be the realization of either a lexical @samp{n} or @samp{m}). In this way the two rules interact with each other to produce the correct output. Insertion correspondences, where the lexical segment is @samp{0}, enable one to write rules for processes such as stress insertion, gemination, infixation, and reduplication. For example, Tagalog has a verbalizing infix @samp{um} that attaches between the first consonant and vowel of a stem; thus the infixed form of @samp{bili} is @samp{bumili}. To account for this formation with two-level rules, we represent the underlying form of the infix @samp{um} as the prefix @samp{X+}, where @samp{X} is a special symbol that has no phonological purpose other than standing for the infix. We then write a rule that inserts the sequence @samp{um} in the presence of @samp{X+}, which is deleted. Here is the two-level correspondence: @example @group UR: X + b 0 0 i l i SR: 0 0 b u m i l i @end group @end example @noindent and here is the two-level rule, which simultaneously deletes @samp{X} and inserts @samp{um}: @example @group (7) Infixation X:0 <=> ___ +:0 C:C 0:u 0:m V:V @end group @end example @noindent These examples involving deletion and insertion show that the invention of zero is just as important for phonology as it was for arithmetic. Without zero, two-level phonology would be limited to the most trivial phonological processes; with zero, the two-level model has the expressive power to handle complex phonological or morphological phenomena (though not necessarily with the degree of felicity that a linguist might desire). @c ---------------------------------------------------------------------------- @node Running PC-Kimmo, Rules file, Two-level formalism, Top @chapter Running PC-Kimmo PC-Kimmo is an interactive program. It has a few command line options, but it is controlled primarily by commands typed at the keyboard (or loaded from a file previously prepared). @menu * Command line options:: * Interactive commands:: @end menu @c ---------------------------------------------------------------------------- @node Command line options, Interactive commands, Running PC-Kimmo, Running PC-Kimmo @section PC-Kimmo Command Line Options The PC-Kimmo program uses an old-fashioned command line interface following the convention of options starting with a dash character (@samp{-}). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. @ftable @code @item -g filename loads the grammar from a PC-Kimmo grammar file. @item -l filename loads an analysis lexicon from a PC-Kimmo lexicon file. @item -r filename loads the two-level rules from a PC-Kimmo rules file. @item -s filename loads a synthesis lexicon from a PC-Kimmo lexicon file. @item -t filename opens a file containing one or more PC-Kimmo commands. @ifset txt See `Interactive Commands' below. @end ifset @ifclear txt @xref{Interactive commands}. @end ifclear @end ftable The following options exist only in beta-test versions of the program, since they are used only for debugging. @ftable @code @item -/ increments the debugging level. The default is zero (no debugging output). @item -z filename opens a file for recording a memory allocation log. @item -Z address,count traps the program at the point where @code{address} is allocated or freed for the @code{count}'th time. @end ftable @c ---------------------------------------------------------------------------- @node Interactive commands, , Command line options, Running PC-Kimmo @section Interactive Commands Each of the commands available in PC-Kimmo is described below. Each command consists of one or more keywords followed by zero or more arguments. Keywords may be abbreviated to the minimum length necessary to prevent ambiguity. @menu * cd:: * clear:: * close:: * compare:: * directory:: * edit:: * exit:: * file:: * generate:: * help:: * list:: * load:: * log:: * quit:: * recognize:: * save:: * set:: * show:: * status:: * synthesize:: * system:: * take:: @end menu @c ---------------------------------------------------------------------------- @node cd, clear, Interactive commands, Interactive commands @subsection cd @w{@code{cd} @var{directory}} changes the current directory to the one specified. Spaces in the directory pathname are not permitted. For MS-DOS or Windows, you can give a full path starting with the disk letter and a colon (for example, @code{a:}); a path starting with @code{\} which indicates a directory at the top level of the current disk; a path starting with @code{..} which indicates the directory above the current one; and so on. Directories are separated by the @code{\} character. (The forward slash @code{/} works just as well as the backslash @code{\} for MS-DOS or Windows.) For the Macintosh, you can give a full path starting with the name of a hard disk, a path starting with @code{:} which means the current folder, or one starting @code{::} which means the folder containing the current one (and so on). For Unix, you can give a full path starting with a @code{/} (for example, @code{/usr/pckimmo}); a path starting with @code{..} which indicates the directory above the current one; and so on. Directories are separated by the @code{/} character. @c ---------------------------------------------------------------------------- @node clear, close, cd, Interactive commands @subsection clear @code{clear} erases all existing rules, lexicon, and grammar information, allowing the user to prepare to load information for a new language. Strictly speaking, it is not needed since the @w{@code{load rules}} command erases any previously existing rules, the @w{@code{load lexicon}} command erases any previously existing analysis lexicon, the @w{@code{load synthesis-lexicon}} command erases any previously existing synthesis lexicon, and the @w{@code{load grammar}} command erases any previously existing grammar. @code{cle} is the minimal abbreviation for @code{clear}. @c ---------------------------------------------------------------------------- @node close, compare, clear, Interactive commands @subsection close @code{close} closes the current log file opened by a previous @code{log} command. @code{clo} is the minimal abbreviation for @code{close}. @c ---------------------------------------------------------------------------- @node compare, directory, close, Interactive commands @subsection compare The @code{compare} commands all test the current language description files by processing data against known (precomputed) results. @w{@code{co}} is the minimal abbreviation for @code{compare}. @w{@code{file compare}} is a synonym for @code{compare}. @menu * compare generate:: * compare pairs:: * compare recognize:: * compare synthesize:: @end menu @c ---------------------------------------------------------------------------- @node compare generate, compare pairs, compare, compare @subsubsection compare generate @w{@code{compare generate} @var{}} reads lexical and surface forms from the specified file. After reading a lexical form, PC-Kimmo generates the corresponding surface form(s) and compares the result to the surface form(s) read from the file. If @code{VERBOSE} is @code{ON}, then each form from the file is echoed on the screen with a message indicating whether or not the surface forms generated by PC-Kimmo and read from the file are in agreement. If @code{VERBOSE} is @code{OFF}, then only the disagreements in surface form are displayed fully. Each result which agrees is indicated by a single dot written to the screen. The default filetype extension for @w{@code{compare generate}} is @file{.gen}, and the default filename is @file{data.gen}. @w{@code{co g}} is the minimal abbreviation for @w{@code{compare generate}}. @w{@code{file compare generate}} is a synonym for @w{@code{compare generate}}. @c ---------------------------------------------------------------------------- @node compare pairs, compare recognize, compare generate, compare @subsubsection compare pairs @w{@code{compare pairs} @var{}} reads pairs of surface and lexical forms from the specified file. After reading a lexical form, PC-Kimmo produces any corresponding surface form(s) and compares the result(s) to the surface form read from the file. For each surface form, PC-Kimmo also produces any corresponding lexical form(s) and compares the result to the lexical form read from the file. If @code{VERBOSE} is @code{ON}, then each form from the file is echoed on the screen with a message indicating whether or not the forms produced by PC-Kimmo and read from the file are in agreement. If @code{VERBOSE} is @code{OFF}, then each result which agrees is indicated by a single dot written to the screen, and only disagreements in lexical forms are displayed fully. The default filetype extension for @code{compare pairs} is @file{.pai}, and the default filename is @file{data.pai}. @w{@code{co p}} is the minimal abbreviation for @w{@code{compare pairs}}. @w{@code{file compare pairs}} is a synonym for @w{@code{compare pairs}}. @c ---------------------------------------------------------------------------- @node compare recognize, compare synthesize, compare pairs, compare @subsubsection compare recognize @w{@code{compare recognize} @var{}} reads surface and lexical forms from the specified file. After reading a surface form, PC-Kimmo produces any corresponding lexical form(s) and compares the result(s) to the lexical form(s) read from the file. If @code{VERBOSE} is @code{ON}, then each form from the file is echoed on the screen with a message indicating whether or not the lexical forms produced by PC-Kimmo and read from the file are in agreement. If @code{VERBOSE} is @code{OFF}, then each result which agrees is indicated by a single dot written to the screen, and only disagreements in lexical forms are displayed fully. The default filetype extension for @code{compare recognize} is @file{.rec}, and the default filename is @file{data.rec}. @w{@code{co r}} is the minimal abbreviation for @w{@code{compare recognize}}. @w{@code{file compare recognize}} is a synonym for @w{@code{compare recognize}}. @c ---------------------------------------------------------------------------- @node compare synthesize, , compare recognize, compare @subsubsection compare synthesize @w{@code{compare synthesize} @var{}} reads morphological and surface forms from the specified file. After reading a morphological form, PC-Kimmo produces any corresponding surface form(s) and compares the result(s) to the surface form(s) read from the file. If @code{VERBOSE} is @code{ON}, then each form from the file is echoed on the screen with a message indicating whether or not the surface forms produced by PC-Kimmo and read from the file are in agreement. If @code{VERBOSE} is @code{OFF}, then each result which agrees is indicated by a single dot written to the screen, and only disagreements in surface forms are displayed fully. The default filetype extension for @code{compare synthesize} is @file{.syn}, and the default filename is @file{data.syn}. @w{@code{co s}} is the minimal abbreviation for @w{@code{compare synthesize}}. @w{@code{file compare synthesize}} is a synonym for @w{@code{compare synthesize}}. @c ---------------------------------------------------------------------------- @node directory, edit, compare, Interactive commands @subsection directory @code{directory} lists the contents of the current directory. This command is available only for the MS-DOS and Unix implementations. It does not exist for the Microsoft Windows or Macintosh implementations. @c ---------------------------------------------------------------------------- @node edit, exit, directory, Interactive commands @subsection edit @w{@code{edit} @var{filename}} attempts to edit the specified file using the program indicated by the environment variable @code{EDITOR}. If this environment variable is not defined, then @code{edit} is used to edit the file on MS-DOS, and @code{emacs} is used to edit the file on Unix. This command is not available for the Microsoft Windows or Macintosh implementations. @c ---------------------------------------------------------------------------- @node exit, file, edit, Interactive commands @subsection exit @code{exit} stops PC-Kimmo, returning control to the operating system. This is the same as @code{quit}. @c ---------------------------------------------------------------------------- @node file, generate, exit, Interactive commands @subsection file The @code{file} commands process data from a file, optionally writing the results to another file. Each of these commands is described below. @menu * file compare:: * file generate:: * file recognize:: * file synthesize:: @end menu @c ---------------------------------------------------------------------------- @node file compare, file generate, file, file @subsubsection file compare The @code{file compare} commands all test the current language description files by processing data against known (precomputed) results. @w{@code{f c}} is the minimal abbreviation for @w{@code{file compare}}. @w{@code{file compare}} is a synonym for @code{compare}. @ifset txt See `compare generate', `compare pairs', `compare recognize', and `compare synthesize' above. @end ifset @ifset tex @xref{compare generate}, @ref{compare pairs}, @ref{compare recognize}, and @ref{compare synthesize}. @end ifset @menu * compare generate:: is the same as file compare generate * compare pairs:: is the same as file compare pairs * compare recognize:: is the same as file compare recognize * compare synthesize:: is the same as file compare synthesize @end menu @c ---------------------------------------------------------------------------- @node file generate, file recognize, file compare, file @subsubsection file generate @w{@code{file generate} @var{ []}} reads lexical forms from the specified input file and writes the corresponding computed surface forms either to the screen or to an optionally specified output file. This command behaves the same as @code{generate} except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. @ifset txt See `generate' below. @end ifset @ifclear txt @xref{generate}. @end ifclear @w{@code{f g}} is the minimal abbreviation for @w{@code{file generate}}. @c ---------------------------------------------------------------------------- @node file recognize, file synthesize, file generate, file @subsubsection file recognize @w{@code{file recognize} @var{ []}} reads surface forms from the specified input file and writes the corresponding computed morphological and lexical forms either to the screen or to an optionally specified output file. This command behaves the same as @code{recognize} except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. @ifset txt See `recognize' below. @end ifset @ifclear txt @xref{recognize}. @end ifclear @w{@code{f r}} is the minimal abbreviation for @w{@code{file recognize}}. @c ---------------------------------------------------------------------------- @node file synthesize, , file recognize, file @subsubsection file synthesize @w{@code{file synthesize} @var{ []}} reads morphological forms from the specified input file and writes the corresponding computed surface forms either to the screen or to an optionally specified output file. This command behaves the same as @code{synthesize} except that input comes from a file rather than the keyboard, and output may go to a file rather than the screen. @ifset txt See `synthesize' below. @end ifset @ifclear txt @xref{synthesize}. @end ifclear @w{@code{f s}} is the minimal abbreviation for @w{@code{file synthesize}}. @c ---------------------------------------------------------------------------- @node generate, help, file, Interactive commands @subsection generate @w{@code{generate} @var{[]}} attempts to produce a surface form from a lexical form provided by the user. If a lexical form is typed on the same line as the command, then that lexical form is used to generate a surface form. If the command is typed without a form, then PC-Kimmo prompts the user for lexical forms with a special generator prompt, and processes each form in turn. This cycle of typing and generating is terminated by typing an empty ``form'' (that is, nothing but the @code{Enter} or @code{Return} key). The rules must be loaded before using this command. It does not require either a lexicon or a grammar. @code{g} is the minimal abbreviation for @code{generate}. @c ---------------------------------------------------------------------------- @node help, list, generate, Interactive commands @subsection help @w{@code{help} @var{command}} displays a description of the specified command. If @code{help} is typed by itself, PC-Kimmo displays a list of commands with short descriptions of each command. @code{h} is the minimal abbreviation for @code{help}. @c ---------------------------------------------------------------------------- @node list, load, help, Interactive commands @subsection list The @code{list} commands all display information about the currently loaded data. Each of these commands are described below. @code{li} is the minimal abbreviation for @code{list}. @menu * list lexicon:: * list pairs:: * list rules:: @end menu @c ---------------------------------------------------------------------------- @node list lexicon, list pairs, list, list @subsubsection list lexicon @w{@code{list lexicon}} displays the names of all the (sub)lexicons currently loaded. The order of presentation is the order in which they are referenced in the @code{ALTERNATIONS} declarations. @w{@code{li l}} is the minimal abbreviation for @w{@code{list lexicon}}. @c ---------------------------------------------------------------------------- @node list pairs, list rules, list lexicon, list @subsubsection list pairs @w{@code{list pairs}} displays all the feasible pairs for the current set of active rules. The feasible pairs are displayed as pairs of lines, with the lexical characters shown above the corresponding surface characters. @w{@code{li p}} is the minimal abbreviation for @w{@code{list pairs}}. @c ---------------------------------------------------------------------------- @node list rules, , list pairs, list @subsubsection list rules @w{@code{list rules}} displays the names of the current rules, preceded by the number of the rule (used by the @w{@code{set rules}} command) and an indication of whether the rule is @code{ON} or @code{OFF}. @w{@code{li r}} is the minimal abbreviation for @w{@code{list rules}}. @c ---------------------------------------------------------------------------- @node load, log, list, Interactive commands @subsection load The @code{load} commands all load information stored in specially formatted files. Each of the @code{load} commands is described below. @code{l} is the minimal abbreviation for @code{load}. @menu * load grammar:: * load lexicon:: * load rules:: * load synthesis-lexicon:: @end menu @c ---------------------------------------------------------------------------- @node load grammar, load lexicon, load, load @subsubsection load grammar @w{@code{load grammar} @var{[]}} erases any existing word grammar and reads a new word grammar from the specified file. The default filetype extension for @w{@code{load grammar}} is @file{.grm}, and the default filename is @file{grammar.grm}. A grammar file can also be loaded by using the @samp{-g} command line option when starting PC-Kimmo. @w{@code{l g}} is the minimal abbreviation for @w{@code{load grammar}}. @c ---------------------------------------------------------------------------- @node load lexicon, load rules, load grammar, load @subsubsection load lexicon @w{@code{load lexicon} @var{[]}} erases any existing analysis lexicon information and reads a new analysis lexicon from the specified file. A rules file must be loaded before an analysis lexicon file can be loaded. The default filetype extension for @w{@code{load lexicon}} is @file{.lex}, and the default filename is @file{lexicon.lex}. An analysis lexicon file can also be loaded by using the @samp{-l} command line option when starting PC-Kimmo. This requires that a @samp{-r} option also be used to load a rules file. @w{@code{l l}} is the minimal abbreviation for @w{@code{load lexicon}}. @c ---------------------------------------------------------------------------- @node load rules, load synthesis-lexicon, load lexicon, load @subsubsection load rules @w{@code{load rules} @var{[]}} erases any existing rules and reads a new set of two-level rules from the specified file. The default filetype extension for @w{@code{load rules}} is @file{.rul}, and the default filename is @file{rules.rul}. A rules file can also be loaded by using the @samp{-r} command line option when starting PC-Kimmo. @w{@code{l r}} is the minimal abbreviation for @w{@code{load rules}}. @c ---------------------------------------------------------------------------- @node load synthesis-lexicon, , load rules, load @subsubsection load synthesis-lexicon @w{@code{load synthesis-lexicon} @var{[]}} erases any existing synthesis lexicon and reads a new synthesis lexicon from the specified file. A rules file must be loaded before a synthesis lexicon file can be loaded. The default filetype extension for @w{@code{load synthesis-lexicon}} is @file{.lex}, and the default filename is @file{lexicon.lex}. A synthesis lexicon file can also be loaded by using the @samp{-s} command line option when starting PC-Kimmo. This requires that a @samp{-r} option also be used to load a rules file. @w{@code{l s}} is the minimal abbreviation for @w{@code{load synthesis-lexicon}}. @c ---------------------------------------------------------------------------- @node log, quit, load, Interactive commands @subsection log @w{@code{log} @var{[]}} opens a log file. Each item processed by a @code{generate}, @code{recognize}, @code{synthesize}, @code{compare}, or @code{file} command is recorded in the log file as well as being displayed on the screen. If a filename is given on the same line as the @code{log} command, then that file is used for the log file. Any previously existing file with the same name will be overwritten. If no filename is provided, then the file @file{pckimmo.log} in the current directory is used for the log file. Use @code{close} to stop recording in a log file. If a @code{log} command is given when a log file is already open, then the earlier log file is closed before the new log file is opened. @c ---------------------------------------------------------------------------- @node quit, recognize, log, Interactive commands @subsection quit @code{quit} stops PC-Kimmo, returning control to the operating system. This is the same as @code{exit}. @c ---------------------------------------------------------------------------- @node recognize, save, quit, Interactive commands @subsection recognize @w{@code{recognize} @var{[]}} attempts to produce lexical and morphological forms from a surface wordform provided by the user. If a wordform is typed on the same line as a command, then that word is parsed. If the command is typed without a form, then PC-Kimmo prompts the user for surface forms with a special recognizer prompt, and processes each form in turn. This cycle of typing and parsing is terminated by typing an empty ``word'' (that is, nothing but the @code{Enter} or @code{Return} key). Both the rules and the lexicon must be loaded before using this command. A grammar may also be loaded and used to eliminate invalid parses from the two-level processor results. If a grammar is used, then parse trees and feature structures may be displayed as well as the lexical and morphological forms. @c ---------------------------------------------------------------------------- @node save, set, recognize, Interactive commands @subsection save @w{@code{save} @var{[file.tak]}} writes the current settings to the designated file in the form of PC-Kimmo commands. If the file is not specified, the settings are written to @code{pckimmo.tak} in the current directory. @c ---------------------------------------------------------------------------- @node set, show, save, Interactive commands @subsection set The @code{set} commands control program behavior by setting internal program variables. Each of these commands (and variables) is described below. @menu * set ambiguities:: * set ample-dictionary:: * set check-cycles:: * set comment:: * set failures:: * set features:: * set gloss:: * set marker category:: * set marker features:: * set marker gloss:: * set marker record:: * set marker word:: * set timing:: * set top-down-filter:: * set tree:: * set trim-empty-features:: * set unification:: * set verbose:: * set warnings:: * set write-ample-parses:: @end menu @c ---------------------------------------------------------------------------- @node set ambiguities, set ample-dictionary, set, set @subsubsection set ambiguities @w{@code{set ambiguities} @var{number}} limits the number of analyses printed to the given number. The default value is 10. Note that this does not limit the number of analyses produced, just the number printed. @c ---------------------------------------------------------------------------- @node set ample-dictionary, set check-cycles, set ambiguities, set @subsubsection set ample-dictionary @w{@code{set ample-dictionary} @var{value}} determines whether or not the AMPLE dictionary files are divided according to morpheme type. @w{@code{set ample-dictionary split}} declares that the AMPLE dictionary is divided into a prefix dictionary file, an infix dictionary file, a suffix dictionary file, and one or more root dictionary files. The existence of the three affix dictionary depends on settings in the AMPLE analysis data file. If they exist, the @w{@code{load ample dictionary}} command requires that they be given in this relative order: prefix, infix, suffix, root(s). @w{@code{set ample-dictionary unified}} declares that any of the AMPLE dictionary files may contain any type of morpheme. This implies that each dictionary entry may contain a field specifying the type of morpheme (the default is @var{root}), and that the dictionary code table contains a @code{\unified} field. One of the changes listed under @code{\unified} must convert a backslash code to @code{T}. The default is for the AMPLE dictionary to be @emph{split}.@footnote{The unified dictionary is a new feature of AMPLE version 3.} @c ---------------------------------------------------------------------------- @node set check-cycles, set comment, set ample-dictionary, set @subsubsection set check-cycles @w{@code{set check-cycles} @var{value}} enables or disables a check to prevent cycles in the parse chart. @w{@code{set check-cycles on}} turns on this check, and @w{@code{set check-cycles off}} turns it off. This check slows down the parsing of a sentence, but it makes the parser less vulnerable to hanging on perverse grammars. The default setting is @code{on}. @c ---------------------------------------------------------------------------- @node set comment, set failures, set check-cycles, set @subsubsection set comment @w{@code{set comment} @var{character}} sets the comment character to the indicated value. If @var{character} is missing (or equal to the current comment character), then comment handling is disabled. The default comment character is @code{;} (semicolon). @c ---------------------------------------------------------------------------- @node set failures, set features, set comment, set @subsubsection set failures @w{@code{set failures} @var{value}} enables or disables @var{grammar failure mode}. @w{@code{set failures on}} turns on grammar failure mode, and @w{@code{set failures off}} turns it off. When grammar failure mode is on, the partial results of forms that fail the grammar module are displayed. A form may fail the grammar either by failing the feature constraints or by failing the constituent structure rules. In the latter case, a partial tree (bush) will be returned. The default setting is @code{off}. Be careful with this option. Setting failures to @code{on} can cause the PC-Kimmo to go into an infinite loop for certain recursive grammars and certain input sentences. @sc{we may try to do something to detect this type of behavior, at least partially.} @c ---------------------------------------------------------------------------- @node set features, set gloss, set failures, set @subsubsection set features @w{@code{set features} @var{value}} determines how features will be displayed. @w{@code{set features all}} enables the display of the features for all nodes of the parse tree. @w{@code{set features top}} enables the display of the feature structure for only the top node of the parse tree. This is the default setting. @w{@code{set features flat}} causes features to be displayed in a flat, linear string that uses less space on the screen. @w{@code{set features full}} causes features to be displayed in an indented form that makes the embedded structure of the feature set clear. This is the default setting. @w{@code{set features on}} turns on features display mode, allowing features to be shown. This is the default setting. @w{@code{set features off}} turns off features display mode, preventing features from being shown. @c ---------------------------------------------------------------------------- @node set gloss, set marker category, set features, set @subsubsection set gloss @w{@code{set gloss} @var{value}} enables the display of glosses in the parse tree output if @var{value} is @code{on}, and disables the display of glosses if @var{value} is @code{off}. If any glosses exist in the lexicon file, then @code{gloss} is automatically turned @code{on} when the lexicon is loaded. If no glosses exist in the lexicon, then this flag is ignored. @c ---------------------------------------------------------------------------- @node set marker category, set marker features, set gloss, set @subsubsection set marker category @w{@code{set marker category} @var{marker}} establishes the marker for the field containing the category (part of speech) feature. The default is @code{\c}. @c ---------------------------------------------------------------------------- @node set marker features, set marker gloss, set marker category, set @subsubsection set marker features @w{@code{set marker features} @var{marker}} establishes the marker for the field containing miscellaneous features. (This field is not needed for many words.) The default is @code{\f}. @c ---------------------------------------------------------------------------- @node set marker gloss, set marker record, set marker features, set @subsubsection set marker gloss @w{@code{set marker gloss} @var{marker}} establishes the marker for the field containing the word gloss. The default is @code{\g}. @c ---------------------------------------------------------------------------- @node set marker record, set marker word, set marker gloss, set @subsubsection set marker record @w{@code{set marker record} @var{marker}} establishes the field marker that begins a new record in the lexicon file. This may or may not be the same as the @code{word} marker. The default is @code{\w}. @c ---------------------------------------------------------------------------- @node set marker word, set timing, set marker record, set @subsubsection set marker word @w{@code{set marker word} @var{marker}} establishes the marker for the word field. The default is @code{\w}. @c ---------------------------------------------------------------------------- @node set timing, set top-down-filter, set marker word, set @subsubsection set timing @w{@code{set timing} @var{value}} enables timing mode if @var{value} is @code{on}, and disables timing mode if @var{value} is @code{off}. If timing mode is @code{on}, then the elapsed time required to process a command is displayed when the command finishes. If timing mode is @code{off}, then the elapsed time is not shown. The default is @code{off}. (This option is useful only to satisfy idle curiosity.) @c ---------------------------------------------------------------------------- @node set top-down-filter, set tree, set timing, set @subsubsection set top-down-filter @w{@code{set top-down-filter} @var{value}} enables or disables top-down filtering based on the categories. @w{@code{set top-down-filter on}} turns on this filtering, and @w{@code{set top-down-filter off}} turns it off. The top-down filter speeds up the parsing of a sentence, but might cause the parser to miss some valid parses. The default setting is @code{on}. This should not be required in the final version of PC-Kimmo. @c ---------------------------------------------------------------------------- @node set tree, set trim-empty-features, set top-down-filter, set @subsubsection set tree @w{@code{set tree} @var{value}} specifies how parse trees should be displayed. @w{@code{set tree full}} turns on the parse tree display, displaying the result of the parse as a full tree. This is the default setting. A short sentence would look something like this: @example @group Sentence | Declarative _____|_____ NP VP | ___|____ N V COMP cows eat | NP | N grass @end group @end example @w{@code{set tree flat}} turns on the parse tree display, displaying the result of the parse as a flat tree structure in the form of a bracketed string. The same short sentence would look something like this: @example @group (Sentence (Declarative (NP (N cows)) (VP (V eat) (COMP (NP (N grass)))))) @end group @end example @w{@code{set tree indented}} turns on the parse tree display, displaying the result of the parse in an indented format sometimes called a @emph{northwest tree}. The same short sentence would look like this: @example @group Sentence Declarative NP N cows VP V eat COMP NP N grass @end group @end example @w{@code{set tree off}} disables the display of parse trees altogether. @c ---------------------------------------------------------------------------- @node set trim-empty-features, set unification, set tree, set @subsubsection set trim-empty-features @w{@code{set trim-empty-features} @var{value}} disables the display of empty feature values if @var{value} is @code{on}, and enables the display of empty feature values if @var{value} is @code{off}. The default is not to display empty feature values. @c ---------------------------------------------------------------------------- @node set unification, set verbose, set trim-empty-features, set @subsubsection set unification @w{@code{set unification} @var{value}} enables or disables feature unification. @w{@code{set unification on}} turns on unification mode. This is the default setting. @w{@code{set unification off}} turns off feature unification in the grammar. Only the context-free phrase structure rules are used to guide the parse; the feature contraints are ignored. This can be dangerous, as it is easy to introduce infinite cycles in recursive phrase structure rules. @c ---------------------------------------------------------------------------- @node set verbose, set warnings, set unification, set @subsubsection set verbose @w{@code{set verbose} @var{value}} enables or disables the screen display of parse trees in the @w{@code{file parse}} command. @w{@code{set verbose on}} enables the screen display of parse trees, and @w{@code{set verbose off}} disables such display. The default setting is @code{off}. @c ---------------------------------------------------------------------------- @node set warnings, set write-ample-parses, set verbose, set @subsubsection set warnings @w{@code{set warnings} @var{value}} enables warning mode if @var{value} is @code{on}, and disables warning mode if @var{value} is @code{off}. If warning mode is enabled, then warning messages are displayed on the output. If warning mode is disabled, then no warning messages are displayed. The default setting is @code{on}. @c ---------------------------------------------------------------------------- @node set write-ample-parses, , set warnings, set @subsubsection set write-ample-parses @w{@code{set write-ample-parses} @var{value}} enables writing @code{\parse} and @code{\features} fields at the end of each sentence in the disambiguated analysis file if @var{value} is @code{on}, and disables writing these fields if @var{value} is @code{off}. The default setting is @code{off}. This variable setting affects only the @w{@code{file disambiguate}} command. @c ---------------------------------------------------------------------------- @node show, status, set, Interactive commands @subsection show The @code{show} commands display internal settings on the screen. Each of these commands is described below. @menu * show lexicon:: * show status:: @end menu @c ---------------------------------------------------------------------------- @node show lexicon, show status, show, show @subsubsection show lexicon @w{@code{show lexicon}} prints the contents of the lexicon stored in memory on the standard output. @sc{this is not very useful, and may be removed.} @c ---------------------------------------------------------------------------- @node show status, , show lexicon, show @subsubsection show status @w{@code{show status}} displays the names of the current grammar, sentences, and log files, and the values of the switches established by the @code{set} command. @code{show} (by itself) and @code{status} are synonyms for @w{@code{show status}}. @c ---------------------------------------------------------------------------- @node status, synthesize, show, Interactive commands @subsection status @code{status} displays the names of the current grammar, sentences, and log files, and the values of the switches established by the @code{set} command. @c ---------------------------------------------------------------------------- @node synthesize, system, status, Interactive commands @subsection synthesize @w{@code{synthesize} @var{[]}} attempts to produce surface forms from a morphological form provided by the user. If a morphological form is typed on the same line as the command, then that form is synthesized. If the command is typed without a form, then PC-Kimmo repeatedly prompts the user for morphological forms with a special synthesizer prompt, processing each form. This cycle of typing and synthesizing is terminated by typing an empty ``form'' (that is, nothing but the @code{Enter} or @code{Return} key). Note that the morphemes in the morphological form must be separated by spaces, and must match gloss entries loaded from the lexicon. Also, the morphemes must be given in the proper order. Both the rules and the synthesis lexicon must be loaded before using this command. It does not use a grammar. @c ---------------------------------------------------------------------------- @node system, take, synthesize, Interactive commands @subsection system @w{@code{system} @var{[command]}} allows the user to execute an operating system command (such as checking the available space on a disk) from within PC-Kimmo. This is available only for MS-DOS and Unix, not for Microsoft Windows or the Macintosh. If no system-level command is given on the line with the @code{system} command, then PC-Kimmo is pushed into the background and a new system command processor (shell) is started. Control is usually returned to PC-Kimmo in this case by typing @code{exit} as the operating system command. @code{sys} is the minimal abbreviation for @code{system}. @code{!} (exclamation point) is a synonym for @code{system}. (@code{!} does not require a space to separate it from the command.) @c ---------------------------------------------------------------------------- @node take, , system, Interactive commands @subsection take @w{@code{take} @var{[file.tak]}} redirects command input to the specified file. The default filetype extension for @code{take} is @code{.tak}, and the default filename is @code{pckimmo.tak}. @code{take} files can be nested three deep. That is, the user types @w{@code{take file1}}, @code{file1} contains the command @w{@code{take file2}}, and @code{file2} has the command @w{@code{take file3}}. It would be an error for @code{file3} to contain a @code{take} command. This should not prove to be a serious limitation. A @code{take} file can also be specified by using the @code{-t} command line option when starting PC-Kimmo. When started, PC-Kimmo looks for a @code{take} file named @file{pckimmo.tak} in the current directory to initialize itself with. @c ---------------------------------------------------------------------------- @node Rules file, Lexicon files, Running PC-Kimmo, Top @chapter The PC-Kimmo Rules File @set rules-structure 1 The general structure of the rules file is a list of keyword declarations. Figure @value{rules-structure} shows the conventional structure of the rules file. Note that the notation @w{@samp{@{x | y@}}} means either @samp{x} or @samp{y} (but not both). @example @group @b{Figure @value{rules-structure} Structure of the rules file} COMMENT @var{} ALPHABET @var{} NULL @var{} ANY @var{} BOUNDARY @var{} SUBSET @var{} @var{} . (more subsets) . . RULE @var{} @var{} @var{} @var{} @var{} @var{}@{: | .@} @var{} . (more states) . . . (more rules) . . END @end group @end example The following specifications apply to the rules file. @itemize @bullet @item Extra spaces, blank lines, and comment lines are ignored. In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: @code{^I} (ASCII 9, tab), @code{^J} (ASCII 10, line feed), @code{^K} (ASCII 11, vertical tab), @code{^L} (ASCII 12, form feed), and @code{^M} (ASCII 13, carriage return). @item Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored. (See below on the @code{COMMENT} declaration.) @item The set of valid keywords used to form declarations includes @code{COMMENT}, @code{ALPHABET}, @code{NULL}, @code{ANY}, @code{BOUNDARY}, @code{SUBSET}, @code{RULE}, and @code{END}. @item These declarations are obligatory and can occur only once in a file: @code{ALPHABET}, @code{NULL}, @code{ANY}, @code{BOUNDARY}. @item These declarations are optional and can occur one or more times in a file: @code{COMMENT}, @code{SUBSET}, and @code{RULE}. @item The @code{COMMENT} declaration sets the comment character used in the rules file, lexicon files, and grammar file. The @code{COMMENT} declaration can only be used in the rules file, not in the lexicon or grammar file. The @code{COMMENT} declaration is optional. If it is not used, the comment character is set to @code{;} (semicolon) as a default. @item The @code{COMMENT} declaration can be used anywhere in the rules file and can be used more than once. That is, different parts of the rules file can use different comment characters. The @code{COMMENT} declaration can (and in practice usually does) occur as the first keyword in the rules file, followed by either one or more @code{COMMENT} declarations or the @code{ALPHABET} declaration. @item Note that if you use the @code{COMMENT} declaration to declare the character that is already in use as the comment character, an error will result. For instance, if semicolon is the current comment character, the declaration @w{@code{COMMENT ;}} will result in an error. @item The comment character can no longer be set using a command line option or with a command in the user interface, as was the case in version 1 of PC-Kimmo. @item The @code{ALPHABET} declaration must either occur first in the file or follow one or more @code{COMMENT} declarations only. The other declarations can appear in any order. The @code{COMMENT}, @code{NULL}, @code{ANY}, @code{BOUNDARY}, and @code{SUBSET} declarations can even be interspersed among the rules. However, these declarations must appear before any rule that uses them or an error will result. @item The @code{ALPHABET} declaration defines the set of symbols used in either lexical or surface representations. The keyword @code{ALPHABET} is followed by a @var{} of all alphabetic symbols. Each symbol must be separated from the others by at least one space. The list can span multiple lines, and ends with the next valid keyword. All alphanumeric characters (such as @code{a}, @code{B}, and @code{2}), symbols (such as @code{$} and @code{+}), and punctuation characters (such as @code{.} and @code{?}) are available as alphabet members. The characters in the IBM extended character set (above ASCII 127) are also available. Control characters (below ASCII 32) can also be used, with the exception of whitespace characters (see above), @code{^Z} (end of file), and @code{^@@} (null). The alphabet can contain a maximum of 255 symbols. An alphabetic symbol can also be a multigraph, that is, a sequence of two or more characters. The individual characters composing a multigraph do not necessarily have to also be declared as alphabetic characters. For example, an alphabet could include the characters @code{s} and @code{z} and the multigraph @code{sz%}, but not include @code{%} as an alphabetic character. Note that a multigraph cannot also be interpreted as a sequence of the individual characters that comprise it. @item The keyword @code{NULL} is followed by a single @var{} that represents a null (empty, zero) element. The @code{NULL} symbol is considered to be an alphabetic character, but cannot also be listed in the @code{ALPHABET} declaration. The @code{NULL} symbol declared in the rules file is also used in the lexicon file to represent a null lexical entry. @item The keyword @code{ANY} is followed by a single ``wildcard'' @var{} that represents a match of any character in the alphabet. The @code{ANY} symbol is not considered to be an alphabetic character, though it is used in the column headers of state tables. It cannot be listed in the @code{ALPHABET} declaration. It is not used in the lexicon file. @item The keyword @code{BOUNDARY} is followed by a single @var{} character that represents an initial or final word boundary. The @code{BOUNDARY} symbol is considered to be an alphabetic character, but cannot also be listed in the @code{ALPHABET} declaration. When used in the column header of a state table, it can only appear as the pair @code{#:#} (where, for instance, @code{#} has been declared as the @code{BOUNDARY} symbol). The @code{BOUNDARY} symbol is also used in the lexicon file in the continuation class field of a lexical entry to indicate the end of a word (that is, no continuation class). @item The @code{SUBSET} declaration defines set of characters that are referred to in the column headers of rules. The keyword @code{SUBSET} is followed by the @var{} and @var{}. @var{} is a single word (one or more characters) that names the list of characters that follows it. The subset name must be unique (that is, if it is a single character it cannot also be in the alphabet or be any other declared symbol). It can be composed of any characters (except space); that is, it is not limited to the characters declared in the @code{ALPHABET} section. It must not be identical to any keyword used in the rules file. The subset name is used in rules to represent all members of the subset of the alphabet that it defines. Note that @code{SUBSET} declarations can be interspersed among the rules. This allows subsets to be placed near the rule that uses them if such a style is desired. However, a subset must be declared before a rule that uses it. @item The @var{} following a @var{} is a list of single symbols, each of which is separated by at least one space. The list can span multiple lines. Each symbol in the list must be a member of the previously defined @code{ALPHABET}, with the exception of the @code{NULL} symbol, which can appear in a subset list but is not included in the @code{ALPHABET} declaration. Neither the @code{ANY} symbol nor the @code{BOUNDARY} symbol can appear in a subset symbol list. @item The keyword @code{RULE} signals that a state table immediately follows. Note that two-level rules must be expressed as a state table rather than in the form discussed in @ifset txt chapter 2 `The Two-level Formalism' @end ifset @ifclear txt @ref{Two-level formalism} @end ifclear above. @item @var{} is the name or description of the rule which the state table encodes. It functions as an annotation to the state table and has no effect on the computational operation of the table. It is displayed by the list rules and show rule commands and is also displayed in traces. The rule name must be surrounded by a pair of identical delimiter characters. Any material can be used between the delimiters of the rule name with the exception of the current comment character and of course the rule name delimiter character of the rule itself. Each rule in the file can use a different pair of delimiters. The rule name must be all on one line, but it does not have to be on the same line as the @code{RULE} keyword. @item @var{} is the number of states (rows in the table) that will be defined for this table. The states must begin at 1 and go in sequence through the number defined here (that is, gaps in state numbers are not allowed). @item @var{} is the number of state transitions (columns in the table) that will be defined for each state. @item @var{} is a list of elements separated by one or more spaces. Each element represents the lexical half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the @code{NULL} symbol, the @code{ANY} symbol, or the @code{BOUNDARY} symbol (in which case the corresponding surface character must also be the @code{BOUNDARY} symbol). The list can span multiple lines, but the number of elements in the list must be equal to the number of columns defined for the rule. @item @var{} is a list of elements separated by one or more spaces. Each element represents the surface half of a lexical:surface correspondence which, when matched, defines a state transition. Each element in the list must be either a member of the alphabet, a subset name, the @code{NULL} symbol, the @code{ANY} symbol, or the @code{BOUNDARY} symbol (in which case the corresponding lexical character must also be the @code{BOUNDARY} symbol). The list can span multiple lines, but the number of characters in the list must be equal to the number of columns defined for the rule. @item @var{} is the number of the state or row of the table. The first state number must be 1, and subsequent state numbers must follow in numerical sequence without any gaps. @item @samp{@{: | .@}} is the final or nonfinal state indicator. This should be a colon (@code{:}) if the state is a final state and a period (@code{.}) if it is a nonfinal state. It must follow the @var{} with no intervening space. @item @var{} is a list of state transition numbers for a particular state. Each number must be between 1 and the number of states (inclusive) declared for the table. The list can span multiple lines, but the number of elements in the list must be equal to the number of columns declared for this rule. @item The keyword @code{END} follows all other declarations and indicates the end of the rules file. Any material in the file thereafter is ignored by PC-Kimmo. The @code{END} keyword is optional; the physical end of the file also terminates the rules file. @end itemize @set rules-sample 2 Figure @value{rules-sample} shows a sample rules file. @example @group @b{Figure @value{rules-sample} A sample rules file} ALPHABET b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary a e i o u NULL 0 ANY @@ BOUNDARY # SUBSET C b c d f g h j k l m n p q r s t v w x y z SUBSET V a e i o u ; more subsets RULE "Consonant defaults" 1 23 b c d f g h j k l m n p q r s t v w x y z + @@ b c d f g h j k l m n p q r s t v w x y z 0 @@ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "Vowel defaults" 1 6 a e i o u @@ a e i o u @@ 1: 1 1 1 1 1 1 RULE "Voicing s:z <=> V___V" 4 4 V s s @@ V z @@ @@ 1: 2 0 1 1 2: 2 4 3 1 3: 0 0 1 1 4. 2 0 0 0 ; more rules END @end group @end example @c ---------------------------------------------------------------------------- @node Lexicon files, Grammar file, Rules file, Top @chapter The PC-Kimmo Lexicon Files @set lex-main 3 A lexicon consists of one main lexicon file plus one or more files of lexical entries. The general structure of the main lexicon file is a list of keyword declarations. The set of valid keywords is @code{ALTERNATION}, @code{FEATURES}, @code{FIELDCODE}, @code{INCLUDE}, and @code{END}. Figure @value{lex-main} shows the conventional structure of the main lexicon file. @example @group @b{Figure @value{lex-main} Structure of the main lexicon file} ALTERNATION @var{} @var{} . (more ALTERNATIONs) . . FEATURES @var{} FIELDCODE @var{} U FIELDCODE @var{} L FIELDCODE @var{} A FIELDCODE @var{} F FIELDCODE @var{} G INCLUDE @var{} . (more INCLUDEd files) . . END @end group @end example The following specifications apply to the main lexicon file. @itemize @bullet @item Extra spaces, blank lines, and comment lines are ignored. In the descriptions below, reference to the use of a space character implies any whitespace character (that is, any character treated like a space character). The following control characters when used in a file are whitespace characters: @code{^I} (ASCII 9, tab), @code{^J} (ASCII 10, line feed), @code{^K} (ASCII 11, vertical tab), @code{^L} (ASCII 12, form feed), and @code{^M} (ASCII 13, carriage return). @item The comment character declared in the rules file is operative in the main lexicon file. Comments may be placed anywhere in the file. All data following a comment character to the end of the line is ignored. @item The set of valid keywords used to form declarations includes @code{ALTERNATION}, @code{FEATURES}, @code{FIELDCODE}, @code{INCLUDE}, and @code{END}. @item The declarations can appear in any order with the proviso that any alternation name, feature name, or fieldcode used in a lexical entry must be declared before the lexical entry is read. In practice, this means that the @code{INCLUDE} declarations should appear last, but the @code{ALTERNATION}, @code{FEATURES}, and @code{FIELDCODE} declarations can appear in any order. @item The @code{ALTERNATION} declaration defines a set of sublexicon names that serve as the continuation class of a lexical item. The @code{ALTERNATION} keyword is followed by an @var{} and a @var{}. @code{ALTERNATION} declarations are optional (but nearly always used in practice) and can occur as many times as needed. @item @var{} is a name associated with the following @var{}. It is a word composed of one or more characters, not limited to the @code{ALPHABET} characters declared in the rules file. An alternation name can be any word other than a keyword used in the lexicon file. The program does not check to see if an alternation name is actually used in the lexicon file. @item @var{} is a list of sublexicon names. It can span multiple lines until the next valid keyword is encountered. Each sublexicon name in the list must be used in the sublexicon field of a lexical entry. Although it is not enforced at the time the lexicon file is loaded, an undeclared sublexicon named in a sublexicon name list will cause an error when the recognizer tries to use it. @item The @code{FEATURES} keyword followed by a @var{}. A @var{} is a list of words, each of which is expanded into feature structures by the word grammar. @item The @code{FIELDCODE} declaration is used to define what fieldcode will be used to mark each type of field in a lexical entry. The @code{FIELDCODE} keyword is followed by a @var{} and one of five possible internal codes: @code{U}, @code{L}, @code{A}, @code{F}, or @code{G}. There must be five @code{FIELDCODE} declarations, one for each of these internal codes, where @code{U} indicates the lexical item field, @code{L} indicates the sublexicon field, @code{A} indicates the alternation field, @code{F} indicates the features field, and @code{G} indicates the gloss field. @item The @code{INCLUDE} keyword is followed by a @var{} that names a file containing lexical entries to be loaded. An @code{INCLUDE}d file cannot contain any declarations (such as a @code{FIELDCODE} or an @code{INCLUDE} declaration), only lexical entries and comment lines. @item The keyword @code{END} follows all other declarations and indicates the end of the main lexicon file. Any material in the file thereafter is ignored by PC-Kimmo. The @code{END} keyword is optional; the physical end of the file also terminates the main lexicon file. @end itemize @noindent @set lex-sample-main 4 Figure @value{lex-sample-main} shows a sample main lexicon file. @example @group @b{Figure @value{lex-sample-main} A sample main lexicon file} ALTERNATION Begin PREF ALTERNATION Pref N AJ V AV ALTERNATION Stem SUFFIX FEATURES sg pl reg irreg FIELDCODE lf U ;lexical item FIELDCODE lx L ;sublexicon FIELDCODE alt A ;alternation FIELDCODE fea F ;features FIELDCODE gl G ;gloss INCLUDE affix.lex ;file of affixes INCLUDE noun.lex ;file of nouns INCLUDE verb.lex ;file of verbs INCLUDE adjectiv.lex ;file of adjectives INCLUDE adverb.lex ;file of adverbs END @end group @end example @set lex-entry 5 Figure @value{lex-entry} shows the structure of a lexical entry. Lexical entries are encoded in ``field-oriented standard format.'' Standard format is an information interchange convention developed by SIL International. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure. @example @group @b{Figure @value{lex-entry} Structure of a lexical entry} \@var{} @var{} \@var{} @var{} \@var{} @{@var{} | @var{}@} \@var{} @var{} \@var{} @var{} @end group @end example @noindent The following points provide an informal description of the syntax of FOSF files. @itemize @bullet @item A field-oriented standard format (FOSF) file consists of a sequence of records. @item A record consists of a sequence of fields. @item A field consist of a field marker and a field value. @item A field marker consists of a backslash character at the beginning of a line, followed by an alphabetic or numeric character, followed by zero or more printable characters, and terminated by a space, tab, or the end of a line. A field marker without its initial backslash character is termed a field code. @item A field marker must begin in the first position of a line. Backslash characters occurring elsewhere in the file are not interpreted as field markers. @item The first field marker of the record is considered the record marker, and thus the same field must occur first in every record of the file. @item Each field marker is separated from the field value by one or more spaces, tabs, or newlines. The field value continues up to the next field marker. @item Any line that is empty or contains only whitespace characters is considered a comment line and is ignored. Comment lines may occur between or within fields. @item Fields and lines in an FOSF file can be arbitrarily long. @item There are two basic types of fields in FOSF files: nonrepeating and repeating. Repeating fields are multiple consecutive occurrences of fields marked by the same marker. Individual fields within a repeating field can be called subfields. @end itemize @noindent The following specifications apply to how FOSF is implemented in PC-Kimmo. @itemize @bullet @item Lexical entries are encoded as records in a FOSF file. @item Only those fields whose field codes are declared in the main lexicon file are recognized (see above on the @code{FIELDCODE} declaration). All other fields are considered to be extraneous and are ignored. @item The first field of each lexical entry must be the lexical item field. The lexical item field code is assigned to the internal code U by a @code{FIELDCODE} declaration in the main lexicon file. @item Only nonrepeating fields are permitted. @item The comment character declared in the rules file is operative in included files of lexical entries. All data following a comment character to the end of the line is ignored. @end itemize A file of lexical entries is loaded by using an @code{INCLUDE} declaration in the main lexicon file (see above). An @code{INCLUDE}d file of lexical entries cannot contain any declarations (such as a @code{FIELDCODE} or an @code{INCLUDE} declaration), only lexical entries and comment lines. The following specifications apply to lexical entries. @itemize @bullet @item A lexical entry is composed of five fields: lexical item, sublexicon, alternation, features, and gloss. The lexical item, sublexicon, and alternation, fields are obligatory, the features and gloss fields are optional. The first field of the entry must always be the lexical item. The other fields can appear in any order, even differing from one entry to another. @item Although the gloss field is optional, if a lexical entry does not include one, a warning message to that effect will be displayed when the entry is loaded. To suppress this warning message, do the command @w{@code{set warnings off}} @ifset txt (see section 3.2.17.19 `set warnings') @end ifset @ifclear txt (@pxref{set warnings}) @end ifclear before loading the lexicon. @item If an entry has an empty gloss field (that is, the field marker for the gloss field is present but there is no data after it), then the contents of the lexical form field will be also be used as the gloss for that entry. @item A lexical item field consists of a @var{} and a @var{}. @item A @var{} is a field code assigned to the internal code @code{U} by a @code{FIELDCODE} declaration in the main lexicon file. @item A @var{} is one or more characters that represent an element (typically a morpheme or word) of the lexicon. Each character (or multigraph) must be in the alphabet defined for the language. The lexical item uses only the lexical subset of the alphabet. @item A sublexicon field consists of a @var{} and a @var{}. @item A @var{} is a field code assigned to the internal code @code{L} by a @code{FIELDCODE} declaration in the main lexicon file. @item A @var{} is the name associated with a sublexicon. It is a word composed of one or more characters, not limited to the alphabetic characters declared in the rules file. Every lexical item must belong to a sublexicon. Every lexicon must include a special sublexicon named INITIAL (that is, there must be at least one lexical entry that belongs to the INITIAL sublexicon). @item Lexical entries belonging to a sublexicon do not have to be listed consecutively in a single file (as was the case for PC-Kimmo version 1); rather, lexical entries in a file can occur in any order, regardless of what sublexicon they belong to. Lexical entries of a sublexicon can even be placed in two or more separate files. @item An alternation field consists of a @var{} followed by either an @var{} or the @var{}. @item An @var{} is declared in an @code{ALTERNATION} declaration in the main lexicon file. The @var{} is declared in the rules file and indicates the end of all possible continuations in the lexicon. @item A features field consists of a @var{} and a @var{}. @item A @var{} is a field code assigned to the internal code @code{F} by a @code{FIELDCODE} declaration in the main lexicon file. @item A @var{} is a list of feature abbreviations. Each abbreviation is a single word consisting of alphanumeric characters or other characters except @code{()@{@}[]<>=:$!} (these are used for special purposes in the grammar file). The character @code{\} should not be used as the first character of an abbreviation because that is how fields are marked in the lexicon file. Upper and lower case letters used in template names are considered different. For example, @code{PLURAL} is not the same as @code{Plural} or @code{plural}. Feature abbreviations are expanded into full feature structures by the word grammar @ifset txt (see chapter 6 `The Grammar File'). @end ifset @ifclear txt (@pxref{Grammar file}). @end ifclear @item A gloss field consists of a @var{} and a @var{}. @item A @var{} is a field code assigned to the internal code @code{G} by a @code{FIELDCODE} declaration in the main lexicon file. @item A @var{} is a string of text. Any material can be used in the gloss field with the exception of the comment character. @end itemize @set lex-sample-entry 6 Figure @value{lex-sample-entry} shows a sample lexical entry. @example @group @b{Figure @value{lex-sample-entry} A sample lexical entry} \lf `knives \lx N \alt Infl \fea pl irreg \gl N(`knife)+PL @end group @end example @c ---------------------------------------------------------------------------- @node Grammar file, Convlex, Lexicon files, Top @chapter The Grammar File The following specifications apply generally to the word grammar file: @itemize @bullet @item Blank lines, spaces, and tabs separate elements of the grammar file from one another, but are ignored otherwise. @item The comment character declared by the @w{@code{set comment}} command @ifset txt (see section 3.2.17.4 `set comment' above) @end ifset @ifclear txt (@pxref{set comment}) @end ifclear is operative in the grammar file. The default comment character is the semicolon (@code{;}). Comments may be placed anywhere in the grammar file. Everything following a comment character to the end of the line is ignored. @item A grammar file is divided into fields identified by a small set of keywords. @enumerate @item @code{Rule} starts a context-free phrase structure rule with its set of feature constraints. These rules define how words join together to form phrases, clauses, or sentences. The lexicon and grammar are tied together by using the lexical categories as the terminal symbols of the phrase structure rules and by using the other lexical features in the feature constraints. @item @code{Let} starts a feature template definition. Feature templates are used as macros (abbreviations) in the lexicon. They may also be used to assign default feature structures to the categories. @item @code{Parameter} starts a program parameter definition. These parameters control various aspects of the program. @item @code{Define} starts a lexical rule definition. As noted in Shieber (1985), something more powerful than just abbreviations for common feature elements is sometimes needed to represent systematic relationships among the elements of a lexicon. This need is met by lexical rules, which express transformations rather than mere abbreviations. Lexical rules are not yet implemented properly. They may or may not be useful for word grammars used by PC-Kimmo. @item @code{Lexicon} starts a lexicon section. This is only for compatibility with the original PATR-II. The section name is skipped over properly, but nothing is done with it. @item @code{Word} starts an entry in the lexicon. This is only for compatibility with the original PATR-II. The entry is skipped over properly, but nothing is done with it. @item @code{End} effectively terminates the file. Anything following this keyword is ignored. @end enumerate Note that these keywords are not case sensitive: @code{RULE} is the same as @code{rule}, and both are the same as @code{Rule}. @item Each of the fields in the grammar file may optionally end with a period. If there is no period, the next keyword (in an appropriate slot) marks the end of one field and the beginning of the next. @end itemize @menu * Rule:: defining a word structure rule * Let:: defining a feature template * Parameter:: setting control variables * Define:: defining a lexical rule @end menu @c ---------------------------------------------------------------------------- @node Rule, Let, Grammar file, Grammar file @section Rules A PC-Kimmo word grammar rule has these parts, in the order listed: @enumerate @item the keyword @code{Rule} @item an optional rule identifier enclosed in braces (@code{@{@}}) @item the nonterminal symbol to be expanded @item an arrow (@code{->}) or equal sign (@code{=}) @item zero or more terminal or nonterminal symbols, possibly marked for alternation or optionality @item an optional colon (@code{:}) @item zero or more feature constraints @item an optional period (@code{.}) @end enumerate The optional rule identifier consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file. The terminal and nonterminal symbols in the rule have the following characteristics: @itemize @bullet @item Upper and lower case letters used in symbols are considered different. For example, @code{NOUN} is not the same as @code{Noun}, and neither is the same as @code{noun}. @item The symbol X may be used to stand for any terminal or nonterminal. For example, this rule says that any category in the grammar rules can be replaced by two copies of the same category separated by a CJ. @example @group Rule X -> X_1 CJ X_2 = = = = @end group @end example The symbol X can be useful for capturing generalities. Care must be taken, since it can be replaced by anything. @item Index numbers are used to distinguish instances of a symbol that is used more than once in a rule. They are added to the end of a symbol following an underscore character (@code{_}). This is illustrated in the rule for X above. @item The characters @code{()@{@}[]<>=:/} cannot be used in terminal or nonterminal symbols since they are used for special purposes in the grammar file. The character @code{_} can be used @emph{only} for attaching an index number to a symbol. @item By default, the left hand symbol of the first rule in the grammar file is the start symbol of the grammar. @end itemize The symbols on the right hand side of a phrase structure rule may be marked or grouped in various ways: @itemize @bullet @item Parentheses around an element of the expansion (right hand) part of a rule indicate that the element is optional. Parentheses may be placed around multiple elements. This makes an optional group of elements. @item A forward slash (/) is used to separate alternative elements of the expansion (right hand) part of a rule. @item Curly braces can be used for grouping elements. For example the following says that an S consists of an NP followed by either a TVP or an IV: @example Rule S -> NP @{TVP / IV@} @end example @item Alternatives are taken to be as long as possible. Thus if the curly braces were omitted from the rule above, as in the rule below, the TVP would be treated as part of the alternative containing the NP. It would not be allowed before the IV. @example Rule S -> NP TVP / IV @end example @item Parentheses group enclosed elements the same as curly braces do. Alternatives and groups delimited by parentheses or curly braces may be nested to any depth. @end itemize A rule can be followed by zero or more @emph{feature constraints} that refer to symbols used in the rule. A feature constraint has these parts, in the order listed: @enumerate @item a feature path that begins with one of the symbols from the phrase structure rule @item an equal sign @item either another path or a value @end enumerate A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the values of the @emph{agr} features for the NP and VP nodes of the parse tree must unify: @example @group Rule S -> NP VP = @end group @end example If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the @emph{head case} feature for the NP node of the parse tree must either be originally undefined or equal to NOM: @example @group Rule S -> NP VP = NOM @end group @end example (After unification succeeds, the @emph{head case} feature for the NP node of the parse tree will be equal to NOM.) A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the @emph{tense} feature is passed from the VP node up to the S node: @example @group Rule S -> NP VP = @end group @end example @c ---------------------------------------------------------------------------- @node Let, Parameter, Rule, Grammar file @section Feature templates A PC-Kimmo grammar feature template has these parts, in the order listed: @enumerate @item the keyword @code{Let} @item the template name @item the keyword @code{be} @item a feature definition @item an optional period (@code{.}) @end enumerate If the template name is a terminal category (a terminal symbol in one of the phrase structure rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure. The characters @code{()@{@}[]<>=:} cannot be used in template names since they are used for special purposes in the grammar file. The characters @code{/_} can be freely used in template names. The character @code{\} should not be used as the first character of a template name because that is how fields are marked in the lexicon file. The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form @emph{feet} may have the abbreviation @emph{pl} in its features field. The grammar file would define this abbreviation with a template like this: @example Let pl be [number: PL] @end example The path notation may also be used: @example Let pl be = PL @end example More complicated feature structures may be defined in templates. For example, @example @group Let 3sg be [tense: PRES agr: 3SG finite: + vform: S] @end group @end example which is equivalent to: @example @group Let 3sg be = PRES = 3SG = + = S @end group @end example In the following example, the abbreviation @emph{irreg} is defined using another abbreviation: @example @group Let irreg be = - pl @end group @end example The abbreviation @emph{pl} must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation @emph{irreg} in its definition. In this way, an inheritance hierarchy features may be constructed. Feature templates permit disjunctive definitions. For example, the lexical entry for the word @emph{deer} may specify the feature abbreviation @emph{sg-pl}. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural: @example @group Let sg/pl be @{[number:SG] [number:PL]@} @end group @end example This has the effect of creating two entries for @emph{deer}, one with singular number and another with plural. Note that there is no limit to the number of disjunct structures listed between the braces. Also, there is no slash (@code{/}) between the elements of the disjunction as there is between the elements of a disjunction in the rules. A shorter version of the above template using the path notation looks like this: @example Let sg/pl be = @{SG PL@} @end example Abbreviations can also be used in disjunctions, provided that they have previously been defined: @example @group Let sg be = SG Let pl be = PL Let sg/pl be @{[sg] [pl]@} @end group @end example Note the square brackets around the abbreviations @emph{sg} and @emph{pl}; without square brackets they would be interpreted as simple values instead. Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value: @example Let N be = !SG @end example The effect of this template is to make all nouns singular unless they are explicitly marked as plural. For example, regular nouns such as @emph{book} do not need any feature in their lexical entries to signal that they are singular; but an irregular noun such as @emph{feet} would have a feature abbreviation such as @emph{pl} in its lexical entry. This would be defined in the grammar as @w{@code{[number: PL]}}, and would override the default value for the feature number specified by the template above. If the N template above used @code{SG} instead of @code{!SG}, then the word @emph{feet} would fail to parse, since its @emph{number} feature would have an internal conflict between @code{SG} and @code{PL}. @c ---------------------------------------------------------------------------- @node Parameter, Define, Let, Grammar file @section Parameter settings A PC-Kimmo grammar parameter setting has these parts, in the order listed: @enumerate @item the keyword @code{Parameter} @item an optional colon (@code{:}) @item one or more keywords identifying the parameter @item the keyword @code{is} @item the parameter value @item an optional period (@code{.}) @end enumerate PC-Kimmo recognizes the following grammar parameters: @table @code @item Start symbol defines the start symbol of the grammar. For example, @example Parameter Start symbol is S @end example declares that the parse goal of the grammar is the nonterminal category S. The default start symbol is the left hand symbol of the first phrase structure rule in the grammar file. @item Restrictor defines a set of features to use for top-down filtering, expressed as a list of feature paths. For example, @example Parameter Restrictor is @end example declares that the @emph{cat} and @emph{head form} features should be used to screen rules before adding them to the parse chart. The default is not to use any features for such filtering. This filtering, named @emph{restriction} in Shieber (1985), is performed in addition to the normal top-down filtering based on categories alone. @sc{restriction is not yet implemented. should it be instead of normal filtering rather than in addition to?} @item Attribute order specifies the order in which feature attributes are displayed. For example, @example @group Parameter Attribute order is cat lex sense head first rest agreement @end group @end example declares that the @emph{cat} attribute should be the first one shown in any output from PC-Kimmo, and that the other attributes should be shown in the relative order shown, with the @emph{agreement} attribute shown last among those listed, but ahead of any attributes that are not listed above. Attributes that are not listed are ordered according to their character code sort order. If the attribute order is not specified, then the category feature @emph{cat} is shown first, with all other attributes sorted according to their character codes. @item Category feature defines the label for the category attribute. For example, @example Parameter Category feature is Categ @end example declares that @emph{Categ} is the name of the category attribute. The default name for this attribute is @emph{cat}. @item Lexical feature defines the label for the lexical attribute. For example, @example Parameter Lexical feature is Lex @end example declares that @emph{Lex} is the name of the lexical attribute. The default name for this attribute is @emph{lex}. @item Gloss feature defines the label for the gloss attribute. For example, @example Parameter Gloss feature is Gloss @end example declares that @emph{Gloss} is the name of the gloss attribute. The default name for this attribute is @emph{gloss}. @end table @c ---------------------------------------------------------------------------- @node Define, , Parameter, Grammar file @section Lexical rules A PC-Kimmo grammar lexical rule has these parts, in the order listed: @enumerate @item the keyword @code{Define} @item the name of the lexical rule @item the keyword @code{as} @item the rule definition @item an optional period (@code{.}) @end enumerate The rule definition consists of one or more mappings. Each mapping has three parts: an output feature path, an assignment operator, and the value assigned, either an input feature path or an atomic value. Every output path begins with the feature name @code{out} and every input path begins with the feature name @code{in}. The assignment operator is either an equal sign (@code{=}) or an equal sign followed by a ``greater than'' sign (@code{=>}). As noted before, lexical rules are not yet implemented properly, and may not prove to be useful for PC-Kimmo word grammars in any case. @c ---------------------------------------------------------------------------- @node Convlex, Bibliography, Grammar file, Top @chapter Convlex: converting version 1 lexicons The format of the lexicon files changed significantly between version 1 and version 2 of PC-Kimmo. For this reason, an auxiliary program to convert lexicon files was written. A version 1 PC-Kimmo lexicon file looks like this: @example @group ; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End LEXICON INITIAL 0 Begin "[ " LEXICON NOUN s'ati Noun "Noun1" s'adi Noun "Noun2" bab'at Noun "Noun3" bab'ad Noun "Noun4" LEXICON End 0 # " ]" END @end group @end example For PC-Kimmo version 2, the same lexicon must be split into two files. The first one would look like this: @example @group ; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End FIELDCODE lf U FIELDCODE lx L FIELDCODE alt A FIELDCODE fea F FIELDCODE gl G INCLUDE sample2.sfm END @end group @end example Note that everything except the lexicon sections and entries has been copied verbatim into this new primary lexicon file. The @code{FIELDCODE} statements define how to interpret the other lexicon files containing the actual lexicon sections and entries. These files are indicated by @code{INCLUDE} statements, and look like this: @example @group \lf 0 \lx INITIAL \alt Begin \fea \gl [ @end group @group \lf s'ati \lx NOUN \alt Noun \fea \gl Noun1 @end group @group \lf s'adi \lx NOUN \alt Noun \fea \gl Noun2 @end group @group \lf bab'at \lx NOUN \alt Noun \fea \gl Noun3 @end group @group \lf bab'ad \lx NOUN \alt Noun \fea \gl Noun4 @end group @group \lf 0 \lx End \alt # \fea \gl ] @end group @end example @file{convlex} was written to make the transition from version 1 to version 2 of PC-Kimmo as painless as possible. It reads a version 1 lexicon file, including any @code{INCLUDE}d files, and writes a version 2 set of lexicon files. For a trivial case like the example above, the interaction with the user might go something like this: @example @group C:\>convlex CONVLEX: convert lexicon from PC-KIMMO version 1 to version 2 Comment character: [;] Input lexicon file: sample.lex Output lexicon file: sample2.lex Primary sfm lexicon file: sample2.sfm @end group @end example For each @code{INCLUDE} statement in the version 1 lexicon file, @file{convlex} prompts for a replacement filename like this: @example New sfm include file to replace noun.lex: noun2.sfm @end example The user interface is extremely crude, but since this is a program that is run only once or twice by most users, that should not be regarded as a problem. @c ---------------------------------------------------------------------------- @node Bibliography, , Convlex, Top @unnumbered Bibliography @enumerate @item Antworth, Evan L.@. 1990. @cite{PC-KIMMO: a two-level processor for morphological analysis}. @tex \break @end tex Occasional Publications in Academic Computing No.@: 16. Dallas, TX: Summer Institute of Linguistics. @item Antworth, Evan L.@. 1991. Introduction to two-level phonology. @cite{Notes on Linguistics} 53:4@value{endash}18. Dallas, TX: Summer Institute of Linguistics. @item Antworth, Evan L.@. 1995. @cite{User's Guide to PC-KIMMO version 2}. URL @tex \hfil\break @end tex @ifset html ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip @end ifset @ifclear html @w{@t{ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip}} @end ifclear (visited August 29, 1997). @item Chomsky, Noam. 1957. @cite{Syntactic structures.} The Hague: Mouton. @item Chomsky, Noam, and Morris Halle. 1968. @cite{The sound pattern of English.} New York: Harper and Row. @item Goldsmith, John A. 1990. @cite{Autosegmental and metrical phonology.} Basil Blackwell. @item Johnson, C. Douglas. 1972. @cite{Formal aspects of phonological description.} The Hague: Mouton. @item Kay, Martin. 1983. When meta-rules are not meta-rules. In Karen Sparck Jones and Yorick Wilks, eds., @cite{Automatic natural language parsing,} 94@value{endash}116. Chichester: Ellis Horwood Ltd. See pages 100@value{endash}104. @item Koskenniemi, Kimmo. 1983. @cite{Two-level morphology: a general computational model for word-form recognition and production.} Publication No. 11. Helsinki: University of Helsinki Department of General Linguistics. @end enumerate @c ---------------------------------------------------------------------------- @contents @bye