This document describes PC-Kimmo, an implementation of the two-level computational linguistic formalism for personal computers. It is available for MS-DOS, Microsoft Windows, Macintosh, and Unix.(1)
The authors would appreciate feedback directed to the following addresses. For linguistic questions, contact:
Gary Simons SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 gary.simons@sil.org U.S.A.
For programming questions, contact:
Stephen McConnel (972)708-7361 (office) Language Software Development (972)708-7561 (fax) SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 steve@acadcomp.sil.org U.S.A. or Stephen_McConnel@sil.org
An online user manual for PC-Kimmo is available on the world wide web at the URL http://www.sil.org/pckimmo/v2/doc/guide.html.
Two-level phonology is a linguistic tool developed by computational linguists. Its primary use is in systems for natural language processing such as PC-Kimmo. This chapter describes the linguistic and computational basis of two-level phonology.(2)
As the fields of computer science and linguistics have grown up together during the past several decades, they have each benefited from cross-fertilization. Modern linguistics has especially been influenced by the formal language theory that underlies computation. The most famous application of formal language theory to linguistics was Chomsky's (1957) transformational generative grammar. Chomsky's strategy was to consider several types of formal languages to see if they were capable of modeling natural language syntax. He started by considering the simplest type of formal languages, called finite state languages. As a general principle, computational linguists try to use the least powerful computational devices possible. This is because the less powerful devices are better understood, their behavior is predictable, and they are computationally more efficient. Chomsky (1957:18ff) demonstrated that natural language syntax could not be effectively modeled as a finite state language; thus he rejected finite state languages as a theory of syntax and proposed that syntax requires the use of more powerful, non-finite state languages. However, there is no reason to assume that the same should be true for natural language phonology. A finite state model of phonology is especially desirable from the computational point of view, since it makes possible a computational implementation that is simple and efficient.
While various linguists proposed that generative phonological rules could be implemented by finite state devices (see Johnson 1972, Kay 1983), the most successful model of finite state phonology was developed by Kimmo Koskenniemi, a Finnish computer scientist. He called his model two-level morphology (Koskenniemi 1983), though his use of the term morphology should be understood to encompass both what linguists would consider morphology proper (the decomposition of words into morphemes) and phonology (at least in the sense of morphophonemics). Our main interest in this article is the phonological formalism used by the two-level model, hereafter called two-level phonology. Two-level phonology traces its linguistic heritage to "classical" generative phonology as codified in The Sound Pattern of English (Chomsky and Halle 1968). The basic insight of two-level phonology is due to the phonologist C. Douglas Johnson (1972) who showed that the SPE theory of phonology could be implemented using finite state devices by replacing sequential rule application with simultaneous rule application. At its core, then, two-level phonology is a rule formalism, not a complete theory of phonology. The following sections of this article describe the mechanism of two-level rule application by contrasting it with rule application in classical generative phonology. It should be noted that Chomsky and Halle's theory of rule application became the focal point of much controversy during the 1970s with the result that current theories of phonology differ significantly from classical generative phonology. The relevance of two-level phonology to current theory is an important issue, but one that will not be fully addressed here. Rather, the comparison of two-level phonology to classical generative phonology is done mainly for expository purposes, recognizing that while classical generative phonology has been superseded by subsequent theoretical work, it constitutes a historically coherent view of phonology that continues to influence current theory and practice.
One feature that two-level phonology shares with classical generative phonology is linear representation. That is, phonological forms are represented as linear strings of symbols. This is in contrast to the nonlinear representations used in much current work in phonology, namely autosegmental and metrical phonology (see Goldsmith 1990). On the computational side, two-level phonology is consistent with natural language processing systems that are designed to operate on linear orthographic input.
We will begin by reviewing the formal properties of generative rules. Stated succinctly, generative rules are sequentially ordered rewriting rules. What does this mean?
First, rewriting rules are rules that change or transform one symbol into another symbol. For example, a rewriting rule of the form `a --> b' interprets the relationship between the symbols `a' and `b' as a dynamic change whereby the symbol `a' is rewritten or turned into the symbol `b'. This means that after this operation takes place, the symbol `a' no longer "exists," in the sense that it is no longer available to other rules. In linguistic theory generative rules are known as process rules. Process rules attempt to characterize the relationship between levels of representation (such as the phonemic and phonetic levels) by specifying how to transform representations from one level into representations on the other level.
Second, generative phonological rules apply sequentially, that is, one after another, rather than applying simultaneously. This means that each rule creates as its output a new intermediate level of representation. This intermediate level then serves as the input to the next rule. As a consequence, the underlying form becomes inaccessible to later rules.
Third, generative phonological rules are ordered; that is, the description specifies the sequence in which the rules must apply. Applying rules in any other order may result in incorrect output.
As an example of a set of generative rules, consider the following rules:
(1) Vowel Raising e --> i / ___C_0 i (2) Palatalization t --> c / ___i
Rule 1 (Vowel Raising) states that `e' becomes (is rewritten as) `i' in the environment preceding `Ci' (where `C' stands for the set of consonants and `C_0' stands for zero or more consonants). Rule 2 (Palatalization) states that `t' becomes `c' preceding `i'. A sample derivation of forms to which these rules apply looks like this (where UR stands for Underlying Representation, SR stands for Surface Representation):(3)
UR: temi (1) timi (2) cimi SR: cimi
Notice that in addition to the underlying and surface levels, an intermediate level has been created as the result of sequentially applying rules 1 and 2. The application of rule 1 produces the intermediate form `timi', which then serves as the input to rule 2.
Not only are these rules sequential, they are ordered, such that rule 1 must apply before rule 2. Rule 1 has a feeding relationship to rule 2; that is, rule 1 increases the number of forms that can undergo rule 2 by creating more instances of `i'. Consider what would happen if they were applied in the reverse order. Given the input form `temi', rule 2 would do nothing, since its environment is not satisfied. Rule 1 would then apply to produce the incorrect surface form `timi'.
Two-level rules differ from generative rules in the following ways. First, whereas generative rules apply in a sequential order, two-level rules apply simultaneously, which is better described as applying in parallel. Applying rules in parallel to an input form means that for each segment in the form all of the rules must apply successfully, even if only vacuously.
Second, whereas sequentially applied generative rules create intermediate levels of derivation, simultaneously applied two-level rules require only two levels of representation: the underlying or lexical level and the surface level. There are no intermediate levels of derivation. It is in this sense that the model is called two-level.
Third, whereas generative rules relate the underlying and surface levels by rewriting underlying symbols as surface symbols, two-level rules express the relationship between the underlying and surface levels by positing direct, static correspondences between pairs of underlying and surface symbols. For instance, instead of rewriting underlying `a' as surface `b', a two-level rule states that an underlying `a' corresponds to a surface `b'. The two-level rule does not change `a' into `b', so `a' is available to other rules. In other words, after a two-level rule applies, both the underlying and surface symbols still "exist."
Fourth, whereas generative rules have access only to the current intermediate form at each stage of the derivation, two-level rules have access to both underlying and surface environments. Generative rules cannot "look back" at underlying environments or "look ahead" to surface environments. In contrast, the environments of two-level rules are stated as lexical-to-surface correspondences. This means that a two-level rule can easily refer to an underlying `a' that corresponds to a surface `b', or to a surface `b' that corresponds to an underlying `a'. In generative phonology, the interaction between a pair of rules is controlled by requiring that they apply in a certain sequential order. In two-level phonology, rule interactions are controlled not by ordering the rules but by carefully specifying their environments as strings of two-level correspondences.
Fifth, whereas generative, rewriting rules are unidirectional (that is, they operate only in an underlying to surface direction), two-level rules are bidirectional. Two-level rules can operate either in an underlying to surface direction (generation mode) or in a surface to underlying direction (recognition mode). Thus in generation mode two-level rules accept an underlying form as input and return a surface form, while in recognition mode they accept a surface form as input and return an underlying form. The practical application of bidirectional phonological rules is obvious: a computational implementation of bidirectional rules is not limited to generation mode to produce words; it can also be used in recognition direction to parse words.
To understand how a two-level phonological description works, we will use the example given above involving Raising and Palatalization. The two-level model treats the relationship between the underlying form `temi' and the surface form `cimi' as a direct, symbol-to-symbol correspondence:
UR: t e m i SR: c i m i
Each pair of lexical and surface symbols is a correspondence pair. We refer to a correspondence pair with the notation `<underlying symbol>:<surface symbol>', for instance `e:i' and `m:m'. There must be an exact one-to-one correspondence between the symbols of the underlying form and the symbols of the surface form. Deletion and insertion of symbols (explained in detail in the next section) is handled by positing correspondences with zero, a null segment. The two-level model uses a notation for expressing two-level rules that is similar to the notation linguists use for phonological rules. Corresponding to the generative rule for Palatalization (rule 2 above), here is the two-level rule for the `t:c' correspondence:
(3) Palatalization t:c <=> ___ @:i
This rule is a statement about the distribution of the pair `t:c' on the left side of the arrow with respect to the context or environment on the right side of the arrow. A two-level rule has three parts: the correspondence, the operator, and the environment. The correspondence part of rule 3 is the pair `t:c', which is the correspondence that the rule sanctions. The operator part of rule 3 is the double-headed arrow. It indicates the nature of the logical relationship between the correspondence and the environment (thus it means something very different from the rewriting arrow `-->' of generative phonology). The `<=>' arrow is equivalent to the biconditional operator of formal logic and means that the correspondence occurs always and only in the stated context; that is, `t:c' is allowed if and only if it is found in the context `___i'. In short, rule 3 is an obligatory rule. The environment part of rule 3 is everything to the right of the arrow. The long underline indicates the gap where the pair `t:c' occurs. Notice that even the environment part of the rule is specified as two-level correspondence pairs.
The environment part of rule 3 requires further explanation. Instead of using a correspondence such as `i:i', it uses the correspondence `@:i'. The `@' symbol is a special "wildcard" symbol that stands for any phonological segment included in the description. In the context of rule 3, the correspondence `@:i' stands for all the feasible pairs in the description whose surface segment is `i', in this case `e:i' and `i:i'. Thus by using the correspondence `@:i', we allow Palatalization to apply in the environment of either a lexical `e' or lexical `i'. In other words, we are claiming that Palatalization is sensitive to a surface (phonetic) environment rather than an underlying (phonemic) environment. Thus rule 3 will apply to both underlying forms `timi' and `temi' to produce a surface form with an initial `c'.
Corresponding to the generative rule for Raising (rule 1 above) is the following two-level rule for the `e:i' correspondence:
(4) Vowel Raising e:i <=> ___ C:C* @:i
(The asterisk in `C:C*' indicates zero or more instances of the correspondence `C:C') Similar to rule 3 above, rule 4 uses the correspondence `@:i' in its environment. Thus rule 4 states that the correspondence `e:i' occurs preceding a surface `i', regardless of whether it is derived from a lexical `e' or `i'. Why is this necessary? Consider the case of an underlying form such as `pememi'. In order to derive the surface form `pimimi', Raising must apply twice: once before a lexical `i' and again before a lexical `e', both of which correspond to a surface `i'. Thus rule 4 will apply to both instances of lexical `e', capturing the regressive spreading of Raising through the word.
By applying rules 3 and 4 in parallel, they work in consort to produce the right output. For example,
UR: t e m i | | | | Rules 3 4 | | | | | | SR: c i m i
Conceptually, a two-level phonological description of a data set such as this can be understood as follows. First, the two-level description declares an alphabet of all the phonological segments used in the data in both underlying and surface forms, in the case of our example, `t', `m', `c', `e', and `i'. Second, the description declares a set feasible pairs, which is the complete set of all underlying-to-surface correspondences of segments that occur in the data. The set of feasible pairs for these data is the union of the set of default correspondences, whose underlying and surface segments are identical (namely `t:t', `m:m', `e:e', and `i:i') and the set of special correspondences, whose underlying and surface segments are different (namely `t:c' and `e:i'). Notice that since the segment `c' only occurs as a surface segment in the feasible pairs, the description will disallow any underlying form that contains a `c'.
A minimal two-level description, then, consists of nothing more than this declaration of the feasible pairs. Since it contains all possible underlying-to-surface correspondences, such a description will produce the correct output form, but because it does not constrain the environments where the special correspondences can occur, it will also allow many incorrect output forms. For example, given the underlying form `temi', it will produce the surface forms `temi', `timi', `cemi', and `cimi', of which only the last is correct.
Third, in order to restrict the output to only correct forms, we include rules in the description that specify where the special correspondences are allowed to occur. Thus the rules function as constraints or filters, blocking incorrect forms while allowing correct forms to pass through. For instance, rule 3 (Palatalization) states that a lexical `t' must be realized as a surface `c' when it precedes `@:i'; thus, given the underlying form `temi' it will block the potential surface output forms `timi' (because the surface sequence `ti' is prohibited) and `cemi' (because surface `c' is prohibited before anything except surface `i'). Rule 4 (Raising) states that a lexical `e' must be realized as a surface `i' when it precedes the sequence `C:C' `@:i'; thus, given the underlying form `temi' it will block the potential surface output forms `temi' and `cemi' (because the surface sequence `emi' is prohibited). Therefore of the four potential surface forms, three are filtered out; rules 3 and 4 leave only the correct form `cimi'.
Two-level phonology facilitates a rather different way of thinking about phonological rules. We think of generative rules as processes that change one segment into another. In contrast, two-level rules do not perform operations on segments, rather they state static constraints on correspondences between underlying and surface forms. Generative phonology and two-level phonology also differ in how they characterize relationships between rules. Rules in generative phonology are described in terms of their relative order of application and their effect on the input of other rules (the so-called feeding and bleeding relations). Thus the generative rule 1 for Raising precedes and feeds rule 2 for Palatalization. In contrast, rules in the two-level model are categorized according to whether they apply in lexical versus surface environments. So we say that the two-level rules for Raising and Palatalization are sensitive to a surface rather than underlying environment.
Phonological processes that delete or insert segments pose a special challenge to two-level phonology. Since an underlying form and its surface form must correspond segment for segment, how can segments be deleted from an underlying form or inserted into a surface form? The answer lies in the use of the special null symbol `0' (zero). Thus the correspondence `x:0' represents the deletion of `x', while `0:x' represents the insertion of `x'. (It should be understood that these zeros are provided by rule application mechanism and exist only internally; that is, zeros are not included in input forms nor are they printed in output forms.) As an example of deletion, consider these forms from Tagalog (where `+' represents a morpheme boundary):
UR: m a n + b i l i SR: m a m 0 0 i l i
Using process terminology, these forms exemplify phonological coalescence, whereby the sequence `nb' becomes `m'. Since in the two-level model a sequence of two underlying segments cannot correspond to a single surface segment, coalescence must be interpreted as simultaneous assimilation and deletion. Thus we need two rules: an assimilation rule for the correspondence `n:m' and a deletion rule for the correspondence `b:0' (note that the morpheme boundary `+' is treated as a special symbol that is always deleted).
(5) Nasal Assimilation n:m <=> ___ +:0 b:@ (6) Deletion b:0 <=> @:m +:0 ___
Notice the interaction between the rules: Nasal Assimilation occurs in a lexical environment, namely a lexical `b' (which can correspond to either a surface `b' or `0'), while Deletion occurs in a surface environment, namely a surface `m' (which could be the realization of either a lexical `n' or `m'). In this way the two rules interact with each other to produce the correct output.
Insertion correspondences, where the lexical segment is `0', enable one to write rules for processes such as stress insertion, gemination, infixation, and reduplication. For example, Tagalog has a verbalizing infix `um' that attaches between the first consonant and vowel of a stem; thus the infixed form of `bili' is `bumili'. To account for this formation with two-level rules, we represent the underlying form of the infix `um' as the prefix `X+', where `X' is a special symbol that has no phonological purpose other than standing for the infix. We then write a rule that inserts the sequence `um' in the presence of `X+', which is deleted. Here is the two-level correspondence:
UR: X + b 0 0 i l i SR: 0 0 b u m i l i
and here is the two-level rule, which simultaneously deletes `X' and inserts `um':
(7) Infixation X:0 <=> ___ +:0 C:C 0:u 0:m V:V
These examples involving deletion and insertion show that the invention of zero is just as important for phonology as it was for arithmetic. Without zero, two-level phonology would be limited to the most trivial phonological processes; with zero, the two-level model has the expressive power to handle complex phonological or morphological phenomena (though not necessarily with the degree of felicity that a linguist might desire).
PC-Kimmo is an interactive program. It has a few command line options, but it is controlled primarily by commands typed at the keyboard (or loaded from a file previously prepared).
The PC-Kimmo program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.
-g filename
-l filename
-r filename
-s filename
-t filename
The following options exist only in beta-test versions of the program, since they are used only for debugging.
-/
-z filename
-Z address,count
address
is allocated or
freed for the count
'th time.
Each of the commands available in PC-Kimmo is described below. Each command consists of one or more keywords followed by zero or more arguments. Keywords may be abbreviated to the minimum length necessary to prevent ambiguity.
cd
directory
changes the current directory to the one specified. Spaces in the
directory pathname are not permitted.
For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, a:
); a path starting with
\
which indicates a directory at the top level of the current
disk; a path starting with ..
which indicates the directory
above the current one; and so on. Directories are separated by the
\
character. (The forward slash /
works just as well as
the backslash \
for MS-DOS or Windows.)
For the Macintosh, you can give a full path starting with the name of
a hard disk, a path starting with :
which means the current
folder, or one starting ::
which means the folder containing the
current one (and so on).
For Unix, you can give a full path starting with a /
(for
example, /usr/pckimmo
); a path starting with ..
which
indicates the directory above the current one; and so on. Directories
are separated by the /
character.
clear
erases all existing rules, lexicon, and grammar information, allowing
the user to prepare to load information for a new language. Strictly
speaking, it is not needed since the load rules
command
erases any previously existing rules, the load lexicon
command erases any previously existing analysis lexicon, the
load synthesis-lexicon
command erases any previously
existing synthesis lexicon, and the load grammar
command
erases any previously existing grammar.
cle
is the minimal abbreviation for clear
.
close
closes the current log file opened by a previous log
command.
clo
is the minimal abbreviation for close
.
The compare
commands all test the current language description
files by processing data against known (precomputed) results.
co
is the minimal abbreviation for compare
.
file compare
is a synonym for compare
.
compare generate
<file>
reads lexical and surface forms from the specified file. After reading
a lexical form, PC-Kimmo generates the corresponding surface form(s)
and compares the result to the surface form(s) read from the file. If
VERBOSE
is ON
, then each form from the file is echoed on
the screen with a message indicating whether or not the surface forms
generated by PC-Kimmo and read from the file are in agreement. If
VERBOSE
is OFF
, then only the disagreements in surface
form are displayed fully. Each result which agrees is indicated by a
single dot written to the screen.
The default filetype extension for compare generate
is
`.gen', and the default filename is `data.gen'.
co g
is the minimal abbreviation for compare generate
.
file compare generate
is a synonym for
compare generate
.
compare pairs
<file>
reads pairs of surface and lexical forms from the specified file.
After reading a lexical form, PC-Kimmo produces any corresponding
surface form(s) and compares the result(s) to the surface form read
from the file. For each surface form, PC-Kimmo also produces any
corresponding lexical form(s) and compares the result to the lexical
form read from the file. If VERBOSE
is ON
, then each
form from the file is echoed on the screen with a message indicating
whether or not the forms produced by PC-Kimmo and read from the file
are in agreement. If VERBOSE
is OFF
, then each result
which agrees is indicated by a single dot written to the screen, and
only disagreements in lexical forms are displayed fully.
The default filetype extension for compare pairs
is `.pai',
and the default filename is `data.pai'.
co p
is the minimal abbreviation for compare pairs
.
file compare pairs
is a synonym for
compare pairs
.
compare recognize
<file>
reads surface and lexical forms from the specified file. After reading
a surface form, PC-Kimmo produces any corresponding lexical form(s) and
compares the result(s) to the lexical form(s) read from the file. If
VERBOSE
is ON
, then each form from the file is echoed on
the screen with a message indicating whether or not the lexical forms
produced by PC-Kimmo and read from the file are in agreement. If
VERBOSE
is OFF
, then each result which agrees is
indicated by a single dot written to the screen, and only disagreements
in lexical forms are displayed fully.
The default filetype extension for compare recognize
is
`.rec', and the default filename is `data.rec'.
co r
is the minimal abbreviation for compare recognize
.
file compare recognize
is a synonym for
compare recognize
.
compare synthesize
<file>
reads morphological and surface forms from the specified file. After
reading a morphological form, PC-Kimmo produces any corresponding
surface form(s) and compares the result(s) to the surface form(s) read
from the file. If VERBOSE
is ON
, then each form from the file
is echoed on the screen with a message indicating whether or not the surface
forms produced by PC-Kimmo and read from the file are in agreement. If
VERBOSE
is OFF
, then each result which agrees is
indicated by a single dot written to the screen, and only disagreements
in surface forms are displayed fully.
The default filetype extension for compare synthesize
is
`.syn', and the default filename is `data.syn'.
co s
is the minimal abbreviation for compare synthesize
.
file compare synthesize
is a synonym for
compare synthesize
.
directory
lists the contents of the current directory. This command is available
only for the MS-DOS and Unix implementations. It does not exist for the
Microsoft Windows or Macintosh implementations.
edit
filename
attempts to edit the specified file using the program indicated by the
environment variable EDITOR
. If this environment variable is not
defined, then edit
is used to edit the file on MS-DOS, and
emacs
is used to edit the file on Unix. This command is not
available for the Microsoft Windows or Macintosh implementations.
exit
stops PC-Kimmo, returning control to the operating system. This is the
same as quit
.
The file
commands process data from a file, optionally writing
the results to another file. Each of these commands is described
below.
The file compare
commands all test the current language description
files by processing data against known (precomputed) results.
f c
is the minimal abbreviation for file compare
.
file compare
is a synonym for compare
.
file generate
<infile> [<outfile>]
reads lexical forms from the specified input file and writes the
corresponding computed surface forms either to the screen or to an
optionally specified output file.
This command behaves the same as generate
except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
See section 3.2.9 generate.
f g
is the minimal abbreviation for file generate
.
file recognize
<infile> [<outfile>]
reads surface forms from the specified input file and writes the
corresponding computed morphological and lexical forms either to the
screen or to an optionally specified output file.
This command behaves the same as recognize
except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
See section 3.2.15 recognize.
f r
is the minimal abbreviation for file recognize
.
file synthesize
<infile> [<outfile>]
reads morphological forms from the specified input file and writes the
corresponding computed surface forms either to the screen or to an
optionally specified output file.
This command behaves the same as synthesize
except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
See section 3.2.20 synthesize.
f s
is the minimal abbreviation for file synthesize
.
generate
[<lexical-form>]
attempts to produce a surface form from a lexical form provided by the
user.
If a lexical form is typed on the same line as the command, then that
lexical form is used to generate a surface form.
If the command is typed without a form, then PC-Kimmo prompts the user
for lexical forms with a special generator prompt, and processes each
form in turn.
This cycle of typing and generating is terminated by typing an empty
"form" (that is, nothing but the Enter
or Return
key).
The rules must be loaded before using this command. It does not require either a lexicon or a grammar.
g
is the minimal abbreviation for generate
.
help
command
displays a description of the specified command. If help
is typed
by itself, PC-Kimmo displays a list of commands with short descriptions of
each command.
h
is the minimal abbreviation for help
.
The list
commands all display information about the currently
loaded data. Each of these commands are described below.
li
is the minimal abbreviation for list
.
list lexicon
displays the names of all the (sub)lexicons currently loaded. The
order of presentation is the order in which they are referenced in the
ALTERNATIONS
declarations.
li l
is the minimal abbreviation for list lexicon
.
list pairs
displays all the feasible pairs for the current set of active rules.
The feasible pairs are displayed as pairs of lines, with the lexical
characters shown above the corresponding surface characters.
li p
is the minimal abbreviation for list pairs
.
list rules
displays the names of the current rules, preceded by the number of the
rule (used by the set rules
command) and an indication of
whether the rule is ON
or OFF
.
li r
is the minimal abbreviation for list rules
.
The load
commands all load information stored in specially
formatted files. Each of the load
commands is described below.
l
is the minimal abbreviation for load
.
load grammar
[<file>]
erases any existing word grammar and reads a new word grammar from the
specified file.
The default filetype extension for load grammar
is
`.grm', and the default filename is `grammar.grm'.
A grammar file can also be loaded by using the `-g' command line option when starting PC-Kimmo.
l g
is the minimal abbreviation for
load grammar
.
load lexicon
[<file>]
erases any existing analysis lexicon information and reads a new
analysis lexicon from the specified file. A rules file must
be loaded before an analysis lexicon file can be loaded.
The default filetype extension for load lexicon
is
`.lex', and the default filename is `lexicon.lex'.
An analysis lexicon file can also be loaded by using the `-l' command line option when starting PC-Kimmo. This requires that a `-r' option also be used to load a rules file.
l l
is the minimal abbreviation for
load lexicon
.
load rules
[<file>]
erases any existing rules and reads a new set of two-level rules from
the specified file.
The default filetype extension for load rules
is
`.rul', and the default filename is `rules.rul'.
A rules file can also be loaded by using the `-r' command line option when starting PC-Kimmo.
l r
is the minimal abbreviation for
load rules
.
load synthesis-lexicon
[<file>]
erases any existing synthesis lexicon and reads a new synthesis lexicon
from the specified file. A rules file must be loaded before a
synthesis lexicon file can be loaded.
The default filetype extension for load synthesis-lexicon
is `.lex', and the default filename is `lexicon.lex'.
A synthesis lexicon file can also be loaded by using the `-s' command line option when starting PC-Kimmo. This requires that a `-r' option also be used to load a rules file.
l s
is the minimal abbreviation for
load synthesis-lexicon
.
log
[<file>]
opens a log file. Each item processed by a generate
,
recognize
, synthesize
, compare
, or file
command is recorded in the log file as well as being displayed on the
screen.
If a filename is given on the same line as the log
command, then
that file is used for the log file. Any previously existing file with
the same name will be overwritten. If no filename is provided, then
the file `pckimmo.log' in the current directory is used for the log
file.
Use close
to stop recording in a log file. If a log
command is given when a log file is already open, then the earlier log
file is closed before the new log file is opened.
quit
stops PC-Kimmo, returning control to the operating system. This is the
same as exit
.
recognize
[<surface-form>]
attempts to produce lexical and morphological forms from a surface
wordform provided by the user. If a wordform is typed on the same line
as a command, then that word is parsed. If the command is typed
without a form, then PC-Kimmo prompts the user for surface forms with a
special recognizer prompt, and processes each form in turn. This cycle
of typing and parsing is terminated by typing an empty "word" (that
is, nothing but the Enter
or Return
key).
Both the rules and the lexicon must be loaded before using this command. A grammar may also be loaded and used to eliminate invalid parses from the two-level processor results. If a grammar is used, then parse trees and feature structures may be displayed as well as the lexical and morphological forms.
save
[file.tak]
writes the current settings to the designated file in the form of
PC-Kimmo commands. If the file is not specified, the settings are
written to pckimmo.tak
in the current directory.
The set
commands control program behavior by setting internal
program variables. Each of these commands (and variables) is described
below.
set ambiguities
number
limits the number of analyses printed to the given number. The default
value is 10. Note that this does not limit the number of analyses
produced, just the number printed.
set ample-dictionary
value
determines whether or not the AMPLE dictionary files are divided
according to morpheme type. set ample-dictionary split
declares
that the AMPLE dictionary is divided into a
prefix dictionary file, an infix dictionary file, a suffix dictionary
file, and one or more root dictionary files. The existence of the
three affix dictionary depends on settings in the AMPLE analysis
data file. If they exist, the load ample dictionary
command
requires that they be given in this relative
order: prefix, infix, suffix, root(s).
set ample-dictionary unified
declares that any of the AMPLE
dictionary files may contain any type of morpheme. This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is root), and that the dictionary code
table contains a \unified
field. One of the changes
listed under \unified
must convert a backslash code to T
.
The default is for the AMPLE dictionary to be split.(4)
set check-cycles
value
enables or disables a check to prevent cycles in the parse chart.
set check-cycles on
turns on this check, and
set check-cycles off
turns it off. This check
slows down the parsing of a sentence, but it makes the parser less
vulnerable to hanging on perverse grammars. The default setting is
on
.
set comment
character
sets the comment character to the indicated value. If character
is missing (or equal to the current comment character), then comment
handling is disabled. The default comment character is ;
(semicolon).
set failures
value
enables or disables grammar failure mode. set failures on
turns on grammar failure mode, and set failures off
turns it
off. When grammar failure mode is on, the partial results of forms
that fail the grammar module are displayed. A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned. The default setting is off
.
Be careful with this option. Setting failures to on
can cause
the PC-Kimmo to go into an infinite loop for certain recursive grammars
and certain input sentences. WE MAY TRY TO DO SOMETHING TO DETECT
THIS TYPE OF BEHAVIOR, AT LEAST PARTIALLY.
set features
value
determines how features will be displayed.
set features all
enables the display of the features for all
nodes of the parse tree.
set features top
enables the display of the feature
structure for only the top node of the parse tree. This is the default
setting.
set features flat
causes features to be displayed in a flat,
linear string that uses less space on the screen.
set features full
causes features to be displayed in an
indented form that makes the embedded structure of the feature set
clear. This is the default setting.
set features on
turns on features display mode, allowing
features to be shown. This is the default setting.
set features off
turns off features display mode, preventing
features from being shown.
set gloss
value
enables the display of glosses in the parse tree output if value is
on
, and disables the display of glosses if value is
off
. If any glosses exist in the lexicon file, then gloss
is
automatically turned on
when the lexicon is loaded. If no glosses
exist in the lexicon, then this flag is ignored.
set marker category
marker
establishes the marker for the field containing the category (part of
speech) feature. The default is \c
.
set marker features
marker
establishes the marker for the field containing miscellaneous features.
(This field is not needed for many words.) The default is \f
.
set marker gloss
marker
establishes the marker for the field containing the word gloss. The
default is \g
.
set marker record
marker
establishes the field marker that begins a new record in the lexicon
file. This may or may not be the same as the word
marker. The
default is \w
.
set marker word
marker
establishes the marker for the word field. The default is \w
.
set timing
value
enables timing mode if value is on
, and disables timing
mode if value is off
. If timing mode is on
, then
the elapsed time required to process a command is displayed when the
command finishes. If timing mode is off
, then the elapsed time
is not shown. The default is off
. (This option is useful only
to satisfy idle curiosity.)
set top-down-filter
value
enables or disables top-down filtering based on the categories.
set top-down-filter on
turns on this filtering,
and set top-down-filter off
turns it off. The
top-down filter speeds up the parsing of a sentence, but might cause
the parser to miss some valid parses. The default setting is
on
.
This should not be required in the final version of PC-Kimmo.
set tree
value
specifies how parse trees should be displayed.
set tree full
turns on the parse tree display, displaying the
result of the parse as a full tree. This is the default setting.
A short sentence would look something like this:
Sentence | Declarative _____|_____ NP VP | ___|____ N V COMP cows eat | NP | N grass
set tree flat
turns on the parse tree display, displaying the
result of the parse as a flat tree structure in the form of a bracketed
string. The same short sentence would look something like this:
(Sentence (Declarative (NP (N cows)) (VP (V eat) (COMP (NP (N grass))))))
set tree indented
turns on the parse tree display, displaying
the result of the parse in an indented format sometimes called a
northwest tree. The same short sentence would look like this:
Sentence Declarative NP N cows VP V eat COMP NP N grass
set tree off
disables the display of parse trees altogether.
set trim-empty-features
value
disables the display of empty feature values if value is
on
, and enables the display of empty feature values if
value is off
. The default is not to display empty feature
values.
set unification
value
enables or disables feature unification.
set unification on
turns on unification mode. This is the
default setting.
set unification off
turns off feature unification in the
grammar. Only the context-free phrase structure rules are used to
guide the parse; the feature contraints are ignored. This can be
dangerous, as it is easy to introduce infinite cycles in recursive
phrase structure rules.
set verbose
value
enables or disables the screen display of parse trees in the
file parse
command. set verbose on
enables the screen display of parse
trees, and set verbose off
disables such display. The default
setting is off
.
set warnings
value
enables warning mode if value is on
, and disables
warning mode if value is off
. If warning mode is
enabled, then warning messages are displayed on the output. If warning
mode is disabled, then no warning messages are displayed. The default
setting is on
.
set write-ample-parses
value
enables writing \parse
and \features
fields at the end of
each sentence in the disambiguated analysis file if value is
on
, and disables writing these fields if value is
off
. The default setting is off
.
This variable setting affects only the file disambiguate
command.
The show
commands display internal settings on the screen. Each
of these commands is described below.
show lexicon
prints the contents of the lexicon stored in memory on the standard
output. THIS IS NOT VERY USEFUL, AND MAY BE REMOVED.
show status
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the set
command.
show
(by itself) and status
are synonyms for
show status
.
status
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the set
command.
synthesize
[<morphological-form>] attempts to produce
surface forms from a morphological form provided by the user. If a
morphological form is typed on the same line as the command, then that
form is synthesized. If the command is typed without a form, then
PC-Kimmo repeatedly prompts the user for morphological forms with a
special synthesizer prompt, processing each form. This cycle of typing
and synthesizing is terminated by typing an empty "form" (that is,
nothing but the Enter
or Return
key).
Note that the morphemes in the morphological form must be separated by spaces, and must match gloss entries loaded from the lexicon. Also, the morphemes must be given in the proper order.
Both the rules and the synthesis lexicon must be loaded before using this command. It does not use a grammar.
system
[command]
allows the user to execute an operating system command (such as
checking the available space on a disk) from within PC-Kimmo. This is
available only for MS-DOS and Unix, not for Microsoft Windows or the
Macintosh.
If no system-level command is given on the line with the system
command, then PC-Kimmo is pushed into the background and a new system
command processor (shell) is started. Control is usually returned to
PC-Kimmo in this case by typing exit
as the operating system
command.
sys
is the minimal abbreviation for system
.
!
(exclamation point) is a synonym for system
.
(!
does not require a space to separate it from the command.)
take
[file.tak]
redirects command input to the specified file.
The default filetype extension for take
is .tak
, and the default
filename is pckimmo.tak
.
take
files can be nested three deep. That is, the user types
take file1
, file1
contains the command take file2
,
and file2
has the command take file3
. It would be an
error for file3
to contain a take
command. This should
not prove to be a serious limitation.
A take
file can also be specified by using the -t
command
line option when starting PC-Kimmo. When started, PC-Kimmo looks for a
take
file named `pckimmo.tak' in the current directory to
initialize itself with.
The general structure of the rules file is a list of keyword declarations. Figure 1 shows the conventional structure of the rules file. Note that the notation `{x | y}' means either `x' or `y' (but not both).
Figure 1 Structure of the rules file COMMENT <character> ALPHABET <symbol list> NULL <character> ANY <character> BOUNDARY <character> SUBSET <subset name> <symbol list> . (more subsets) . . RULE <rule name> <number of states> <number of columns> <lexical symbol list> <surface symbol list> <state number>{: | .} <state number list> . (more states) . . . (more rules) . . END
The following specifications apply to the rules file.
^I
(ASCII 9, tab), ^J
(ASCII 10,
line feed), ^K
(ASCII 11, vertical tab), ^L
(ASCII 12,
form feed), and ^M
(ASCII 13, carriage return).
COMMENT
declaration.)
COMMENT
, ALPHABET
, NULL
, ANY
,
BOUNDARY
, SUBSET
, RULE
, and END
.
ALPHABET
, NULL
, ANY
, BOUNDARY
.
COMMENT
, SUBSET
, and RULE
.
COMMENT
declaration sets the comment character used in the
rules file, lexicon files, and grammar file. The COMMENT
declaration can only be used in the rules file, not in the lexicon or
grammar file. The COMMENT
declaration is optional. If it is
not used, the comment character is set to ;
(semicolon) as a
default.
COMMENT
declaration can be used anywhere in the rules file
and can be used more than once. That is, different parts of the rules
file can use different comment characters. The COMMENT
declaration can (and in practice usually does) occur as the first
keyword in the rules file, followed by either one or more
COMMENT
declarations or the ALPHABET
declaration.
COMMENT
declaration to declare the
character that is already in use as the comment character, an error
will result. For instance, if semicolon is the current comment
character, the declaration COMMENT ;
will result in an error.
ALPHABET
declaration must either occur first in the file or
follow one or more COMMENT
declarations only. The other
declarations can appear in any order. The COMMENT
, NULL
,
ANY
, BOUNDARY
, and SUBSET
declarations can even be
interspersed among the rules. However, these declarations must appear
before any rule that uses them or an error will result.
ALPHABET
declaration defines the set of symbols used in
either lexical or surface representations. The keyword ALPHABET
is followed by a <symbol list> of all alphabetic symbols. Each
symbol must be separated from the others by at least one space. The
list can span multiple lines, and ends with the next valid keyword.
All alphanumeric characters (such as a
, B
, and 2
),
symbols (such as $
and +
), and punctuation characters
(such as .
and ?
) are available as alphabet members. The
characters in the IBM extended character set (above ASCII 127) are also
available. Control characters (below ASCII 32) can also be used, with
the exception of whitespace characters (see above), ^Z
(end of
file), and ^@
(null). The alphabet can contain a maximum of 255
symbols. An alphabetic symbol can also be a multigraph, that is, a
sequence of two or more characters. The individual characters
composing a multigraph do not necessarily have to also be declared as
alphabetic characters. For example, an alphabet could include the
characters s
and z
and the multigraph sz%
, but not
include %
as an alphabetic character. Note that a multigraph cannot
also be interpreted as a sequence of the individual characters that
comprise it.
NULL
is followed by a single <character> that
represents a null (empty, zero) element. The NULL
symbol is
considered to be an alphabetic character, but cannot also be listed in
the ALPHABET
declaration. The NULL
symbol declared in
the rules file is also used in the lexicon file to represent a null
lexical entry.
ANY
is followed by a single "wildcard"
<character> that represents a match of any character in the
alphabet. The ANY
symbol is not considered to be an alphabetic
character, though it is used in the column headers of state tables. It
cannot be listed in the ALPHABET
declaration. It is not used in
the lexicon file.
BOUNDARY
is followed by a single <character>
character that represents an initial or final word boundary. The
BOUNDARY
symbol is considered to be an alphabetic character, but
cannot also be listed in the ALPHABET
declaration. When used in
the column header of a state table, it can only appear as the pair
#:#
(where, for instance, #
has been declared as the
BOUNDARY
symbol). The BOUNDARY
symbol is also used in
the lexicon file in the continuation class field of a lexical entry to
indicate the end of a word (that is, no continuation class).
SUBSET
declaration defines set of characters that are
referred to in the column headers of rules. The keyword SUBSET
is followed by the <subset name> and
<symbol list>. <subset name> is a single word (one or more
characters) that names the list of characters that follows it. The
subset name must be unique (that is, if it is a single character it
cannot also be in the alphabet or be any other declared symbol). It
can be composed of any characters (except space); that is, it is not
limited to the characters declared in the ALPHABET
section. It
must not be identical to any keyword used in the rules file. The
subset name is used in rules to represent all members of the subset of
the alphabet that it defines. Note that SUBSET
declarations can
be interspersed among the rules. This allows subsets to be placed near
the rule that uses them if such a style is desired. However, a subset
must be declared before a rule that uses it.
ALPHABET
, with the exception of the
NULL
symbol, which can appear in a subset list but is not
included in the ALPHABET
declaration. Neither the ANY
symbol nor the BOUNDARY
symbol can appear in a subset symbol
list.
RULE
signals that a state table immediately follows.
Note that two-level rules must be expressed as a state table rather
than in the form discussed in
section 2. The Two-level Formalism
above.
RULE
keyword.
NULL
symbol, the ANY
symbol,
or the BOUNDARY
symbol (in which case the corresponding surface
character must also be the BOUNDARY
symbol). The list can span
multiple lines, but the number of elements in the list must be equal to
the number of columns defined for the rule.
NULL
symbol, the ANY
symbol,
or the BOUNDARY
symbol (in which case the corresponding lexical
character must also be the BOUNDARY
symbol). The list can span
multiple lines, but the number of characters in the list must be equal
to the number of columns defined for the rule.
:
) if the state is a final state and a period
(.
) if it is a nonfinal state. It must follow the
<state number> with no intervening space.
END
follows all other declarations and indicates the
end of the rules file. Any material in the file thereafter is ignored
by PC-Kimmo. The END
keyword is optional; the physical end of
the file also terminates the rules file.
Figure 2 shows a sample rules file.
Figure 2 A sample rules file ALPHABET b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary a e i o u NULL 0 ANY @ BOUNDARY # SUBSET C b c d f g h j k l m n p q r s t v w x y z SUBSET V a e i o u ; more subsets RULE "Consonant defaults" 1 23 b c d f g h j k l m n p q r s t v w x y z + @ b c d f g h j k l m n p q r s t v w x y z 0 @ 1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 RULE "Vowel defaults" 1 6 a e i o u @ a e i o u @ 1: 1 1 1 1 1 1 RULE "Voicing s:z <=> V___V" 4 4 V s s @ V z @ @ 1: 2 0 1 1 2: 2 4 3 1 3: 0 0 1 1 4. 2 0 0 0 ; more rules END
A lexicon consists of one main lexicon file plus one or more files of
lexical entries. The general structure of the main lexicon file is a
list of keyword declarations. The set of valid keywords is
ALTERNATION
, FEATURES
, FIELDCODE
, INCLUDE
,
and END
. Figure 3 shows the conventional
structure of the main lexicon file.
Figure 3 Structure of the main lexicon file ALTERNATION <alternation name> <sublexicon name list> . (more ALTERNATIONs) . . FEATURES <feature abbreviation list> FIELDCODE <lexical item code> U FIELDCODE <sublexicon code> L FIELDCODE <alternation code> A FIELDCODE <features code> F FIELDCODE <gloss code> G INCLUDE <filespec> . (more INCLUDEd files) . . END
The following specifications apply to the main lexicon file.
^I
(ASCII 9, tab), ^J
(ASCII 10,
line feed), ^K
(ASCII 11, vertical tab), ^L
(ASCII 12,
form feed), and ^M
(ASCII 13, carriage return).
ALTERNATION
, FEATURES
, FIELDCODE
, INCLUDE
,
and END
.
INCLUDE
declarations should appear last, but the
ALTERNATION
, FEATURES
, and FIELDCODE
declarations
can appear in any order.
ALTERNATION
declaration defines a set of sublexicon names
that serve as the continuation class of a lexical item. The
ALTERNATION
keyword is followed by an <alternation name>
and a <sublexicon name list>. ALTERNATION
declarations
are optional (but nearly always used in practice) and can occur as many
times as needed.
ALPHABET
characters declared in
the rules file. An alternation name can be any word other than a
keyword used in the lexicon file. The program does not check to see if
an alternation name is actually used in the lexicon file.
FEATURES
keyword followed by a <feature abbreviation
list>. A <feature abbreviation list> is a list of words, each
of which is expanded into feature structures by the word grammar.
FIELDCODE
declaration is used to define what fieldcode will
be used to mark each type of field in a lexical entry. The
FIELDCODE
keyword is followed by a <code> and one of five
possible internal codes: U
, L
, A
, F
, or
G
. There must be five FIELDCODE
declarations, one for
each of these internal codes, where U
indicates the lexical item
field, L
indicates the sublexicon field, A
indicates the
alternation field, F
indicates the features field, and G
indicates the gloss field.
INCLUDE
keyword is followed by a <filespec> that names
a file containing lexical entries to be loaded. An INCLUDE
d
file cannot contain any declarations (such as a FIELDCODE
or an
INCLUDE
declaration), only lexical entries and comment lines.
END
follows all other declarations and indicates the
end of the main lexicon file. Any material in the file thereafter is
ignored by PC-Kimmo. The END
keyword is optional; the physical
end of the file also terminates the main lexicon file.
Figure 4 shows a sample main lexicon file.
Figure 4 A sample main lexicon file ALTERNATION Begin PREF ALTERNATION Pref N AJ V AV ALTERNATION Stem SUFFIX FEATURES sg pl reg irreg FIELDCODE lf U ;lexical item FIELDCODE lx L ;sublexicon FIELDCODE alt A ;alternation FIELDCODE fea F ;features FIELDCODE gl G ;gloss INCLUDE affix.lex ;file of affixes INCLUDE noun.lex ;file of nouns INCLUDE verb.lex ;file of verbs INCLUDE adjectiv.lex ;file of adjectives INCLUDE adverb.lex ;file of adverbs END
Figure 5 shows the structure of a lexical entry. Lexical entries are encoded in "field-oriented standard format." Standard format is an information interchange convention developed by SIL International. It tags the kinds of information in ASCII text files by means of markers which begin with backslash. Field-oriented standard format (FOSF) is a refinement of standard format geared toward representing data which has a database-like record and field structure.
Figure 5 Structure of a lexical entry \<lexical item code> <lexical item> \<sublexicon code> <sublexicon name> \<alternation code> {<alternation name> | <BOUNDARY symbol>} \<features code> <features list> \<gloss code> <gloss string>
The following points provide an informal description of the syntax of FOSF files.
The following specifications apply to how FOSF is implemented in PC-Kimmo.
FIELDCODE
declaration). All other
fields are considered to be extraneous and are ignored.
FIELDCODE
declaration in the main lexicon file.
A file of lexical entries is loaded by using an INCLUDE
declaration in the main lexicon file (see above). An INCLUDE
d
file of lexical entries cannot contain any declarations (such as a
FIELDCODE
or an INCLUDE
declaration), only lexical
entries and comment lines.
The following specifications apply to lexical entries.
set warnings off
(see section 3.2.17.19 set warnings)
before loading the lexicon.
U
by a FIELDCODE
declaration in the main lexicon
file.
L
by a FIELDCODE
declaration in the main lexicon file.
ALTERNATION
declaration in the main lexicon file. The <BOUNDARY symbol> is
declared in the rules file and indicates the end of all possible
continuations in the lexicon.
F
by a FIELDCODE
declaration in the main lexicon file.
(){}[]<>=:$!
(these are used for
special purposes in the grammar file). The character \
should
not be used as the first character of an abbreviation because that is
how fields are marked in the lexicon file. Upper and lower case
letters used in template names are considered different. For example,
PLURAL
is not the same as Plural
or plural
.
Feature abbreviations are expanded into full feature structures by the
word grammar
(see section 6. The Grammar File).
G
by a FIELDCODE
declaration in the main lexicon file.
Figure 6 shows a sample lexical entry.
Figure 6 A sample lexical entry \lf `knives \lx N \alt Infl \fea pl irreg \gl N(`knife)+PL
The following specifications apply generally to the word grammar file:
set comment
command
(see section 3.2.17.4 set comment)
is operative in the grammar file. The default comment character is the
semicolon (;
). Comments may be placed anywhere in the grammar
file. Everything following a comment character to the end of the line
is ignored.
Rule
starts a context-free phrase structure rule with its
set of feature constraints. These rules define how words join together
to form phrases, clauses, or sentences. The lexicon and grammar are
tied together by using the lexical categories as the terminal symbols
of the phrase structure rules and by using the other lexical features
in the feature constraints.
Let
starts a feature template definition. Feature
templates are used as macros (abbreviations) in the lexicon. They may
also be used to assign default feature structures to the categories.
Parameter
starts a program parameter definition. These
parameters control various aspects of the program.
Define
starts a lexical rule definition.
As noted in Shieber (1985), something more powerful than just
abbreviations for common feature elements is sometimes needed to
represent systematic relationships among the elements of a lexicon.
This need is met by lexical rules, which express transformations rather
than mere abbreviations.
Lexical rules are not yet implemented properly. They may or may not be
useful for word grammars used by PC-Kimmo.
Lexicon
starts a lexicon section. This is only for
compatibility with the original PATR-II. The section name is
skipped over properly, but nothing is done with it.
Word
starts an entry in the lexicon. This is only for
compatibility with the original PATR-II. The entry is skipped
over properly, but nothing is done with it.
End
effectively terminates the file. Anything following this
keyword is ignored.
RULE
is the
same as rule
, and both are the same as Rule
.
A PC-Kimmo word grammar rule has these parts, in the order listed:
Rule
{}
)
->
) or equal sign (=
)
:
)
.
)
The optional rule identifier consists of one or more words enclosed in braces. Its current utility is only as a special form of comment describing the intent of the rule. (Eventually it may be used as a tag for interactively adding and removing rules.) The only limits on the rule identifier are that it not contain the comment character and that it all appears on the same line in the grammar file.
The terminal and nonterminal symbols in the rule have the following characteristics:
NOUN
is not the same as Noun
, and neither is
the same as noun
.
Rule X -> X_1 CJ X_2 <X cat> = <X_1 cat> <X cat> = <X_2 cat> <X arg1> = <X_1 arg1> <X arg1> = <X_2 arg1>The symbol X can be useful for capturing generalities. Care must be taken, since it can be replaced by anything.
_
). This is
illustrated in the rule for X above.
(){}[]<>=:/
cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the
grammar file. The character _
can be used only for
attaching an index number to a symbol.
The symbols on the right hand side of a phrase structure rule may be marked or grouped in various ways:
Rule S -> NP {TVP / IV}
Rule S -> NP TVP / IV
A rule can be followed by zero or more feature constraints that refer to symbols used in the rule. A feature constraint has these parts, in the order listed:
A feature constraint that refers only to symbols on the right hand side of the rule constrains their co-occurrence. In the following rule and constraint, the values of the agr features for the NP and VP nodes of the parse tree must unify:
Rule S -> NP VP <NP agr> = <VP agr>
If a feature constraint refers to a symbol on the right hand side of the rule, and has an atomic value on its right hand side, then the designated feature must not have a different value. In the following rule and constraint, the head case feature for the NP node of the parse tree must either be originally undefined or equal to NOM:
Rule S -> NP VP <NP head case> = NOM
(After unification succeeds, the head case feature for the NP node of the parse tree will be equal to NOM.)
A feature constraint that refers to the symbol on the left hand side of the rule passes information up the parse tree. In the following rule and constraint, the value of the tense feature is passed from the VP node up to the S node:
Rule S -> NP VP <S tense> = <VP tense>
A PC-Kimmo grammar feature template has these parts, in the order listed:
Let
be
.
)
If the template name is a terminal category (a terminal symbol in one of the phrase structure rules), the template defines the default features for that category. Otherwise the template name serves as an abbreviation for the associated feature structure.
The characters (){}[]<>=:
cannot be used in template names
since they are used for special purposes in the grammar file. The
characters /_
can be freely used in template names. The
character \
should not be used as the first character of a
template name because that is how fields are marked in the lexicon
file.
The abbreviations defined by templates are usually used in the feature field of entries in the lexicon file. For example, the lexical entry for the irregular plural form feet may have the abbreviation pl in its features field. The grammar file would define this abbreviation with a template like this:
Let pl be [number: PL]
The path notation may also be used:
Let pl be <number> = PL
More complicated feature structures may be defined in templates. For example,
Let 3sg be [tense: PRES agr: 3SG finite: + vform: S]
which is equivalent to:
Let 3sg be <tense> = PRES <agr> = 3SG <finite> = + <vform> = S
In the following example, the abbreviation irreg is defined using another abbreviation:
Let irreg be <reg> = - pl
The abbreviation pl must be defined previously in the grammar file or an error will result. A subsequent template could also use the abbreviation irreg in its definition. In this way, an inheritance hierarchy features may be constructed.
Feature templates permit disjunctive definitions. For example, the lexical entry for the word deer may specify the feature abbreviation sg-pl. The grammar file would define this as a disjunction of feature structures reflecting the fact that the word can be either singular or plural:
Let sg/pl be {[number:SG] [number:PL]}
This has the effect of creating two entries for deer, one with
singular number and another with plural. Note that there is no limit
to the number of disjunct structures listed between the braces. Also,
there is no slash (/
) between the elements of the disjunction as
there is between the elements of a disjunction in the rules.
A shorter version of the above template using the path notation looks
like this:
Let sg/pl be <number> = {SG PL}
Abbreviations can also be used in disjunctions, provided that they have previously been defined:
Let sg be <number> = SG Let pl be <number> = PL Let sg/pl be {[sg] [pl]}
Note the square brackets around the abbreviations sg and pl; without square brackets they would be interpreted as simple values instead.
Feature templates can assign default atomic feature values, indicated by prefixing an exclamation point (!). A default value can be overridden by an explicit feature assignment. This template says that all members of category N have singular number as a default value:
Let N be <number> = !SG
The effect of this template is to make all nouns singular unless they
are explicitly marked as plural. For example, regular nouns such as
book do not need any feature in their lexical entries to signal
that they are singular; but an irregular noun such as feet would
have a feature abbreviation such as pl in its lexical entry.
This would be defined in the grammar as [number: PL]
, and would
override the default value for the feature number specified by the
template above. If the N template above used SG
instead of
!SG
, then the word feet would fail to parse, since its
number feature would have an internal conflict between SG
and PL
.
A PC-Kimmo grammar parameter setting has these parts, in the order listed:
Parameter
:
)
is
.
)
PC-Kimmo recognizes the following grammar parameters:
Start symbol
Parameter Start symbol is Sdeclares that the parse goal of the grammar is the nonterminal category S. The default start symbol is the left hand symbol of the first phrase structure rule in the grammar file.
Restrictor
Parameter Restrictor is <cat> <head form>declares that the cat and head form features should be used to screen rules before adding them to the parse chart. The default is not to use any features for such filtering. This filtering, named restriction in Shieber (1985), is performed in addition to the normal top-down filtering based on categories alone. RESTRICTION IS NOT YET IMPLEMENTED. SHOULD IT BE INSTEAD OF NORMAL FILTERING RATHER THAN IN ADDITION TO?
Attribute order
Parameter Attribute order is cat lex sense head first rest agreementdeclares that the cat attribute should be the first one shown in any output from PC-Kimmo, and that the other attributes should be shown in the relative order shown, with the agreement attribute shown last among those listed, but ahead of any attributes that are not listed above. Attributes that are not listed are ordered according to their character code sort order. If the attribute order is not specified, then the category feature cat is shown first, with all other attributes sorted according to their character codes.
Category feature
Parameter Category feature is Categdeclares that Categ is the name of the category attribute. The default name for this attribute is cat.
Lexical feature
Parameter Lexical feature is Lexdeclares that Lex is the name of the lexical attribute. The default name for this attribute is lex.
Gloss feature
Parameter Gloss feature is Glossdeclares that Gloss is the name of the gloss attribute. The default name for this attribute is gloss.
A PC-Kimmo grammar lexical rule has these parts, in the order listed:
Define
as
.
)
The rule definition consists of one or more mappings. Each mapping has
three parts: an output feature path, an assignment operator, and the
value assigned, either an input feature path or an atomic value. Every
output path begins with the feature name out
and every input
path begins with the feature name in
. The assignment operator
is either an equal sign (=
) or an equal sign followed by a
"greater than" sign (=>
).
As noted before, lexical rules are not yet implemented properly, and may not prove to be useful for PC-Kimmo word grammars in any case.
The format of the lexicon files changed significantly between version 1 and version 2 of PC-Kimmo. For this reason, an auxiliary program to convert lexicon files was written.
A version 1 PC-Kimmo lexicon file looks like this:
; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End LEXICON INITIAL 0 Begin "[ " LEXICON NOUN s'ati Noun "Noun1" s'adi Noun "Noun2" bab'at Noun "Noun3" bab'ad Noun "Noun4" LEXICON End 0 # " ]" END
For PC-Kimmo version 2, the same lexicon must be split into two files. The first one would look like this:
; SAMPLE.LEX 25-OCT-89 ; To load this file, first load the rules file SAMPLE.RUL and ; then enter the command LOAD LEXICON SAMPLE. ALTERNATION Begin NOUN ALTERNATION Noun End FIELDCODE lf U FIELDCODE lx L FIELDCODE alt A FIELDCODE fea F FIELDCODE gl G INCLUDE sample2.sfm END
Note that everything except the lexicon sections and entries has been
copied verbatim into this new primary lexicon file. The
FIELDCODE
statements define how to interpret the other lexicon
files containing the actual lexicon sections and entries. These files
are indicated by INCLUDE
statements, and look like this:
\lf 0 \lx INITIAL \alt Begin \fea \gl [ \lf s'ati \lx NOUN \alt Noun \fea \gl Noun1 \lf s'adi \lx NOUN \alt Noun \fea \gl Noun2 \lf bab'at \lx NOUN \alt Noun \fea \gl Noun3 \lf bab'ad \lx NOUN \alt Noun \fea \gl Noun4 \lf 0 \lx End \alt # \fea \gl ]
`convlex' was written to make the transition from version 1 to
version 2 of PC-Kimmo as painless as possible. It reads a version 1
lexicon file, including any INCLUDE
d files, and writes a version
2 set of lexicon files. For a trivial case like the example above, the
interaction with the user might go something like this:
C:\>convlex CONVLEX: convert lexicon from PC-KIMMO version 1 to version 2 Comment character: [;] Input lexicon file: sample.lex Output lexicon file: sample2.lex Primary sfm lexicon file: sample2.sfm
For each INCLUDE
statement in the version 1 lexicon file,
`convlex' prompts for a replacement filename like this:
New sfm include file to replace noun.lex: noun2.sfm
The user interface is extremely crude, but since this is a program that is run only once or twice by most users, that should not be regarded as a problem.
The Microsoft Windows implementation uses the Microsoft C QuickWin function, and the Macintosh implementation uses the Metrowerks C SIOUX function.
This chapter is excerpted from Antworth 1991.
This made-up example is used for expository purposes. To make better phonological sense, the forms should have internal morpheme boundaries, for instance `te+mi' (otherwise there would be no basis for positing an underlying `e'). See the section below on the use of zero to see how morpheme boundaries are handled.
The unified dictionary is a new feature of AMPLE version 3.
This document was generated on 20 March 2003 using texi2html 1.56k.