\input pcparse % common TeX setup @c -*-texinfo-*-
\input texinfo
@c %**start of header
@setfilename pckimmo.info
@settitle PC-Kimmo Reference Manual
@c %**end of header
@syncodeindex fn cp
@set TITLE PC-Kimmo Reference Manual
@set SUBTITLE a two-level processor for morphological analysis
@set VERSION version 2.1.0
@set DATE October 1997
@set AUTHOR by Evan Antworth and Stephen McConnel
@set COPYRIGHT Copyright @copyright{} 2000 SIL International
@include front.txi
@c ----------------------------------------------------------------------------
@node Top, Introduction, (dir), (dir)
@comment node-name, next, previous, up
@ifinfo
@ifclear txt
This is the reference manual for the PC-Kimmo program.
@end ifclear
@end ifinfo
@menu
* Introduction::
* Two-level formalism::
* Running PC-Kimmo::
* Rules file::
* Lexicon files::
* Grammar file::
* Convlex::
* Bibliography::
@end menu
@c ----------------------------------------------------------------------------
@node Introduction, Two-level formalism, Top, Top
@comment node-name, next, previous, up
@chapter Introduction to the PC-Kimmo program
This document describes PC-Kimmo, an implementation of the
two-level computational linguistic formalism for personal computers.
It is available for MS-DOS, Microsoft Windows, Macintosh, and
Unix.@footnote{The Microsoft Windows implementation uses the Microsoft
C QuickWin function, and the Macintosh implementation uses the Metrowerks C
SIOUX function.}
The authors would appreciate feedback directed to the following
addresses. For linguistic questions, contact:
@example
Gary Simons
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236 gary.simons@@sil.org
U.S.A.
@end example
@noindent
For programming questions, contact:
@example
Stephen McConnel (972)708-7361 (office)
Language Software Development (972)708-7561 (fax)
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236 steve@@acadcomp.sil.org
U.S.A. or Stephen_McConnel@@sil.org
@end example
An online user manual for PC-Kimmo is available on the world wide web
at the URL
@ifclear html
@code{http://www.sil.org/pckimmo/v2/doc/guide.html}.
@end ifclear
@ifset html
http://www.sil.org/pckimmo/v2/doc/guide.html.
@end ifset
@c ----------------------------------------------------------------------------
@node Two-level formalism, Running PC-Kimmo, Introduction, Top
@chapter The Two-level Formalism
Two-level phonology is a linguistic tool developed by computational
linguists. Its primary use is in systems for natural language
processing such as PC-Kimmo. This chapter describes the linguistic and
computational basis of two-level phonology.@footnote{This chapter is
excerpted from Antworth 1991.}
@menu
* Roots::
* Rule application::
* How it works::
* Zero::
@end menu
@c ----------------------------------------------------------------------------
@node Roots, Rule application, Two-level formalism, Two-level formalism
@section Computational and linguistic roots
As the fields of computer science and linguistics have grown up
together during the past several decades, they have each benefited from
cross-fertilization. Modern linguistics has especially been influenced
by the formal language theory that underlies computation. The most
famous application of formal language theory to linguistics was
Chomsky's (1957) transformational generative grammar. Chomsky's
strategy was to consider several types of formal languages to see if
they were capable of modeling natural language syntax. He started by
considering the simplest type of formal languages, called finite state
languages. As a general principle, computational linguists try to use
the least powerful computational devices possible. This is because the
less powerful devices are better understood, their behavior is
predictable, and they are computationally more efficient. Chomsky
(1957:18ff) demonstrated that natural language syntax could not be
effectively modeled as a finite state language; thus he rejected finite
state languages as a theory of syntax and proposed that syntax requires
the use of more powerful, non-finite state languages. However, there
is no reason to assume that the same should be true for natural
language phonology. A finite state model of phonology is especially
desirable from the computational point of view, since it makes possible
a computational implementation that is simple and efficient.
While various linguists proposed that generative phonological rules
could be implemented by finite state devices (see Johnson 1972, Kay
1983), the most successful model of finite state phonology was
developed by Kimmo Koskenniemi, a Finnish computer scientist. He
called his model two-level morphology (Koskenniemi 1983), though his
use of the term morphology should be understood to encompass both what
linguists would consider morphology proper (the decomposition of words
into morphemes) and phonology (at least in the sense of
morphophonemics). Our main interest in this article is the
phonological formalism used by the two-level model, hereafter called
two-level phonology. Two-level phonology traces its linguistic
heritage to ``classical'' generative phonology as codified in @cite{The
Sound Pattern of English} (Chomsky and Halle 1968). The basic insight
of two-level phonology is due to the phonologist C. Douglas Johnson
(1972) who showed that the SPE theory of phonology could be implemented
using finite state devices by replacing sequential rule application
with simultaneous rule application. At its core, then, two-level
phonology is a rule formalism, not a complete theory of phonology. The
following sections of this article describe the mechanism of two-level
rule application by contrasting it with rule application in classical
generative phonology. It should be noted that Chomsky and Halle's
theory of rule application became the focal point of much controversy
during the 1970s with the result that current theories of phonology
differ significantly from classical generative phonology. The
relevance of two-level phonology to current theory is an important
issue, but one that will not be fully addressed here. Rather, the
comparison of two-level phonology to classical generative phonology is
done mainly for expository purposes, recognizing that while classical
generative phonology has been superseded by subsequent theoretical
work, it constitutes a historically coherent view of phonology that
continues to influence current theory and practice.
One feature that two-level phonology shares with classical generative
phonology is linear representation. That is, phonological forms are
represented as linear strings of symbols. This is in contrast to the
nonlinear representations used in much current work in phonology,
namely autosegmental and metrical phonology (see Goldsmith 1990). On
the computational side, two-level phonology is consistent with natural
language processing systems that are designed to operate on linear
orthographic input.
@c ----------------------------------------------------------------------------
@node Rule application, How it works, Roots, Two-level formalism
@section Two-level rule application
We will begin by reviewing the formal properties of generative rules.
Stated succinctly, generative rules are sequentially ordered rewriting
rules. What does this mean?
First, rewriting rules are rules that change or transform one symbol
into another symbol. For example, a rewriting rule of the form
@w{@samp{a --> b}} interprets the relationship between the symbols
@samp{a} and @samp{b} as a dynamic change whereby the symbol @samp{a}
is rewritten or turned into the symbol @samp{b}. This means that after
this operation takes place, the symbol @samp{a} no longer ``exists,''
in the sense that it is no longer available to other rules. In
linguistic theory generative rules are known as process rules. Process
rules attempt to characterize the relationship between levels of
representation (such as the phonemic and phonetic levels) by specifying
how to transform representations from one level into representations on
the other level.
Second, generative phonological rules apply sequentially, that is, one
after another, rather than applying simultaneously. This means that
each rule creates as its output a new intermediate level of
representation. This intermediate level then serves as the input to
the next rule. As a consequence, the underlying form becomes
inaccessible to later rules.
Third, generative phonological rules are ordered; that is, the
description specifies the sequence in which the rules must apply.
Applying rules in any other order may result in incorrect output.
As an example of a set of generative rules, consider the following
rules:
@example
@group
(1) Vowel Raising
e --> i / ___C_0 i
@end group
@group
(2) Palatalization
t --> c / ___i
@end group
@end example
@noindent
Rule 1 (Vowel Raising) states that @samp{e} becomes (is rewritten as)
@samp{i} in the environment preceding @samp{Ci} (where @samp{C} stands
for the set of consonants and @samp{C_0} stands for zero or more
consonants). Rule 2 (Palatalization) states that @samp{t} becomes
@samp{c} preceding @samp{i}. A sample derivation of forms to which
these rules apply looks like this (where UR stands for Underlying
Representation, SR stands for Surface Representation):@footnote{This
made-up example is used for expository purposes. To make better
phonological sense, the forms should have internal morpheme boundaries,
for instance @samp{te+mi} (otherwise there would be no basis for
positing an underlying @samp{e}). See the section below on the use of
zero to see how morpheme boundaries are handled.}
@example
@group
UR: temi
(1) timi
(2) cimi
SR: cimi
@end group
@end example
@noindent
Notice that in addition to the underlying and surface levels, an
intermediate level has been created as the result of sequentially
applying rules 1 and 2. The application of rule 1 produces the
intermediate form @samp{timi}, which then serves as the input to rule 2.
Not only are these rules sequential, they are ordered, such that rule 1
must apply before rule 2. Rule 1 has a feeding relationship to rule 2;
that is, rule 1 increases the number of forms that can undergo rule 2
by creating more instances of @samp{i}. Consider what would happen if
they were applied in the reverse order. Given the input form
@samp{temi}, rule 2 would do nothing, since its environment is not
satisfied. Rule 1 would then apply to produce the incorrect surface
form @samp{timi}.
Two-level rules differ from generative rules in the following ways.
First, whereas generative rules apply in a sequential order, two-level
rules apply simultaneously, which is better described as applying in
parallel. Applying rules in parallel to an input form means that for
each segment in the form all of the rules must apply successfully, even
if only vacuously.
Second, whereas sequentially applied generative rules create
intermediate levels of derivation, simultaneously applied two-level
rules require only two levels of representation: the underlying or
lexical level and the surface level. There are no intermediate levels
of derivation. It is in this sense that the model is called two-level.
Third, whereas generative rules relate the underlying and surface
levels by rewriting underlying symbols as surface symbols, two-level
rules express the relationship between the underlying and surface
levels by positing direct, static correspondences between pairs of
underlying and surface symbols. For instance, instead of rewriting
underlying @samp{a} as surface @samp{b}, a two-level rule states that
an underlying @samp{a} corresponds to a surface @samp{b}. The
two-level rule does not change @samp{a} into @samp{b}, so @samp{a} is
available to other rules. In other words, after a two-level rule
applies, both the underlying and surface symbols still ``exist.''
Fourth, whereas generative rules have access only to the current
intermediate form at each stage of the derivation, two-level rules have
access to both underlying and surface environments. Generative rules
cannot ``look back'' at underlying environments or ``look ahead'' to
surface environments. In contrast, the environments of two-level rules
are stated as lexical-to-surface correspondences. This means that a
two-level rule can easily refer to an underlying @samp{a} that
corresponds to a surface @samp{b}, or to a surface @samp{b} that
corresponds to an underlying @samp{a}. In generative phonology, the
interaction between a pair of rules is controlled by requiring that
they apply in a certain sequential order. In two-level phonology, rule
interactions are controlled not by ordering the rules but by carefully
specifying their environments as strings of two-level correspondences.
Fifth, whereas generative, rewriting rules are unidirectional (that is,
they operate only in an underlying to surface direction), two-level
rules are bidirectional. Two-level rules can operate either in an
underlying to surface direction (generation mode) or in a surface to
underlying direction (recognition mode). Thus in generation mode
two-level rules accept an underlying form as input and return a surface
form, while in recognition mode they accept a surface form as input and
return an underlying form. The practical application of bidirectional
phonological rules is obvious: a computational implementation of
bidirectional rules is not limited to generation mode to produce words;
it can also be used in recognition direction to parse words.
@c ----------------------------------------------------------------------------
@node How it works, Zero, Rule application, Two-level formalism
@section How a two-level description works
To understand how a two-level phonological description works, we will
use the example given above involving Raising and Palatalization. The
two-level model treats the relationship between the underlying form
@samp{temi} and the surface form @samp{cimi} as a direct,
symbol-to-symbol correspondence:
@example
@group
UR: t e m i
SR: c i m i
@end group
@end example
@noindent
Each pair of lexical and surface symbols is a correspondence pair. We
refer to a correspondence pair with the notation
@w{@samp{:}}, for instance
@samp{e:i} and @samp{m:m}. There must be an exact one-to-one
correspondence between the symbols of the underlying form and the
symbols of the surface form. Deletion and insertion of symbols
(explained in detail in the next section) is handled by positing
correspondences with zero, a null segment. The two-level model uses a
notation for expressing two-level rules that is similar to the notation
linguists use for phonological rules. Corresponding to the generative
rule for Palatalization (rule 2 above), here is the two-level rule for
the @samp{t:c} correspondence:
@example
@group
(3) Palatalization
t:c <=> ___ @@:i
@end group
@end example
This rule is a statement about the distribution of the pair @samp{t:c}
on the left side of the arrow with respect to the context or
environment on the right side of the arrow. A two-level rule has three
parts: the correspondence, the operator, and the environment. The
correspondence part of rule 3 is the pair @samp{t:c}, which is the
correspondence that the rule sanctions. The operator part of rule 3 is
the double-headed arrow. It indicates the nature of the logical
relationship between the correspondence and the environment (thus it
means something very different from the rewriting arrow @samp{-->} of
generative phonology). The @samp{<=>} arrow is equivalent to the
biconditional operator of formal logic and means that the
correspondence occurs always and only in the stated context; that is,
@samp{t:c} is allowed if and only if it is found in the context
@samp{___@:i}. In short, rule 3 is an obligatory rule. The
environment part of rule 3 is everything to the right of the arrow.
The long underline indicates the gap where the pair @samp{t:c} occurs.
Notice that even the environment part of the rule is specified as
two-level correspondence pairs.
The environment part of rule 3 requires further explanation. Instead
of using a correspondence such as @samp{i:i}, it uses the
correspondence @samp{@@:i}. The @samp{@@} symbol is a special
``wildcard'' symbol that stands for any phonological segment included
in the description. In the context of rule 3, the correspondence
@samp{@@:i} stands for all the feasible pairs in the description whose
surface segment is @samp{i}, in this case @samp{e:i} and @samp{i:i}.
Thus by using the correspondence @samp{@@:i}, we allow Palatalization
to apply in the environment of either a lexical @samp{e} or lexical
@samp{i}. In other words, we are claiming that Palatalization is
sensitive to a surface (phonetic) environment rather than an underlying
(phonemic) environment. Thus rule 3 will apply to both underlying
forms @samp{timi} and @samp{temi} to produce a surface form with an
initial @samp{c}.
Corresponding to the generative rule for Raising (rule 1 above) is the
following two-level rule for the @samp{e:i} correspondence:
@example
@group
(4) Vowel Raising
e:i <=> ___ C:C* @@:i
@end group
@end example
@noindent
(The asterisk in @samp{C:C*} indicates zero or more instances of the
correspondence @samp{C:C}) Similar to rule 3 above, rule 4 uses the
correspondence @samp{@@:i} in its environment. Thus rule 4 states that
the correspondence @samp{e:i} occurs preceding a surface @samp{i},
regardless of whether it is derived from a lexical @samp{e} or
@samp{i}. Why is this necessary? Consider the case of an underlying
form such as @samp{pememi}. In order to derive the surface form
@samp{pimimi}, Raising must apply twice: once before a lexical @samp{i}
and again before a lexical @samp{e}, both of which correspond to a
surface @samp{i}. Thus rule 4 will apply to both instances of lexical
@samp{e}, capturing the regressive spreading of Raising through the
word.
By applying rules 3 and 4 in parallel, they work in consort to produce the
right output. For example,
@example
@group
UR: t e m i
| | | |
Rules 3 4 | |
| | | |
SR: c i m i
@end group
@end example
@noindent
Conceptually, a two-level phonological description of a data set such
as this can be understood as follows. First, the two-level description
declares an alphabet of all the phonological segments used in the data
in both underlying and surface forms, in the case of our example,
@samp{t}, @samp{m}, @samp{c}, @samp{e}, and @samp{i}. Second, the
description declares a set feasible pairs, which is the complete set of
all underlying-to-surface correspondences of segments that occur in the
data. The set of feasible pairs for these data is the union of the set
of default correspondences, whose underlying and surface segments are
identical (namely @samp{t:t}, @samp{m:m}, @samp{e:e}, and @samp{i:i})
and the set of special correspondences, whose underlying and surface
segments are different (namely @samp{t:c} and @samp{e:i}). Notice that
since the segment @samp{c} only occurs as a surface segment in the
feasible pairs, the description will disallow any underlying form that
contains a @samp{c}.
A minimal two-level description, then, consists of nothing more than
this declaration of the feasible pairs. Since it contains all possible
underlying-to-surface correspondences, such a description will produce
the correct output form, but because it does not constrain the
environments where the special correspondences can occur, it will also
allow many incorrect output forms. For example, given the underlying
form @samp{temi}, it will produce the surface forms @samp{temi},
@samp{timi}, @samp{cemi}, and @samp{cimi}, of which only the last is
correct.
Third, in order to restrict the output to only correct forms, we
include rules in the description that specify where the special
correspondences are allowed to occur. Thus the rules function as
constraints or filters, blocking incorrect forms while allowing correct
forms to pass through. For instance, rule 3 (Palatalization) states
that a lexical @samp{t} must be realized as a surface @samp{c} when it
precedes @samp{@@:i}; thus, given the underlying form @samp{temi} it
will block the potential surface output forms @samp{timi} (because the
surface sequence @samp{ti} is prohibited) and @samp{cemi} (because
surface @samp{c} is prohibited before anything except surface
@samp{i}). Rule 4 (Raising) states that a lexical @samp{e} must be
realized as a surface @samp{i} when it precedes the sequence @samp{C:C}
@samp{@@:i}; thus, given the underlying form @samp{temi} it will block
the potential surface output forms @samp{temi} and @samp{cemi} (because
the surface sequence @samp{emi} is prohibited). Therefore of the four
potential surface forms, three are filtered out; rules 3 and 4 leave
only the correct form @samp{cimi}.
Two-level phonology facilitates a rather different way of thinking
about phonological rules. We think of generative rules as processes
that change one segment into another. In contrast, two-level rules do
not perform operations on segments, rather they state static
constraints on correspondences between underlying and surface forms.
Generative phonology and two-level phonology also differ in how they
characterize relationships between rules. Rules in generative
phonology are described in terms of their relative order of application
and their effect on the input of other rules (the so-called feeding and
bleeding relations). Thus the generative rule 1 for Raising precedes
and feeds rule 2 for Palatalization. In contrast, rules in the
two-level model are categorized according to whether they apply in
lexical versus surface environments. So we say that the two-level
rules for Raising and Palatalization are sensitive to a surface rather
than underlying environment.
@c ----------------------------------------------------------------------------
@node Zero, , How it works, Two-level formalism
@section With zero you can do (almost) anything
Phonological processes that delete or insert segments pose a special
challenge to two-level phonology. Since an underlying form and its
surface form must correspond segment for segment, how can segments be
deleted from an underlying form or inserted into a surface form? The
answer lies in the use of the special null symbol @samp{0} (zero).
Thus the correspondence @samp{x:0} represents the deletion of @samp{x},
while @samp{0:x} represents the insertion of @samp{x}. (It should be
understood that these zeros are provided by rule application mechanism
and exist only internally; that is, zeros are not included in input
forms nor are they printed in output forms.) As an example of
deletion, consider these forms from Tagalog (where @samp{+} represents
a morpheme boundary):
@example
@group
UR: m a n + b i l i
SR: m a m 0 0 i l i
@end group
@end example
@noindent
Using process terminology, these forms exemplify phonological
coalescence, whereby the sequence @samp{nb} becomes @samp{m}. Since in
the two-level model a sequence of two underlying segments cannot
correspond to a single surface segment, coalescence must be interpreted
as simultaneous assimilation and deletion. Thus we need two rules: an
assimilation rule for the correspondence @samp{n:m} and a deletion rule
for the correspondence @samp{b:0} (note that the morpheme boundary
@samp{+} is treated as a special symbol that is always deleted).
@example
@group
(5) Nasal Assimilation
n:m <=> ___ +:0 b:@@
@end group
@group
(6) Deletion
b:0 <=> @@:m +:0 ___
@end group
@end example
@noindent
Notice the interaction between the rules: Nasal Assimilation occurs in
a lexical environment, namely a lexical @samp{b} (which can correspond
to either a surface @samp{b} or @samp{0}), while Deletion occurs in a
surface environment, namely a surface @samp{m} (which could be the
realization of either a lexical @samp{n} or @samp{m}). In this way the
two rules interact with each other to produce the correct output.
Insertion correspondences, where the lexical segment is @samp{0},
enable one to write rules for processes such as stress insertion,
gemination, infixation, and reduplication. For example, Tagalog has a
verbalizing infix @samp{um} that attaches between the first consonant
and vowel of a stem; thus the infixed form of @samp{bili} is
@samp{bumili}. To account for this formation with two-level rules, we
represent the underlying form of the infix @samp{um} as the prefix
@samp{X+}, where @samp{X} is a special symbol that has no phonological
purpose other than standing for the infix. We then write a rule that
inserts the sequence @samp{um} in the presence of @samp{X+}, which is
deleted. Here is the two-level correspondence:
@example
@group
UR: X + b 0 0 i l i
SR: 0 0 b u m i l i
@end group
@end example
@noindent
and here is the two-level rule, which simultaneously deletes @samp{X}
and inserts @samp{um}:
@example
@group
(7) Infixation
X:0 <=> ___ +:0 C:C 0:u 0:m V:V
@end group
@end example
@noindent
These examples involving deletion and insertion show that the invention
of zero is just as important for phonology as it was for arithmetic.
Without zero, two-level phonology would be limited to the most trivial
phonological processes; with zero, the two-level model has the
expressive power to handle complex phonological or morphological
phenomena (though not necessarily with the degree of felicity that a
linguist might desire).
@c ----------------------------------------------------------------------------
@node Running PC-Kimmo, Rules file, Two-level formalism, Top
@chapter Running PC-Kimmo
PC-Kimmo is an interactive program. It has a few command line options,
but it is controlled primarily by commands typed at the keyboard (or
loaded from a file previously prepared).
@menu
* Command line options::
* Interactive commands::
@end menu
@c ----------------------------------------------------------------------------
@node Command line options, Interactive commands, Running PC-Kimmo, Running PC-Kimmo
@section PC-Kimmo Command Line Options
The PC-Kimmo program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(@samp{-}). The available options are listed below in alphabetical
order. Those options which require an argument have the argument type
following the option letter.
@ftable @code
@item -g filename
loads the grammar from a PC-Kimmo grammar file.
@item -l filename
loads an analysis lexicon from a PC-Kimmo lexicon file.
@item -r filename
loads the two-level rules from a PC-Kimmo rules file.
@item -s filename
loads a synthesis lexicon from a PC-Kimmo lexicon file.
@item -t filename
opens a file containing one or more PC-Kimmo commands.
@ifset txt
See `Interactive Commands' below.
@end ifset
@ifclear txt
@xref{Interactive commands}.
@end ifclear
@end ftable
The following options exist only in beta-test versions of the program,
since they are used only for debugging.
@ftable @code
@item -/
increments the debugging level. The default is zero (no debugging output).
@item -z filename
opens a file for recording a memory allocation log.
@item -Z address,count
traps the program at the point where @code{address} is allocated or
freed for the @code{count}'th time.
@end ftable
@c ----------------------------------------------------------------------------
@node Interactive commands, , Command line options, Running PC-Kimmo
@section Interactive Commands
Each of the commands available in PC-Kimmo is described below. Each
command consists of one or more keywords followed by zero or more
arguments. Keywords may be abbreviated to the minimum length necessary
to prevent ambiguity.
@menu
* cd::
* clear::
* close::
* compare::
* directory::
* edit::
* exit::
* file::
* generate::
* help::
* list::
* load::
* log::
* quit::
* recognize::
* save::
* set::
* show::
* status::
* synthesize::
* system::
* take::
@end menu
@c ----------------------------------------------------------------------------
@node cd, clear, Interactive commands, Interactive commands
@subsection cd
@w{@code{cd} @var{directory}}
changes the current directory to the one specified. Spaces in the
directory pathname are not permitted.
For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, @code{a:}); a path starting with
@code{\} which indicates a directory at the top level of the current
disk; a path starting with @code{..} which indicates the directory
above the current one; and so on. Directories are separated by the
@code{\} character. (The forward slash @code{/} works just as well as
the backslash @code{\} for MS-DOS or Windows.)
For the Macintosh, you can give a full path starting with the name of
a hard disk, a path starting with @code{:} which means the current
folder, or one starting @code{::} which means the folder containing the
current one (and so on).
For Unix, you can give a full path starting with a @code{/} (for
example, @code{/usr/pckimmo}); a path starting with @code{..} which
indicates the directory above the current one; and so on. Directories
are separated by the @code{/} character.
@c ----------------------------------------------------------------------------
@node clear, close, cd, Interactive commands
@subsection clear
@code{clear}
erases all existing rules, lexicon, and grammar information, allowing
the user to prepare to load information for a new language. Strictly
speaking, it is not needed since the @w{@code{load rules}} command
erases any previously existing rules, the @w{@code{load lexicon}}
command erases any previously existing analysis lexicon, the
@w{@code{load synthesis-lexicon}} command erases any previously
existing synthesis lexicon, and the @w{@code{load grammar}} command
erases any previously existing grammar.
@code{cle} is the minimal abbreviation for @code{clear}.
@c ----------------------------------------------------------------------------
@node close, compare, clear, Interactive commands
@subsection close
@code{close}
closes the current log file opened by a previous @code{log} command.
@code{clo} is the minimal abbreviation for @code{close}.
@c ----------------------------------------------------------------------------
@node compare, directory, close, Interactive commands
@subsection compare
The @code{compare} commands all test the current language description
files by processing data against known (precomputed) results.
@w{@code{co}} is the minimal abbreviation for @code{compare}.
@w{@code{file compare}} is a synonym for @code{compare}.
@menu
* compare generate::
* compare pairs::
* compare recognize::
* compare synthesize::
@end menu
@c ----------------------------------------------------------------------------
@node compare generate, compare pairs, compare, compare
@subsubsection compare generate
@w{@code{compare generate} @var{}}
reads lexical and surface forms from the specified file. After reading
a lexical form, PC-Kimmo generates the corresponding surface form(s)
and compares the result to the surface form(s) read from the file. If
@code{VERBOSE} is @code{ON}, then each form from the file is echoed on
the screen with a message indicating whether or not the surface forms
generated by PC-Kimmo and read from the file are in agreement. If
@code{VERBOSE} is @code{OFF}, then only the disagreements in surface
form are displayed fully. Each result which agrees is indicated by a
single dot written to the screen.
The default filetype extension for @w{@code{compare generate}} is
@file{.gen}, and the default filename is @file{data.gen}.
@w{@code{co g}} is the minimal abbreviation for @w{@code{compare generate}}.
@w{@code{file compare generate}} is a synonym for
@w{@code{compare generate}}.
@c ----------------------------------------------------------------------------
@node compare pairs, compare recognize, compare generate, compare
@subsubsection compare pairs
@w{@code{compare pairs} @var{}}
reads pairs of surface and lexical forms from the specified file.
After reading a lexical form, PC-Kimmo produces any corresponding
surface form(s) and compares the result(s) to the surface form read
from the file. For each surface form, PC-Kimmo also produces any
corresponding lexical form(s) and compares the result to the lexical
form read from the file. If @code{VERBOSE} is @code{ON}, then each
form from the file is echoed on the screen with a message indicating
whether or not the forms produced by PC-Kimmo and read from the file
are in agreement. If @code{VERBOSE} is @code{OFF}, then each result
which agrees is indicated by a single dot written to the screen, and
only disagreements in lexical forms are displayed fully.
The default filetype extension for @code{compare pairs} is @file{.pai},
and the default filename is @file{data.pai}.
@w{@code{co p}} is the minimal abbreviation for @w{@code{compare pairs}}.
@w{@code{file compare pairs}} is a synonym for
@w{@code{compare pairs}}.
@c ----------------------------------------------------------------------------
@node compare recognize, compare synthesize, compare pairs, compare
@subsubsection compare recognize
@w{@code{compare recognize} @var{}}
reads surface and lexical forms from the specified file. After reading
a surface form, PC-Kimmo produces any corresponding lexical form(s) and
compares the result(s) to the lexical form(s) read from the file. If
@code{VERBOSE} is @code{ON}, then each form from the file is echoed on
the screen with a message indicating whether or not the lexical forms
produced by PC-Kimmo and read from the file are in agreement. If
@code{VERBOSE} is @code{OFF}, then each result which agrees is
indicated by a single dot written to the screen, and only disagreements
in lexical forms are displayed fully.
The default filetype extension for @code{compare recognize} is
@file{.rec}, and the default filename is @file{data.rec}.
@w{@code{co r}} is the minimal abbreviation for @w{@code{compare recognize}}.
@w{@code{file compare recognize}} is a synonym for
@w{@code{compare recognize}}.
@c ----------------------------------------------------------------------------
@node compare synthesize, , compare recognize, compare
@subsubsection compare synthesize
@w{@code{compare synthesize} @var{}}
reads morphological and surface forms from the specified file. After
reading a morphological form, PC-Kimmo produces any corresponding
surface form(s) and compares the result(s) to the surface form(s) read
from the file. If @code{VERBOSE} is @code{ON}, then each form from the file
is echoed on the screen with a message indicating whether or not the surface
forms produced by PC-Kimmo and read from the file are in agreement. If
@code{VERBOSE} is @code{OFF}, then each result which agrees is
indicated by a single dot written to the screen, and only disagreements
in surface forms are displayed fully.
The default filetype extension for @code{compare synthesize} is
@file{.syn}, and the default filename is @file{data.syn}.
@w{@code{co s}} is the minimal abbreviation for @w{@code{compare synthesize}}.
@w{@code{file compare synthesize}} is a synonym for
@w{@code{compare synthesize}}.
@c ----------------------------------------------------------------------------
@node directory, edit, compare, Interactive commands
@subsection directory
@code{directory}
lists the contents of the current directory. This command is available
only for the MS-DOS and Unix implementations. It does not exist for the
Microsoft Windows or Macintosh implementations.
@c ----------------------------------------------------------------------------
@node edit, exit, directory, Interactive commands
@subsection edit
@w{@code{edit} @var{filename}}
attempts to edit the specified file using the program indicated by the
environment variable @code{EDITOR}. If this environment variable is not
defined, then @code{edit} is used to edit the file on MS-DOS, and
@code{emacs} is used to edit the file on Unix. This command is not
available for the Microsoft Windows or Macintosh implementations.
@c ----------------------------------------------------------------------------
@node exit, file, edit, Interactive commands
@subsection exit
@code{exit}
stops PC-Kimmo, returning control to the operating system. This is the
same as @code{quit}.
@c ----------------------------------------------------------------------------
@node file, generate, exit, Interactive commands
@subsection file
The @code{file} commands process data from a file, optionally writing
the results to another file. Each of these commands is described
below.
@menu
* file compare::
* file generate::
* file recognize::
* file synthesize::
@end menu
@c ----------------------------------------------------------------------------
@node file compare, file generate, file, file
@subsubsection file compare
The @code{file compare} commands all test the current language description
files by processing data against known (precomputed) results.
@w{@code{f c}} is the minimal abbreviation for @w{@code{file compare}}.
@w{@code{file compare}} is a synonym for @code{compare}.
@ifset txt
See `compare generate', `compare pairs', `compare recognize', and
`compare synthesize' above.
@end ifset
@ifset tex
@xref{compare generate}, @ref{compare pairs}, @ref{compare recognize},
and @ref{compare synthesize}.
@end ifset
@menu
* compare generate:: is the same as file compare generate
* compare pairs:: is the same as file compare pairs
* compare recognize:: is the same as file compare recognize
* compare synthesize:: is the same as file compare synthesize
@end menu
@c ----------------------------------------------------------------------------
@node file generate, file recognize, file compare, file
@subsubsection file generate
@w{@code{file generate} @var{ []}}
reads lexical forms from the specified input file and writes the
corresponding computed surface forms either to the screen or to an
optionally specified output file.
This command behaves the same as @code{generate} except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
@ifset txt
See `generate' below.
@end ifset
@ifclear txt
@xref{generate}.
@end ifclear
@w{@code{f g}} is the minimal abbreviation for @w{@code{file generate}}.
@c ----------------------------------------------------------------------------
@node file recognize, file synthesize, file generate, file
@subsubsection file recognize
@w{@code{file recognize} @var{ []}}
reads surface forms from the specified input file and writes the
corresponding computed morphological and lexical forms either to the
screen or to an optionally specified output file.
This command behaves the same as @code{recognize} except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
@ifset txt
See `recognize' below.
@end ifset
@ifclear txt
@xref{recognize}.
@end ifclear
@w{@code{f r}} is the minimal abbreviation for @w{@code{file recognize}}.
@c ----------------------------------------------------------------------------
@node file synthesize, , file recognize, file
@subsubsection file synthesize
@w{@code{file synthesize} @var{ []}}
reads morphological forms from the specified input file and writes the
corresponding computed surface forms either to the screen or to an
optionally specified output file.
This command behaves the same as @code{synthesize} except that input
comes from a file rather than the keyboard, and output may go to a file
rather than the screen.
@ifset txt
See `synthesize' below.
@end ifset
@ifclear txt
@xref{synthesize}.
@end ifclear
@w{@code{f s}} is the minimal abbreviation for @w{@code{file synthesize}}.
@c ----------------------------------------------------------------------------
@node generate, help, file, Interactive commands
@subsection generate
@w{@code{generate} @var{[]}}
attempts to produce a surface form from a lexical form provided by the
user.
If a lexical form is typed on the same line as the command, then that
lexical form is used to generate a surface form.
If the command is typed without a form, then PC-Kimmo prompts the user
for lexical forms with a special generator prompt, and processes each
form in turn.
This cycle of typing and generating is terminated by typing an empty
``form'' (that is, nothing but the @code{Enter} or @code{Return}
key).
The rules must be loaded before using this command. It does not
require either a lexicon or a grammar.
@code{g} is the minimal abbreviation for @code{generate}.
@c ----------------------------------------------------------------------------
@node help, list, generate, Interactive commands
@subsection help
@w{@code{help} @var{command}}
displays a description of the specified command. If @code{help} is typed
by itself, PC-Kimmo displays a list of commands with short descriptions of
each command.
@code{h} is the minimal abbreviation for @code{help}.
@c ----------------------------------------------------------------------------
@node list, load, help, Interactive commands
@subsection list
The @code{list} commands all display information about the currently
loaded data. Each of these commands are described below.
@code{li} is the minimal abbreviation for @code{list}.
@menu
* list lexicon::
* list pairs::
* list rules::
@end menu
@c ----------------------------------------------------------------------------
@node list lexicon, list pairs, list, list
@subsubsection list lexicon
@w{@code{list lexicon}}
displays the names of all the (sub)lexicons currently loaded. The
order of presentation is the order in which they are referenced in the
@code{ALTERNATIONS} declarations.
@w{@code{li l}} is the minimal abbreviation for @w{@code{list lexicon}}.
@c ----------------------------------------------------------------------------
@node list pairs, list rules, list lexicon, list
@subsubsection list pairs
@w{@code{list pairs}}
displays all the feasible pairs for the current set of active rules.
The feasible pairs are displayed as pairs of lines, with the lexical
characters shown above the corresponding surface characters.
@w{@code{li p}} is the minimal abbreviation for @w{@code{list pairs}}.
@c ----------------------------------------------------------------------------
@node list rules, , list pairs, list
@subsubsection list rules
@w{@code{list rules}}
displays the names of the current rules, preceded by the number of the
rule (used by the @w{@code{set rules}} command) and an indication of
whether the rule is @code{ON} or @code{OFF}.
@w{@code{li r}} is the minimal abbreviation for @w{@code{list rules}}.
@c ----------------------------------------------------------------------------
@node load, log, list, Interactive commands
@subsection load
The @code{load} commands all load information stored in specially
formatted files. Each of the @code{load} commands is described below.
@code{l} is the minimal abbreviation for @code{load}.
@menu
* load grammar::
* load lexicon::
* load rules::
* load synthesis-lexicon::
@end menu
@c ----------------------------------------------------------------------------
@node load grammar, load lexicon, load, load
@subsubsection load grammar
@w{@code{load grammar} @var{[]}}
erases any existing word grammar and reads a new word grammar from the
specified file.
The default filetype extension for @w{@code{load grammar}} is
@file{.grm}, and the default filename is @file{grammar.grm}.
A grammar file can also be loaded by using the @samp{-g} command line
option when starting PC-Kimmo.
@w{@code{l g}} is the minimal abbreviation for
@w{@code{load grammar}}.
@c ----------------------------------------------------------------------------
@node load lexicon, load rules, load grammar, load
@subsubsection load lexicon
@w{@code{load lexicon} @var{[]}}
erases any existing analysis lexicon information and reads a new
analysis lexicon from the specified file. A rules file must
be loaded before an analysis lexicon file can be loaded.
The default filetype extension for @w{@code{load lexicon}} is
@file{.lex}, and the default filename is @file{lexicon.lex}.
An analysis lexicon file can also be loaded by using the @samp{-l}
command line option when starting PC-Kimmo. This requires that a
@samp{-r} option also be used to load a rules file.
@w{@code{l l}} is the minimal abbreviation for
@w{@code{load lexicon}}.
@c ----------------------------------------------------------------------------
@node load rules, load synthesis-lexicon, load lexicon, load
@subsubsection load rules
@w{@code{load rules} @var{[]}}
erases any existing rules and reads a new set of two-level rules from
the specified file.
The default filetype extension for @w{@code{load rules}} is
@file{.rul}, and the default filename is @file{rules.rul}.
A rules file can also be loaded by using the @samp{-r} command line
option when starting PC-Kimmo.
@w{@code{l r}} is the minimal abbreviation for
@w{@code{load rules}}.
@c ----------------------------------------------------------------------------
@node load synthesis-lexicon, , load rules, load
@subsubsection load synthesis-lexicon
@w{@code{load synthesis-lexicon} @var{[]}}
erases any existing synthesis lexicon and reads a new synthesis lexicon
from the specified file. A rules file must be loaded before a
synthesis lexicon file can be loaded.
The default filetype extension for @w{@code{load synthesis-lexicon}}
is @file{.lex}, and the default filename is @file{lexicon.lex}.
A synthesis lexicon file can also be loaded by using the @samp{-s}
command line option when starting PC-Kimmo. This requires that a
@samp{-r} option also be used to load a rules file.
@w{@code{l s}} is the minimal abbreviation for
@w{@code{load synthesis-lexicon}}.
@c ----------------------------------------------------------------------------
@node log, quit, load, Interactive commands
@subsection log
@w{@code{log} @var{[]}}
opens a log file. Each item processed by a @code{generate},
@code{recognize}, @code{synthesize}, @code{compare}, or @code{file}
command is recorded in the log file as well as being displayed on the
screen.
If a filename is given on the same line as the @code{log} command, then
that file is used for the log file. Any previously existing file with
the same name will be overwritten. If no filename is provided, then
the file @file{pckimmo.log} in the current directory is used for the log
file.
Use @code{close} to stop recording in a log file. If a @code{log}
command is given when a log file is already open, then the earlier log
file is closed before the new log file is opened.
@c ----------------------------------------------------------------------------
@node quit, recognize, log, Interactive commands
@subsection quit
@code{quit}
stops PC-Kimmo, returning control to the operating system. This is the
same as @code{exit}.
@c ----------------------------------------------------------------------------
@node recognize, save, quit, Interactive commands
@subsection recognize
@w{@code{recognize} @var{[]}}
attempts to produce lexical and morphological forms from a surface
wordform provided by the user. If a wordform is typed on the same line
as a command, then that word is parsed. If the command is typed
without a form, then PC-Kimmo prompts the user for surface forms with a
special recognizer prompt, and processes each form in turn. This cycle
of typing and parsing is terminated by typing an empty ``word'' (that
is, nothing but the @code{Enter} or @code{Return} key).
Both the rules and the lexicon must be loaded before using this
command. A grammar may also be loaded and used to eliminate invalid
parses from the two-level processor results. If a grammar is used,
then parse trees and feature structures may be displayed as well as the
lexical and morphological forms.
@c ----------------------------------------------------------------------------
@node save, set, recognize, Interactive commands
@subsection save
@w{@code{save} @var{[file.tak]}}
writes the current settings to the designated file in the form of
PC-Kimmo commands. If the file is not specified, the settings are
written to @code{pckimmo.tak} in the current directory.
@c ----------------------------------------------------------------------------
@node set, show, save, Interactive commands
@subsection set
The @code{set} commands control program behavior by setting internal
program variables. Each of these commands (and variables) is described
below.
@menu
* set ambiguities::
* set ample-dictionary::
* set check-cycles::
* set comment::
* set failures::
* set features::
* set gloss::
* set marker category::
* set marker features::
* set marker gloss::
* set marker record::
* set marker word::
* set timing::
* set top-down-filter::
* set tree::
* set trim-empty-features::
* set unification::
* set verbose::
* set warnings::
* set write-ample-parses::
@end menu
@c ----------------------------------------------------------------------------
@node set ambiguities, set ample-dictionary, set, set
@subsubsection set ambiguities
@w{@code{set ambiguities} @var{number}}
limits the number of analyses printed to the given number. The default
value is 10. Note that this does not limit the number of analyses
produced, just the number printed.
@c ----------------------------------------------------------------------------
@node set ample-dictionary, set check-cycles, set ambiguities, set
@subsubsection set ample-dictionary
@w{@code{set ample-dictionary} @var{value}}
determines whether or not the AMPLE dictionary files are divided
according to morpheme type. @w{@code{set ample-dictionary split}} declares
that the AMPLE dictionary is divided into a
prefix dictionary file, an infix dictionary file, a suffix dictionary
file, and one or more root dictionary files. The existence of the
three affix dictionary depends on settings in the AMPLE analysis
data file. If they exist, the @w{@code{load ample dictionary}} command
requires that they be given in this relative
order: prefix, infix, suffix, root(s).
@w{@code{set ample-dictionary unified}} declares that any of the AMPLE
dictionary files may contain any type of morpheme. This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is @var{root}), and that the dictionary code
table contains a @code{\unified} field. One of the changes
listed under @code{\unified} must convert a backslash code to @code{T}.
The default is for the AMPLE dictionary to be
@emph{split}.@footnote{The unified dictionary is a new feature of AMPLE
version 3.}
@c ----------------------------------------------------------------------------
@node set check-cycles, set comment, set ample-dictionary, set
@subsubsection set check-cycles
@w{@code{set check-cycles} @var{value}}
enables or disables a check to prevent cycles in the parse chart.
@w{@code{set check-cycles on}} turns on this check, and
@w{@code{set check-cycles off}} turns it off. This check
slows down the parsing of a sentence, but it makes the parser less
vulnerable to hanging on perverse grammars. The default setting is
@code{on}.
@c ----------------------------------------------------------------------------
@node set comment, set failures, set check-cycles, set
@subsubsection set comment
@w{@code{set comment} @var{character}}
sets the comment character to the indicated value. If @var{character}
is missing (or equal to the current comment character), then comment
handling is disabled. The default comment character is @code{;}
(semicolon).
@c ----------------------------------------------------------------------------
@node set failures, set features, set comment, set
@subsubsection set failures
@w{@code{set failures} @var{value}}
enables or disables @var{grammar failure mode}. @w{@code{set failures on}}
turns on grammar failure mode, and @w{@code{set failures off}} turns it
off. When grammar failure mode is on, the partial results of forms
that fail the grammar module are displayed. A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned. The default setting is @code{off}.
Be careful with this option. Setting failures to @code{on} can cause
the PC-Kimmo to go into an infinite loop for certain recursive grammars
and certain input sentences. @sc{we may try to do something to detect
this type of behavior, at least partially.}
@c ----------------------------------------------------------------------------
@node set features, set gloss, set failures, set
@subsubsection set features
@w{@code{set features} @var{value}}
determines how features will be displayed.
@w{@code{set features all}} enables the display of the features for all
nodes of the parse tree.
@w{@code{set features top}} enables the display of the feature
structure for only the top node of the parse tree. This is the default
setting.
@w{@code{set features flat}} causes features to be displayed in a flat,
linear string that uses less space on the screen.
@w{@code{set features full}} causes features to be displayed in an
indented form that makes the embedded structure of the feature set
clear. This is the default setting.
@w{@code{set features on}} turns on features display mode, allowing
features to be shown. This is the default setting.
@w{@code{set features off}} turns off features display mode, preventing
features from being shown.
@c ----------------------------------------------------------------------------
@node set gloss, set marker category, set features, set
@subsubsection set gloss
@w{@code{set gloss} @var{value}}
enables the display of glosses in the parse tree output if @var{value} is
@code{on}, and disables the display of glosses if @var{value} is
@code{off}. If any glosses exist in the lexicon file, then @code{gloss} is
automatically turned @code{on} when the lexicon is loaded. If no glosses
exist in the lexicon, then this flag is ignored.
@c ----------------------------------------------------------------------------
@node set marker category, set marker features, set gloss, set
@subsubsection set marker category
@w{@code{set marker category} @var{marker}}
establishes the marker for the field containing the category (part of
speech) feature. The default is @code{\c}.
@c ----------------------------------------------------------------------------
@node set marker features, set marker gloss, set marker category, set
@subsubsection set marker features
@w{@code{set marker features} @var{marker}}
establishes the marker for the field containing miscellaneous features.
(This field is not needed for many words.) The default is @code{\f}.
@c ----------------------------------------------------------------------------
@node set marker gloss, set marker record, set marker features, set
@subsubsection set marker gloss
@w{@code{set marker gloss} @var{marker}}
establishes the marker for the field containing the word gloss. The
default is @code{\g}.
@c ----------------------------------------------------------------------------
@node set marker record, set marker word, set marker gloss, set
@subsubsection set marker record
@w{@code{set marker record} @var{marker}}
establishes the field marker that begins a new record in the lexicon
file. This may or may not be the same as the @code{word} marker. The
default is @code{\w}.
@c ----------------------------------------------------------------------------
@node set marker word, set timing, set marker record, set
@subsubsection set marker word
@w{@code{set marker word} @var{marker}}
establishes the marker for the word field. The default is @code{\w}.
@c ----------------------------------------------------------------------------
@node set timing, set top-down-filter, set marker word, set
@subsubsection set timing
@w{@code{set timing} @var{value}}
enables timing mode if @var{value} is @code{on}, and disables timing
mode if @var{value} is @code{off}. If timing mode is @code{on}, then
the elapsed time required to process a command is displayed when the
command finishes. If timing mode is @code{off}, then the elapsed time
is not shown. The default is @code{off}. (This option is useful only
to satisfy idle curiosity.)
@c ----------------------------------------------------------------------------
@node set top-down-filter, set tree, set timing, set
@subsubsection set top-down-filter
@w{@code{set top-down-filter} @var{value}}
enables or disables top-down filtering based on the categories.
@w{@code{set top-down-filter on}} turns on this filtering,
and @w{@code{set top-down-filter off}} turns it off. The
top-down filter speeds up the parsing of a sentence, but might cause
the parser to miss some valid parses. The default setting is
@code{on}.
This should not be required in the final version of PC-Kimmo.
@c ----------------------------------------------------------------------------
@node set tree, set trim-empty-features, set top-down-filter, set
@subsubsection set tree
@w{@code{set tree} @var{value}}
specifies how parse trees should be displayed.
@w{@code{set tree full}} turns on the parse tree display, displaying the
result of the parse as a full tree. This is the default setting.
A short sentence would look something like this:
@example
@group
Sentence
|
Declarative
_____|_____
NP VP
| ___|____
N V COMP
cows eat |
NP
|
N
grass
@end group
@end example
@w{@code{set tree flat}} turns on the parse tree display, displaying the
result of the parse as a flat tree structure in the form of a bracketed
string. The same short sentence would look something like this:
@example
@group
(Sentence (Declarative (NP
(N cows)) (VP (V eat) (COMP
(NP (N grass))))))
@end group
@end example
@w{@code{set tree indented}} turns on the parse tree display, displaying
the result of the parse in an indented format sometimes called a
@emph{northwest tree}. The same short sentence would look like this:
@example
@group
Sentence
Declarative
NP
N cows
VP
V eat
COMP
NP
N grass
@end group
@end example
@w{@code{set tree off}} disables the display of parse trees altogether.
@c ----------------------------------------------------------------------------
@node set trim-empty-features, set unification, set tree, set
@subsubsection set trim-empty-features
@w{@code{set trim-empty-features} @var{value}}
disables the display of empty feature values if @var{value} is
@code{on}, and enables the display of empty feature values if
@var{value} is @code{off}. The default is not to display empty feature
values.
@c ----------------------------------------------------------------------------
@node set unification, set verbose, set trim-empty-features, set
@subsubsection set unification
@w{@code{set unification} @var{value}}
enables or disables feature unification.
@w{@code{set unification on}} turns on unification mode. This is the
default setting.
@w{@code{set unification off}} turns off feature unification in the
grammar. Only the context-free phrase structure rules are used to
guide the parse; the feature contraints are ignored. This can be
dangerous, as it is easy to introduce infinite cycles in recursive
phrase structure rules.
@c ----------------------------------------------------------------------------
@node set verbose, set warnings, set unification, set
@subsubsection set verbose
@w{@code{set verbose} @var{value}}
enables or disables the screen display of parse trees in the
@w{@code{file parse}}
command. @w{@code{set verbose on}} enables the screen display of parse
trees, and @w{@code{set verbose off}} disables such display. The default
setting is @code{off}.
@c ----------------------------------------------------------------------------
@node set warnings, set write-ample-parses, set verbose, set
@subsubsection set warnings
@w{@code{set warnings} @var{value}}
enables warning mode if @var{value} is @code{on}, and disables
warning mode if @var{value} is @code{off}. If warning mode is
enabled, then warning messages are displayed on the output. If warning
mode is disabled, then no warning messages are displayed. The default
setting is @code{on}.
@c ----------------------------------------------------------------------------
@node set write-ample-parses, , set warnings, set
@subsubsection set write-ample-parses
@w{@code{set write-ample-parses} @var{value}}
enables writing @code{\parse} and @code{\features} fields at the end of
each sentence in the disambiguated analysis file if @var{value} is
@code{on}, and disables writing these fields if @var{value} is
@code{off}. The default setting is @code{off}.
This variable setting affects only the @w{@code{file disambiguate}} command.
@c ----------------------------------------------------------------------------
@node show, status, set, Interactive commands
@subsection show
The @code{show} commands display internal settings on the screen. Each
of these commands is described below.
@menu
* show lexicon::
* show status::
@end menu
@c ----------------------------------------------------------------------------
@node show lexicon, show status, show, show
@subsubsection show lexicon
@w{@code{show lexicon}}
prints the contents of the lexicon stored in memory on the standard
output. @sc{this is not very useful, and may be removed.}
@c ----------------------------------------------------------------------------
@node show status, , show lexicon, show
@subsubsection show status
@w{@code{show status}}
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the @code{set} command.
@code{show} (by itself) and @code{status} are synonyms for
@w{@code{show status}}.
@c ----------------------------------------------------------------------------
@node status, synthesize, show, Interactive commands
@subsection status
@code{status}
displays the names of the current grammar, sentences, and log files,
and the values of the switches established by the @code{set} command.
@c ----------------------------------------------------------------------------
@node synthesize, system, status, Interactive commands
@subsection synthesize
@w{@code{synthesize} @var{[]}} attempts to produce
surface forms from a morphological form provided by the user. If a
morphological form is typed on the same line as the command, then that
form is synthesized. If the command is typed without a form, then
PC-Kimmo repeatedly prompts the user for morphological forms with a
special synthesizer prompt, processing each form. This cycle of typing
and synthesizing is terminated by typing an empty ``form'' (that is,
nothing but the @code{Enter} or @code{Return} key).
Note that the morphemes in the morphological form must be separated by
spaces, and must match gloss entries loaded from the lexicon. Also, the
morphemes must be given in the proper order.
Both the rules and the synthesis lexicon must be loaded before using
this command. It does not use a grammar.
@c ----------------------------------------------------------------------------
@node system, take, synthesize, Interactive commands
@subsection system
@w{@code{system} @var{[command]}}
allows the user to execute an operating system command (such as
checking the available space on a disk) from within PC-Kimmo. This is
available only for MS-DOS and Unix, not for Microsoft Windows or the
Macintosh.
If no system-level command is given on the line with the @code{system}
command, then PC-Kimmo is pushed into the background and a new system
command processor (shell) is started. Control is usually returned to
PC-Kimmo in this case by typing @code{exit} as the operating system
command.
@code{sys} is the minimal abbreviation for @code{system}.
@code{!} (exclamation point) is a synonym for @code{system}.
(@code{!} does not require a space to separate it from the command.)
@c ----------------------------------------------------------------------------
@node take, , system, Interactive commands
@subsection take
@w{@code{take} @var{[file.tak]}}
redirects command input to the specified file.
The default filetype extension for @code{take} is @code{.tak}, and the default
filename is @code{pckimmo.tak}.
@code{take} files can be nested three deep. That is, the user types
@w{@code{take file1}}, @code{file1} contains the command @w{@code{take file2}},
and @code{file2} has the command @w{@code{take file3}}. It would be an
error for @code{file3} to contain a @code{take} command. This should
not prove to be a serious limitation.
A @code{take} file can also be specified by using the @code{-t} command
line option when starting PC-Kimmo. When started, PC-Kimmo looks for a
@code{take} file named @file{pckimmo.tak} in the current directory to
initialize itself with.
@c ----------------------------------------------------------------------------
@node Rules file, Lexicon files, Running PC-Kimmo, Top
@chapter The PC-Kimmo Rules File
@set rules-structure 1
The general structure of the rules file is a list of keyword
declarations. Figure @value{rules-structure} shows the conventional
structure of the rules file. Note that the notation
@w{@samp{@{x | y@}}} means either @samp{x} or @samp{y} (but not both).
@example
@group
@b{Figure @value{rules-structure} Structure of the rules file}
COMMENT @var{}
ALPHABET @var{}
NULL @var{}
ANY @var{}
BOUNDARY @var{}
SUBSET @var{} @var{}
. (more subsets)
.
.
RULE @var{} @var{} @var{}
@var{}
@var{}
@var{}@{: | .@} @var{}
. (more states)
.
.
. (more rules)
.
.
END
@end group
@end example
The following specifications apply to the rules file.
@itemize @bullet
@item
Extra spaces, blank lines, and comment lines are ignored. In the
descriptions below, reference to the use of a space character implies
any whitespace character (that is, any character treated like a space
character). The following control characters when used in a file are
whitespace characters: @code{^I} (ASCII 9, tab), @code{^J} (ASCII 10,
line feed), @code{^K} (ASCII 11, vertical tab), @code{^L} (ASCII 12,
form feed), and @code{^M} (ASCII 13, carriage return).
@item
Comments may be placed anywhere in the file. All data following a
comment character to the end of the line is ignored. (See below on the
@code{COMMENT} declaration.)
@item
The set of valid keywords used to form declarations includes
@code{COMMENT}, @code{ALPHABET}, @code{NULL}, @code{ANY},
@code{BOUNDARY}, @code{SUBSET}, @code{RULE}, and @code{END}.
@item
These declarations are obligatory and can occur only once in a file:
@code{ALPHABET}, @code{NULL}, @code{ANY}, @code{BOUNDARY}.
@item
These declarations are optional and can occur one or more times in a
file: @code{COMMENT}, @code{SUBSET}, and @code{RULE}.
@item
The @code{COMMENT} declaration sets the comment character used in the
rules file, lexicon files, and grammar file. The @code{COMMENT}
declaration can only be used in the rules file, not in the lexicon or
grammar file. The @code{COMMENT} declaration is optional. If it is
not used, the comment character is set to @code{;} (semicolon) as a
default.
@item
The @code{COMMENT} declaration can be used anywhere in the rules file
and can be used more than once. That is, different parts of the rules
file can use different comment characters. The @code{COMMENT}
declaration can (and in practice usually does) occur as the first
keyword in the rules file, followed by either one or more
@code{COMMENT} declarations or the @code{ALPHABET} declaration.
@item
Note that if you use the @code{COMMENT} declaration to declare the
character that is already in use as the comment character, an error
will result. For instance, if semicolon is the current comment
character, the declaration @w{@code{COMMENT ;}} will result in an error.
@item
The comment character can no longer be set using a command line option
or with a command in the user interface, as was the case in version 1
of PC-Kimmo.
@item
The @code{ALPHABET} declaration must either occur first in the file or
follow one or more @code{COMMENT} declarations only. The other
declarations can appear in any order. The @code{COMMENT}, @code{NULL},
@code{ANY}, @code{BOUNDARY}, and @code{SUBSET} declarations can even be
interspersed among the rules. However, these declarations must appear
before any rule that uses them or an error will result.
@item
The @code{ALPHABET} declaration defines the set of symbols used in
either lexical or surface representations. The keyword @code{ALPHABET}
is followed by a @var{} of all alphabetic symbols. Each
symbol must be separated from the others by at least one space. The
list can span multiple lines, and ends with the next valid keyword.
All alphanumeric characters (such as @code{a}, @code{B}, and @code{2}),
symbols (such as @code{$} and @code{+}), and punctuation characters
(such as @code{.} and @code{?}) are available as alphabet members. The
characters in the IBM extended character set (above ASCII 127) are also
available. Control characters (below ASCII 32) can also be used, with
the exception of whitespace characters (see above), @code{^Z} (end of
file), and @code{^@@} (null). The alphabet can contain a maximum of 255
symbols. An alphabetic symbol can also be a multigraph, that is, a
sequence of two or more characters. The individual characters
composing a multigraph do not necessarily have to also be declared as
alphabetic characters. For example, an alphabet could include the
characters @code{s} and @code{z} and the multigraph @code{sz%}, but not
include @code{%} as an alphabetic character. Note that a multigraph cannot
also be interpreted as a sequence of the individual characters that
comprise it.
@item
The keyword @code{NULL} is followed by a single @var{} that
represents a null (empty, zero) element. The @code{NULL} symbol is
considered to be an alphabetic character, but cannot also be listed in
the @code{ALPHABET} declaration. The @code{NULL} symbol declared in
the rules file is also used in the lexicon file to represent a null
lexical entry.
@item
The keyword @code{ANY} is followed by a single ``wildcard''
@var{} that represents a match of any character in the
alphabet. The @code{ANY} symbol is not considered to be an alphabetic
character, though it is used in the column headers of state tables. It
cannot be listed in the @code{ALPHABET} declaration. It is not used in
the lexicon file.
@item
The keyword @code{BOUNDARY} is followed by a single @var{}
character that represents an initial or final word boundary. The
@code{BOUNDARY} symbol is considered to be an alphabetic character, but
cannot also be listed in the @code{ALPHABET} declaration. When used in
the column header of a state table, it can only appear as the pair
@code{#:#} (where, for instance, @code{#} has been declared as the
@code{BOUNDARY} symbol). The @code{BOUNDARY} symbol is also used in
the lexicon file in the continuation class field of a lexical entry to
indicate the end of a word (that is, no continuation class).
@item
The @code{SUBSET} declaration defines set of characters that are
referred to in the column headers of rules. The keyword @code{SUBSET}
is followed by the @var{} and
@var{}. @var{} is a single word (one or more
characters) that names the list of characters that follows it. The
subset name must be unique (that is, if it is a single character it
cannot also be in the alphabet or be any other declared symbol). It
can be composed of any characters (except space); that is, it is not
limited to the characters declared in the @code{ALPHABET} section. It
must not be identical to any keyword used in the rules file. The
subset name is used in rules to represent all members of the subset of
the alphabet that it defines. Note that @code{SUBSET} declarations can
be interspersed among the rules. This allows subsets to be placed near
the rule that uses them if such a style is desired. However, a subset
must be declared before a rule that uses it.
@item
The @var{} following a @var{} is a list of
single symbols, each of which is separated by at least one space. The
list can span multiple lines. Each symbol in the list must be a member
of the previously defined @code{ALPHABET}, with the exception of the
@code{NULL} symbol, which can appear in a subset list but is not
included in the @code{ALPHABET} declaration. Neither the @code{ANY}
symbol nor the @code{BOUNDARY} symbol can appear in a subset symbol
list.
@item
The keyword @code{RULE} signals that a state table immediately follows.
Note that two-level rules must be expressed as a state table rather
than in the form discussed in
@ifset txt
chapter 2 `The Two-level Formalism'
@end ifset
@ifclear txt
@ref{Two-level formalism}
@end ifclear
above.
@item
@var{} is the name or description of the rule which the state
table encodes. It functions as an annotation to the state table and has
no effect on the computational operation of the table. It is displayed
by the list rules and show rule commands and is also displayed in
traces. The rule name must be surrounded by a pair of identical
delimiter characters. Any material can be used between the delimiters
of the rule name with the exception of the current comment character
and of course the rule name delimiter character of the rule itself.
Each rule in the file can use a different pair of delimiters. The rule
name must be all on one line, but it does not have to be on the same
line as the @code{RULE} keyword.
@item
@var{} is the number of states (rows in the table) that
will be defined for this table. The states must begin at 1 and go in
sequence through the number defined here (that is, gaps in state
numbers are not allowed).
@item
@var{} is the number of state transitions (columns in the
table) that will be defined for each state.
@item
@var{} is a list of elements separated by one or
more spaces. Each element represents the lexical half of a
lexical:surface correspondence which, when matched, defines a state
transition. Each element in the list must be either a member of the
alphabet, a subset name, the @code{NULL} symbol, the @code{ANY} symbol,
or the @code{BOUNDARY} symbol (in which case the corresponding surface
character must also be the @code{BOUNDARY} symbol). The list can span
multiple lines, but the number of elements in the list must be equal to
the number of columns defined for the rule.
@item
@var{} is a list of elements separated by one or
more spaces. Each element represents the surface half of a
lexical:surface correspondence which, when matched, defines a state
transition. Each element in the list must be either a member of the
alphabet, a subset name, the @code{NULL} symbol, the @code{ANY} symbol,
or the @code{BOUNDARY} symbol (in which case the corresponding lexical
character must also be the @code{BOUNDARY} symbol). The list can span
multiple lines, but the number of characters in the list must be equal
to the number of columns defined for the rule.
@item
@var{} is the number of the state or row of the table. The
first state number must be 1, and subsequent state numbers must follow
in numerical sequence without any gaps.
@item
@samp{@{: | .@}} is the final or nonfinal state indicator. This should
be a colon (@code{:}) if the state is a final state and a period
(@code{.}) if it is a nonfinal state. It must follow the
@var{} with no intervening space.
@item
@var{} is a list of state transition numbers for a
particular state. Each number must be between 1 and the number of
states (inclusive) declared for the table. The list can span multiple
lines, but the number of elements in the list must be equal to the
number of columns declared for this rule.
@item
The keyword @code{END} follows all other declarations and indicates the
end of the rules file. Any material in the file thereafter is ignored
by PC-Kimmo. The @code{END} keyword is optional; the physical end of
the file also terminates the rules file.
@end itemize
@set rules-sample 2
Figure @value{rules-sample} shows a sample rules file.
@example
@group
@b{Figure @value{rules-sample} A sample rules file}
ALPHABET
b c d f g h j k l m n p q r s t v w x y z + ; + is morpheme boundary
a e i o u
NULL 0
ANY @@
BOUNDARY #
SUBSET C b c d f g h j k l m n p q r s t v w x y z
SUBSET V a e i o u
; more subsets
RULE "Consonant defaults" 1 23
b c d f g h j k l m n p q r s t v w x y z + @@
b c d f g h j k l m n p q r s t v w x y z 0 @@
1: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
RULE "Vowel defaults" 1 6
a e i o u @@
a e i o u @@
1: 1 1 1 1 1 1
RULE "Voicing s:z <=> V___V" 4 4
V s s @@
V z @@ @@
1: 2 0 1 1
2: 2 4 3 1
3: 0 0 1 1
4. 2 0 0 0
; more rules
END
@end group
@end example
@c ----------------------------------------------------------------------------
@node Lexicon files, Grammar file, Rules file, Top
@chapter The PC-Kimmo Lexicon Files
@set lex-main 3
A lexicon consists of one main lexicon file plus one or more files of
lexical entries. The general structure of the main lexicon file is a
list of keyword declarations. The set of valid keywords is
@code{ALTERNATION}, @code{FEATURES}, @code{FIELDCODE}, @code{INCLUDE},
and @code{END}. Figure @value{lex-main} shows the conventional
structure of the main lexicon file.
@example
@group
@b{Figure @value{lex-main} Structure of the main lexicon file}
ALTERNATION @var{} @var{}
. (more ALTERNATIONs)
.
.
FEATURES @var{}
FIELDCODE @var{} U
FIELDCODE @var{} L
FIELDCODE @var{} A
FIELDCODE @var{} F
FIELDCODE @var{} G
INCLUDE @var{}
. (more INCLUDEd files)
.
.
END
@end group
@end example
The following specifications apply to the main lexicon file.
@itemize @bullet
@item
Extra spaces, blank lines, and comment lines are ignored. In the
descriptions below, reference to the use of a space character implies
any whitespace character (that is, any character treated like a space
character). The following control characters when used in a file are
whitespace characters: @code{^I} (ASCII 9, tab), @code{^J} (ASCII 10,
line feed), @code{^K} (ASCII 11, vertical tab), @code{^L} (ASCII 12,
form feed), and @code{^M} (ASCII 13, carriage return).
@item
The comment character declared in the rules file is operative in the
main lexicon file. Comments may be placed anywhere in the file. All
data following a comment character to the end of the line is ignored.
@item
The set of valid keywords used to form declarations includes
@code{ALTERNATION}, @code{FEATURES}, @code{FIELDCODE}, @code{INCLUDE},
and @code{END}.
@item
The declarations can appear in any order with the proviso that any
alternation name, feature name, or fieldcode used in a lexical entry
must be declared before the lexical entry is read. In practice, this
means that the @code{INCLUDE} declarations should appear last, but the
@code{ALTERNATION}, @code{FEATURES}, and @code{FIELDCODE} declarations
can appear in any order.
@item
The @code{ALTERNATION} declaration defines a set of sublexicon names
that serve as the continuation class of a lexical item. The
@code{ALTERNATION} keyword is followed by an @var{}
and a @var{}. @code{ALTERNATION} declarations
are optional (but nearly always used in practice) and can occur as many
times as needed.
@item
@var{} is a name associated with the following
@var{}. It is a word composed of one or more
characters, not limited to the @code{ALPHABET} characters declared in
the rules file. An alternation name can be any word other than a
keyword used in the lexicon file. The program does not check to see if
an alternation name is actually used in the lexicon file.
@item
@var{} is a list of sublexicon names. It can
span multiple lines until the next valid keyword is encountered. Each
sublexicon name in the list must be used in the sublexicon field of a
lexical entry. Although it is not enforced at the time the lexicon
file is loaded, an undeclared sublexicon named in a sublexicon name
list will cause an error when the recognizer tries to use it.
@item
The @code{FEATURES} keyword followed by a @var{}. A @var{} is a list of words, each
of which is expanded into feature structures by the word grammar.
@item
The @code{FIELDCODE} declaration is used to define what fieldcode will
be used to mark each type of field in a lexical entry. The
@code{FIELDCODE} keyword is followed by a @var{} and one of five
possible internal codes: @code{U}, @code{L}, @code{A}, @code{F}, or
@code{G}. There must be five @code{FIELDCODE} declarations, one for
each of these internal codes, where @code{U} indicates the lexical item
field, @code{L} indicates the sublexicon field, @code{A} indicates the
alternation field, @code{F} indicates the features field, and @code{G}
indicates the gloss field.
@item
The @code{INCLUDE} keyword is followed by a @var{} that names
a file containing lexical entries to be loaded. An @code{INCLUDE}d
file cannot contain any declarations (such as a @code{FIELDCODE} or an
@code{INCLUDE} declaration), only lexical entries and comment lines.
@item
The keyword @code{END} follows all other declarations and indicates the
end of the main lexicon file. Any material in the file thereafter is
ignored by PC-Kimmo. The @code{END} keyword is optional; the physical
end of the file also terminates the main lexicon file.
@end itemize
@noindent
@set lex-sample-main 4
Figure @value{lex-sample-main} shows a sample main lexicon file.
@example
@group
@b{Figure @value{lex-sample-main} A sample main lexicon file}
ALTERNATION Begin PREF
ALTERNATION Pref N AJ V AV
ALTERNATION Stem SUFFIX
FEATURES sg pl reg irreg
FIELDCODE lf U ;lexical item
FIELDCODE lx L ;sublexicon
FIELDCODE alt A ;alternation
FIELDCODE fea F ;features
FIELDCODE gl G ;gloss
INCLUDE affix.lex ;file of affixes
INCLUDE noun.lex ;file of nouns
INCLUDE verb.lex ;file of verbs
INCLUDE adjectiv.lex ;file of adjectives
INCLUDE adverb.lex ;file of adverbs
END
@end group
@end example
@set lex-entry 5
Figure @value{lex-entry} shows the structure of a lexical entry. Lexical
entries are encoded in ``field-oriented standard format.'' Standard
format is an information interchange convention developed by SIL
International. It tags the kinds of information in ASCII text files by
means of markers which begin with backslash. Field-oriented standard
format (FOSF) is a refinement of standard format geared toward
representing data which has a database-like record and field structure.
@example
@group
@b{Figure @value{lex-entry} Structure of a lexical entry}
\@var{} @var{}
\@var{} @var{}
\@var{} @{@var{} | @var{}@}
\@var{} @var{}
\@var{} @var{}
@end group
@end example
@noindent
The following points provide an informal description of the syntax of
FOSF files.
@itemize @bullet
@item
A field-oriented standard format (FOSF) file consists of a sequence of
records.
@item
A record consists of a sequence of fields.
@item
A field consist of a field marker and a field value.
@item
A field marker consists of a backslash character at the beginning of a
line, followed by an alphabetic or numeric character, followed by zero
or more printable characters, and terminated by a space, tab, or the
end of a line. A field marker without its initial backslash character
is termed a field code.
@item
A field marker must begin in the first position of a line. Backslash
characters occurring elsewhere in the file are not interpreted as field
markers.
@item
The first field marker of the record is considered the record marker,
and thus the same field must occur first in every record of the file.
@item
Each field marker is separated from the field value by one or more
spaces, tabs, or newlines. The field value continues up to the next
field marker.
@item
Any line that is empty or contains only whitespace characters is
considered a comment line and is ignored. Comment lines may occur
between or within fields.
@item
Fields and lines in an FOSF file can be arbitrarily long.
@item
There are two basic types of fields in FOSF files: nonrepeating and
repeating. Repeating fields are multiple consecutive occurrences of
fields marked by the same marker. Individual fields within a repeating
field can be called subfields.
@end itemize
@noindent
The following specifications apply to how FOSF is implemented in PC-Kimmo.
@itemize @bullet
@item
Lexical entries are encoded as records in a FOSF file.
@item
Only those fields whose field codes are declared in the main lexicon
file are recognized (see above on the @code{FIELDCODE} declaration). All other
fields are considered to be extraneous and are ignored.
@item
The first field of each lexical entry must be the lexical item field.
The lexical item field code is assigned to the internal code U by a
@code{FIELDCODE} declaration in the main lexicon file.
@item
Only nonrepeating fields are permitted.
@item
The comment character declared in the rules file is operative in
included files of lexical entries. All data following a comment
character to the end of the line is ignored.
@end itemize
A file of lexical entries is loaded by using an @code{INCLUDE}
declaration in the main lexicon file (see above). An @code{INCLUDE}d
file of lexical entries cannot contain any declarations (such as a
@code{FIELDCODE} or an @code{INCLUDE} declaration), only lexical
entries and comment lines.
The following specifications apply to lexical entries.
@itemize @bullet
@item
A lexical entry is composed of five fields: lexical item, sublexicon,
alternation, features, and gloss. The lexical item, sublexicon, and
alternation, fields are obligatory, the features and gloss fields are
optional. The first field of the entry must always be the lexical
item. The other fields can appear in any order, even differing from
one entry to another.
@item
Although the gloss field is optional, if a lexical entry does not
include one, a warning message to that effect will be displayed when
the entry is loaded. To suppress this warning message, do the command
@w{@code{set warnings off}}
@ifset txt
(see section 3.2.17.19 `set warnings')
@end ifset
@ifclear txt
(@pxref{set warnings})
@end ifclear
before loading the lexicon.
@item
If an entry has an empty gloss field (that is, the field marker for the
gloss field is present but there is no data after it), then the
contents of the lexical form field will be also be used as the gloss
for that entry.
@item
A lexical item field consists of a @var{} and a
@var{}.
@item
A @var{} is a field code assigned to the internal
code @code{U} by a @code{FIELDCODE} declaration in the main lexicon
file.
@item
A @var{} is one or more characters that represent an
element (typically a morpheme or word) of the lexicon. Each character
(or multigraph) must be in the alphabet defined for the language. The
lexical item uses only the lexical subset of the alphabet.
@item
A sublexicon field consists of a @var{} and a
@var{}.
@item
A @var{} is a field code assigned to the internal code
@code{L} by a @code{FIELDCODE} declaration in the main lexicon file.
@item
A @var{} is the name associated with a sublexicon. It
is a word composed of one or more characters, not limited to the
alphabetic characters declared in the rules file. Every lexical item
must belong to a sublexicon. Every lexicon must include a special
sublexicon named INITIAL (that is, there must be at least one lexical
entry that belongs to the INITIAL sublexicon).
@item
Lexical entries belonging to a sublexicon do not have to be listed
consecutively in a single file (as was the case for PC-Kimmo version
1); rather, lexical entries in a file can occur in any order,
regardless of what sublexicon they belong to. Lexical entries of a
sublexicon can even be placed in two or more separate files.
@item
An alternation field consists of a @var{} followed by
either an @var{} or the @var{}.
@item
An @var{} is declared in an @code{ALTERNATION}
declaration in the main lexicon file. The @var{} is
declared in the rules file and indicates the end of all possible
continuations in the lexicon.
@item
A features field consists of a @var{} and a
@var{}.
@item
A @var{} is a field code assigned to the internal code
@code{F} by a @code{FIELDCODE} declaration in the main lexicon file.
@item
A @var{} is a list of feature abbreviations. Each
abbreviation is a single word consisting of alphanumeric characters or
other characters except @code{()@{@}[]<>=:$!} (these are used for
special purposes in the grammar file). The character @code{\} should
not be used as the first character of an abbreviation because that is
how fields are marked in the lexicon file. Upper and lower case
letters used in template names are considered different. For example,
@code{PLURAL} is not the same as @code{Plural} or @code{plural}.
Feature abbreviations are expanded into full feature structures by the
word grammar
@ifset txt
(see chapter 6 `The Grammar File').
@end ifset
@ifclear txt
(@pxref{Grammar file}).
@end ifclear
@item
A gloss field consists of a @var{} and a
@var{}.
@item
A @var{} is a field code assigned to the internal code
@code{G} by a @code{FIELDCODE} declaration in the main lexicon file.
@item
A @var{} is a string of text. Any material can be used
in the gloss field with the exception of the comment character.
@end itemize
@set lex-sample-entry 6
Figure @value{lex-sample-entry} shows a sample lexical entry.
@example
@group
@b{Figure @value{lex-sample-entry} A sample lexical entry}
\lf `knives
\lx N
\alt Infl
\fea pl irreg
\gl N(`knife)+PL
@end group
@end example
@c ----------------------------------------------------------------------------
@node Grammar file, Convlex, Lexicon files, Top
@chapter The Grammar File
The following specifications apply generally to the word grammar file:
@itemize @bullet
@item
Blank lines, spaces, and tabs separate elements of the grammar file from one
another, but are ignored otherwise.
@item
The comment character declared by the @w{@code{set comment}} command
@ifset txt
(see section 3.2.17.4 `set comment' above)
@end ifset
@ifclear txt
(@pxref{set comment})
@end ifclear
is operative in the grammar file. The default comment character is the
semicolon (@code{;}). Comments may be placed anywhere in the grammar
file. Everything following a comment character to the end of the line
is ignored.
@item
A grammar file is divided into fields identified by a small set of keywords.
@enumerate
@item
@code{Rule} starts a context-free phrase structure rule with its
set of feature constraints. These rules define how words join together
to form phrases, clauses, or sentences. The lexicon and grammar are
tied together by using the lexical categories as the terminal symbols
of the phrase structure rules and by using the other lexical features
in the feature constraints.
@item
@code{Let} starts a feature template definition. Feature
templates are used as macros (abbreviations) in the lexicon. They may
also be used to assign default feature structures to the categories.
@item
@code{Parameter} starts a program parameter definition. These
parameters control various aspects of the program.
@item
@code{Define} starts a lexical rule definition.
As noted in Shieber (1985), something more powerful than just
abbreviations for common feature elements is sometimes needed to
represent systematic relationships among the elements of a lexicon.
This need is met by lexical rules, which express transformations rather
than mere abbreviations.
Lexical rules are not yet implemented properly. They may or may not be
useful for word grammars used by PC-Kimmo.
@item
@code{Lexicon} starts a lexicon section. This is only for
compatibility with the original PATR-II. The section name is
skipped over properly, but nothing is done with it.
@item
@code{Word} starts an entry in the lexicon. This is only for
compatibility with the original PATR-II. The entry is skipped
over properly, but nothing is done with it.
@item
@code{End} effectively terminates the file. Anything following this
keyword is ignored.
@end enumerate
Note that these keywords are not case sensitive: @code{RULE} is the
same as @code{rule}, and both are the same as @code{Rule}.
@item
Each of the fields in the grammar file may optionally end with a
period. If there is no period, the next keyword (in an appropriate slot)
marks the end of one field and the beginning of the next.
@end itemize
@menu
* Rule:: defining a word structure rule
* Let:: defining a feature template
* Parameter:: setting control variables
* Define:: defining a lexical rule
@end menu
@c ----------------------------------------------------------------------------
@node Rule, Let, Grammar file, Grammar file
@section Rules
A PC-Kimmo word grammar rule has these parts, in the order listed:
@enumerate
@item the keyword @code{Rule}
@item an optional rule identifier enclosed in braces (@code{@{@}})
@item the nonterminal symbol to be expanded
@item an arrow (@code{->}) or equal sign (@code{=})
@item zero or more terminal or nonterminal symbols, possibly marked for
alternation or optionality
@item an optional colon (@code{:})
@item zero or more feature constraints
@item an optional period (@code{.})
@end enumerate
The optional rule identifier consists of one or more words enclosed in
braces. Its current utility is only as a special form of comment
describing the intent of the rule. (Eventually it may be used as a tag
for interactively adding and removing rules.) The only limits on the
rule identifier are that it not contain the comment character and that
it all appears on the same line in the grammar file.
The terminal and nonterminal symbols in the rule have the following
characteristics:
@itemize @bullet
@item
Upper and lower case letters used in symbols are considered different.
For example, @code{NOUN} is not the same as @code{Noun}, and neither is
the same as @code{noun}.
@item
The symbol X may be used to stand for any terminal or nonterminal. For
example, this rule says that any category in the grammar rules can be
replaced by two copies of the same category separated by a CJ.
@example
@group
Rule X -> X_1 CJ X_2
=
=
=
=
@end group
@end example
The symbol X can be useful for capturing generalities. Care must be
taken, since it can be replaced by anything.
@item
Index numbers are used to distinguish instances of a symbol that
is used more than once in a rule. They are added to the end of a
symbol following an underscore character (@code{_}). This is
illustrated in the rule for X above.
@item
The characters @code{()@{@}[]<>=:/} cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the
grammar file. The character @code{_} can be used @emph{only} for
attaching an index number to a symbol.
@item
By default, the left hand symbol of the first rule in the grammar file
is the start symbol of the grammar.
@end itemize
The symbols on the right hand side of a phrase structure rule may be
marked or grouped in various ways:
@itemize @bullet
@item
Parentheses around an element of the expansion (right hand) part of a
rule indicate that the element is optional. Parentheses may be placed
around multiple elements. This makes an optional group of elements.
@item
A forward slash (/) is used to separate alternative elements of the
expansion (right hand) part of a rule.
@item
Curly braces can be used for grouping elements. For example the
following says that an S consists of an NP followed by either a TVP
or an IV:
@example
Rule S -> NP @{TVP / IV@}
@end example
@item
Alternatives are taken to be as long as possible. Thus if the curly
braces were omitted from the rule above, as in the rule below, the
TVP would be treated as part of the alternative containing the
NP. It would not be allowed before the IV.
@example
Rule S -> NP TVP / IV
@end example
@item
Parentheses group enclosed elements the same as curly braces
do. Alternatives and groups delimited by parentheses or curly braces
may be nested to any depth.
@end itemize
A rule can be followed by zero or more @emph{feature constraints}
that refer to symbols used in the rule.
A feature constraint has these parts, in the order listed:
@enumerate
@item a feature path that begins with one of the symbols from the
phrase structure rule
@item an equal sign
@item either another path or a value
@end enumerate
A feature constraint that refers only to symbols on the right hand side
of the rule constrains their co-occurrence. In the following rule and
constraint, the values of the @emph{agr} features for the NP and VP
nodes of the parse tree must unify:
@example
@group
Rule S -> NP VP
=
@end group
@end example
If a feature constraint refers to a symbol on the right hand side of
the rule, and has an atomic value on its right hand side, then the
designated feature must not have a different value. In the following
rule and constraint, the @emph{head case} feature for the NP node of
the parse tree must either be originally undefined or equal to NOM:
@example
@group
Rule S -> NP VP
= NOM
@end group
@end example
(After unification succeeds, the @emph{head case} feature for the NP
node of the parse tree will be equal to NOM.)
A feature constraint that refers to the symbol on the left hand side of
the rule passes information up the parse tree. In the following rule
and constraint, the value of the @emph{tense} feature is passed from
the VP node up to the S node:
@example
@group
Rule S -> NP VP
=
@end group
@end example
@c ----------------------------------------------------------------------------
@node Let, Parameter, Rule, Grammar file
@section Feature templates
A PC-Kimmo grammar feature template has these parts, in the order listed:
@enumerate
@item the keyword @code{Let}
@item the template name
@item the keyword @code{be}
@item a feature definition
@item an optional period (@code{.})
@end enumerate
If the template name is a terminal category (a terminal symbol in one
of the phrase structure rules), the template defines the default
features for that category. Otherwise the template name serves as an
abbreviation for the associated feature structure.
The characters @code{()@{@}[]<>=:} cannot be used in template names
since they are used for special purposes in the grammar file. The
characters @code{/_} can be freely used in template names. The
character @code{\} should not be used as the first character of a
template name because that is how fields are marked in the lexicon
file.
The abbreviations defined by templates are usually used in the feature
field of entries in the lexicon file. For example, the lexical entry
for the irregular plural form @emph{feet} may have the abbreviation
@emph{pl} in its features field. The grammar file would define this
abbreviation with a template like this:
@example
Let pl be [number: PL]
@end example
The path notation may also be used:
@example
Let pl be = PL
@end example
More complicated feature structures may be defined in templates. For
example,
@example
@group
Let 3sg be [tense: PRES
agr: 3SG
finite: +
vform: S]
@end group
@end example
which is equivalent to:
@example
@group
Let 3sg be = PRES
= 3SG
= +
= S
@end group
@end example
In the following example, the abbreviation @emph{irreg} is defined using
another abbreviation:
@example
@group
Let irreg be = -
pl
@end group
@end example
The abbreviation @emph{pl} must be defined previously in the grammar
file or an error will result. A subsequent template could also use the
abbreviation @emph{irreg} in its definition. In this way, an
inheritance hierarchy features may be constructed.
Feature templates permit disjunctive definitions. For example, the
lexical entry for the word @emph{deer} may specify the feature
abbreviation @emph{sg-pl}. The grammar file would define this as a
disjunction of feature structures reflecting the fact that the word can
be either singular or plural:
@example
@group
Let sg/pl be @{[number:SG]
[number:PL]@}
@end group
@end example
This has the effect of creating two entries for @emph{deer}, one with
singular number and another with plural. Note that there is no limit
to the number of disjunct structures listed between the braces. Also,
there is no slash (@code{/}) between the elements of the disjunction as
there is between the elements of a disjunction in the rules.
A shorter version of the above template using the path notation looks
like this:
@example
Let sg/pl be = @{SG PL@}
@end example
Abbreviations can also be used in disjunctions, provided that they
have previously been defined:
@example
@group
Let sg be = SG
Let pl be = PL
Let sg/pl be @{[sg] [pl]@}
@end group
@end example
Note the square brackets around the abbreviations @emph{sg} and @emph{pl};
without square brackets they would be interpreted as simple values
instead.
Feature templates can assign default atomic feature values, indicated
by prefixing an exclamation point (!). A default value can be
overridden by an explicit feature assignment. This template says that
all members of category N have singular number as a default value:
@example
Let N be = !SG
@end example
The effect of this template is to make all nouns singular unless they
are explicitly marked as plural. For example, regular nouns such as
@emph{book} do not need any feature in their lexical entries to signal
that they are singular; but an irregular noun such as @emph{feet} would
have a feature abbreviation such as @emph{pl} in its lexical entry.
This would be defined in the grammar as @w{@code{[number: PL]}}, and would
override the default value for the feature number specified by the
template above. If the N template above used @code{SG} instead of
@code{!SG}, then the word @emph{feet} would fail to parse, since its
@emph{number} feature would have an internal conflict between @code{SG}
and @code{PL}.
@c ----------------------------------------------------------------------------
@node Parameter, Define, Let, Grammar file
@section Parameter settings
A PC-Kimmo grammar parameter setting has these parts, in the order listed:
@enumerate
@item the keyword @code{Parameter}
@item an optional colon (@code{:})
@item one or more keywords identifying the parameter
@item the keyword @code{is}
@item the parameter value
@item an optional period (@code{.})
@end enumerate
PC-Kimmo recognizes the following grammar parameters:
@table @code
@item Start symbol
defines the start symbol of the grammar. For example,
@example
Parameter Start symbol is S
@end example
declares that the parse goal of the grammar is the nonterminal category
S. The default start symbol is the left hand symbol of the first
phrase structure rule in the grammar file.
@item Restrictor
defines a set of features to use for top-down filtering, expressed as a
list of feature paths. For example,
@example
Parameter Restrictor is
@end example
declares that the @emph{cat} and @emph{head form} features should be
used to screen rules before adding them to the parse chart. The
default is not to use any features for such filtering. This filtering,
named @emph{restriction} in Shieber (1985), is performed in addition to
the normal top-down filtering based on categories alone.
@sc{restriction is not yet implemented. should it be instead of
normal filtering rather than in addition to?}
@item Attribute order
specifies the order in which feature attributes are displayed. For
example,
@example
@group
Parameter Attribute order is cat lex sense head
first rest agreement
@end group
@end example
declares that the @emph{cat} attribute should be the first one shown
in any output from PC-Kimmo, and that the other attributes should
be shown in the relative order shown, with the @emph{agreement}
attribute shown last among those listed, but ahead of any attributes
that are not listed above. Attributes that are not listed are ordered
according to their character code sort order. If the attribute order
is not specified, then the category feature @emph{cat} is shown first,
with all other attributes sorted according to their character codes.
@item Category feature
defines the label for the category attribute. For example,
@example
Parameter Category feature is Categ
@end example
declares that @emph{Categ} is the name of the category attribute. The
default name for this attribute is @emph{cat}.
@item Lexical feature
defines the label for the lexical attribute. For example,
@example
Parameter Lexical feature is Lex
@end example
declares that @emph{Lex} is the name of the lexical attribute. The
default name for this attribute is @emph{lex}.
@item Gloss feature
defines the label for the gloss attribute. For example,
@example
Parameter Gloss feature is Gloss
@end example
declares that @emph{Gloss} is the name of the gloss attribute. The
default name for this attribute is @emph{gloss}.
@end table
@c ----------------------------------------------------------------------------
@node Define, , Parameter, Grammar file
@section Lexical rules
A PC-Kimmo grammar lexical rule has these parts, in the order listed:
@enumerate
@item the keyword @code{Define}
@item the name of the lexical rule
@item the keyword @code{as}
@item the rule definition
@item an optional period (@code{.})
@end enumerate
The rule definition consists of one or more mappings. Each mapping has
three parts: an output feature path, an assignment operator, and the
value assigned, either an input feature path or an atomic value. Every
output path begins with the feature name @code{out} and every input
path begins with the feature name @code{in}. The assignment operator
is either an equal sign (@code{=}) or an equal sign followed by a
``greater than'' sign (@code{=>}).
As noted before, lexical rules are not yet implemented properly, and
may not prove to be useful for PC-Kimmo word grammars in any case.
@c ----------------------------------------------------------------------------
@node Convlex, Bibliography, Grammar file, Top
@chapter Convlex: converting version 1 lexicons
The format of the lexicon files changed significantly between version 1
and version 2 of PC-Kimmo. For this reason, an auxiliary program to
convert lexicon files was written.
A version 1 PC-Kimmo lexicon file looks like this:
@example
@group
; SAMPLE.LEX 25-OCT-89
; To load this file, first load the rules file SAMPLE.RUL and
; then enter the command LOAD LEXICON SAMPLE.
ALTERNATION Begin NOUN
ALTERNATION Noun End
LEXICON INITIAL
0 Begin "[ "
LEXICON NOUN
s'ati Noun "Noun1"
s'adi Noun "Noun2"
bab'at Noun "Noun3"
bab'ad Noun "Noun4"
LEXICON End
0 # " ]"
END
@end group
@end example
For PC-Kimmo version 2, the same lexicon must be split into two files.
The first one would look like this:
@example
@group
; SAMPLE.LEX 25-OCT-89
; To load this file, first load the rules file SAMPLE.RUL and
; then enter the command LOAD LEXICON SAMPLE.
ALTERNATION Begin NOUN
ALTERNATION Noun End
FIELDCODE lf U
FIELDCODE lx L
FIELDCODE alt A
FIELDCODE fea F
FIELDCODE gl G
INCLUDE sample2.sfm
END
@end group
@end example
Note that everything except the lexicon sections and entries has been
copied verbatim into this new primary lexicon file. The
@code{FIELDCODE} statements define how to interpret the other lexicon
files containing the actual lexicon sections and entries. These files
are indicated by @code{INCLUDE} statements, and look like this:
@example
@group
\lf 0
\lx INITIAL
\alt Begin
\fea
\gl [
@end group
@group
\lf s'ati
\lx NOUN
\alt Noun
\fea
\gl Noun1
@end group
@group
\lf s'adi
\lx NOUN
\alt Noun
\fea
\gl Noun2
@end group
@group
\lf bab'at
\lx NOUN
\alt Noun
\fea
\gl Noun3
@end group
@group
\lf bab'ad
\lx NOUN
\alt Noun
\fea
\gl Noun4
@end group
@group
\lf 0
\lx End
\alt #
\fea
\gl ]
@end group
@end example
@file{convlex} was written to make the transition from version 1 to
version 2 of PC-Kimmo as painless as possible. It reads a version 1
lexicon file, including any @code{INCLUDE}d files, and writes a version
2 set of lexicon files. For a trivial case like the example above, the
interaction with the user might go something like this:
@example
@group
C:\>convlex
CONVLEX: convert lexicon from PC-KIMMO version 1 to version 2
Comment character: [;]
Input lexicon file: sample.lex
Output lexicon file: sample2.lex
Primary sfm lexicon file: sample2.sfm
@end group
@end example
For each @code{INCLUDE} statement in the version 1 lexicon file,
@file{convlex} prompts for a replacement filename like this:
@example
New sfm include file to replace noun.lex: noun2.sfm
@end example
The user interface is extremely crude, but since this is a program that
is run only once or twice by most users, that should not be regarded as
a problem.
@c ----------------------------------------------------------------------------
@node Bibliography, , Convlex, Top
@unnumbered Bibliography
@enumerate
@item
Antworth, Evan L.@. 1990.
@cite{PC-KIMMO: a two-level processor for morphological analysis}.
@tex
\break
@end tex
Occasional Publications in Academic Computing No.@: 16. Dallas, TX:
Summer Institute of Linguistics.
@item
Antworth, Evan L.@. 1991.
Introduction to two-level phonology.
@cite{Notes on Linguistics} 53:4@value{endash}18.
Dallas, TX: Summer Institute of Linguistics.
@item
Antworth, Evan L.@. 1995. @cite{User's Guide to PC-KIMMO version 2}. URL
@tex
\hfil\break
@end tex
@ifset html
ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip
@end ifset
@ifclear html
@w{@t{ftp://ftp.sil.org/software/dos/pc-kimmo/guide.zip}}
@end ifclear
(visited August 29, 1997).
@item
Chomsky, Noam. 1957.
@cite{Syntactic structures.}
The Hague: Mouton.
@item
Chomsky, Noam, and Morris Halle. 1968.
@cite{The sound pattern of English.}
New York: Harper and Row.
@item
Goldsmith, John A. 1990.
@cite{Autosegmental and metrical phonology.}
Basil Blackwell.
@item
Johnson, C. Douglas. 1972.
@cite{Formal aspects of phonological description.}
The Hague: Mouton.
@item
Kay, Martin. 1983.
When meta-rules are not meta-rules.
In Karen Sparck Jones and Yorick Wilks, eds.,
@cite{Automatic natural language parsing,}
94@value{endash}116.
Chichester: Ellis Horwood Ltd. See pages 100@value{endash}104.
@item
Koskenniemi, Kimmo. 1983.
@cite{Two-level morphology: a general computational model for word-form
recognition and production.}
Publication No. 11.
Helsinki: University of Helsinki Department of General Linguistics.
@end enumerate
@c ----------------------------------------------------------------------------
@contents
@bye