PC-PATR Reference Manual
a unification based syntactic parser
version 1.3.0
March 2002
by Stephen McConnel (changes for v. 1.2.5-1.3.0 by H. Andrew Black)
Copyright (C) 2000-2002 SIL International
Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.
The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.
Introduction to the PC-PATR program
***********************************
This document describes PC-PATR, an implementation of the PATR-II
computational linguistic formalism (plus a few enhancements) for
personal computers. It is available for MS-DOS, Microsoft Windows,
Macintosh, and Unix.(1)
PC-PATR uses a left corner chart parser with these characteristics:
* bottom-up parse with top-down filtering based on the categories
* left-to-right order-after each word is added to the chart, all
possible edges that can be derived up that point are computed as a
side-effect
PC-PATR is still under development. The author would appreciate
feedback directed to the following address:
Stephen McConnel (972)708-7361 (office)
Language Software Development (972)708-7561 (fax)
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236 steve@acadcomp.sil.org
U.S.A. or Stephen_McConnel@sil.org
---------- Footnotes ----------
(1) The Microsoft Windows implementation uses the Microsoft C QuickWin
function, and the Macintosh implementation uses the Metrowerks C SIOUX
function.
The PATR-II Formalism
*********************
The PATR-II formalism can be viewed as a computer language for encoding
linguistic information. It does not presuppose any particular theory
of syntax. It was originally developed by Stuart M. Shieber at
Stanford University in the early 1980's (Shieber 1984, Shieber 1986).
A PATR-II grammar consists of a set of rules and a lexicon. Each rule
consists of a context-free _phrase structure rule_ and a set of
_feature constraints_, that is, _unifications_ on the _feature
structures_ associated with the constituents of the phrase structure
rules. The lexicon provides the items that can replace the terminal
symbols of the phrase structure rules, that is, the words of the
language together with their relevant features.
Phrase structure rules
======================
Context-free phrase structure rules should be familiar to anyone who
has studied either linguistic theory or computer science. They look
like this:
LHS -> RHS_1 RHS_2 ...
`LHS' (the symbol to the left of the arrow) is a nonterminal symbol for
the type of phrase that is being described. To the right of the arrow
is an ordered list of the constituents of the phrase. These
constituents are either nonterminal symbols, appearing on the left hand
side of some rule in the grammar, or terminal symbols, representing
basic classes of elements from the lexicon. These basic classes
usually correspond to what are commonly called _parts of speech_. In
PATR-II, the terminal and nonterminal symbols are both referred to as
_categories_.
Figure 1. Context-free phrase structure grammar
Rule S -> NP VP (SubCl)
Rule NP -> {(Det) (AdjP) N (PrepP)} / PR
Rule Det -> DT / PR
Rule VP -> VerbalP (NP / AdjP) (AdvP)
Rule VerbalP -> V
Rule VerbalP -> AuxP V
Rule AuxP -> AUX (AuxP_1)
Rule PrepP -> PP NP
Rule AdjP -> (AV) AJ (AdjP_1)
Rule AdvP -> {AV / PrepP} (AdvP_1)
Rule SubCl -> CJ S
Consider the PC-PATR style context-free phrase structure grammar in
figure 1. It has ten nonterminal symbols (S, NP, Det, VP, VerbalP,
AuxP, PrepP, AdjP, AdvP, and SubCl), and nine terminal symbols (N, PR,
DT, V, AUX, PP, AV, AJ, and CJ). This grammar describes a small subset
of English sentences. Several aspects of this grammar are worth
mentioning.
1. Optional constituents (or sets of constituents) on the right hand
side are enclosed in parentheses.
2. Alternative constituents (or sets of constituents) on the right
hand side are separated by slashes.
3. Braces are used to group alternative sets of elements together, so
that alternations are not ambiguous.
4. Symbols should not be repeated verbatim within a rule. Repeated
symbols should be distinguished from each other by adding a
different index number to a symbol each time it is repeated.
Index numbers are introduced by the underscore (`_') character.
Figure 2. Parse of sample English sentence
S
/\
/ \
/ \
/ \
/ \
/ \
/ \
NP VP
/\ /|\
/ \ / | \
/ \ / | \
Det N VerbalP NP AdvP
| | | | |
DT man V PR PrepP
| | | /\
the sees us / \
/ \
PP NP
| /\
with / \
/ \
Det N
| |
DT telescope
|
a
Figure 3. Parse of sample sentence (PC-PATR output)
S
__________|__________
NP VP
___|____ _________|__________
Det N VerbalP NP AdvP
| man | | |
DT V PR PrepP
the sees us _____|______
PP NP
with ____|_____
Det N
| telescope
DT
a
A significant amount of grammar development can be done just with
context-free phrase structure rules such as these. For example,
parsing the sentence "the man sees us with a telescope" with this
simple grammar produces a parse tree like that shown in figure
2. (In order to minimize the height of parse trees without needing to
use a graphical interface, PC-PATR actually draws parse trees like the
one shown in figure 3.) Parsing the similar sentence "we see the man
with a telescope" produces two different parses as shown in figure
4, correctly showing the ambiguity between whether we used a telescope
to see the man, or the man had a telescope when we saw him.
Figure 4. Parses of an ambiguous English sentence
S_1
__________|__________
NP_2+ VP_4
| _____________|_____________
PR_3+ VerbalP_5+ NP_7 AdvP_11
we | ___|____ |
V_6+ Det_8+ N_10+ PrepP_12+
see | man _____|______
DT_9+ PP_13+ NP_14+
the with ____|_____
Det_15+ N_17+
| telescope
DT_16+
a
S_18
_______|________
NP_2+ VP_19
| ________|________
PR_3+ VerbalP_5+ NP_20
we | _________|__________
V_6+ Det_8+ N_10+ PrepP_12+
see | man _____|______
DT_9+ PP_13+ NP_14+
the with ____|_____
Det_15+ N_17+
| telescope
DT_16+
a
A fundamental problem with context-free phrase structure grammars is
that they tend to grossly overgenerate. For example, the sample
grammar would incorrectly recognize the sentence "*he see the man with
a telescope", assigning it tree structures similar to those shown in
figure 4. With only the simple categories used by context-free phrase
structure rules, a very large number of rules are required to
accurately handle even a small subset of a language's grammar. This is
the primary motivation behind feature structures, the basic enhancement
of PATR-II over context-free phrase structure grammars.(1)
---------- Footnotes ----------
(1) Gazdar and Mellish (1989, pages 142-147) discuss why context-free
phrase structure grammars are inadequate to model some human languages.
The PATR-II formalism (unification of feature structures added to the
context-free phrase structure rules) is shown to be adequate for those
cases.
Feature structures
==================
The basic data structure of the PATR-II formalism is called a _feature
structure_. A feature structure contains one or more _features_. A
feature consists of an attribute name and a value. Feature structures
are commonly written as attribute-value matrices like this (example 1):
(1) [ lex: telescope
cat: N ]
where _lex_ and _cat_ are attribute names, and _telescope_ and _N_ are
the values for those attributes. Note that the feature structure is
enclosed in brackets. Each feature occurs on a separate line, with the
name coming first, followed by a colon and then its value. Feature
names and (simple) values are single words consisting of alphanumeric
characters.
Feature structures can have either simple values, such as the example
above, or complex values, such as this (example 2):
(2) [ lex: telescope
cat: N
gloss: `telescope
head: [ agr: [ 3sg: + ]
number: SG
pos: N
proper: -
verbal: - ]
root_pos: N ]
where the value of the _head_ feature is another feature structure,
that also contains an embedded feature structure. Feature structures
can be arbitrarily nested in this manner.
Portions of a feature structure can be referred to using the _path_
notation. A path is a sequence of one or more feature names enclosed
in angled brackets (`<>'). For instance, examples 3-5 would all be
valid feature paths based on the feature structure of example 2:
(3)
(4)
(5)
Paths are used in feature templates and feature constraints, described
below.
Different features within a feature structure can share values. This
is not the same thing as two features having identical values. In
Example 6 below, the `' and `' features have
identical values, but in Example 7, they share the same value:
(6) [ cat: S
pred: [ cat: VP
head: [ agr: [ 3sg: + ]
finite: +
pos: V
tense: PAST
vform: ED ] ]
subj: [ cat: NP
head: [ agr: [ 3sg: + ]
case: NOM
number: SG
pos: N
proper: -
verbal: - ] ] ]
(7) [ cat: S
pred: [ cat: VP
head: [ agr: $1[ 3sg: + ]
finite: +
pos: V
tense: PAST
vform: ED ] ]
subj: [ cat: NP
head: [ agr: $1[ 3sg: + ]
case: NOM
number: SG
pos: N
proper: -
verbal: - ] ] ]
Shared values are indicated by the coindexing markers `$1', `$2', and
so on.
Note that upper and lower case letters used in feature names and values
are distinctive. For example, _NUMBER_ is not the same as _Number_ or
_number_. (This is also true of the symbols used in the context-free
phrase structure rules.)
Unification
===========
_Unification_ is the basic operation applied to feature structures in
PC-PATR. It consists of the merging of the information from two
feature structures. Two feature structures can unify if their common
features have the same values, but do not unify if any feature values
conflict.
Consider the following feature structures:
(8) [ agreement: [ number: singular
person: first ] ]
(9) [ agreement: [ number: singular ]
case: nominative ]
(10) [ agreement: [ number: singular
person: third ] ]
(11) [ agreement: [ number: singular
person: first ]
case: nominative ]
(12) [ agreement: [ number: singular
person: third ]
case: nominative ]
Feature 9 can unify with either feature 8 (producing feature 11) or
feature 10 (producing feature 12). However, feature 8 cannot unify with
feature 10 due to the conflict in the values of their `' features.
Feature constraints
===================
The feature constraints associated with phrase structure rules in
PATR-II consist of a set of unification expressions (the _unification
constraints_). Each unification expression has three parts, in this
order:
1. a feature path, the first element of which is one of the symbols
from the phrase structure rule
2. an equal sign (`=')
3. either a simple value, or another feature path that also starts
with a symbol from the phrase structure rule
As an example, consider the following PC-PATR rules:
(13) Rule S -> NP VP (SubCl)
= = NOM
= =
(14) Rule NP -> {(Det) (AJ) N (PrepP)} / PR
= = =
Rule 13 has two feature constraints that limit the co-occurrence of NP
and VP, and two feature constraints that build the feature structures
for S. This highlights the dual purpose of feature constraints in
PC-PATR: limiting the co-occurrence of phrase structure elements and
constructing the feature structure for the element defined by a rule.
The first constraint states that the NP and VP `' features
must unify successfully, and also modifies both of those features if
they do unify. The second constraint states that NP's `'
feature must either be equal to `NOM' or else be undefined. In the
latter case, it is set equal to `NOM'. The last two constraints create
a new feature structure for S from the feature structures for NP and VP.
Rule 14 illustrates another important point about feature unification
constraints: they are applied only if they involve the phrase
structure constituents actually found for the rule.
Figure 5. PC-PATR grammar of English subset
Rule S -> NP VP (SubCl)
= = NOM
= =
Rule NP -> {(Det) (AdjP) N (PrepP)} / PR
= = =
Rule Det -> DT / PR
= GEN
=
=
Rule VP -> VerbalP (NP / AdjP) (AdvP)
= ACC
= -
=
Rule VerbalP -> V
= +
=
Rule VerbalP -> AuxP V
= -
=
Rule AuxP -> AUX (AuxP_1)
=
Rule PrepP -> PP NP
= ACC
=
Rule AdjP -> (AV) AJ (AdjP_1)
Rule AdvP -> {AV / PrepP} (AdvP_1)
Rule SubCl -> CJ S
Figure 6. PC-PATR output with feature structure
1:
S
__________|__________
NP VP
___|____ _________|__________
Det N VerbalP NP AdvP
| man | | |
DT V PR PrepP
the saw us _____|______
PP NP
with ____|_____
Det N
| telescope
DT
a
S:
[ cat: S
pred: [ cat: VP
head: [ agr: $1[ 3sg: + ]
finite:+
pos: V
tense: PAST
vform: ED ] ]
subj: [ cat: NP
head: [ agr: $1[ 3sg: + ]
case: NOM
number:SG
pos: N
proper:-
verbal:- ] ] ]
1 parse found
Figure 5 shows the grammar of figure 1 augmented with a number of
feature constraints. With this grammar (and a suitable lexicon), the
parse output shown in figure 2 would include the sentence feature
structure, as shown in figure 6. Note that the `' and
`' features share a common value as a result of the
feature constraint unifications associated with the rule
`S -> NP VP (SubCl)'.
PC-PATR allows disjunctive feature unification constraints with its
phrase structure rules. Consider rules 15 and 16 below. These two
rules have the same phrase structure rule part. They can therefore be
collapsed into the single rule 17, which has a disjunction in its
unification constraints.
(15) Rule CP -> NP C' ; for wh questions with NP fronted
= +
= = = = none
= + ; root clauses
= +
= +
= none
= none
(16) Rule CP -> NP C' ; for wh questions with NP fronted
= +
= = = = none
= - ; non-root clauses
(17) Rule CP -> NP C' ; for wh questions with NP fronted
= +
= = = = none
{
= + ; root clauses
= +
= +
= none
= none
/
= - ; non-root clauses
}
Not only does PC-PATR allow disjunctive unification constraints, but it
also allows disjunctive phrase structure rules. Consider rule 18: it is
very similar to rule 17. These two rules can be further combined to
form rule 19, which has disjunctions in both its phrase structure rule
and its unification constraints.
(18) Rule CP -> PP C' ; for wh questions with PP fronted
= +
= = = = none
{
= + ; root clauses
= +
= +
= none
= none
/
= - ; non-root clauses
}
(19) ; for wh questions with NP or PP fronted
Rule CP -> { NP / PP } C'
= +
= = = +
= = = = none
{
= + ; root clauses
= +
= +
= none
= none
/
= - ; non-root clauses
}
Since the open brace (`{') introduces disjunctions both in the phrase
structure rule and in the unification constraints, care must be taken
to avoid confusing PC-PATR when it is loading the grammar file. The
end of the phrase structure rule, and the beginning of the unification
constraints, is signaled either by the first constraint beginning with
an open angle bracket (`<') or by a colon (`:'). If the first
constraint is part of a disjunction, then the phrase structure rule
must end with a colon. Otherwise, PC-PATR will treat the unification
constraint as part of the phrase structure rule, and will shortly
complain about syntax errors in the grammar file.
Perhaps it should be noted that disjunctions in phrase structure rules
or unifications are expanded when the grammar file is read. They serve
only as a convenience for the person writing the rules.
The lexicon
===========
The lexicon provides the basic elements (atoms) of the grammar, which
are usually words. Information like that shown in feature 2 is
provided for each lexicon entry. Unlike the original implementation of
PATR-II, PC-PATR stores the lexicon in a separate file from the grammar
rules. See `Lexicon File' below for details.
Running PC-PATR
***************
PC-PATR is an interactive program. It has a few command line options,
but it is controlled primarily by commands typed at the keyboard (or
loaded from a file previously prepared).
PC-PATR Command Line Options
============================
The PC-PATR program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(`-'). The available options are listed below in alphabetical order.
Those options which require an argument have the argument type
following the option letter.
`-a filename'
loads the lexicon from an AMPLE analysis output file.
`-g filename'
loads the grammar from a PC-PATR grammar file.
`-l filename'
loads the lexicon from a PC-PATR lexicon file.
`-t filename'
opens a file containing one or more PC-PATR commands. See
`Interactive Commands' below.
The following options exist only in beta-test versions of the program,
since they are used only for debugging.
`-/'
increments the debugging level. The default is zero (no debugging
output).
`-z filename'
opens a file for recording a memory allocation log.
`-Z address,count'
traps the program at the point where `address' is allocated or
freed for the `count''th time.
Interactive Commands
====================
Each of the commands available in PC-PATR is described below. Each
command consists of one or more keywords followed by zero or more
arguments. Keywords may be abbreviated to the minimum length necessary
to prevent ambiguity.
cd
--
`cd' DIRECTORY changes the current directory to the one specified.
Spaces in the directory pathname are not permitted.
For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, `a:'); a path starting with `\' which
indicates a directory at the top level of the current disk; a path
starting with `..' which indicates the directory above the current one;
and so on. Directories are separated by the `\' character. (The
forward slash `/' works just as well as the backslash `\' for MS-DOS or
Windows.)
For the Macintosh, you can give a full path starting with the name of a
hard disk, a path starting with `:' which means the current folder, or
one starting `::' which means the folder containing the current one
(and so on).
For Unix, you can give a full path starting with a `/' (for example,
`/usr/pcpatr'); a path starting with `..' which indicates the directory
above the current one; and so on. Directories are separated by the `/'
character.
clear
-----
`clear' erases all existing grammar and lexicon information, allowing
the user to prepare to load information for a new language. Strictly
speaking, it is not needed since the `load grammar' command erases the
previously existing grammar, and the `load lexicon' and `load analysis'
commands erase any previously existing lexicon.
close
-----
`close' closes the current log file opened by a previous `log' command.
directory
---------
`directory' lists the contents of the current directory. This command
is available only for the MS-DOS and Unix implementations. It does not
exist for Microsoft Windows or the Macintosh.
edit
----
`edit' FILENAME attempts to edit the specified file using the program
indicated by the environment variable `EDITOR'. If this environment
variable is not defined, then `edlin' is used to edit the file on
MS-DOS, and `vi' is used to edit the file on Unix. (These defaults
should convince you to set this variable!) This command is not
available for Microsoft Windows or the Macintosh.
exit
----
`exit' stops PC-PATR, returning control to the operating system. This
is the same as `quit'.
file
----
The `file' commands process data from a file, optionally writing the
parse results to another file. Each of these commands is described
below.
file disambiguate
.................
`file disambiguate' INPUT.ANA [OUT.ANA] reads sentences from the
specified AMPLE analysis file and writes the corresponding parse trees
and feature structures either to the screen or to the optionally
specified output file. If the output file is written, ambiguous word
parses are eliminated as much as possible as a result of the sentence
parsing. When finished, a statistical report of successful (sentence)
parses is displayed on the screen.
file parse
..........
`file parse' INPUT-FILE [OUTPUT-FILE] reads sentences from the
specified input file, one per line, and writes the corresponding parse
trees and feature structures to the screen or to the optionally
specified output file. The comment character is in effect while
reading this file. PC-PATR currently makes no attempt to handle either
capitalization or punctuation. PROBABLY SOME CAPABILITY FOR HANDLING
PUNCTUATION WILL BE ADDED AT SOME POINT.
This command behaves the same as `parse' except that input comes from a
file rather than the keyboard, and output may go to a file rather than
the screen. When finished, a statistical report of successful parses
is displayed on the screen.
help
----
`help' COMMAND displays a description of the specified command. If
`help' is typed by itself, PC-PATR displays a list of commands with
short descriptions of each command.
load
----
The `load' commands all load information stored in specially formatted
files. The `load ample' and `load kimmo' commands activate
morphological parsers, and serve as alternatives to `load lexicon' (or
`load analysis') for obtaining the category and other feature
information for words. Each of the `load' commands is described below.
load ample control
..................
`load ample control' XXAD01.CTL XXANCD.TAB [XXORDC.TAB] erases any
existing AMPLE information (including dictionaries) and reads control
information from the specified files. This also erases any stored
PC-Kimmo information.
At least two and possibly three files are loaded by this command. The
first file is the AMPLE ANALYSIS DATA file. It has a default filetype
extension of `.ctl' but no default filename. The second file is the
AMPLE dictionary code table file. It has a default filetype extension
of `.tab' but no default filename. The third file is an optional
dictionary orthography change table. It has a default filetype
extension of `.tab' and no default filename.
`l am c' is a synonym for `load ample control'.
load ample dictionary
.....................
`load ample dictionary' [PREFIX.DIC] [INFIX.DIC] [SUFFIX.DIC] ROOT1.DIC [...]
or
`load ample dictionary' FILE01.DIC [FILE02.DIC ...] erases any
existing AMPLE dictionary information and reads the specified files.
This also erases any stored PC-Kimmo information.
The first form of the command is for using a dictionary whose files are
divided according to morpheme type (`set ample-dictionary split'). The
different types of dictionary files must be loaded in the order shown,
with any unneeded affix dictionaries omitted.
The second form of the command is for using a dictionary whose entries
contain the type of morpheme (`set ample-dictionary unified').(1)
`l am d' is a synonym for `load ample dictionary'.
---------- Footnotes ----------
(1) This is a new feature of AMPLE version 3.
load ample text-control
.......................
`load ample text-control' XXINTX.CTL erases any existing AMPLE text
input control information and reads the specified file. This also
erases any stored PC-Kimmo information.
The text input control file has a default filetype extension of `.ctl'
but no default filename.
`l am t' is a synonym for `load ample text-control'.
load analysis
.............
`load analysis' FILE1.ANA [FILE2.ANA ...] erases any existing lexicon
and reads a new lexicon from the specified AMPLE analysis file(s).
Note that more than one file may be loaded with the single
`load analysis' command: duplicate entries are not stored in the
lexicon.
The default filetype extension for `load analysis' is `.ana', and the
default filename is `ample.ana'.
`l a' is a synonym for `load analysis'.
load grammar
............
`load grammar' FILE.GRM erases any existing grammar and reads a new
grammar from the specified file.
The default filetype extension for `load grammar' is `.grm', and the
default filename is `grammar.grm'.
`l g' is a synonym for `load grammar'.
load kimmo grammar
..................
`load kimmo grammar' FILE.GRM erases any existing PC-Kimmo (word)
grammar and reads a new word grammar from the specified file.
The default filetype extension for `load kimmo grammar' is `.grm', and
the default filename is `grammar.grm'.
`l k g' is a synonym for `load kimmo grammar'.
load kimmo lexicon
..................
`load kimmo lexicon' FILE.LEX erases any existing PC-Kimmo lexicon
information and reads a new morpheme lexicon from the specified file.
A PC-Kimmo rules file must be loaded before a PC-Kimmo lexicon file can
be loaded.
The default filetype extension for `load kimmo lexicon' is `.lex', and
the default filename is `lexicon.lex'.
`l k l' is a synonym for `load kimmo lexicon'.
load kimmo rules
................
`load kimmo rules' FILE.RUL erases any existing PC-Kimmo rules and
reads a new set of rules from the specified file. This also erases any
stored AMPLE information.
The default filetype extension for `load kimmo rules' is `.rul', and
the default filename is `rules.rul'.
`l k r' is a synonym for `load kimmo rules'.
load lexicon
............
`load lexicon' FILE1.LEX [FILE2.LEX ...] erases any existing lexicon
and reads a new lexicon from the specified file(s). Note that more
than one file may be loaded with a single `load lexicon' command.
The default filetype extension for `load lexicon' is `.lex', and the
default filename is `lexicon.lex'.
`l l' is a synonym for `load lexicon'.
log
---
`log' [FILE.LOG] opens a log file. Each item processed by a `parse'
command is stored to the log file as well as being displayed on the
screen.
If a filename is given on the same line as the `log' command, then that
file is used for the log file. Any previously existing file with the
same name will be overwritten. If no filename is provided, then the
file `pcpatr.log' in the current directory is used for the log file.
Use `close' to stop recording in a log file. If a `log' command is
given when a log file is already open, then the earlier log file is
closed before the new log file is opened.
parse
-----
`parse' [SENTENCE OR PHRASE] attempts to parse the input sentence
according to the loaded grammar. If a sentence is typed on the same
line as the command, then that sentence is parsed. If the `parse'
command is given by itself, then the user is prompted repeatedly for
sentences to parse. This cycle of typing and parsing is terminated by
typing an empty "sentence" (that is, nothing but the `Enter' or `Return'
key).
Both the grammar and the lexicon must be loaded before using this
command.
quit
----
`quit' stops PC-PATR, returning control to the operating system. This
is the same as `exit'.
save
----
The `save' commands write information stored in memory to a file
suitable for reloading into PC-PATR later. Each of these commands is
described below.
save lexicon
............
`save lexicon' [FILE.LEX] writes the current lexicon contents to the
designated file. The output lexicon file must be specified. This can
be useful if you are using a morphological parser to populate the
lexicon.
save status
...........
`save status' [FILE.TAK] writes the current settings to the designated
file in the form of PC-PATR commands. If the file is not specified,
the settings are written to `pcpatr.tak' in the current directory.
set
---
The `set' commands control program behavior by setting internal program
variables. Each of these commands (and variables) is described below.
set ambiguities
...............
`set ambiguities' NUMBER limits the number of analyses printed to the
given number. The default value is 10. Note that this does not limit
the number of analyses produced, just the number printed.
set ample-dictionary
....................
`set ample-dictionary' VALUE determines whether or not the AMPLE
dictionary files are divided according to morpheme type.
`set ample-dictionary split' declares that the AMPLE dictionary is
divided into a prefix dictionary file, an infix dictionary file, a
suffix dictionary file, and one or more root dictionary files. The
existence of the three affix dictionary depends on settings in the
AMPLE analysis data file. If they exist, the `load ample dictionary'
command requires that they be given in this relative order: prefix,
infix, suffix, root(s).
`set ample-dictionary unified' declares that any of the AMPLE
dictionary files may contain any type of morpheme. This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is ROOT), and that the dictionary code table
contains a `\unified' field. One of the changes listed under
`\unified' must convert a backslash code to `T'.
The default is for the AMPLE dictionary to be _split_.(1)
---------- Footnotes ----------
(1) The unified dictionary is a new feature of AMPLE version 3.
set check-cycles
................
`set check-cycles' VALUE enables or disables a check to prevent cycles
in the parse chart. `set check-cycles on' turns on this check, and
`set check-cycles off' turns it off. This check slows down the parsing
of a sentence, but it makes the parser less vulnerable to hanging on
perverse grammars. The default setting is `on'.
set comment
...........
`set comment' CHARACTER sets the comment character to the indicated
value. If CHARACTER is missing (or equal to the current comment
character), then comment handling is disabled. The default comment
character is `;' (semicolon).
set failures
............
`set failures' VALUE enables or disables GRAMMAR FAILURE MODE.
`set failures on' turns on grammar failure mode, and `set failures off'
turns it off. When grammar failure mode is on, the partial results of
forms that fail the grammar module are displayed. A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned. The default setting is `off'.
Be careful with this option. Setting failures to `on' can cause the
PC-PATR to go into an infinite loop for certain recursive grammars and
certain input sentences. WE MAY TRY TO DO SOMETHING TO DETECT THIS
TYPE OF BEHAVIOR, AT LEAST PARTIALLY.
set features
............
`set features' VALUE determines how features will be displayed.
`set features all' enables the display of the features for all nodes of
the parse tree.
`set features top' enables the display of the feature structure for
only the top node of the parse tree. This is the default setting.
`set features flat' causes features to be displayed in a flat, linear
string that uses less space on the screen.
`set features full' causes features to be displayed in an indented form
that makes the embedded structure of the feature set clear. This is
the default setting.
`set features on' turns on features display mode, allowing features to
be shown. This is the default setting.
`set features off' turns off features display mode, preventing features
from being shown.
set final-punctuation
.....................
`set final-punctuation' VALUE defines the set of characters used to
mark the ends of sentences. The individual characters must be
separated by spaces so that digraphs and trigraphs can be used, not
just single character units. The default is `. ! ? : ;'.
This variable setting affects only the `file disambiguate' command.
set gloss
.........
`set gloss' VALUE enables the display of glosses in the parse tree
output if VALUE is `on', and disables the display of glosses if VALUE is
`off'. If any glosses exist in the lexicon file, then `gloss' is
automatically turned `on' when the lexicon is loaded. If no glosses
exist in the lexicon, then this flag is ignored.
set kimmo check-cycles
......................
`set kimmo check-cycles' VALUE enables or disables a check to prevent
cycles in a word parse chart created by the embedded PC-Kimmo
morphological parser. `set kimmo check-cycles on' turns on this check,
and `set kimmo check-cycles off' turns it off. This check slows down
the parsing of a sentence, but it makes the parser less vulnerable to
hanging on perverse grammars. The default setting is `on'.
set kimmo promote-defaults
..........................
`set kimmo promote-default' VALUE controls whether default atomic
values in the feature structures loaded from the lexicon are "promoted"
to ordinary atomic values before parsing a word with the embedded
PC-Kimmo morphological parser. `set kimmo promote-defaults on' turns
on this behavior, and `set kimmo promote-defaults off' turns it off.
The default setting is `on'. (It is arguable that this is the wrong
choice for the default, but this has been the behavior since the
program was first written.)
set kimmo top-down-filter
.........................
`set kimmo top-down-filter' VALUE enables or disables top-down
filtering in the embedded PC-Kimmo morphological parser, based on the
morpheme categories. `set kimmo top-down-filter on' turns on this
filtering, and `set kimmo top-down-filter off' turns it off. The
top-down filter speeds up the parsing of a sentence, but might cause
the parser to miss some valid parses. The default setting is `on'.
This should not be required in the final version of PC-PATR.
set limit
.........
`set limit' NUMBER sets the time limit (in seconds) for parsing a
sentence. Its argument is a number greater than or equal to zero,
which is the maximum number of seconds than a parse is allowed before
being cancelled. The default value is `0', which has the special
meaning that no time limit is imposed.
NOTE: this feature is new and still somewhat experimental. It may not
be fully debugged, and may cause unforeseen side effects such as program
crashes some time after one or more parses are cancelled due to
exceeding the set time limit.
set marker category
...................
`set marker category' MARKER establishes the marker for the field
containing the category (part of speech) feature. The default is `\c'.
set marker features
...................
`set marker features' MARKER establishes the marker for the field
containing miscellaneous features. (This field is not needed for many
words.) The default is `\f'.
set marker gloss
................
`set marker gloss' MARKER establishes the marker for the field
containing the word gloss. The default is `\g'.
set marker record
.................
`set marker record' MARKER establishes the field marker that begins a
new record in the lexicon file. This may or may not be the same as the
`word' marker. The default is `\w'.
set marker rootgloss
....................
`set marker rootgloss' MARKER establishes the marker for the field
containing the word rootgloss. The default is `\r'. The word's root
gloss may be useful for handling syntactic constructions such as verb
reduplication. One can write a unification constraint that ensures
that the rootgloss unifies between two successive lexical
items/terminal symbols. Note that this does not work when using Kimmo
to parse words.
set marker word
...............
`set marker word' MARKER establishes the marker for the word field.
The default is `\w'.
set promote-defaults
....................
`set promote-defaults' VALUE controls whether default atomic values in
the feature structures loaded from the lexicon are "promoted" to
ordinary atomic values before parsing a sentence.
`set promote-defaults on' turns on this behavior, and
`set promote-defaults off' turns it off. (This can affect feature
unification since a conflicting default value does not cause a failure:
the default value merely disappears.) The default setting is `on'.
(It is arguable that this is the wrong choice for the default, but this
has been the behavior since the program was first written.)
set property-is-feature
.......................
`set property-is-feature' VALUE controls whether the values in the
AMPLE analysis `\p' (property) field are to be interpreted as feature
template names, the same as the values in the AMPLE analysis `\fd'
(feature descriptor) field. `set property-is-feature on' turns on this
behavior, and `set property-is-feature off' turns it off. The default
setting is `off'. (It is arguable that this is the wrong choice for
the default, but this has been the behavior since the program was first
written.)
set rootgloss
.............
`set rootgloss' VALUE specifies if root glosses should be treated as a
lexical feature and, if so, which root(s) in compound roots are used.
The word's root gloss may be useful for handling syntactic
constructions such as verb reduplication. Note that this does not work
when using Kimmo to parse words.
`set rootgloss off' turns off the use of the root gloss feature. This
is the default setting.
`set rootgloss on' turns on the use of the root gloss feature. This
value should be used when using a word lexicon (i.e. when using the
`load lexicon file' command). N.B. that it must be set before one
loads the lexicon file (otherwise, no root glosses will be loaded).
`set rootgloss leftheaded' turns on the use of the root gloss feature
and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the leftmost root in compound roots
will be used as the root gloss feature value.
`set rootgloss rightheaded' turns on the use of the root gloss feature
and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the rightmost root in compound roots
will be used as the root gloss feature value.
`set rootgloss all' turns on the use of the root gloss feature and, if
one is either disambiguating an ANA file or using AMPLE to parse the
words in a sentence, every root gloss in compound roots will be used as
the root gloss feature value.
set timing
..........
`set timing' VALUE enables timing mode if VALUE is `on', and disables
timing mode if VALUE is `off'. If timing mode is `on', then the
elapsed time required to process a command is displayed when the
command finishes. If timing mode is `off', then the elapsed time is
not shown. The default is `off'. (This option is useful only to
satisfy idle curiosity.)
set top-down-filter
...................
`set top-down-filter' VALUE enables or disables top-down filtering
based on the categories. `set top-down-filter on' turns on this
filtering, and `set top-down-filter off' turns it off. The top-down
filter speeds up the parsing of a sentence, but might cause the parser
to miss some valid parses. The default setting is `on'.
This should not be required in the final version of PC-PATR.
set tree
........
`set tree' VALUE specifies how parse trees should be displayed.
`set tree full' turns on the parse tree display, displaying the result
of the parse as a full tree. This is the default setting. A short
sentence would look something like this:
Sentence_1
|
Declarative_2
_____|_____
NP_3 VP_5
| ___|____
N_4 V_6 COMP_7
cows eat |
NP_8
|
N_9
grass
`set tree flat' turns on the parse tree display, displaying the result
of the parse as a flat tree structure in the form of a bracketed
string. The same short sentence would look something like this:
(Sentence_1 (Declarative_2 (NP_3 (N_4 cows))(VP_5 (V_6 eat)(COMP_7
(NP_8 (N_9 grass))))))
`set tree indented' turns on the parse tree display, displaying the
result of the parse in an indented format sometimes called a _northwest
tree_. The same short sentence would look like this:
Sentence_1
Declarative_2
NP_3
N_4 cows
VP_5
V_6 eat
COMP_7
NP_8
N_9 grass
`set tree xml' turns on the parse tree display, displaying the result
of the parse in an XML format. The same short sentence would look like
this:
SentenceDeclarativeNPNcowsNcowscows
... (35 lines omitted)
`set tree off' disables the display of parse trees altogether.
set trim-empty-features
.......................
`set trim-empty-features' VALUE disables the display of empty feature
values if VALUE is `on', and enables the display of empty feature
values if VALUE is `off'. The default is not to display empty feature
values.
set unification
...............
`set unification' VALUE enables or disables feature unification.
`set unification on' turns on unification mode. This is the default
setting.
`set unification off' turns off feature unification in the grammar.
Only the context-free phrase structure rules are used to guide the
parse; the feature contraints are ignored. This can be dangerous, as
it is easy to introduce infinite cycles in recursive phrase structure
rules.
set verbose
...........
`set verbose' VALUE enables or disables the screen display of parse
trees in the `file parse' command. `set verbose on' enables the screen
display of parse trees, and `set verbose off' disables such display.
The default setting is `off'.
set warnings
............
`set warnings' VALUE enables warning mode if VALUE is `on', and disables
warning mode if VALUE is `off'. If warning mode is enabled, then
warning messages are displayed on the output. If warning mode is
disabled, then no warning messages are displayed. The default setting
is `on'.
set write-ample-parses
......................
`set write-ample-parses' VALUE enables writing `\parse' and `\features'
fields at the end of each sentence in the disambiguated analysis file
if VALUE is `on', and disables writing these fields if VALUE is `off'.
The default setting is `off'.
This variable setting affects only the `file disambiguate' command.
show
----
The `show' commands display internal settings on the screen. Each of
these commands is described below.
show lexicon
............
`show lexicon' prints the contents of the lexicon stored in memory on
the standard output. THIS IS NOT VERY USEFUL, AND MAY BE REMOVED.
show status
...........
`show status' displays the names of the current grammar, sentences, and
log files, and the values of the switches established by the `set'
command.
`show' (by itself) and `status' are synonyms for `show status'.
status
------
`status' displays the names of the current grammar, sentences, and log
files, and the values of the switches established by the `set' command.
system
------
`system' [COMMAND] allows the user to execute an operating system
command (such as checking the available space on a disk) from within
PC-PATR. This is available only for MS-DOS and Unix, not for Microsoft
Windows or the Macintosh.
If no system-level command is given on the line with the `system'
command, then PC-PATR is pushed into the background and a new system
command processor (shell) is started. Control is usually returned to
PC-PATR in this case by typing `exit' as the operating system command.
`!' (exclamation point) is a synonym for `system'.
take
----
`take' [FILE.TAK] redirects command input to the specified file.
The default filetype extension for `take' is `.tak', and the default
filename is `pcpatr.tak'.
`take' files can be nested three deep. That is, the user types
`take file1', `file1' contains the command `take file2', and `file2'
has the command `take file3'. It would be an error for `file3' to
contain a `take' command. This should not prove to be a serious
limitation.
A `take' file can also be specified by using the `-t' command line
option when starting PC-PATR. When started, PC-PATR looks for a `take'
file named `pcpatr.tak' in the current directory to initialize itself
with.
The PC-PATR Grammar File
************************
The following specifications apply generally to the grammar file:
* Blank lines, spaces, and tabs separate elements of the grammar
file from one another, but are ignored otherwise.
* The comment character declared by the `set comment' command (see
`set comment' above) is operative in the grammar file. The default
comment character is the semicolon (`;'). Comments may be placed
anywhere in the grammar file. Everything following a comment
character to the end of the line is ignored.
* A grammar file is divided into fields identified by a small set of
keywords.
1. `Rule' starts a context-free phrase structure rule with its
set of feature constraints. These rules define how words
join together to form phrases, clauses, or sentences. The
lexicon and grammar are tied together by using the lexical
categories as the terminal symbols of the phrase structure
rules and by using the other lexical features in the feature
constraints.
2. `Let' starts a feature template definition. Feature
templates are used as macros (abbreviations) in the lexicon.
They may also be used to assign default feature structures to
the categories.
3. `Parameter' starts a program parameter definition. These
parameters control various aspects of the program.
4. `Define' starts a lexical rule definition. As noted in
Shieber (1985), something more powerful than just
abbreviations for common feature elements is sometimes needed
to represent systematic relationships among the elements of a
lexicon. This need is met by lexical rules, which express
transformations rather than mere abbreviations. Lexical
rules serve two primary purposes in PC-PATR: modifying the
feature structures associated with lexicon entries to produce
additional lexicon entries, and modifying the feature
structures produced by a morphological parser to fit the
syntactic grammar description.
5. `Constraint' starts a constraint template definition.
Constraint templates are used as macros (abbreviations) in
the grammar file.
6. `Lexicon' starts a lexicon section. This is only for
compatibility with the original PATR-II. The section name is
skipped over properly, but nothing is done with it.
7. `Word' starts an entry in the lexicon. This is only for
compatibility with the original PATR-II. The entry is skipped
over properly, but nothing is done with it.(1)
8. `End' effectively terminates the file. Anything following
this keyword is ignored.
9. `Comment' starts a comment field. The rest of the line
following the keyword is skipped over, and everything in
following lines until the next keyword is also ignored. If
you must use a keyword (other than `comment' verbatim in one
of the extra lines of a comment, put a comment character at
the beginning of the line containing the keyword.
Note that these keywords are not case sensitive: `RULE' is the
same as `rule', and both are the same as `Rule'. Also, in order
to facilitate interaction with the `Shoebox' program, any of the
keywords may begin with a backslash `\' character. For example,
`\Rule' and `\rule' are both acceptable alternatives to `RULE' or
`rule'. The abbreviated form `\co' is a special synonym for
`comment' or `\comment'. Note that there is no requirement that
these keywords appear at the beginning of a line.
* Except for `comment', each of the fields in the grammar file may
optionally end with a period. If there is no period, the next
keyword (in an appropriate slot) marks the end of one field and
the beginning of the next.
---------- Footnotes ----------
(1) Would this be a useful enhancement to PC-PATR?
Rules
=====
A PC-PATR grammar rule has these parts, in the order listed:
1. the keyword `Rule'
2. an optional rule identifier enclosed in braces (`{}')
3. a phrase structure rule consisting of the following:
a. the nonterminal symbol to be expanded
b. an arrow (`->') or equal sign (`=')
c. zero or more terminal or nonterminal symbols, possibly marked
for alternation or optionality
4. an optional colon (`:')
5. zero or more unification constraints
6. zero or more priority union operations
7. zero or more logical constraint operations
8. an optional period (`.')
The optional rule identifier consists of one or more words enclosed in
braces. Its current utility is only as a special form of comment
describing the intent of the rule. (Eventually it may be used as a tag
for interactively adding and removing rules.) The only limits on the
rule identifier are that it not contain the comment character and that
it all appears on the same line in the grammar file.
The terminal and nonterminal symbols in the rule have the following
characteristics:
* Upper and lower case letters used in symbols are considered
different. For example, `NOUN' is not the same as `Noun', and
neither is the same as `noun'.
* The symbol `X' (capital letter x) may be used to stand for any
terminal or nonterminal. For example, this rule says that any
category in the grammar rules can be replaced by two copies of the
same category separated by a CJ.
Rule X -> X_1 CJ X_2
= = = =
The symbol X can be useful for capturing generalities. Care must
be taken, since it can be replaced by anything.
* Index numbers are used to distinguish instances of a symbol that
is used more than once in a rule. They are added to the end of a
symbol following an underscore character (`_'). This is
illustrated in the rule for X above.
* The characters `(){}[]<>=:/' cannot be used in terminal or
nonterminal symbols since they are used for special purposes in the
grammar file. The character `_' can be used _only_ for attaching
an index number to a symbol.
* By default, the left hand symbol of the first rule in the grammar
file is the start symbol of the grammar.
The symbols on the right hand side of a phrase structure rule may be
marked or grouped in various ways:
* Parentheses around an element of the expansion (right hand) part
of a rule indicate that the element is optional. Parentheses may
be placed around multiple elements. This makes an optional group
of elements.
* A forward slash (/) is used to separate alternative elements of the
expansion (right hand) part of a rule.
* Curly braces can be used for grouping alternative elements. For
example the following says that an S consists of an NP followed by
either a TVP or an IV:
Rule S -> NP {TVP / IV}
* Alternatives are taken to be as long as possible. Thus if the curly
braces were omitted from the rule above, as in the rule below, the
TVP would be treated as part of the alternative containing the NP.
It would not be allowed before the IV.
Rule S -> NP TVP / IV
* Parentheses group enclosed elements the same as curly braces do.
Alternatives and groups delimited by parentheses or curly braces
may be nested to any depth.
The phrase structure rule can be followed by zero or more _unification
constraints_ that refer to symbols used in the rule. A unification
constraint has these parts, in the order listed:
1. a feature path that begins with one of the symbols from the phrase
structure rule
2. an equal sign
3. either another path or a value
A unification constraint that refers only to symbols on the right hand
side of the rule constrains their co-occurrence. In the following rule
and constraint, the values of the _agr_ features for the NP and VP
nodes of the parse tree must unify:
Rule S -> NP VP
=
If a unification constraint refers to a symbol on the right hand side of
the rule, and has an atomic value on its right hand side, then the
designated feature must not have a different value. In the following
rule and constraint, the _head case_ feature for the NP node of the
parse tree must either be originally undefined or equal to NOM:
Rule S -> NP VP
= NOM
(After unification succeeds, the _head case_ feature for the NP node of
the parse tree will be equal to NOM.)
A unification constraint that refers to the symbol on the left hand
side of the rule passes information up the parse tree. In the
following rule and constraint, the value of the _tense_ feature is
passed from the VP node up to the S node:
Rule S -> NP VP
=
See `Feature constraints' above for more details about unification
constraints.
The phrase structure rule can also be followed by zero or more
_priority union operations_ that refer to symbols used in the rule. A
priority union operation has these parts, in the order listed:
1. a feature path that begins with one of the symbols from the phrase
structure rule
2. a priority union operation sign (`<=')
3. either another path or an atomic value
Although priority union operations may be intermingled with unification
constraints following the phrase structure rule, they are applied only
after all unification constraints have succeeded. Therefore, it makes
more sense to place them after all of the unification constraints as a
reminder of the order of application.
Priority union operations may not appear inside a disjunction: if two
rules logically differ only in the application of one priority union or
another, both rules must be written out in full.
The phrase structure rule can also be followed by zero or more _logical
constraint operations_ that refer to symbols used in the rule. A
logical constraint operation has these parts, in the order listed:
1. a feature path that begins with one of the symbols from the phrase
structure rule
2. a logical constraint operation sign (`==')
3. a logical constraint expression, or a constraint template label
Although logical constraint operations may be intermingled with
unification constraints or priority union operations following the
phrase structure rule, they are applied only after all unification
constraints have succeeded and all priority union operations have been
applied. Therefore, it makes more sense to place them after all of the
unification constraints, and after any priority union operations, as a
reminder of the order of application.
Logical constraint operations may not appear inside a disjunction: if
two rules logically differ only in the application of one logical
constraint or another, both rules must be written out in full.
These last two elements of a PC-PATR rule are enhancements to the
original PATR-II formalism. For this reason, they are discussed in more
detail in the following two sections.
Priority union operations
-------------------------
Unification is the only mechanism implemented in the original PATR-II
formulism for merging two feature structures. There are situations
where the desired percolation of information is not easily expressed in
terms of unification. For example, consider the following rule (where
_ms_ stands for _morphosyntactic features_):
Stem -> Root Deriv:
= = =
The first unification expression above imposes the agreement constraints
for this rule. The second and third unification expressions attempt to
provide the percolation of information up to the `Stem'. However, it
is quite possible for there to be a conflict between `' and
`'. Any such conflict would cause the third unification
expression to fail, causing the rule as a whole to fail. The only way
around this at present is to provide a large number of unification
expressions that go into greater depth in the feature structures. Even
then it may not be possible to always avoid conflicts.
An additional mechanism for merging feature structures is provided to
properly handle percolation of information: overwriting via priority
union. The notation of the previous example changes slightly to the
following:
Stem -> Root Deriv:
= = <=
The only change is in the third expression under the rule: the
unification operator `=' has been changed to a priority union operator
`<='. This new operator is the same as unification except for handling
conflicts and storing results. In unification, a conflict causes the
operation to fail. In priority union, a conflict is resolved by taking
the value in the right hand feature structure. In unification, both
the left hand feature structure and the right hand feature structure
are replaced by the unified result. In priority union, only the left
hand feature structure is replaced by the result.
There is one other significant difference between unification and
priority union. Unification is logically an unordered process; it makes
no difference what order the unification expressions are written.
Priority union, on the other hand, is inherently ordered; a priority
union operation always overrides any earlier priority union (or
unification) result. For this reason, all unification expressions are
evaluated before any priority union expressions, and the ordering of the
priority union expressions is significant.
A BNF grammar for PC-PATR priority union operations follows.
::= '<='
| '<=' ::= '<' '
::=