PC-PATR Reference Manual
                 a unification based syntactic parser
                             version 1.3.0
                              March 2002

  by Stephen McConnel (changes for v. 1.2.5-1.3.0 by H. Andrew Black)

               Copyright (C) 2000-2002 SIL International

Published by:
Language Software Development
SIL International
7500 W. Camp Wisdom Road
Dallas, TX 75236
U.S.A.
Permission is granted to make and distribute verbatim copies of this
file provided the copyright notice and this permission notice are
preserved in all copies.

The author may be reached at the address above or via email as
`steve@acadcomp.sil.org'.

Introduction to the PC-PATR program
***********************************

This document describes PC-PATR, an implementation of the PATR-II
computational linguistic formalism (plus a few enhancements) for
personal computers.  It is available for MS-DOS, Microsoft Windows,
Macintosh, and Unix.(1)

PC-PATR uses a left corner chart parser with these characteristics:

   * bottom-up parse with top-down filtering based on the categories

   * left-to-right order-after each word is added to the chart, all
     possible edges that can be derived up that point are computed as a
     side-effect

PC-PATR is still under development.  The author would appreciate
feedback directed to the following address:

     Stephen McConnel                 (972)708-7361 (office)
     Language Software Development    (972)708-7561 (fax)
     SIL International
     7500 W. Camp Wisdom Road
     Dallas, TX 75236                 steve@acadcomp.sil.org
     U.S.A.                        or Stephen_McConnel@sil.org

---------- Footnotes ----------

(1) The Microsoft Windows implementation uses the Microsoft C QuickWin
function, and the Macintosh implementation uses the Metrowerks C SIOUX
function.

The PATR-II Formalism
*********************

The PATR-II formalism can be viewed as a computer language for encoding
linguistic information.  It does not presuppose any particular theory
of syntax.  It was originally developed by Stuart M. Shieber at
Stanford University in the early 1980's (Shieber 1984, Shieber 1986).
A PATR-II grammar consists of a set of rules and a lexicon.  Each rule
consists of a context-free _phrase structure rule_ and a set of
_feature constraints_, that is, _unifications_ on the _feature
structures_ associated with the constituents of the phrase structure
rules.  The lexicon provides the items that can replace the terminal
symbols of the phrase structure rules, that is, the words of the
language together with their relevant features.

Phrase structure rules
======================

Context-free phrase structure rules should be familiar to anyone who
has studied either linguistic theory or computer science.  They look
like this:

     LHS -> RHS_1 RHS_2 ...

`LHS' (the symbol to the left of the arrow) is a nonterminal symbol for
the type of phrase that is being described.  To the right of the arrow
is an ordered list of the constituents of the phrase.  These
constituents are either nonterminal symbols, appearing on the left hand
side of some rule in the grammar, or terminal symbols, representing
basic classes of elements from the lexicon.  These basic classes
usually correspond to what are commonly called _parts of speech_.  In
PATR-II, the terminal and nonterminal symbols are both referred to as
_categories_.


     Figure 1. Context-free phrase structure grammar
     
     Rule  S       -> NP VP (SubCl)
     Rule  NP      -> {(Det) (AdjP) N (PrepP)} / PR
     Rule  Det     -> DT / PR
     Rule  VP      -> VerbalP (NP / AdjP) (AdvP)
     Rule  VerbalP -> V
     Rule  VerbalP -> AuxP V
     Rule  AuxP    -> AUX (AuxP_1)
     Rule  PrepP   -> PP NP
     Rule  AdjP    -> (AV) AJ (AdjP_1)
     Rule  AdvP    -> {AV / PrepP} (AdvP_1)
     Rule  SubCl   -> CJ S


Consider the PC-PATR style context-free phrase structure grammar in
figure 1.  It has ten nonterminal symbols (S, NP, Det, VP, VerbalP,
AuxP, PrepP, AdjP, AdvP, and SubCl), and nine terminal symbols (N, PR,
DT, V, AUX, PP, AV, AJ, and CJ).  This grammar describes a small subset
of English sentences.  Several aspects of this grammar are worth
mentioning.

  1. Optional constituents (or sets of constituents) on the right hand
     side are enclosed in parentheses.

  2. Alternative constituents (or sets of constituents) on the right
     hand side are separated by slashes.

  3. Braces are used to group alternative sets of elements together, so
     that alternations are not ambiguous.

  4. Symbols should not be repeated verbatim within a rule.  Repeated
     symbols should be distinguished from each other by adding a
     different index number to a symbol each time it is repeated.
     Index numbers are introduced by the underscore (`_') character.


     Figure 2. Parse of sample English sentence
     
                   S
                   /\
                  /  \
                 /    \
                /      \
               /        \
              /          \
             /            \
           NP              VP
           /\             /|\
          /  \           / | \
         /    \         /  |  \
       Det     N  VerbalP NP   AdvP
        |      |     |     |     |
       DT     man    V    PR   PrepP
        |            |     |    /\
       the         sees   us   /  \
                              /    \
                             PP     NP
                              |     /\
                            with   /  \
                                  /    \
                                Det     N
                                 |      |
                                DT  telescope
                                 |
                                 a


     Figure 3. Parse of sample sentence (PC-PATR output)
     
                     S
           __________|__________
          NP                  VP
        ___|____      _________|__________
       Det     N   VerbalP  NP         AdvP
        |     man     |      |           |
       DT             V     PR         PrepP
       the          sees    us      _____|______
                                   PP         NP
                                  with     ____|_____
                                          Det       N
                                           |    telescope
                                          DT
                                           a


A significant amount of grammar development can be done just with
context-free phrase structure rules such as these.  For example,
parsing the sentence "the man sees us with a telescope" with this
simple grammar produces a parse tree like that shown in figure
2.  (In order to minimize the height of parse trees without needing to
use a graphical interface, PC-PATR actually draws parse trees like the
one shown in figure 3.)  Parsing the similar sentence "we see the man
with a telescope" produces two different parses as shown in figure
4, correctly showing the ambiguity between whether we used a telescope
to see the man, or the man had a telescope when we saw him.


     Figure 4. Parses of an ambiguous English sentence
     
                 S_1
        __________|__________
      NP_2+               VP_4
        |      _____________|_____________
      PR_3+ VerbalP_5+ NP_7           AdvP_11
       we      |      ___|____           |
             V_6+  Det_8+  N_10+     PrepP_12+
              see     |     man     _____|______
                    DT_9+        PP_13+     NP_14+
                     the          with     ____|_____
                                        Det_15+   N_17+
                                           |    telescope
                                        DT_16+
                                           a
     
             S_18
        _______|________
      NP_2+          VP_19
        |      ________|________
      PR_3+ VerbalP_5+       NP_20
       we      |      _________|__________
             V_6+  Det_8+  N_10+     PrepP_12+
              see     |     man     _____|______
                    DT_9+        PP_13+     NP_14+
                     the          with     ____|_____
                                        Det_15+   N_17+
                                           |    telescope
                                        DT_16+
                                           a


A fundamental problem with context-free phrase structure grammars is
that they tend to grossly overgenerate.  For example, the sample
grammar would incorrectly recognize the sentence "*he see the man with
a telescope", assigning it tree structures similar to those shown in
figure 4.  With only the simple categories used by context-free phrase
structure rules, a very large number of rules are required to
accurately handle even a small subset of a language's grammar.  This is
the primary motivation behind feature structures, the basic enhancement
of PATR-II over context-free phrase structure grammars.(1)

---------- Footnotes ----------

(1) Gazdar and Mellish (1989, pages 142-147) discuss why context-free
phrase structure grammars are inadequate to model some human languages.
The PATR-II formalism (unification of feature structures added to the
context-free phrase structure rules) is shown to be adequate for those
cases.

Feature structures
==================

The basic data structure of the PATR-II formalism is called a _feature
structure_.  A feature structure contains one or more _features_.  A
feature consists of an attribute name and a value.  Feature structures
are commonly written as attribute-value matrices like this (example 1):

     (1)     [ lex: telescope
               cat: N ]


where _lex_ and _cat_ are attribute names, and _telescope_ and _N_ are
the values for those attributes.  Note that the feature structure is
enclosed in brackets.  Each feature occurs on a separate line, with the
name coming first, followed by a colon and then its value.  Feature
names and (simple) values are single words consisting of alphanumeric
characters.

Feature structures can have either simple values, such as the example
above, or complex values, such as this (example 2):

     (2)     [ lex:      telescope
               cat:      N
               gloss:    `telescope
               head:     [ agr:    [ 3sg: + ]
                           number: SG
                           pos:    N
                           proper: -
                           verbal: - ]
               root_pos: N ]


where the value of the _head_ feature is another feature structure,
that also contains an embedded feature structure.  Feature structures
can be arbitrarily nested in this manner.

Portions of a feature structure can be referred to using the _path_
notation.  A path is a sequence of one or more feature names enclosed
in angled brackets (`<>').  For instance, examples 3-5 would all be
valid feature paths based on the feature structure of example 2:

     (3)     <head>
     (4)     <head number>
     (5)     <head agr 3sg>


Paths are used in feature templates and feature constraints, described
below.

Different features within a feature structure can share values.  This
is not the same thing as two features having identical values.  In
Example 6 below, the `<head agr>' and `<subj head agr>' features have
identical values, but in Example 7, they share the same value:

     (6)     [ cat:  S
               pred: [ cat:  VP
                       head: [ agr:    [ 3sg: + ]
                               finite: +
                               pos:    V
                               tense:  PAST
                               vform:  ED ] ]
               subj: [ cat:  NP
                       head: [ agr:    [ 3sg: + ]
                               case:   NOM
                               number: SG
                               pos:    N
                               proper: -
                               verbal: - ] ] ]

     (7)     [ cat:  S
               pred: [ cat:  VP
                       head: [ agr:    $1[ 3sg: + ]
                               finite: +
                               pos:    V
                               tense:  PAST
                               vform:  ED ] ]
               subj: [ cat:  NP
                       head: [ agr:    $1[ 3sg: + ]
                               case:   NOM
                               number: SG
                               pos:    N
                               proper: -
                               verbal: - ] ] ]

Shared values are indicated by the coindexing markers `$1', `$2', and
so on.

Note that upper and lower case letters used in feature names and values
are distinctive.  For example, _NUMBER_ is not the same as _Number_ or
_number_.  (This is also true of the symbols used in the context-free
phrase structure rules.)

Unification
===========

_Unification_ is the basic operation applied to feature structures in
PC-PATR.  It consists of the merging of the information from two
feature structures.  Two feature structures can unify if their common
features have the same values, but do not unify if any feature values
conflict.

Consider the following feature structures:

     (8)     [ agreement: [ number: singular
                            person: first ] ]
     
     (9)     [ agreement: [ number: singular ]
               case:      nominative ]
     
     (10)    [ agreement: [ number: singular
                            person: third ] ]
     
     (11)    [ agreement: [ number: singular
                            person: first ]
               case:      nominative ]
     
     (12)    [ agreement: [ number: singular
                            person: third ]
               case:      nominative ]


Feature 9 can unify with either feature 8 (producing feature 11) or
feature 10 (producing feature 12).  However, feature 8 cannot unify with
feature 10 due to the conflict in the values of their `<agreement
person>' features.

Feature constraints
===================

The feature constraints associated with phrase structure rules in
PATR-II consist of a set of unification expressions (the _unification
constraints_).  Each unification expression has three parts, in this
order:
  1. a feature path, the first element of which is one of the symbols
     from the phrase structure rule

  2. an equal sign (`=')

  3. either a simple value, or another feature path that also starts
     with a symbol from the phrase structure rule
     As an example, consider the following PC-PATR rules:

     (13)	Rule S -> NP VP (SubCl)
     	    <NP head agr>  = <VP head agr>
     	    <NP head case> = NOM
     	    <S subj>       = <NP>
     	    <S head>       = <VP head>
     
     (14)	Rule NP -> {(Det) (AJ) N (PrepP)} / PR
     	    <Det head number> = <N head number>
     	    <NP head>         = <N head>
     	    <NP head>         = <PR head>


Rule 13 has two feature constraints that limit the co-occurrence of NP
and VP, and two feature constraints that build the feature structures
for S.  This highlights the dual purpose of feature constraints in
PC-PATR: limiting the co-occurrence of phrase structure elements and
constructing the feature structure for the element defined by a rule.
The first constraint states that the NP and VP `<head agr>' features
must unify successfully, and also modifies both of those features if
they do unify.  The second constraint states that NP's `<head case>'
feature must either be equal to `NOM' or else be undefined.  In the
latter case, it is set equal to `NOM'.  The last two constraints create
a new feature structure for S from the feature structures for NP and VP.

Rule 14 illustrates another important point about feature unification
constraints:  they are applied only if they involve the phrase
structure constituents actually found for the rule.


     Figure 5. PC-PATR grammar of English subset
     
     Rule  S -> NP VP (SubCl)
             <NP head agr>  = <VP head agr>
             <NP head case> = NOM
             <S subj> = <NP>
             <S pred> = <VP>
     Rule  NP -> {(Det) (AdjP) N (PrepP)} / PR
             <Det head number> = <N head number>
             <NP head> = <N head>
             <NP head> = <PR head>
     Rule  Det -> DT / PR
             <PR head case> = GEN
             <Det head> = <DT head>
             <Det head> = <PR head>
     Rule  VP -> VerbalP (NP / AdjP) (AdvP)
             <NP head case>   = ACC
             <NP head verbal> = -
             <VP head> = <VerbalP head>
     Rule  VerbalP -> V
             <V head finite> = +
             <VerbalP head>  = <V head>
     Rule  VerbalP -> AuxP V
             <V head finite> = -
             <VerbalP head>  = <AuxP head>
     Rule  AuxP -> AUX (AuxP_1)
             <AuxP head> = <AUX head>
     Rule  PrepP -> PP NP
             <NP head case> = ACC
             <PrepP head> = <PP head>
     Rule  AdjP -> (AV) AJ (AdjP_1)
     Rule  AdvP -> {AV / PrepP} (AdvP_1)
     Rule  SubCl -> CJ S


     Figure 6. PC-PATR output with feature structure
     
     1:
                     S
           __________|__________
          NP                  VP
        ___|____      _________|__________
       Det     N   VerbalP  NP         AdvP
        |     man     |      |           |
       DT             V     PR         PrepP
       the           saw    us      _____|______
                                   PP         NP
                                  with     ____|_____
                                          Det       N
                                           |    telescope
                                          DT
                                           a
     
     S:
     [ cat:   S
       pred:    [ cat:   VP
                  head:    [ agr:   $1[ 3sg:   + ]
                             finite:+
                             pos:   V
                             tense: PAST
                             vform: ED ] ]
       subj:    [ cat:   NP
                  head:    [ agr:   $1[ 3sg:   + ]
                             case:  NOM
                             number:SG
                             pos:   N
                             proper:-
                             verbal:- ] ] ]
     
     1 parse found


Figure 5 shows the grammar of figure 1 augmented with a number of
feature constraints.  With this grammar (and a suitable lexicon), the
parse output shown in figure 2 would include the sentence feature
structure, as shown in figure 6.  Note that the `<subj head agr>' and
`<pred head agr>' features share a common value as a result of the
feature constraint unifications associated with the rule
`S -> NP VP (SubCl)'.

PC-PATR allows disjunctive feature unification constraints with its
phrase structure rules.  Consider rules 15 and 16 below.  These two
rules have the same phrase structure rule part.  They can therefore be
collapsed into the single rule 17, which has a disjunction in its
unification constraints.

     (15)	Rule CP -> NP C'        ; for wh questions with NP fronted
     	    <NP type wh> = +
     	    <C' moved A-bar> = <NP>
     	    <CP type wh> = <NP type wh>
     	    <CP type> = <C' type>
     	    <CP moved A-bar> = none
     	    <CP type root> = +          ; root clauses
     	    <CP type q> = +
     	    <CP type fin> = +
     	    <CP moved A> = none
     	    <CP moved head> = none
     
     (16)	Rule CP -> NP C'        ; for wh questions with NP fronted
     	    <NP type wh> = +
     	    <C' moved A-bar> = <NP>
     	    <CP type wh> = <NP type wh>
     	    <CP type> = <C' type>
     	    <CP moved A-bar> = none
     	    <CP type root> = -          ; non-root clauses
     
     (17)	Rule CP -> NP C'        ; for wh questions with NP fronted
     	    <NP type wh> = +
     	    <C' moved A-bar> = <NP>
     	    <CP type wh> = <NP type wh>
     	    <CP type> = <C' type>
     	    <CP moved A-bar> = none
     	    {
     	    <CP type root> = +		; root clauses
     	    <CP type q> = +
     	    <CP type fin> = +
     	    <CP moved A> = none
     	    <CP moved head> = none
     	        /
     	    <CP type root> = -		; non-root clauses
     	    }


Not only does PC-PATR allow disjunctive unification constraints, but it
also allows disjunctive phrase structure rules.  Consider rule 18: it is
very similar to rule 17.  These two rules can be further combined to
form rule 19, which has disjunctions in both its phrase structure rule
and its unification constraints.

     (18)	Rule CP -> PP C'        ; for wh questions with PP fronted
     	    <PP type wh> = +
     	    <C' moved A-bar> = <PP>
     	    <CP type wh> = <PP type wh>
     	    <CP type> = <C' type>
     	    <CP moved A-bar> = none
     	    {
     	    <CP type root> = +		; root clauses
     	    <CP type q> = +
     	    <CP type fin> = +
     	    <CP moved A> = none
     	    <CP moved head> = none
     	        /
     	    <CP type root> = -		; non-root clauses
     	    }
     
     (19)	; for wh questions with NP or PP fronted
     	Rule CP -> { NP / PP } C'
     	    <NP type wh> = +
     	    <C' moved A-bar> = <NP>
     	    <CP type wh> = <NP type wh>
     	    <PP type wh> = +
     	    <C' moved A-bar> = <PP>
     	    <CP type wh> = <PP type wh>
     	    <CP type> = <C' type>
     	    <CP moved A-bar> = none
     	    {
     	    <CP type root> = +		; root clauses
     	    <CP type q> = +
     	    <CP type fin> = +
     	    <CP moved A> = none
     	    <CP moved head> = none
     	        /
     	    <CP type root> = -		; non-root clauses
     	    }


Since the open brace (`{') introduces disjunctions both in the phrase
structure rule and in the unification constraints, care must be taken
to avoid confusing PC-PATR when it is loading the grammar file.  The
end of the phrase structure rule, and the beginning of the unification
constraints, is signaled either by the first constraint beginning with
an open angle bracket (`<') or by a colon (`:').  If the first
constraint is part of a disjunction, then the phrase structure rule
must end with a colon.  Otherwise, PC-PATR will treat the unification
constraint as part of the phrase structure rule, and will shortly
complain about syntax errors in the grammar file.

Perhaps it should be noted that disjunctions in phrase structure rules
or unifications are expanded when the grammar file is read.  They serve
only as a convenience for the person writing the rules.

The lexicon
===========

The lexicon provides the basic elements (atoms) of the grammar, which
are usually words.  Information like that shown in feature 2 is
provided for each lexicon entry.  Unlike the original implementation of
PATR-II, PC-PATR stores the lexicon in a separate file from the grammar
rules.  See `Lexicon File' below for details.

Running PC-PATR
***************

PC-PATR is an interactive program.  It has a few command line options,
but it is controlled primarily by commands typed at the keyboard (or
loaded from a file previously prepared).

PC-PATR Command Line Options
============================

The PC-PATR program uses an old-fashioned command line interface
following the convention of options starting with a dash character
(`-').  The available options are listed below in alphabetical order.
Those options which require an argument have the argument type
following the option letter.

`-a filename'
     loads the lexicon from an AMPLE analysis output file.

`-g filename'
     loads the grammar from a PC-PATR grammar file.

`-l filename'
     loads the lexicon from a PC-PATR lexicon file.

`-t filename'
     opens a file containing one or more PC-PATR commands.  See
     `Interactive Commands' below.

The following options exist only in beta-test versions of the program,
since they are used only for debugging.

`-/'
     increments the debugging level.  The default is zero (no debugging
     output).

`-z filename'
     opens a file for recording a memory allocation log.

`-Z address,count'
     traps the program at the point where `address' is allocated or
     freed for the `count''th time.

Interactive Commands
====================

Each of the commands available in PC-PATR is described below.  Each
command consists of one or more keywords followed by zero or more
arguments.  Keywords may be abbreviated to the minimum length necessary
to prevent ambiguity.

cd
--

`cd' DIRECTORY changes the current directory to the one specified.
Spaces in the directory pathname are not permitted.

For MS-DOS or Windows, you can give a full path starting with the disk
letter and a colon (for example, `a:'); a path starting with `\' which
indicates a directory at the top level of the current disk; a path
starting with `..' which indicates the directory above the current one;
and so on.  Directories are separated by the `\' character.  (The
forward slash `/' works just as well as the backslash `\' for MS-DOS or
Windows.)

For the Macintosh, you can give a full path starting with the name of a
hard disk, a path starting with `:' which means the current folder, or
one starting `::' which means the folder containing the current one
(and so on).

For Unix, you can give a full path starting with a `/' (for example,
`/usr/pcpatr'); a path starting with `..' which indicates the directory
above the current one; and so on.  Directories are separated by the `/'
character.

clear
-----

`clear' erases all existing grammar and lexicon information, allowing
the user to prepare to load information for a new language.  Strictly
speaking, it is not needed since the `load grammar' command erases the
previously existing grammar, and the `load lexicon' and `load analysis'
commands erase any previously existing lexicon.

close
-----

`close' closes the current log file opened by a previous `log' command.

directory
---------

`directory' lists the contents of the current directory.  This command
is available only for the MS-DOS and Unix implementations.  It does not
exist for Microsoft Windows or the Macintosh.

edit
----

`edit' FILENAME attempts to edit the specified file using the program
indicated by the environment variable `EDITOR'.  If this environment
variable is not defined, then `edlin' is used to edit the file on
MS-DOS, and `vi' is used to edit the file on Unix.  (These defaults
should convince you to set this variable!)  This command is not
available for Microsoft Windows or the Macintosh.

exit
----

`exit' stops PC-PATR, returning control to the operating system.  This
is the same as `quit'.

file
----

The `file' commands process data from a file, optionally writing the
parse results to another file.  Each of these commands is described
below.

file disambiguate
.................

`file disambiguate' INPUT.ANA [OUT.ANA] reads sentences from the
specified AMPLE analysis file and writes the corresponding parse trees
and feature structures either to the screen or to the optionally
specified output file.  If the output file is written, ambiguous word
parses are eliminated as much as possible as a result of the sentence
parsing.  When finished, a statistical report of successful (sentence)
parses is displayed on the screen.

file parse
..........

`file parse' INPUT-FILE [OUTPUT-FILE] reads sentences from the
specified input file, one per line, and writes the corresponding parse
trees and feature structures to the screen or to the optionally
specified output file.  The comment character is in effect while
reading this file.  PC-PATR currently makes no attempt to handle either
capitalization or punctuation.  PROBABLY SOME CAPABILITY FOR HANDLING
PUNCTUATION WILL BE ADDED AT SOME POINT.

This command behaves the same as `parse' except that input comes from a
file rather than the keyboard, and output may go to a file rather than
the screen.  When finished, a statistical report of successful parses
is displayed on the screen.

help
----

`help' COMMAND displays a description of the specified command.  If
`help' is typed by itself, PC-PATR displays a list of commands with
short descriptions of each command.

load
----

The `load' commands all load information stored in specially formatted
files.  The `load ample' and `load kimmo' commands activate
morphological parsers, and serve as alternatives to `load lexicon' (or
`load analysis') for obtaining the category and other feature
information for words.  Each of the `load' commands is described below.

load ample control
..................

`load ample control' XXAD01.CTL XXANCD.TAB [XXORDC.TAB] erases any
existing AMPLE information (including dictionaries) and reads control
information from the specified files.  This also erases any stored
PC-Kimmo information.

At least two and possibly three files are loaded by this command.  The
first file is the AMPLE ANALYSIS DATA file.  It has a default filetype
extension of `.ctl' but no default filename.  The second file is the
AMPLE dictionary code table file.  It has a default filetype extension
of `.tab' but no default filename.  The third file is an optional
dictionary orthography change table.  It has a default filetype
extension of `.tab' and no default filename.

`l am c' is a synonym for `load ample control'.

load ample dictionary
.....................

`load ample dictionary' [PREFIX.DIC] [INFIX.DIC] [SUFFIX.DIC] ROOT1.DIC [...]
or
`load ample dictionary' FILE01.DIC [FILE02.DIC ...]  erases any
existing AMPLE dictionary information and reads the specified files.
This also erases any stored PC-Kimmo information.

The first form of the command is for using a dictionary whose files are
divided according to morpheme type (`set ample-dictionary split').  The
different types of dictionary files must be loaded in the order shown,
with any unneeded affix dictionaries omitted.

The second form of the command is for using a dictionary whose entries
contain the type of morpheme (`set ample-dictionary unified').(1)

`l am d' is a synonym for `load ample dictionary'.

---------- Footnotes ----------

(1) This is a new feature of AMPLE version 3.

load ample text-control
.......................

`load ample text-control' XXINTX.CTL erases any existing AMPLE text
input control information and reads the specified file.  This also
erases any stored PC-Kimmo information.

The text input control file has a default filetype extension of `.ctl'
but no default filename.

`l am t' is a synonym for `load ample text-control'.

load analysis
.............

`load analysis' FILE1.ANA [FILE2.ANA ...]  erases any existing lexicon
and reads a new lexicon from the specified AMPLE analysis file(s).
Note that more than one file may be loaded with the single
`load analysis' command: duplicate entries are not stored in the
lexicon.

The default filetype extension for `load analysis' is `.ana', and the
default filename is `ample.ana'.

`l a' is a synonym for `load analysis'.

load grammar
............

`load grammar' FILE.GRM erases any existing grammar and reads a new
grammar from the specified file.

The default filetype extension for `load grammar' is `.grm', and the
default filename is `grammar.grm'.

`l g' is a synonym for `load grammar'.

load kimmo grammar
..................

`load kimmo grammar' FILE.GRM erases any existing PC-Kimmo (word)
grammar and reads a new word grammar from the specified file.

The default filetype extension for `load kimmo grammar' is `.grm', and
the default filename is `grammar.grm'.

`l k g' is a synonym for `load kimmo grammar'.

load kimmo lexicon
..................

`load kimmo lexicon' FILE.LEX erases any existing PC-Kimmo lexicon
information and reads a new morpheme lexicon from the specified file.
A PC-Kimmo rules file must be loaded before a PC-Kimmo lexicon file can
be loaded.

The default filetype extension for `load kimmo lexicon' is `.lex', and
the default filename is `lexicon.lex'.

`l k l' is a synonym for `load kimmo lexicon'.

load kimmo rules
................

`load kimmo rules' FILE.RUL erases any existing PC-Kimmo rules and
reads a new set of rules from the specified file.  This also erases any
stored AMPLE information.

The default filetype extension for `load kimmo rules' is `.rul', and
the default filename is `rules.rul'.

`l k r' is a synonym for `load kimmo rules'.

load lexicon
............

`load lexicon' FILE1.LEX [FILE2.LEX ...]  erases any existing lexicon
and reads a new lexicon from the specified file(s).  Note that more
than one file may be loaded with a single `load lexicon' command.

The default filetype extension for `load lexicon' is `.lex', and the
default filename is `lexicon.lex'.

`l l' is a synonym for `load lexicon'.

log
---

`log' [FILE.LOG] opens a log file.  Each item processed by a `parse'
command is stored to the log file as well as being displayed on the
screen.

If a filename is given on the same line as the `log' command, then that
file is used for the log file.  Any previously existing file with the
same name will be overwritten.  If no filename is provided, then the
file `pcpatr.log' in the current directory is used for the log file.

Use `close' to stop recording in a log file.  If a `log' command is
given when a log file is already open, then the earlier log file is
closed before the new log file is opened.

parse
-----

`parse' [SENTENCE OR PHRASE] attempts to parse the input sentence
according to the loaded grammar.  If a sentence is typed on the same
line as the command, then that sentence is parsed.  If the `parse'
command is given by itself, then the user is prompted repeatedly for
sentences to parse.  This cycle of typing and parsing is terminated by
typing an empty "sentence" (that is, nothing but the `Enter' or `Return'
key).

Both the grammar and the lexicon must be loaded before using this
command.

quit
----

`quit' stops PC-PATR, returning control to the operating system.  This
is the same as `exit'.

save
----

The `save' commands write information stored in memory to a file
suitable for reloading into PC-PATR later.  Each of these commands is
described below.

save lexicon
............

`save lexicon' [FILE.LEX] writes the current lexicon contents to the
designated file.  The output lexicon file must be specified.  This can
be useful if you are using a morphological parser to populate the
lexicon.

save status
...........

`save status' [FILE.TAK] writes the current settings to the designated
file in the form of PC-PATR commands.  If the file is not specified,
the settings are written to `pcpatr.tak' in the current directory.

set
---

The `set' commands control program behavior by setting internal program
variables.  Each of these commands (and variables) is described below.

set ambiguities
...............

`set ambiguities' NUMBER limits the number of analyses printed to the
given number.  The default value is 10.  Note that this does not limit
the number of analyses produced, just the number printed.

set ample-dictionary
....................

`set ample-dictionary' VALUE determines whether or not the AMPLE
dictionary files are divided according to morpheme type.
`set ample-dictionary split' declares that the AMPLE dictionary is
divided into a prefix dictionary file, an infix dictionary file, a
suffix dictionary file, and one or more root dictionary files.  The
existence of the three affix dictionary depends on settings in the
AMPLE analysis data file.  If they exist, the `load ample dictionary'
command requires that they be given in this relative order: prefix,
infix, suffix, root(s).

`set ample-dictionary unified' declares that any of the AMPLE
dictionary files may contain any type of morpheme.  This implies that
each dictionary entry may contain a field specifying the type of
morpheme (the default is ROOT), and that the dictionary code table
contains a `\unified' field.  One of the changes listed under
`\unified' must convert a backslash code to `T'.

The default is for the AMPLE dictionary to be _split_.(1)

---------- Footnotes ----------

(1) The unified dictionary is a new feature of AMPLE version 3.

set check-cycles
................

`set check-cycles' VALUE enables or disables a check to prevent cycles
in the parse chart.  `set check-cycles on' turns on this check, and
`set check-cycles off' turns it off.  This check slows down the parsing
of a sentence, but it makes the parser less vulnerable to hanging on
perverse grammars.  The default setting is `on'.

set comment
...........

`set comment' CHARACTER sets the comment character to the indicated
value.  If CHARACTER is missing (or equal to the current comment
character), then comment handling is disabled.  The default comment
character is `;' (semicolon).

set failures
............

`set failures' VALUE enables or disables GRAMMAR FAILURE MODE.
`set failures on' turns on grammar failure mode, and `set failures off'
turns it off.  When grammar failure mode is on, the partial results of
forms that fail the grammar module are displayed.  A form may fail the
grammar either by failing the feature constraints or by failing the
constituent structure rules. In the latter case, a partial tree (bush)
will be returned.  The default setting is `off'.

Be careful with this option.  Setting failures to `on' can cause the
PC-PATR to go into an infinite loop for certain recursive grammars and
certain input sentences.  WE MAY TRY TO DO SOMETHING TO DETECT THIS
TYPE OF BEHAVIOR, AT LEAST PARTIALLY.

set features
............

`set features' VALUE determines how features will be displayed.

`set features all' enables the display of the features for all nodes of
the parse tree.

`set features top' enables the display of the feature structure for
only the top node of the parse tree.  This is the default setting.

`set features flat' causes features to be displayed in a flat, linear
string that uses less space on the screen.

`set features full' causes features to be displayed in an indented form
that makes the embedded structure of the feature set clear.  This is
the default setting.

`set features on' turns on features display mode, allowing features to
be shown.  This is the default setting.

`set features off' turns off features display mode, preventing features
from being shown.

set final-punctuation
.....................

`set final-punctuation' VALUE defines the set of characters used to
mark the ends of sentences.  The individual characters must be
separated by spaces so that digraphs and trigraphs can be used, not
just single character units.  The default is `. ! ? : ;'.

This variable setting affects only the `file disambiguate' command.

set gloss
.........

`set gloss' VALUE enables the display of glosses in the parse tree
output if VALUE is `on', and disables the display of glosses if VALUE is
`off'.  If any glosses exist in the lexicon file, then `gloss' is
automatically turned `on' when the lexicon is loaded.  If no glosses
exist in the lexicon, then this flag is ignored.

set kimmo check-cycles
......................

`set kimmo check-cycles' VALUE enables or disables a check to prevent
cycles in a word parse chart created by the embedded PC-Kimmo
morphological parser.  `set kimmo check-cycles on' turns on this check,
and `set kimmo check-cycles off' turns it off.  This check slows down
the parsing of a sentence, but it makes the parser less vulnerable to
hanging on perverse grammars.  The default setting is `on'.

set kimmo promote-defaults
..........................

`set kimmo promote-default' VALUE controls whether default atomic
values in the feature structures loaded from the lexicon are "promoted"
to ordinary atomic values before parsing a word with the embedded
PC-Kimmo morphological parser.  `set kimmo promote-defaults on' turns
on this behavior, and `set kimmo promote-defaults off' turns it off.
The default setting is `on'.  (It is arguable that this is the wrong
choice for the default, but this has been the behavior since the
program was first written.)

set kimmo top-down-filter
.........................

`set kimmo top-down-filter' VALUE enables or disables top-down
filtering in the embedded PC-Kimmo morphological parser, based on the
morpheme categories.  `set kimmo top-down-filter on' turns on this
filtering, and `set kimmo top-down-filter off' turns it off.  The
top-down filter speeds up the parsing of a sentence, but might cause
the parser to miss some valid parses.  The default setting is `on'.

This should not be required in the final version of PC-PATR.

set limit
.........

`set limit' NUMBER sets the time limit (in seconds) for parsing a
sentence.  Its argument is a number greater than or equal to zero,
which is the maximum number of seconds than a parse is allowed before
being cancelled.  The default value is `0', which has the special
meaning that no time limit is imposed.

NOTE: this feature is new and still somewhat experimental.  It may not
be fully debugged, and may cause unforeseen side effects such as program
crashes some time after one or more parses are cancelled due to
exceeding the set time limit.

set marker category
...................

`set marker category' MARKER establishes the marker for the field
containing the category (part of speech) feature.  The default is `\c'.

set marker features
...................

`set marker features' MARKER establishes the marker for the field
containing miscellaneous features.  (This field is not needed for many
words.)  The default is `\f'.

set marker gloss
................

`set marker gloss' MARKER establishes the marker for the field
containing the word gloss.  The default is `\g'.

set marker record
.................

`set marker record' MARKER establishes the field marker that begins a
new record in the lexicon file.  This may or may not be the same as the
`word' marker.  The default is `\w'.

set marker rootgloss
....................

`set marker rootgloss' MARKER establishes the marker for the field
containing the word rootgloss.  The default is `\r'.  The word's root
gloss may be useful for handling syntactic constructions such as verb
reduplication.  One can write a unification constraint that ensures
that the rootgloss unifies between two successive lexical
items/terminal symbols.  Note that this does not work when using Kimmo
to parse words.

set marker word
...............

`set marker word' MARKER establishes the marker for the word field.
The default is `\w'.

set promote-defaults
....................

`set promote-defaults' VALUE controls whether default atomic values in
the feature structures loaded from the lexicon are "promoted" to
ordinary atomic values before parsing a sentence.
`set promote-defaults on' turns on this behavior, and
`set promote-defaults off' turns it off.  (This can affect feature
unification since a conflicting default value does not cause a failure:
the default value merely disappears.)  The default setting is `on'.
(It is arguable that this is the wrong choice for the default, but this
has been the behavior since the program was first written.)

set property-is-feature
.......................

`set property-is-feature' VALUE controls whether the values in the
AMPLE analysis `\p' (property) field are to be interpreted as feature
template names, the same as the values in the AMPLE analysis `\fd'
(feature descriptor) field.  `set property-is-feature on' turns on this
behavior, and `set property-is-feature off' turns it off.  The default
setting is `off'.  (It is arguable that this is the wrong choice for
the default, but this has been the behavior since the program was first
written.)

set rootgloss
.............

`set rootgloss' VALUE specifies if root glosses should be treated as a
lexical feature and, if so, which root(s) in compound roots are used.
The word's root gloss may be useful for handling syntactic
constructions such as verb reduplication.  Note that this does not work
when using Kimmo to parse words.

`set rootgloss off' turns off the use of the root gloss feature.  This
is the default setting.

`set rootgloss on' turns on the use of the root gloss feature.  This
value should be used when using a word lexicon (i.e. when using the
`load lexicon file' command).  N.B. that it must be set before one
loads the lexicon file (otherwise, no root glosses will be loaded).

`set rootgloss leftheaded' turns on the use of the root gloss feature
and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the leftmost root in compound roots
will be used as the root gloss feature value.

`set rootgloss rightheaded' turns on the use of the root gloss feature
and, if one is either disambiguating an ANA file or using AMPLE to
parse the words in a sentence, only the rightmost root in compound roots
will be used as the root gloss feature value.

`set rootgloss all' turns on the use of the root gloss feature and, if
one is either disambiguating an ANA file or using AMPLE to parse the
words in a sentence, every root gloss in compound roots will be used as
the root gloss feature value.

set timing
..........

`set timing' VALUE enables timing mode if VALUE is `on', and disables
timing mode if VALUE is `off'.  If timing mode is `on', then the
elapsed time required to process a command is displayed when the
command finishes.  If timing mode is `off', then the elapsed time is
not shown.  The default is `off'.  (This option is useful only to
satisfy idle curiosity.)

set top-down-filter
...................

`set top-down-filter' VALUE enables or disables top-down filtering
based on the categories.  `set top-down-filter on' turns on this
filtering, and `set top-down-filter off' turns it off.  The top-down
filter speeds up the parsing of a sentence, but might cause the parser
to miss some valid parses.  The default setting is `on'.

This should not be required in the final version of PC-PATR.

set tree
........

`set tree' VALUE specifies how parse trees should be displayed.

`set tree full' turns on the parse tree display, displaying the result
of the parse as a full tree.  This is the default setting.  A short
sentence would look something like this:

        Sentence_1
             |
       Declarative_2
        _____|_____
      NP_3      VP_5
        |      ___|____
       N_4    V_6  COMP_7
      cows    eat     |
                    NP_8
                      |
                     N_9
                    grass


`set tree flat' turns on the parse tree display, displaying the result
of the parse as a flat tree structure in the form of a bracketed
string.  The same short sentence would look something like this:

     (Sentence_1 (Declarative_2 (NP_3 (N_4  cows))(VP_5 (V_6  eat)(COMP_7
             (NP_8 (N_9  grass))))))


`set tree indented' turns on the parse tree display, displaying the
result of the parse in an indented format sometimes called a _northwest
tree_.  The same short sentence would look like this:

     Sentence_1
         Declarative_2
             NP_3
                 N_4  cows
             VP_5
                 V_6  eat
                 COMP_7
                     NP_8
                         N_9  grass


`set tree xml' turns on the parse tree display, displaying the result
of the parse in an XML format.  The same short sentence would look like
this:

     <Analysis count="1">
       <Parse>
         <Node cat="Sentence" id="_1._1">
           <Fs>
           <F name="cat"><str>Sentence</str></f>
           </Fs>
           <Node cat="Declarative" id="_1._2">
             <Fs>
             <F name="cat"><str>Declarative</str></f>
             </Fs>
             <Node cat="NP" id="_1._3">
               <Fs>
               <F name="cat"><str>NP</str></f>
               </Fs>
               <Leaf cat="N" id="_1._4">
                 <Fs>
                 <F name="cat"><str>N</str></f>
                 <F name="lex"><str>cows</str></f>
                 </Fs>
                 <Lexfs>
                 <F name="cat"><str>N</str></f>
                 <F name="lex"><str>cows</str></f>
                 </Lexfs>
                 <Str>cows</str>
               </Leaf>
             </Node>
             <Node cat="VP" id="_1._5">
               ...                   (35 lines omitted)
             </Node>
           </Node>
         </Node>
       </Parse>
     </Analysis>


`set tree off' disables the display of parse trees altogether.

set trim-empty-features
.......................

`set trim-empty-features' VALUE disables the display of empty feature
values if VALUE is `on', and enables the display of empty feature
values if VALUE is `off'.  The default is not to display empty feature
values.

set unification
...............

`set unification' VALUE enables or disables feature unification.
`set unification on' turns on unification mode.  This is the default
setting.

`set unification off' turns off feature unification in the grammar.
Only the context-free phrase structure rules are used to guide the
parse; the feature contraints are ignored.  This can be dangerous, as
it is easy to introduce infinite cycles in recursive phrase structure
rules.

set verbose
...........

`set verbose' VALUE enables or disables the screen display of parse
trees in the `file parse' command.  `set verbose on' enables the screen
display of parse trees, and `set verbose off' disables such display.
The default setting is `off'.

set warnings
............

`set warnings' VALUE enables warning mode if VALUE is `on', and disables
warning mode if VALUE is `off'.  If warning mode is enabled, then
warning messages are displayed on the output. If warning mode is
disabled, then no warning messages are displayed.  The default setting
is `on'.

set write-ample-parses
......................

`set write-ample-parses' VALUE enables writing `\parse' and `\features'
fields at the end of each sentence in the disambiguated analysis file
if VALUE is `on', and disables writing these fields if VALUE is `off'.
The default setting is `off'.

This variable setting affects only the `file disambiguate' command.

show
----

The `show' commands display internal settings on the screen.  Each of
these commands is described below.

show lexicon
............

`show lexicon' prints the contents of the lexicon stored in memory on
the standard output.  THIS IS NOT VERY USEFUL, AND MAY BE REMOVED.

show status
...........

`show status' displays the names of the current grammar, sentences, and
log files, and the values of the switches established by the `set'
command.

`show' (by itself) and `status' are synonyms for `show status'.

status
------

`status' displays the names of the current grammar, sentences, and log
files, and the values of the switches established by the `set' command.

system
------

`system' [COMMAND] allows the user to execute an operating system
command (such as checking the available space on a disk) from within
PC-PATR.  This is available only for MS-DOS and Unix, not for Microsoft
Windows or the Macintosh.

If no system-level command is given on the line with the `system'
command, then PC-PATR is pushed into the background and a new system
command processor (shell) is started.  Control is usually returned to
PC-PATR in this case by typing `exit' as the operating system command.

`!' (exclamation point) is a synonym for `system'.

take
----

`take' [FILE.TAK] redirects command input to the specified file.

The default filetype extension for `take' is `.tak', and the default
filename is `pcpatr.tak'.

`take' files can be nested three deep.  That is, the user types
`take file1', `file1' contains the command `take file2', and `file2'
has the command `take file3'.  It would be an error for `file3' to
contain a `take' command.  This should not prove to be a serious
limitation.

A `take' file can also be specified by using the `-t' command line
option when starting PC-PATR.  When started, PC-PATR looks for a `take'
file named `pcpatr.tak' in the current directory to initialize itself
with.

The PC-PATR Grammar File
************************

The following specifications apply generally to the grammar file:

   * Blank lines, spaces, and tabs separate elements of the grammar
     file from one another, but are ignored otherwise.

   * The comment character declared by the `set comment' command (see
     `set comment' above) is operative in the grammar file.  The default
     comment character is the semicolon (`;').  Comments may be placed
     anywhere in the grammar file.  Everything following a comment
     character to the end of the line is ignored.

   * A grammar file is divided into fields identified by a small set of
     keywords.

       1. `Rule' starts a context-free phrase structure rule with its
          set of feature constraints.  These rules define how words
          join together to form phrases, clauses, or sentences.  The
          lexicon and grammar are tied together by using the lexical
          categories as the terminal symbols of the phrase structure
          rules and by using the other lexical features in the feature
          constraints.

       2. `Let' starts a feature template definition.  Feature
          templates are used as macros (abbreviations) in the lexicon.
          They may also be used to assign default feature structures to
          the categories.

       3. `Parameter' starts a program parameter definition.  These
          parameters control various aspects of the program.

       4. `Define' starts a lexical rule definition.  As noted in
          Shieber (1985), something more powerful than just
          abbreviations for common feature elements is sometimes needed
          to represent systematic relationships among the elements of a
          lexicon.  This need is met by lexical rules, which express
          transformations rather than mere abbreviations.  Lexical
          rules serve two primary purposes in PC-PATR: modifying the
          feature structures associated with lexicon entries to produce
          additional lexicon entries, and modifying the feature
          structures produced by a morphological parser to fit the
          syntactic grammar description.

       5. `Constraint' starts a constraint template definition.
          Constraint templates are used as macros (abbreviations) in
          the grammar file.

       6. `Lexicon' starts a lexicon section.  This is only for
          compatibility with the original PATR-II.  The section name is
          skipped over properly, but nothing is done with it.

       7. `Word' starts an entry in the lexicon.  This is only for
          compatibility with the original PATR-II.  The entry is skipped
          over properly, but nothing is done with it.(1)

       8. `End' effectively terminates the file.  Anything following
          this keyword is ignored.

       9. `Comment' starts a comment field.  The rest of the line
          following the keyword is skipped over, and everything in
          following lines until the next keyword is also ignored.  If
          you must use a keyword (other than `comment' verbatim in one
          of the extra lines of a comment, put a comment character at
          the beginning of the line containing the keyword.

     Note that these keywords are not case sensitive:  `RULE' is the
     same as `rule', and both are the same as `Rule'.  Also, in order
     to facilitate interaction with the `Shoebox' program, any of the
     keywords may begin with a backslash `\' character.  For example,
     `\Rule' and `\rule' are both acceptable alternatives to `RULE' or
     `rule'.  The abbreviated form `\co' is a special synonym for
     `comment' or `\comment'.  Note that there is no requirement that
     these keywords appear at the beginning of a line.

   * Except for `comment', each of the fields in the grammar file may
     optionally end with a period.  If there is no period, the next
     keyword (in an appropriate slot) marks the end of one field and
     the beginning of the next.

---------- Footnotes ----------

(1) Would this be a useful enhancement to PC-PATR?

Rules
=====

A PC-PATR grammar rule has these parts, in the order listed:

  1. the keyword `Rule'

  2. an optional rule identifier enclosed in braces (`{}')

  3. a phrase structure rule consisting of the following:

       a. the nonterminal symbol to be expanded

       b. an arrow (`->') or equal sign (`=')

       c. zero or more terminal or nonterminal symbols, possibly marked
          for alternation or optionality

  4. an optional colon (`:')

  5. zero or more unification constraints

  6. zero or more priority union operations

  7. zero or more logical constraint operations

  8. an optional period (`.')

The optional rule identifier consists of one or more words enclosed in
braces.  Its current utility is only as a special form of comment
describing the intent of the rule.  (Eventually it may be used as a tag
for interactively adding and removing rules.)  The only limits on the
rule identifier are that it not contain the comment character and that
it all appears on the same line in the grammar file.

The terminal and nonterminal symbols in the rule have the following
characteristics:

   * Upper and lower case letters used in symbols are considered
     different.  For example, `NOUN' is not the same as `Noun', and
     neither is the same as `noun'.

   * The symbol `X' (capital letter x) may be used to stand for any
     terminal or nonterminal.  For example, this rule says that any
     category in the grammar rules can be replaced by two copies of the
     same category separated by a CJ.

          Rule X -> X_1 CJ X_2
                  <X cat>  = <X_1 cat>
                  <X cat>  = <X_2 cat>
                  <X arg1> = <X_1 arg1>
                  <X arg1> = <X_2 arg1>


     The symbol X can be useful for capturing generalities.  Care must
     be taken, since it can be replaced by anything.

   * Index numbers are used to distinguish instances of a symbol that
     is used more than once in a rule.  They are added to the end of a
     symbol following an underscore character (`_').  This is
     illustrated in the rule for X above.

   * The characters `(){}[]<>=:/' cannot be used in terminal or
     nonterminal symbols since they are used for special purposes in the
     grammar file.  The character `_' can be used _only_ for attaching
     an index number to a symbol.

   * By default, the left hand symbol of the first rule in the grammar
     file is the start symbol of the grammar.

The symbols on the right hand side of a phrase structure rule may be
marked or grouped in various ways:

   * Parentheses around an element of the expansion (right hand) part
     of a rule indicate that the element is optional. Parentheses may
     be placed around multiple elements. This makes an optional group
     of elements.

   * A forward slash (/) is used to separate alternative elements of the
     expansion (right hand) part of a rule.

   * Curly braces can be used for grouping alternative elements. For
     example the following says that an S consists of an NP followed by
     either a TVP or an IV:

          Rule S -> NP {TVP / IV}

   * Alternatives are taken to be as long as possible. Thus if the curly
     braces were omitted from the rule above, as in the rule below, the
     TVP would be treated as part of the alternative containing the NP.
     It would not be allowed before the IV.

          Rule S -> NP TVP / IV

   * Parentheses group enclosed elements the same as curly braces do.
     Alternatives and groups delimited by parentheses or curly braces
     may be nested to any depth.

The phrase structure rule can be followed by zero or more _unification
constraints_ that refer to symbols used in the rule.  A unification
constraint has these parts, in the order listed:

  1. a feature path that begins with one of the symbols from the phrase
     structure rule

  2. an equal sign

  3. either another path or a value

A unification constraint that refers only to symbols on the right hand
side of the rule constrains their co-occurrence.  In the following rule
and constraint, the values of the _agr_ features for the NP and VP
nodes of the parse tree must unify:

     Rule S -> NP VP
             <NP agr> = <VP agr>


If a unification constraint refers to a symbol on the right hand side of
the rule, and has an atomic value on its right hand side, then the
designated feature must not have a different value.  In the following
rule and constraint, the _head case_ feature for the NP node of the
parse tree must either be originally undefined or equal to NOM:

     Rule S -> NP VP
             <NP head case> = NOM


(After unification succeeds, the _head case_ feature for the NP node of
the parse tree will be equal to NOM.)

A unification constraint that refers to the symbol on the left hand
side of the rule passes information up the parse tree.  In the
following rule and constraint, the value of the _tense_ feature is
passed from the VP node up to the S node:

     Rule S -> NP VP
             <S tense> = <VP tense>


See `Feature constraints' above for more details about unification
constraints.

The phrase structure rule can also be followed by zero or more
_priority union operations_ that refer to symbols used in the rule.  A
priority union operation has these parts, in the order listed:

  1. a feature path that begins with one of the symbols from the phrase
     structure rule

  2. a priority union operation sign (`<=')

  3. either another path or an atomic value

Although priority union operations may be intermingled with unification
constraints following the phrase structure rule, they are applied only
after all unification constraints have succeeded.  Therefore, it makes
more sense to place them after all of the unification constraints as a
reminder of the order of application.

Priority union operations may not appear inside a disjunction: if two
rules logically differ only in the application of one priority union or
another, both rules must be written out in full.

The phrase structure rule can also be followed by zero or more _logical
constraint operations_ that refer to symbols used in the rule.  A
logical constraint operation has these parts, in the order listed:
  1. a feature path that begins with one of the symbols from the phrase
     structure rule

  2. a logical constraint operation sign (`==')

  3. a logical constraint expression, or a constraint template label
     Although logical constraint operations may be intermingled with
unification constraints or priority union operations following the
phrase structure rule, they are applied only after all unification
constraints have succeeded and all priority union operations have been
applied.  Therefore, it makes more sense to place them after all of the
unification constraints, and after any priority union operations, as a
reminder of the order of application.

Logical constraint operations may not appear inside a disjunction: if
two rules logically differ only in the application of one logical
constraint or another, both rules must be written out in full.

These last two elements of a PC-PATR rule are enhancements to the
original PATR-II formalism.  For this reason, they are discussed in more
detail in the following two sections.

Priority union operations
-------------------------

Unification is the only mechanism implemented in the original PATR-II
formulism for merging two feature structures.  There are situations
where the desired percolation of information is not easily expressed in
terms of unification.  For example, consider the following rule (where
_ms_ stands for _morphosyntactic features_):

     Stem -> Root Deriv:
             <Root ms>  =  <Deriv msFrom>
             <Stem ms>  =  <Root ms>
             <Stem ms>  =  <Deriv msTo>


The first unification expression above imposes the agreement constraints
for this rule.  The second and third unification expressions attempt to
provide the percolation of information up to the `Stem'.  However, it
is quite possible for there to be a conflict between `<Root ms>' and
`<Deriv msTo>'.  Any such conflict would cause the third unification
expression to fail, causing the rule as a whole to fail.  The only way
around this at present is to provide a large number of unification
expressions that go into greater depth in the feature structures.  Even
then it may not be possible to always avoid conflicts.

An additional mechanism for merging feature structures is provided to
properly handle percolation of information: overwriting via priority
union.  The notation of the previous example changes slightly to the
following:

     Stem -> Root Deriv:
             <Root ms>  =  <Deriv msFrom>
             <Stem ms>  =  <Root ms>
             <Stem ms> <=  <Deriv msTo>


The only change is in the third expression under the rule: the
unification operator `=' has been changed to a priority union operator
`<='.  This new operator is the same as unification except for handling
conflicts and storing results.  In unification, a conflict causes the
operation to fail.  In priority union, a conflict is resolved by taking
the value in the right hand feature structure.  In unification, both
the left hand feature structure and the right hand feature structure
are replaced by the unified result.  In priority union, only the left
hand feature structure is replaced by the result.

There is one other significant difference between unification and
priority union.  Unification is logically an unordered process; it makes
no difference what order the unification expressions are written.
Priority union, on the other hand, is inherently ordered; a priority
union operation always overrides any earlier priority union (or
unification) result.  For this reason, all unification expressions are
evaluated before any priority union expressions, and the ordering of the
priority union expressions is significant.

A BNF grammar for PC-PATR priority union operations follows.

     <priority-union> ::= <feature-path> '<=' <feature-path>
                        | <feature-path> '<=' <ATOM>
     
     <feature-path>   ::= '<' <label-list '>'
     
     <label-list>     ::= <LABEL>
                        | <LABEL> <label-list>


Note that both `<LABEL>' and `<ATOM>' refer to a single string token of
contiguous characters.

Logical constraint operations
-----------------------------

Unification is the only mechanism implemented in the original PATR-II
formulism for imposing constraints on feature structures.  There are
situations where the desired constraint is not easily expressed in terms
of unification.  For example, consider the following rule:

     Stem -> Root Deriv:
             <Root ms>  =  <Deriv msFrom>
             <Stem ms>  =  <Root ms>
             <Stem ms> <=  <Deriv msTo>


where `<Root ms>' and `<Deriv msFrom>' have the following feature
structures:

     [Root: [ms: [finite: - ...]]]
     
     [Deriv: [msFrom: [tense: past ...]]]


Assume that from our knowledge of verb morphology, we would like to
rule out this analysis because only finite verb roots (`[finite: +]')
are marked for tense.  The only way to do this with unification is to
add `[finite: +]' to the `msFrom' feature of all the tense bearing
derivational suffixes.  This would work, but it adds information to
suffixes that properly belongs only to roots.  A better approach would
be some way to express the desired constraint more directly.  Consider
the following rule:

     Stem -> Root Deriv:
             <Root ms>  =  <Deriv msFrom>
             <Stem ms>  =  <Root ms>
             <Stem ms> <=  <Deriv msTo>
             <Stem ms> ==  [finite: +] <-> [tense: []]


The fourth feature expression under the rule is a new operation called a
constraint.  This particular constraint is interpreted as follows: if
the feature structure `[finite: +]' _subsumes_ the feature structure
that is the value of `<Stem ms>', then the feature structure `[tense:
[]]' must also subsume the feature structure that is the value of
`<Stem ms>', and if the feature structure `[finite: +]' does not
subsume the feature structure that is the value of `<Stem ms>', then
the feature structure `[tense: []]' must not subsume the feature
structure that is the value of `<Stem ms>'.  (A feature structure _F1_
subsumes another feature structure _F2_ if _F1_ contains a subset of
the information contained by _F2_.  The empty feature structure `[]'
subsumes all other feature structures.  Subsumption is a partial
ordering: not every two feature structures are in a subsumption
relation to each other.)

A constraint is much different both syntactically and semantically from
either unification or priority union.  The first difference is that a
constraint does not modify any feature structures; it merely compares
the content of two feature structures.  The second difference is that
the right hand side of a constraint expression is a logical expression
involving one or more feature structures rather than a feature path.

Constraints support two unary and four binary logical operations:
existence, negation, logical and, logical or, conditional, and
biconditional.  The following tables summarize these logical operations.
(`$' is used for the subsumption operation.  `*P' represents the
feature structure pointed to by the feature path associated with the
logical constraint.  `F', `L', and `R' represent a feature structure
associated with the logical constraint.)

              existence negation
     F $ *P    P == F    P == ~F
     ------    ------    -------
      true      true      false
      false     false     true
     
                      logical and    logical or    conditional    biconditional
     L $ *P  R $ *P    P == L & R    P == L / R    P == L -> R    P == L <-> R
     ------  ------    ----------    ----------    -----------    ------------
      true    true        true          true          true            true
      true    false       false         true          false           false
      false   true        false         true          true            false
      false   false       false         false         true            true


Since they apply to the final feature structure, constraint expressions
are evaluated after all of the unification and priority union
expressions.  Like unification and unlike priority union, the relative
order of constraints is not (logically) important.

A BNF grammar for PC-PATR logical constraint operations follows.

     <logical-constraint> ::= <feature-path> '==' <expression>
     
     <feature-path>       ::= '<' <label-list '>'
     
     <label-list>         ::= <LABEL>
                            | <LABEL> <label-list>
     
     <expression>         ::=     <factor>
                            | '~' <factor>
                            |     <factor> <binop>     <factor>
                            | '~' <factor> <binop>     <factor>
                            |     <factor> <binop> '~' <factor>
                            | '~' <factor> <binop> '~' <factor>
     
     <factor>             ::= <feature>
                            | '(' <expression> ')'
     
     <binop>              ::= '&'
                            | '/'
                            | '->'
                            | '<->'
     
     <feature>            ::= '[' <attribute-list> ']'
                            | '[]'
     
     <attribute-list>     ::= <attribute>
                            | <attribute> <attribute-list>
     
     <attribute>          ::= <LABEL> ':' <ATOM>
                            | <LABEL> ':' <feature>
                            | <LABEL> ':' <indexedvariable>
     
     <indexedvariable>    ::= '^1'
                            | '^2'
                            | '^3'
                            | '^4'
                            | '^5'
                            | '^6'
                            | '^7'
                            | '^8'
                            | '^9'

Note that both `<LABEL>' and `<ATOM>' refer to a single string token of
contiguous characters.

An `<indexedvariable>' is interpreted as a variable for the atomic
value at that place in the feature structure.  The first such variable
is instantiated by the atomic value of the feature at that place in the
feature-path.  All subsequent instances of the variable are compared for
equality with the first instantiated one.

Why might one need such an indexed variable?  In some SOV languages
with pro-drop and noun-verb compounding, a clause consisting just of a
`Noun Verb' sequence is potentially at least three ways ambiguous:
   * `Subject Verb'

   * pro-drop `Object Verb'

   * pro-drop `Noun-Verb-compound'
In at least one of these languages, it is the case that when a
noun-verb compound is possible, it is the only valid reading.
Therefore, the correct thing to do is to ensure that none of the other
possible readings are allowed by the grammar.

Here's a (simplified) example of how one can use indexed variables to
rule out the `Subject Verb' case.  (The `Noun' is realized as the `DP'
node and the `Verb' is realized as a `VP' which is a daughter of the
`I'' node in the following rule.)
     rule {IP option 2cI - subject initial, required, root clause}
     IP = DP I'
         <IP head> = <I' head>
         <IP head type root> = +
         <IP head type pro-drop> = -
            ...
         <DP head case nominative> = +
            ...
         <IP head> == [rootgloss:^1] ->
                      ~ ( [type:[no_intervening:+]] &
                        (( [subject:[head:[type:[compounds_with1:^1]]]]
                         / [subject:[head:[type:[compounds_with2:^1]]]])
                         / ([subject:[head:[type:[compounds_with3:^1]]]]
                         / [subject:[head:[type:[compounds_with4:^1]]]]) ) )
            ...
In the final logical constraint above (which is shown in bold), the
atomic value of the `rootgloss' feature is stored in variable `^1' in
the antecedent (the "if" part) of the conditional.  This atomic value
is then compared with the values of the various `compounds_with'
features.  The idea is that the value of the `rootgloss' feature should
not be any of the values of the various `compounds_with' features (there
are more than one of these because a given noun may compound with more
than one verb).

Feature templates
=================

A PC-PATR feature template has these parts, in the order listed:
  1. the keyword `Let'

  2. the template name

  3. the keyword `be'

  4. a feature definition

  5. an optional period (`.')
     If the template name is a terminal category (a terminal symbol in
one of the phrase structure rules), the template defines the default
features for that category.  Otherwise the template name serves as an
abbreviation for the associated feature structure.

The characters `(){}[]<>=:' cannot be used in template names since they
are used for special purposes in the grammar file.  The characters `/_'
can be freely used in template names.  The character `\' should not be
used as the first character of a template name because that is how
fields are marked in the lexicon file.

The abbreviations defined by templates are usually used in the feature
field of entries in the lexicon file.  For example, the lexical entry
for the irregular plural form _feet_ may have the abbreviation _pl_ in
its features field.  The grammar file would define this abbreviation
with a template like this:

     Let pl be [number: PL]

The path notation may also be used:

     Let pl be <number> = PL

More complicated feature structures may be defined in templates.  For
example,

     Let 3sg be [tense:  PRES
                 agr:    3SG
                 finite: +
                 vform:  S]


which is equivalent to:

     Let 3sg be <tense>  = PRES
                <agr>    = 3SG
                <finite> = +
                <vform>  = S


In the following example, the abbreviation _irreg_ is defined using
another abbreviation:

     Let irreg be <reg> = -
                  pl


The abbreviation _pl_ must be defined previously in the grammar file or
an error will result.  A subsequent template could also use the
abbreviation _irreg_ in its definition.  In this way, an inheritance
hierarchy features may be constructed.

Feature templates permit disjunctive definitions.  For example, the
lexical entry for the word _deer_ may specify the feature abbreviation
_sg-pl_.  The grammar file would define this as a disjunction of
feature structures reflecting the fact that the word can be either
singular or plural:

     Let sg/pl be {[number:SG]
                   [number:PL]}


This has the effect of creating two entries for _deer_, one with
singular number and another with plural.  Note that there is no limit
to the number of disjunct structures listed between the braces.  Also,
there is no slash (`/') between the elements of the disjunction as
there is between the elements of a disjunction in the rules.  A shorter
version of the above template using the path notation looks like this:

     Let sg/pl be <number> = {SG PL}

Abbreviations can also be used in disjunctions, provided that they have
previously been defined:

     Let sg be <number> = SG
     Let pl be <number> = PL
     Let sg/pl be {[sg] [pl]}


Note the square brackets around the abbreviations _sg_ and _pl_;
without square brackets they would be interpreted as simple values
instead.

Feature templates can assign default atomic feature values, indicated
by prefixing an exclamation point (!).  A default value can be
overridden by an explicit feature assignment.  This template says that
all members of category N have singular number as a default value:

     Let N be <number> = !SG

The effect of this template is to make all nouns singular unless they
are explicitly marked as plural.  For example, regular nouns such as
_book_ do not need any feature in their lexical entries to signal that
they are singular; but an irregular noun such as _feet_ would have a
feature abbreviation such as _pl_ in its lexical entry.  This would be
defined in the grammar as `[number: PL]', and would override the
default value for the feature number specified by the template above.
If the N template above used `SG' instead of `!SG', then the word
_feet_ would fail to parse, since its _number_ feature would have an
internal conflict between `SG' and `PL'.

Parameter settings
==================

A PC-PATR parameter setting has these parts, in the order listed:
  1. the keyword `Parameter'

  2. an optional colon (`:')

  3. one or more keywords identifying the parameter

  4. the keyword `is'

  5. the parameter value

  6. an optional period (`.')

PC-PATR recognizes the following parameters:
`Start symbol'
     defines the start symbol of the grammar.  For example,

          Parameter Start symbol is S

     declares that the parse goal of the grammar is the nonterminal
     category S.  The default start symbol is the left hand symbol of
     the first phrase structure rule in the grammar file.

`Restrictor'
     defines a set of features to use for top-down filtering, expressed
     as a list of feature paths.  For example,

          Parameter Restrictor is <cat> <head form>

     declares that the _cat_ and _head form_ features should be used to
     screen rules before adding them to the parse chart.  The default
     is not to use any features for such filtering.  This filtering,
     named _restriction_ in Shieber (1985), is performed in addition to
     the normal top-down filtering based on categories alone.
     RESTRICTION IS NOT YET IMPLEMENTED.  SHOULD IT BE INSTEAD OF
     NORMAL FILTERING RATHER THAN IN ADDITION TO?

`Attribute order'
     specifies the order in which feature attributes are displayed.  For
     example,

          Parameter Attribute order is cat lex sense head
                                       first rest agreement


     declares that the _cat_ attribute should be the first one shown in
     any output from PC-PATR, and that the other attributes should be
     shown in the relative order shown, with the _agreement_ attribute
     shown last among those listed, but ahead of any attributes that
     are not listed above.  Attributes that are not listed are ordered
     according to their character code sort order.  If the attribute
     order is not specified, then the category feature _cat_ is shown
     first, with all other attributes sorted according to their
     character codes.

`Category feature'
     defines the label for the category attribute.  For example,

          Parameter Category feature is Categ

     declares that _Categ_ is the name of the category attribute.  The
     default name for this attribute is _cat_.

`Lexical feature'
     defines the label for the lexical attribute.  For example,

          Parameter Lexical feature is Lex

     declares that _Lex_ is the name of the lexical attribute.  The
     default name for this attribute is _lex_.

`Gloss feature'
     defines the label for the gloss attribute.  For example,

          Parameter Gloss feature is Gloss

     declares that _Gloss_ is the name of the gloss attribute.  The
     default name for this attribute is _gloss_.

`RootGloss feature'
     defines the label for the root gloss attribute.  For example,

          Parameter RootGloss feature is RootGloss

     declares that _RootGloss_ is the name of the root gloss attribute.
     The default name for this attribute is _rootgloss_.  Note that
     this does not work when using Kimmo to parse words.

Lexical rules
=============

Lexical rules serve two purposes: providing a flexible means of creating
multiple related lexicon entries, and converting morphological parser
output into a form suitable for syntactic parser input.


     Figure 7. PC-PATR lexical rule example
     
     ; lexicon entry
     \w stormed
     \c V
     \f Transitive AgentlessPassive
        <head trans pred> = storm
     
     ; definitions from the grammar file
     Let Transitive be
             <subcat first cat> = NP
             <subcat rest first cat> = NP
             <subcat rest rest> = end
             <head trans arg1> = <subcat first head trans>
             <head trans arg2> = <subcat rest first head trans>.
     
     Define AgentlessPassive as
             <out cat> = <in cat>
             <out subcat> = <in subcat rest>
             <out lex> = <in lex> ; added for PC-PATR
             <out head> = <in head>
             <out head form> => passiveparticiple.


     Figure 8. Feature structure before lexical rule
     
     [ lex:    stormed
       cat:    V
       head:   [ trans: [ arg1:  $1 []
                          arg2:  $2 []
                          pred:  storm ] ]
       subcat: [ first: [ cat:   NP
                          head:  [ trans: $1 [] ] ]
                 rest:  [ first: [ cat:   NP
                                    head: [ trans: $2 [] ] ]
                          rest:  end                      ] ] ]


     Figure 9. Feature structures after lexical rule
     
     [ lex:    stormed
       cat:    V
       head:   [ trans: [ arg1:  $1 []
                          arg2:  $2 []
                          pred:  storm ] ]
       subcat: [ first: [ cat:   NP
                          head:  [ trans: $1 [] ] ]
                 rest:  [ first: [ cat:   NP
                                    head: [ trans: $2 [] ] ]
                          rest:  end                      ] ] ]
     
     [ lex:    stormed
       cat:    V
       head:   [ trans: [ arg1: []
                          arg2: $1 []
                          pred: storm ]
                 form:  passiveparticiple ]
       subcat: [ first: [ cat:  NP
                          head: [ trans: $1 [] ] ]
                 rest:  end                     ] ]


A PC-PATR lexical rule has these parts, in the order listed:

  1. the keyword `Define'

  2. the name of the lexical rule

  3. the keyword `as'

  4. the rule definition

  5. an optional period (`.')

The rule definition consists of one or more mappings.  Each mapping has
three parts: an output feature path, an assignment operator, and the
value assigned, either an input feature path or an atomic value.  Every
output path begins with the feature name `out' and every input path
begins with the feature name `in'.  The assignment operator is either
an equal sign (`=') or an equal sign followed by a "greater than" sign
(`=>').(1)

Consider the information shown in figure 7.  When the lexicon entry is
loaded, it is initially assigned the feature structure shown in figure
8, which is the unification of the information given in the various
fields of the lexicon entry.  Since one of the the labels stored in the
`\f' (feature) field is actually the name of a lexical rule, after the
complete feature structure has been built, the named lexical rule is
applied.  After the rule has been applied, the original single feature
structure has been changed to the two feature structures shown in
figure 9.  Note that not all of the input feature information is found
in both of the output feature structures.


     Figure 10. PC-PATR lexical rule for using PC-Kimmo
     
     Define MapKimmoFeatures as
             <out cat>       = <in head pos>
             <out head>      = <in head>
             <out gloss>     = <in root>
             <out root_pos>  = <in root_pos>


     Figure 11. Feature structure received from PC-Kimmo
     
     [ cat:      Word
       clitic:   -
       drvstem:  -
       head:     [ agr:    [ 3sg: + ]
                   finite: +
                   pos:    V
                   tense:  PRES
                   vform:  S          ]
       root:     `sleep
       root_pos: V                      ]


     Figure 12. Feature structure sent to PC-PATR
     
     [ cat:       V
       gloss:     `sleep
       head:      [ agr:    [ 3sg: + ]
                    finite: +
                    pos:    V
                    tense:  PRES
                    vform:  S          ]
       lex:       sleeps
       root_pos: V                       ]


Using a lexical rule in conjunction with the PC-Kimmo morphological
parser within PC-PATR is illustrated in figures 10-12.  Figure
10 shows the lexical rule for mapping from the top-level feature
structure produced by the morphological parser to the bottom-level
feature structure used by the sentence parser.  Note that this rule
must be named `MapKimmoFeatures' (unorthodox capitalization and all).
Figure 11 shows the feature structure created by the PC-Kimmo parser.
After the lexical rule shown in figure 10 has been applied (and after
some additional automatic processing), the feature structure shown in
figure 12 is passed to the PC-PATR parser.  Note that only a single
feature structure results from this operation, unlike the result of a
lexical rule applied to a lexicon entry.

Note that the feature structure passed to the PC-PATR parser always has
both a `lex' feature and a `gloss' feature, even if the
`MapKimmoFeatures' lexical rule does not create them.  The default
value for the `lex' feature is the original word from the sentence
being parsed.  The default value for the `gloss' feature is the
concatenation of the glosses of the individual morphemes in the word.

In contrast to the `lex' and `gloss' features which are provided
automatically by default, the `cat' feature must be provided by the
`MapKimmoFeatures' lexical rule.  There is no way to provide this
feature automatically, and it is required for the phrase structure rule
portion of PC-PATR.

---------- Footnotes ----------

(1) These two operators are equivalent in PC-PATR, since the
implementation treats each lexical rule as an ordered list of
assignments rather than using unification for the mappings that have an
equal sign operator.

Constraint templates
====================

A PC-PATR constraint template has these parts, in the order listed:
  1. the keyword `Constraint'

  2. the template name

  3. the keyword `is'

  4. a logical constraint expression

  5. an optional period (`.')

The characters `(){}[]<>=:/' cannot be used in constraint template
names since they are used for special purposes in the grammar file.  The
characters `_\' can be freely used in constraint template names.

The abbreviations defined by constraint templates are used in the
logical constraint operations that are part of the rules defined in the
grammar file.  A constraint template must be defined in the grammar
file before it can be used in a rule.

Consider the following rules in a grammar file:

     RULE Word -> Stem
             <Word ms> = <Stem ms>
             <Stem ms> == [finite: +] <-> [tense: []]
     
     RULE Word -> Stem Infl
             <Word ms> = <Stem ms>
             <Word ms> = <Infl ms>
             <Stem ms> == [finite: +] <-> [tense: []]
     
     RULE Stem -> Root Deriv
             <Root ms>  = <Deriv msFrom>
             <Stem ms>  = <Root ms>
             <Stem ms> <= <Deriv msTo>
             <Stem ms> == [finite: +] <-> [tense: []]
     
     RULE Stem -> Root
             <Stem ms> = <Root ms>
             <Stem ms> == [finite: +] <-> [tense: []]


These rules can be simplied by defining a constraint template:

     CONSTRAINT ValidVerb is [finite: +] <-> [tense: []]
     
     RULE Word -> Stem
             <Word ms> = <Stem ms>
             <Stem ms> == ValidVerb
     
     RULE Word -> Stem Infl
             <Word ms> = <Stem ms>
             <Word ms> = <Infl ms>
             <Stem ms> == ValidVerb
     
     RULE Stem -> Root Deriv
             <Root ms>  = <Deriv msFrom>
             <Stem ms>  = <Root ms>
             <Stem ms> <= <Deriv msTo>
             <Stem ms> == ValidVerb
     
     RULE Stem -> Root
             <Stem ms> = <Root ms>
             <Stem ms> == ValidVerb


Standard format
***************

Some of the input control files that PC-PATR reads are "standard
format" files.  This means that the files are divided into records and
fields.  A standard format file contains at least one record, and some
files may contain a large number of records.  Each record contains one
or more fields.  Each field occupies at least one line, and is marked
by a "field code" at the beginning of the line.  A field code begins
with a backslash character (`\'), and contains 1 or more printing
characters (usually alphabetic) in addition.

If the file is designed to have multiple records, then one of the field
codes must be designated to be the "record marker", and every record
begins with that field, even if it is empty apart from the field code.
If the file contains only one record, then the relative order of the
fields is constrained only by their semantics.

It is worth emphasizing that field codes must be at the _beginning_ of
a line.  Even a single space before the backslash character prevents it
from being recognized as a field code.

It is also worth emphasizing that record markers _must_ be present even
if that field has no information for that record.  Omitting the record
marker causes two records to be merge into a single record, with
unpredictable results.

The PC-PATR Lexicon File
************************

The lexicon file is a "standard format" database file consisting of any
number of records, each of which represents one word.  These records
are divided into fields, each of which begins with a standard format
marker at the beginning of a line.  These markers begin with the `\'
(backslash) character followed by one or more alphanumeric characters.
Each record begins with a designated field.  PC-PATR recognizes four
different fields, with these default field markers:
`\w'
     the lexical form of the word, spelled exactly as it will appear in
     any sentences or phrases input to PC-PATR(1)

`\c'
     word category (part of speech)

`\g'
     word gloss

`\f'
     additional features of this word Note that the fields containing
the lexical form of the word and its category must be present for each
word (record) in the lexicon.  The other two fields (glosses and
features) are optional, as are additional fields that may be present
for other purposes.

Each word loaded from the lexicon file is assigned certain features
based on the fields described above.
   * The value of the "lex" feature is the lexical form of the word,
     taken from the lexical form field of the word's entry in the
     lexicon.

   * The value of the "cat" feature is the lexical category of the word,
     for example, Noun, Verb, Adjective, and so on.  This is taken from
     the category field of the word's entry in the lexicon.  Note that
     the same lexical form can appear multiple times in the lexicon,
     with a different category for each occurrence.

   * The value of the "gloss" feature is the gloss of the word, taken
     from the gloss field of the word's entry in the lexicon.  Unlike
     the previous two items, this feature is optional.
These feature names should be treated as reserved names and not used
for other purposes.

For example, consider these entries for the words _fox_ and _foxes_:

     \w fox
     \c N
     \g canine
     \f <number> = singular
     
     \w foxes
     \c N
     \g canine+PL
     \f <number> = plural


When these entries are used by the grammar, they are represented by
these feature structures:

     [cat:    N
      gloss:  canine
      lex:    foxes
      number: singular]
     
     [cat:    N
      gloss:  canine+PL
      lex:    foxes
      number: plural]


The lexicon entries can be simplified by defining feature templates in
the grammar file.  Consider the following templates:

     Let PL be <number> = plural
     Let N  be <number> = !singular


With these two templates, defining an abbreviation for "plural" and
defining a default feature for category N (noun), the lexicon entries
can be rewritten as follows:

     \w fox
     \c N
     \g canine
     \f
     
     \w foxes
     \c N
     \g canine+PL
     \f PL


Note that the feature (`\f') field of the first entry could be omitted
altogether since it is now empty.

---------- Footnotes ----------

(1) By default, `\w' also marks the initial field of each word's record.

The AMPLE Analysis File
***********************

Rather than using a dedicated lexicon file, PC-PATR can load its
internal lexicon from one or analysis files produced by the AMPLE
morphological analysis program.  AMPLE writes a standard format
database for its output, each record of which corresponds to a word of
the source text.  The first field of each entry contains the analysis.
Other fields, which may or may not occur, contain additional
information.

The utility of this command has been greatly reduced by the
availability of the `load ample' and `load kimmo' commands which allow
morphological analysis on demand to populate PC-PATR's word lexicon.
However, the `file disambiguate' command also operates on AMPLE
analysis files, so this information is still of interest.

AMPLE analysis file fields
==========================

This section describes the fields that AMPLE writes to the output
analysis file.  The only field that is guaranteed to exist is the
analysis (`\s') field.  All other fields are either data dependent or
optional.

Analysis: \a
------------

The analysis field (`\a') starts each record of the output analysis
file.  It has the following form:

     \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX

where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX'
is a suffix morphname, `CAT' is a root category, and `root' is a root
gloss or etymology.  In the simplest case, an analysis field would look
like this:

     \a < CAT root >

The `\rd' field in the analysis data file can replace the characters
used to bracket the root category and gloss/etymology; see `Root
Delimiter Characters: \rd' in the AMPLE Reference Manual.  The
dictionary field code mapped to `M' in the dictionary codes file
controls the affix and default root morphnames; see `Morphname
(internal code M)' in the AMPLE Reference Manual.  If the AMPLE `-g'
command line option was given, the output analysis file contains
glosses from the root dictionary marked by the field code mapped to `G'
in the dictionary codes file; see `AMPLE Command Options' and `Root
Gloss (internal code G)' in the AMPLE Reference Manual.

Decomposition (surface forms): \d
---------------------------------

The morpheme decomposition field (`\d') follows the analysis field.  It
has the following form:

     \d anti-dis-establish-ment-arian-ism-s

where the hyphens separate the individual morphemes in the surface form
of the word.

The `\dsc' field in the text input control file can replace the hyphen
with another character for separating the morphemes; see `Decomposition
Separation Character: \dsc' in the AMPLE Reference Manual.

The morpheme decomposition field is optional.  It is enabled either by
an AMPLE `-w d' command line option (see `AMPLE Command Options' in the
AMPLE Reference Manual), or by an interactive query.

Category (possible word or morpheme): \cat
------------------------------------------

The category field (`\cat') provides rudimentary category information.
It has the following form:

     \cat CAT

where `CAT' is the proposed word category.  A more complex example is

     \cat C0 C1/C0=C2=C2/C1=C1/C1

where `C0' is the proposed word category, `C1/C0' is a prefix category
pair, `C2' is a root category, and `C2/C1' and `C1/C1' are suffix
category pairs.  The equal signs (`=') serve to separate the category
information of the individual morphemes.

The `\cat' field of the analysis data file controls whether the
category field is written to the output analysis file; see `Category
output control: \cat' in the AMPLE Reference Manual.

Properties: \p
--------------

The properties field (`\p') contains the names of any allomorph or
morpheme properties found in the analysis of the word.  It has the form:

     \p ==prop1 prop2=prop3=

where `prop1', `prop2', and `prop3' are property names.  The equal
signs (`=') serve to separate the property information of the
individual morphemes.  Note that morphemes may have more than one
property, with the names separated by spaces, or no properties at all.

By default, the properties field is written to the output analysis
file.  The `-w 0' command option, or any `-w' option that does not
include `p' in its argument disables the properties field.

Feature Descriptors: \fd
------------------------

The feature descriptor field (`\fd') contains the feature names
associated with each morpheme in the analysis.  It has the following
form:

     \fd ==feat1 feat2=feat3=

where `feat1', `feat2', and `feat3' are feature descriptors.  The equal
signs (`=') serve to separate the feature descriptors of the individual
morphemes.  Note that morphemes may have more than one feature
descriptor, with the names separated by spaces, or no feature
descriptors at all.

The dictionary field code mapped to `F' in the dictionary code table
file controls whether feature descriptors are written to the output
analysis file; if this mapping is not defined, then the `\fd' field is
not written.  See `Feature Descriptor (internal code F)' in the AMPLE
Reference Manual.

Underlying forms (decomposition): \u
------------------------------------

The underlying form field (`\u') is similar to the decomposition field
except that it shows underlying forms instead of surface forms.  It
looks like this:

     \u a-para-a-i-ri-me

where the hyphens separate the individual morphemes.

The `\dsc' field in the text input control file can replace the hyphen
with another character for separating the morphemes; see `Decomposition
Separation Character: \dsc' in the AMPLE Reference Manual.

The dictionary field code mapped to `U' in the dictionary code table
file controls whether underlying forms are written to the output
analysis file; if this mapping is not defined, then the `\u' field is
not written.  See `Underlying Form (internal code U)' in the AMPLE
Reference Manual.

Word (before decapitalization and orthography changes): \w
----------------------------------------------------------

The original word field (`\w') contains the original input word as it
looks before decapitalization and orthography changes.  It looks like
this:

     \w The

Note that this is a gratuitous change from earlier versions of AMPLE,
which wrote the decapitalized form.

The original word field is optional.  It is enabled either by an AMPLE
`-w w' command line option (see `AMPLE Command Options' in the AMPLE
Reference Manual), or by an interactive query.

Formatting (junk before the word): \f
-------------------------------------

The format information field (`\f') records any formatting codes or
punctuation that appeared in the input text file before the word.  It
looks like this:

     \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n
             \\c 5\n\n
             \\s


where backslashes (`\') in the input text are doubled, newlines are
represented by `\n', and additional lines in the field start with a tab
character.

The format information field is written to the output analysis file
whenever it is needed, that is, whenever formatting codes or
punctuation exist before words.

Capitalization flag: \c
-----------------------

The capitalization field (`\c') records any capitalization of the input
word.  It looks like this:

     \c 1

where the number following the field code has one of these values:
`1'
     the first (or only) letter of the word is capitalized

`2'
     all letters of the word are capitalized

`4-32767'
     some letters of the word are capitalized and some are not

Note that the third form is of limited utility, but still exists
because of the author's last name.

The capitalization field is written to the output analysis file
whenever any of the letters in the word are capitalized; see `Prevent
Any Decapitalization: \nocap' and `Prevent Decapitalization of
Individual Characters: \noincap' in the AMPLE Reference Manual.

Nonalphabetic (junk after the word): \n
---------------------------------------

The nonalphabetic field (`\n') records any trailing punctuation, bar
code (see `Bar Code Format Code Characters: \barcodes' in the AMPLE
Reference Manual), or whitespace characters.  It looks like this:

     \n |r.\n
where newlines are represented by `\n'.  The nonalphabetic field ends
with the last whitespace character immediately following the word.

The nonalphabetic field is written to the output analysis file whenever
the word is followed by anything other than a single space character.
This includes the case when a word ends a file with nothing following
it.

Ambiguous analyses
==================

The previous section assumed that AMPLE produced only one analysis for
a word.  This is not always possible since words in isolation are
frequently ambiguous.  AMPLE handles multiple analyses by writing each
analysis field in parallel, with the number of analyses at the
beginning of each output field.  For example,

     \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS%
     \d %2%imaika-Npa-ni%imaika-Npani%
     \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0%
     \p %2%==%=%
     \fd %2%==%=%
     \u %2%imaika-Npa-ni%imaika-Npani%
     \w Imaicampani
     \f \\v124
     \c 1
     \n \n


where the percent sign (`%') separates the different analyses in each
field.  Note that only those fields which contain analysis information
are marked for ambiguity.  The other fields (`\w', `\f', `\c', and
`\n') are the same regardless of the number of analyses that AMPLE
discovers.

The `\ambig' field in the text input control file can replace the
percent sign with another character for separating the analyses; see
`Ambiguity Marker Character: \ambig' in the AMPLE Reference Manual for
details.

Analysis failures
=================

The previous sections assumed that AMPLE successfully analyzed a word.
This does not always happen.  AMPLE marks analysis failures the same
way it marks multiple analyses, but with zero (`0') for the ambiguity
count.  For example,

     \a %0%ta%
     \d %0%ta%
     \cat %0%%
     \p %0%%
     \fd %0%%
     \u %0%%
     \w TA
     \f \\v 12 |b
     \c 2
     \n |r\n


Note that only the `\a' and `\d' fields contain any analysis
information, and those both have the decapitalized word as a place
holder.

The `\ambig' field in the text input control file can replace the
percent sign with another character for marking analysis failures and
ambiguities; see `Ambiguity Marker Character: \ambig' in the AMPLE
Reference Manual.

Using the Embedded Morphological Parsers
****************************************

Normally, PC-PATR requires the linguist to develop a full-fledged
lexicon of words with their features.  This may be unnecessary if a
morphological analysis, and a comprehensive lexicon of morphemes, has
already been developed using either PC-Kimmo (version 2) or AMPLE
(version 3).  These morphological parsing programs are also available
from SIL.

PC-Kimmo
========

Version 2 of PC-Kimmo supports a PC-PATR style grammar for defining
word structure in terms of morphemes.  This provides a straightforward
way to obtain word features as a result of the morphological analysis
process.  For best results, the (PC-Kimmo) word grammar and the
(PC-PATR) sentence or phrase grammar should be developed together.

When using the PC-Kimmo morphological parser, PC-PATR requires a
special lexical rule in the (sentence level) grammar file.  This rule
is named `MapKimmoFeatures' and is used automatically to map from the
features produced by the word parse to the features needed by the
sentence parse.  For example, consider the following definition:

     Define MapKimmoFeatures as
             <out cat>       = <in head pos>
             <out lex>       = <in lex>
             <out head>      = <in head>

This lexical rule uses the `<head pos>' feature produced by the
PC-Kimmo parser as the `<cat>' feature for the PC-PATR parser, and
passes the `<lex>' and `<head>' features from the morphological parser
to the sentence parser unchanged.

AMPLE
=====

The only thing necessary to use the AMPLE morphological parser inside
PC-PATR is to load the appropriate control files and dictionaries.
This will not be too useful, however, unless the AMPLE dictionaries
contain feature descriptors to pass through to PC-PATR.  It is also
required for the AMPLE data to define the word category.  (Either the
word-final suffix category or the word-initial prefix category can be
designated in the analysis data file).  Consult the AMPLE documentation
for more details on either of these issues.

Index
*****

-/:
          See ``PC-PATR Command Line Options''.
-a filename:
          See ``PC-PATR Command Line Options''.
-g filename:
          See ``PC-PATR Command Line Options''.
-l filename:
          See ``PC-PATR Command Line Options''.
-t filename:
          See ``PC-PATR Command Line Options''.
-Z address,count:
          See ``PC-PATR Command Line Options''.
-z filename:
          See ``PC-PATR Command Line Options''.
\a:
          See ``Analysis: \a''.
\c:
          See ``Capitalization flag: \c''.
\cat:
          See ``Category (possible word or morpheme): \cat''.
\d:
          See ``Decomposition (surface forms): \d''.
\f:
          See ``Formatting (junk before the word): \f''.
\fd:
          See ``Feature Descriptors: \fd''.
\n:
          See ``Nonalphabetic (junk after the word): \n''.
\p:
          See ``Properties: \p''.
\u:
          See ``Underlying forms (decomposition): \u''.
\w:
          See ``Word (before decapitalization and orthography changes): \w''.
standard format:
          See ``Standard format''.
Table of Contents
*****************


Introduction to the PC-PATR program

The PATR-II Formalism
  Phrase structure rules
  Feature structures
  Unification
  Feature constraints
  The lexicon

Running PC-PATR
  PC-PATR Command Line Options
  Interactive Commands
    cd
    clear
    close
    directory
    edit
    exit
    file
      file disambiguate
      file parse
    help
    load
      load ample control
      load ample dictionary
      load ample text-control
      load analysis
      load grammar
      load kimmo grammar
      load kimmo lexicon
      load kimmo rules
      load lexicon
    log
    parse
    quit
    save
      save lexicon
      save status
    set
      set ambiguities
      set ample-dictionary
      set check-cycles
      set comment
      set failures
      set features
      set final-punctuation
      set gloss
      set kimmo check-cycles
      set kimmo promote-defaults
      set kimmo top-down-filter
      set limit
      set marker category
      set marker features
      set marker gloss
      set marker record
      set marker rootgloss
      set marker word
      set promote-defaults
      set property-is-feature
      set rootgloss
      set timing
      set top-down-filter
      set tree
      set trim-empty-features
      set unification
      set verbose
      set warnings
      set write-ample-parses
    show
      show lexicon
      show status
    status
    system
    take

The PC-PATR Grammar File
  Rules
    Priority union operations
    Logical constraint operations
  Feature templates
  Parameter settings
  Lexical rules
  Constraint templates

Standard format

The PC-PATR Lexicon File

The AMPLE Analysis File
  AMPLE analysis file fields
    Analysis: \a
    Decomposition (surface forms): \d
    Category (possible word or morpheme): \cat
    Properties: \p
    Feature Descriptors: \fd
    Underlying forms (decomposition): \u
    Word (before decapitalization and orthography changes): \w
    Formatting (junk before the word): \f
    Capitalization flag: \c
    Nonalphabetic (junk after the word): \n
  Ambiguous analyses
  Analysis failures

Using the Embedded Morphological Parsers
  PC-Kimmo
  AMPLE

Index