This manual describes STAMP, a computer program for adapting text in conjunction with the AMPLE program. This combination falls under the Analysis Transfer Synthesis (ATS) paradigm. It involves the following steps:
(1) AMPLE is a morphological parser that is applied to source language text to analyze each word into morphemes.
(2) STAMP is applied to these analyses to make changes that will produce the corresponding target language word.
(3) An interactive editor is applied to STAMP output to correct the words that AMPLE failed to analyze and for which AMPLE and STAMP produced multiple possibilities. The result is a word-for-word draft of the source language text in the target language.
(4) After eliminating analysis failures and ambiguities, the text must be checked and corrected by a competent speaker of the target language.
STAMP incorporates no language-specific facts; the user makes linguistic facts known to STAMP entirely through external files. STAMP is sufficiently general to serve over a wide range of language families. (However, AMPLE and STAMP do not adequately handle highly isolating languages, that is, languages which have virtually no morphology.)
The name STAMP is derived by taking AMP from AMPLE, and T and S from STAMP's main modules, TRANSFER and SYNTHESIS, which are applied in succession to the output of AMPLE. Thus one can think of STAMP as S(T(AMP)), or more explicitly as:
adapted text = Synthesis[Transfer[AMPLE[source text]]]
Note: much of this reference manual is based almost verbatim on the book published in 1990 (Weber, Black, McConnel, and Buseman), without explicit permission from the coauthors.
STAMP is a batch process oriented program. It reads a number of control files, and then processes one or more input analysis files to produce an equal number of output files.
The STAMP program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.
-a
-c character
|
).
-d number
-f filename
-i filename
-m
*
means an analysis failure,
.
means a single analysis, 2
-9
means
2-9 ambiguities, and >
means 10 or more ambiguities.
This is not compatible with the `-q' option.
-n
-o filename
-q
-r
-t
-u
-v
-x
The following options exist only in beta-test versions of the program, since they are used only for debugging.
-/
-z filename
-Z address,count
address
is allocated or
freed for the count
'th time.
If the `-f', `-i', and `-o' command options are not used, STAMP prompts for a number of file names, reading the standard input for the desired values. The interactive dialog goes like this:
C> stamp STAMP: Synthesis(Transfer(AMPle(text))) = adapted text Version 2.0b1 (July 21, 1998), Copyright 1998 SIL, Inc. Beta test version compiled Jul 27 1998 16:04:11 Transfer/Synthesis Performed Tue Jul 28 14:54:04 1998 STAMP declarations file (zzSTAMP.DEC): pnstamp.dec Transfer file (xxzzTR.CHG) [none]: hgpntr.chg Synthesis file (zzSYNT.CHG) [none]: pnsynt.chg Dictionary code table (zzSYCD.TAB): pnsycd.tab Dictionary orthography change table (zzORDC.TAB) [none]: pnordc.tab 10 changes loaded from suffix dictionary code table. Suffix dictionary file (zzSF01.DIC): pnsf01.dic SUFFIX DICTIONARY: Loaded 137 records 10 changes loaded from root dictionary code table. Root dictionary file (xxRTnn.DIC): pnsyrt.dic ROOT DICTIONARY: Loaded 176 records Next Root dictionary file (xxRTnn.DIC) [no more]: Output text control file (zzOUTTX.CTL) [none]: pnoutx.ctl 10 output orthography changes were loaded from pnoutx.ctl First Input file: pntest.ana Output file: pntest.syn Next Input file (or RETURN if no more): C>
Note that each prompt contains a reminder of the expected form of the answer in parentheses and ends with a colon. Several of the prompts also contain the default answer in brackets.
Using the command options does not change the appearance of the program screen output significantly, but the program displays the answers to each of its prompts without waiting for input. Assume that the file `pntest.cmd' contains the following, which is the same as the answers given above:
pnstamp.dec hgpntr.chg pnsynt.chg pnsycd.tab pnordc.tab pnsf01.dic pnsyrt.dic pnoutx.ctl
Then running STAMP with the command options produces screen output like the following:
C> stamp -f pntest.cmd -i pntest.ana -o pntest.syn STAMP: Synthesis(Transfer(AMPle(text))) = adapted text Version 2.0b1 (July 21, 1998), Copyright 1998 SIL, Inc. Beta test version compiled Jul 27 1998 16:04:11 Transfer/Synthesis Performed Tue Jul 28 14:59:34 1998 STAMP declarations file (zzSTAMP.DEC): pnstamp.dec Transfer file (xxzzTR.CHG) [none]: hgpntr.chg Synthesis file (zzSYNT.CHG) [none]: pnsynt.chg Dictionary code table (zzSYCD.TAB): pnsycd.tab Dictionary orthography change table (zzORDC.TAB) [none]: pnordc.tab 10 changes loaded from suffix dictionary code table. Suffix dictionary file (zzSF01.DIC): pnsf01.dic SUFFIX DICTIONARY: Loaded 137 records 10 changes loaded from root dictionary code table. Root dictionary file (xxRTnn.DIC): pnsyrt.dic ROOT DICTIONARY: Loaded 176 records Next Root dictionary file (xxRTnn.DIC) [no more]: Output text control file (zzOUTTX.CTL) [none]: pnoutx.ctl 10 output orthography changes were loaded from pnoutx.ctl C>
The only difference in the screen output is that the prompts for the input text file and the output analysis file are not displayed.
The input control files
and input analysis file
that STAMP reads
are all standard format files. This means that the files are
divided into records and fields. Each file contains at least one
record, and some files may contain a large number of records. Each
record contains one or more fields. Each field occupies at least one
line, and is marked by a field code at the beginning of the
line. A field code begins with a backslash character (\
), and
contains 1 or more printing characters (usually alphabetic) in
addition.
If the file is designed to have multiple records, then one of the field codes must be designated to be the record marker, and every record begins with that field, even if it is empty apart from the field code. If the file contains only one record, then the relative order of the fields is constrained only by their semantics.
It is worth emphasizing that field codes must be at the beginning of a line. Even a single space before the backslash character prevents it from being recognized as a field code.
It is also worth emphasizing that record markers must be present even if that field has no information for that record. Omitting the record marker causes two records to be merge into a single record, with unpredictable results.
The fields that STAMP recognizes in its "declarations file" are described below. Fields that start with any other backslash codes are ignored by STAMP.
When AMPLE produces more than one analysis, each analysis is set off by a
unique character. Likewise, when AMPLE fails to analyze a source
language word, it flags this word with the same character, the default
for which is a percent sign (%
). However, a user may override
AMPLE's default.
Like AMPLE, STAMP assumes this delimiter to be a percent sign. If an
analyzed text does not use this character, STAMP must be informed as to
what character was used. To do this, use the \ambig field to define the
desired character. For example, the following would change the analytic
ambiguity delimiter to @
:
\ambig @
Allomorph properties are defined by the field code \ap
followed
by one or more allomorph property names. An allomorph property name
must be a single, contiguous sequence of printing characters.
Characters and words which have special meanings in tests should not be
used. For example, the following would declare the allomorph properties
deletedK
, deletedG
, and underlyingV
:
\ap deletedK deletedG | elided morpheme final velars \ap underlyingV | underlying long vowel
A maximum of 255 properties (including both allomorph and morpheme
properties) may be defined unless the \maxprops
field is used to
define a larger number. Any number of \ap
fields may be
used so long as the number of property names does not exceed 255 (or the
number defined by the \maxprops
field). Note that any
\maxprops
field must occur before any \ap
or \mp
fields.
Categories are defined by the field code \ca
followed by one or
more category names. A category name must be a single, contiguous
sequence of printing characters. Characters and words which have
special meanings in tests should not be used.
A maximum of 255 categories may be defined. Any number of \ca
fields may be used so long as the number of category names does not
exceed 255.
The category information to write to the analysis output file is
defined by the field code \cat
followed by one or two words.
The first word must be either prefix
or suffix
(or an
abbreviation of one of those words), either capitalized or lowercase.
The \cat
field may appear any number of times, but once is
enough. If more than one such field occurs, the last one is the one
that is used.
NOTE: at present, this does not do anything in the code. Is this a feature that has never been used? When was it introduced? I'd be quite willing to rip it out of the code.
A category class declaration has three parts: the field code \ccl
,
the name of the class, and the list of categories in the class (separated
by spaces). For example, the following defines the class IVERB
containing the categories V1X
and V1Y
:
\ccl IVERB V1X V1Y
The class name must be a single, contiguous sequence of printing
characters. Characters and words which have special meanings in tests
should not be used. The category names must have been defined by an
earlier \ca
field; see
section 4.3 Category declarations: \ca.
In transfer, category classes can only be used in the match strings of lexical changes; see section 5.1.8 Lexical change: \lc.
Each \ccl
field defines a single category class. Any number of
\ccl
fields may appear in the file.
The maximum number of properties that can be defined can be increased
from the default of 255 by giving the \maxprops
field code
followed by a number greater than or equal to 255 but less than 65536.
The \maxprops
field may appear any number of times, but once is
enough. If more than one such field occurs, the one containing the
largest valid value is the one that is used.
The \maxprops
must be used before any properties are defined.
This is the case for both morpheme and allomorph properties.
If no \maxprops
fields appear in the declarations file, then
STAMP limits the number of properties which can be defined to 255.
A morpheme class declaration has three parts: the field code \mcl
,
the name of the class, and the list of morphnames in the class (separated
by spaces). For example, a morpheme class DIRECTION
could be
defined as follows:
\mcl DIRECTION UP DOWN IN OUT
Such a class could be used in conditioning environments for lexical changes, insertion rules, or substitution rules. For example, the following environment would limit the rule to apply only preceding one of the directional morphemes:
/ _ [DIRECTION]
The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The morpheme names should be defined by an entry in one of the dictionary files.
Each \mcl
field defines a single morpheme class. Any number of
\mcl
fields may appear in the file.
Morpheme properties are defined by the field code \mp
followed by
one or more morpheme property names. A morpheme property name must be a
single, contiguous sequence of printing characters. Characters and words
which have special meanings in tests should not be used. For example,
the following would declare the morpheme properties XYZ
,
ABC
, and DEF
:
\mp XYZ \mp ABC DEF
A maximum of 255 properties (including both allomorph and morpheme
properties) may be defined unless the \maxprops
field is used to
define a larger number. Any number of \mp
fields may be
used so long as the number of property names does not exceed 255 (or the
number defined by the \maxprops
field). Note that any
\maxprops
field must occur before any \mp
or \ap
fields.
A punctuation class is defined by the field code \pcl
followed
by the class name, which is followed in turn by one or more
punctuation characters or (previously defined) punctuation class
names. A punctuation class name used as part of the class definition
must be enclosed in square brackets.
The class name must be a single, contiguous sequence of printing characters. The individual members of the class are separated by spaces, tabs, or newlines.
Each \pcl
field defines a single punctuation class. Any number of
\pcl
fields may appear in the file.
If no \pcl
fields appear in the declarations file, then STAMP
does not allow any punctuation classes in tests, and does not allow any
punctuation classes in punctuation environment constraints.
For each analysis, the root (or roots), and the category of the first
root, are delimited by a pair of reserved characters. By default, AMPLE
uses wedges (<
and >
). If some characters other than
wedges are used for this purpose, they must be declared using the
\rd
field. (\rd
is mnemonic for "root delimiter".) For
example, the following line might be included in the input control file:
\rd ( )
Two characters are expected after the field code, optionally separated by white space. The first is taken to be the opening (that is, left) delimiter and the second is taken to be the closing (that is, right) delimiter. Different characters must be used for the opening and closing delimiters.
The delimiters used to set off the root should not be used for any other
purpose in the analysis field. The following may not be used for a
delimiter: the backslash (\
), whatever character is used to
indicate analytic failures and ambiguities, or any orthographic
character.
The \rd
field may appear any number of times, but once is
enough. If more than one such field occurs, the last one is the one
that is used.
If no \rd
fields appear in the declarations file, then STAMP uses
the delimiter characters <
and >
.
A string class declaration has three parts: the field code \scl
,
the name of the class, and the list of strings in the class (separated by
spaces). String classes are used in synthesis in specifying string
environment constraints on regular sound changes and on allomorph entries
in the dictionaries.
The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines.
Each \scl
field defines a single string class. Any number of
\scl
fields may appear in the file.
The characters considered to be valid for allomorph strings and string
environment constraints are defined by a \strcheck
field code
followed by the list of characters. Spaces are not significant in this
list.
The \strcheck
field may appear any number of times, but once is
enough. If more than one such field occurs, the last one is the one
that is used.
If no \strcheck
fields appear in the analysis data file, then
STAMP does not check allomorph strings and string environment
constraints for containing only valid characters.
The transfer control file for the STAMP program is a standard format file containing a single data record.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.1 Analytic ambiguity delimiter: \ambig.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.2 Allomorph property declaration: \ap.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.3 Category declarations: \ca.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.5 Category class declaration: \ccl.
Suppose that a single source language morpheme corresponds to either of two target language morphemes, where the choice between them is not determined by any contextual factor within the word. For example, Huallaga Quechua -ra `PAST' (simple past) corresponds to two suffixes in Pasco Quechua, in some cases to -rqU `RECPST' (recent past) and in other cases to -rqa `REMPST' (remote past). Whether the recent or remote past tense is appropriate is a semantic matter, not determinable by any structural factor, morphological or syntactic.
The best one can do in such a case is to create ambiguous output for every instance of the past tense morpheme and leave the choice between them to the person who edits the computer-adapted text. Thus, for every Huallaga Quechua word containing -ra `PAST' (as in 1 below) the program should produce two Pasco Quechua words (as in 2), one with -rqU `RECPST' and the other with -rqa `SIMPST':
1. aywaran 2. %2%aywarqun%aywarqan%
This can be accomplished by means of a copying rule. A copying rule
produces two output analyses. It produces a copy of the input and then
the result of applying the rule as though it were a substitution rule;
see
section 5.1.15 Substitution rule: \sr.
The only syntactic difference between a copying rule and a substitution
rule is that the former begins with the field code \cr
and the
latter begins with a \sr
. See
section 5.2 Syntax of transfer rules
For a description of the syntax for these rules.
Returning to the Quechua example, the copying rule in 3 would apply to the analysis in 4 to produce the two analyses in 5:
3. \cr "PAST" "REMPST" 4. < V1 *aywa > PAST 3 5. %2%< V1 *aywa > PAST 3%< V1 *aywa > REMPST 3%
Copying rules apply to the output of previous rules in the transfer file, and subsequent rules apply to each of the outputs. Because subsequent rules apply to each of the outputs of a copying rule, it is possible for copying rules to feed copying rules. For example, consider the two hypothetical copying rules in a below. If these are applied to a single analysis containing A and Q (as schematized in 7 below), they would produce the four analyses shown in 9.
6. \cr "A" "B" \cr "Q" "R" 7. ...A...Q... 8. %2%...A...Q...%...B...Q...% 9. %4%...A...Q...%...A...R...%...B...Q...%...B...R...%
That is, the first rule of a produces the two analyses in 8. Then the the second rule of a applies to each of these outputs, producing two outputs for each as in 9.
Note that if the original analysis had been ambiguous (with two analyses each containing A and Q), the two copying rules would have produced an eight-way ambiguity. The moral: if copying rules are used too liberally, there might be a dramatic increase in the levels of ambiguity produced. Let the user beware!
Flags are a mechanism for temporarily remembering some information about an analysis. A rule conditioned by one or more flags affects an analysis only if all the conditions implied by those flags are true for that analysis. Flags make it possible to "insulate" source language phenomena from target language phenomena.
The definition of a flag has three obligatory parts: (1) the field code
\fl
, (2) the name of the flag, and (3) the list of morphnames
which trigger the raising of that flag. For example, consider the
following definition:
\fl PLURAL PLIMPF PLDIR PLSTAT
The name of the flag is PLURAL
. The morphnames whose presence
causes the flag to be raised are PLIMPF
, PLDIR
, and
PLSTAT
.
Flag definitions are a type of rule. Recall that rules are applied in
the order in which they are given in the transfer file. (This excludes
lexical changes, which are applied before all rules.) Thus, a flag is
raised only if one of the morphnames in its definition is present in the
analysis resulting from all previous rules (in the order they are given
in the transfer file). For example, the plural flag defined above would
be raised only if PLIMPF
, PLDIR
, or PLSTAT
were
present in an analysis at the point where the rule is defined in the
transfer file.
Suppose there are two rules. The first deletes PLIMPF
. The next
one is a flag definition which raises the PLURAL
flag whenever
PLIMPF
is present in an analysis. This flag-raising rule only
sees the result of all previous rules. Thus it would never raise the
PLURAL
flag in this case, since PLIMPF
would always have
been deleted by the preceding rule. To get the proper effect, the flag
definition rule should be ordered before the rule which deletes the
morphname that causes the flag to be raised.
Flags cannot be used in a rule until they have been defined with a
\fl
field. Sometimes it is conceptually simpler to define all the
flags at the beginning of the transfer file; in other cases it is
advantageous to define each of them close to--but preceding--the rules
which use them.
Flags may be tested in copy and substitution rules following the match and substitution strings, or in insertion rules following the string to be inserted. The flag names must always precede any conditioning environments. A flag name can be preceded by tilde to complement the sense of the flag; that is, a rule so modified applies only if the named flag is not raised.
When a particular analysis has undergone all the rules of the transfer file, all the flags are automatically lowered before another analysis is considered. To put it another way, flags do not stay raised from one word to another. Many flags are raised and never lowered until the next word, but in some cases it is desirable to have the flags lowered before some subsequent rule.
Whenever an analysis changes as the result of a rule that tests a flag (or flags), then that rule's flags are automatically lowered. The user does not have to do anything because TRANSFER is designed to automatically lower flags under this condition. This avoids the application of subsequent rules on the basis of the same flag. For example, consider the following rules. (Insertion rules are discussed in the next section.)
\fl XFLG M1 M2 |raise flag when M1 or M2 present \ir "X1" XFLG / M1 _ |insert X1 after M1 when XFLG up \ir "X2" XFLG / _ M2 |insert X2 before M2 when XFLG up
Given an analysis with the sequence M1 M2
, the result of these
changes is M1 X1 M2
. It would not be M1 X1 X2 M2
because,
when the first insertion rule applies, it lowers XFLG
.
Consequently, the second insertion rule does not apply.
Suppose one wished to drop a flag (say PFLG
) whenever a particular
morpheme (say XY
) is present in an analysis. The following rule
would do it. (Substitution rules are discussed below.)
\sr "XY" "XY" PFLG |substitute XY for XY when PFLG up
When this rule applies, it produces no net change in the analysis, but it
has the important side effect of dropping the PFLG
flag.
Consider the following substitution rule, where the FLG
flag is
followed by three environments:
\sr "XX" "YY" FLG / M1 _ / M2 _ / M3 _
This is equivalent to the following sequence of rules:
\sr "XX" "YY" FLG / M1 _ \sr "XX" "YY" FLG / M2 _ \sr "XX" "YY" FLG / M3 _
Note that if the first applies, FLG
is lowered so the second and
third could not apply. Likewise, if the second applies, the third could
not apply. Only one rule could possibly apply; transfer's behavior is
the same even when multiple environments are included in a single rule or
spread across several rules.
It is possible to limit a rule by a complemented flag, in which case the
rule applies only if the flag is not raised. For example, definition 1
would raise the flag KUFLAG
when REFL
, INTNS
, or
CMPLT
were present, and rule 2 would insert KU
when
KUFLAG
is not raised and INSERTKUFLAG
is raised:
1. \fl KUFLAG REFL INTNS CMPLT 2. \ir "KU" ~KUFLAG INSERTKUFLAG
Note that the constraint imposed by each of these flags must be simultaneously met for the rule to apply.
In general, a flag automatically lowers whenever a rule constrained by
that flag applies. Note, however, that the converse is not true for a
complemented flag that constrains a rule that applies. If rule 2 above
were to apply, then INSERTKUFLAG
would be lowered but
KUFLAG
would not be affected, that is, it would remain lowered.
Flags and morpheme classes have some interesting similarities and differences. Both are defined in the same way (but with different field codes), and both are used as conditions on rules. They differ, however, in where they are used in rules: morpheme classes occur in environments, whereas flags occur between the rule's main part and before any conditioning environments. Perhaps the most important difference is one of persistence. Once defined, a morpheme class persists until STAMP finishes; there is simply no way to "undefine" a class. But flags are volatile: they are raised when certain morphemes are present in an analysis and lowered when a rule having the flag applies effectively.
Insertion rules insert morphnames into an analysis. An insertion rule may have (in order): (1) the field code \ir, (2) the morphname to be inserted, (3) optionally one or more flags, (4) optionally one or more environments into which the morphname should be inserted, and (5) a comment. The insertion string is delimited by some printing character. The morphname is inserted into an analysis if all the rule's flags are raised (or, for complemented flags, not raised) and at least one of its environments is satisfied (if any are specified).
Of the five parts listed above, only the first two are obligatory.
However, an insertion rule comprised only of a field code and the
morphname (without any conditioning flags or environments) would insert
the morphname into every analysis. The following example has the first
three parts; it would insert PL
whenever the PLFLG
flag is
up:
\ir "PL" PLFLG
(How transfer determines where to insert PL
is discussed
below.) The following inserts PXT
immediately after BDJ
:
\ir "PXT" / BDJ _
The following inserts PLDIR
whenever both the PLURAL
flag
is up and a directional suffix (that is, a member of the class
DIR
) is present. PLDIR
is inserted immediately following
the directional morpheme.
\ir "PLDIR" PLURAL / [DIR] _
When an insertion rule has multiple environments, it applies only for the first environment satisfied by a given analysis. For example, consider the following:
\ir "PXT" / BDJ _ / QMR _
This rule is applied in the following way. Potential insertion sites in
the current analysis are considered in order from left to right. At the
first one, if BDJ
occurs to the left, PXT
is inserted, and
nothing more is done by this rule. If BDJ
does not occur there
but QMR
does, PXT
is inserted and nothing more is done by
this rule. Failing to find either BDJ
or QMR
to the left,
the potential insertion site is shifted one place to the right and the
process is repeated. In this way, all potential insertion sites in the
analysis are evaluated until either an insertion is made or there are no
more potential insertion sites in the analysis. When one of these
conditions is met, the next rule in the transfer file is applied.
Each insertion rule may affect an analysis only once. This prevents multiple insertions in cases where more than one environment is satisfied. For example, consider the following rule:
\ir "X" / _ Y / Z _
Consider how this affects an analysis with the sequence Z Y
. Both
environments are satisfied, so if multiple insertions were permitted by a
single rule, the result would be Z X X Y
. However, the desired
result is more likely to be Z X Y
, which is what the program will
produce. Note, it is possible to get the former result by using two
rules:
\ir "X" / _ Y \ir "X" / Z _
Since there are two rules, both would apply to Z Y
, the first
producing Z X Y
, the second applying to Z X Y
to produce
Z X X Y
.
Insertion rules are frequently conditioned by flags; thus some comments about them are in order. First, recall the discussion above: the application of a rule automatically results in the lowering of any flags in that rule.
Second, flags may be complemented by prefixing a tilde (~
). The
following set of rules illustrates the use of a complemented flag. It is
motivated by a situation in Quechua, where pluralization with -pakU
occurs only in the absence of other morphemes which have kU.
\fl KUFLG REFL CMPL INTNS |flag for suffixes with kU \fl PLFLG PL1 PL2 |flag for pluralizers \sr "PL1" "" |remove PL1 \sr "PL2" "" |remove PL2 \ir "PLKU" PLFLG ~KUFLG |insert if plural and no kU
The first line defines a flag for suffixes containing kU; the second
defines a flag for pluralizers. The third and fourth lines delete the
pluralizers. The last line is a rule that inserts PLKU
if the
PLFLG
is up and the KUFLG
is down, that is, the original
analysis had a pluralizer and no suffix with kU is now present in the
analysis. If PLKU
is ever inserted by this rule, PLFLG
will be lowered.
If an insertion rule has an environment with a simple environment bar, then the position of the bar defines the site for insertion. But when the rule has no environment, or when the environment bar has ellipsis marking, then the insertion site is not explicitly defined. TRANSFER has mechanisms for treating these cases.
Generally, the items to be inserted are either prefixes or suffixes. In the absence of an explicit environment statement, prefixes are inserted somewhere before the leftmost root and suffixes are inserted somewhere after the rightmost root. TRANSFER determines whether the morpheme to be inserted is a prefix or a suffix by determining which dictionary it occurs in. (For this reason each affix should have a unique morphname.) Then it uses the orderclass of the morpheme, as defined in the dictionary entry, to determine exactly where to insert the morpheme.
Consider an insertion rule with no environment, such as the following one
which inserts ABC
whenever XYZFLAG
is raised:
\ir "ABC" XYZFLAG
TRANSFER determines the orderclass of ABC
from the affix
dictionaries of the target language. Given an analysis, if the
XYZFLAG
is up at the point this insertion rule is applied,
TRANSFER searches for an acceptable place to put ABC
, attempting
to place it as far right as possible without violating orderclass, that
is, without placing it after an affix with a greater orderclass. To
illustrate, consider an analysis like the following (with the
orderclasses given below each morphname):
< C1 root > M1 M2 M3 M4 M5 10 20 30 40 50
Assuming that the orderclass of ABC
is 40, the result of the
insertion rule would be the following:
< C1 root > M1 M2 M3 M4 ABC M5
If it is necessary to insert a sequence of morphnames, they can be
inserted by a sequence of insertion rules. For example, the following
three rules inserts ABC DEF GHI
when XYZFLAG
is up:
\ir "ABC" XYZFLAG \ir "DEF" / ABC _ \ir "GHI" / DEF _
(A slightly more complicated solution would be needed if there were
analyses containing ABC
or DEF
into which these rules would
incorrectly insert DEF
or GHI
.) Applied to the previous
example, the result would be:
< C1 root > M1 M2 M3 M4 ABC DEF GHI M5
Whenever the insertion site is not precisely defined by the environment
bar, insertion will be based on orderclass. Therefore, ellipsis marking
can be used to constrain an insertion by the presence of one or more
morphnames and yet have the insertion based on orderclass. For example,
either of the following rules inserts PQR
as far right as possible
without violating orderclass whenever M4
occurs in an analysis:
\ir "PQR" / _... M4 \ir "PQR" / M4 ..._
FIX ME!
This field can also occur in the STAMP declarations file or the STAMP synthesis file instead; see section 4.6 Maximum number of properties: \maxprops.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.7 Morpheme class declaration: \mcl.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.8 Morpheme property declaration: \mp.
This field can also occur in the STAMP declarations file or the STAMP synthesis file instead; see section 4.9 Punctuation class: \pcl.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.10 Root delimiter: \rd.
This field can also occur in the STAMP declarations file or the STAMP synthesis control file instead; see section 4.11 String class declaration: \scl.
FIX ME!
FIX ME!
FIX ME!
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.1 Analytic ambiguity delimiter: \ambig.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.2 Allomorph property declaration: \ap.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.3 Category declarations: \ca.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.5 Category class declaration: \ccl.
FIX ME!
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.6 Maximum number of properties: \maxprops.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.7 Morpheme class declaration: \mcl.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.8 Morpheme property declaration: \mp.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.9 Punctuation class: \pcl.
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.10 Root delimiter: \rd.
FIX ME!
FIX ME!
This field can also occur in the STAMP declarations file or the STAMP transfer control file instead; see section 4.11 String class declaration: \scl.
FIX ME!
The fourth control file read by STAMP contains the dictionary code table. Each entry of an STAMP dictionary (whether for roots, prefixes, infixes, or suffixes) is structured by field codes that indicate the type of information that follows. The dictionary code table maps the field codes used in the dictionary files onto the internal codes that STAMP uses. This allows linguists to use their favorite dictionary field codes rather than constraining them to a predefined set.
The dictionary code table is divided into one or more sections, one for each type of dictionary file. Each section contains several mappings of field codes in the form of simple changes. The field codes used in the dictionary code table file are described in the remainder of this chapter.
A dictionary field code change is defined by \ch
followed by two
quoted strings. The first string is the field code used in the
dictionary (including the leading backslash character). The second
string is the single capital letter designating the field type. For
the lists of dictionary field type codes, see
section 9. Dictionary Files.
Any character not found in either the dictionary field code string or
the dictionary field type code may be used as the quoting character.
The double quote ("
) or single quote ('
) are most often
used for this purpose.
The set of dictionary field code changes for an infix dictionary file
begins with \infix
, optionally followed by the record marker
field code for the infix dictionary. If the record marker is not
given, then the field code ("from string") from the first infix
dictionary field code change is used.
See section 9. Dictionary Files,
for the set of infix dictionary field type codes.
The set of dictionary field code changes for a prefix dictionary file
begins with \prefix
, optionally followed by the record marker
field code for the prefix dictionary. If the record marker is not
given, then the field code ("from string") from the first prefix
dictionary field code change is used.
See section 9. Dictionary Files,
for the set of prefix dictionary field type codes.
The set of dictionary field code changes for a root dictionary file
begins with \root
, optionally followed by the record marker
field code for the root dictionary. If the record marker is not
given, then the field code ("from string") from the first root
dictionary field code change is used.
See section 9. Dictionary Files,
for the set of root dictionary field type codes.
The set of dictionary field code changes for a suffix dictionary file
begins with \suffix
, optionally followed by the record marker
field code for the suffix dictionary. If the record marker is not
given, then the field code ("from string") from the first suffix
dictionary field code change is used.
See section 9. Dictionary Files,
for the set of suffix dictionary field type codes.
The set of dictionary field code changes for a unified dictionary file
begins with \unified
, optionally followed by the record marker
field code for the unified dictionary. If the record marker is not
given, then the field code ("from string") from the first unified
dictionary field code change is used.
See section 9. Dictionary Files,
for the set of unified dictionary field type codes.
The fifth control file read by STAMP, and the third optional one, contains the dictionary orthography change table. This table maps the allomorph strings in the dictionary files into the internal orthographic representation. When the text and internal orthographies differ, it may be desirable to have the allomorphs in the dictionaries stored in the same orthography as the texts, or it may be desirable to have them in the internal form, or it might even be desirable to have them in a third form. STAMP allows for any of these choices.
The dictionary orthography change table is defined by a special standard format file. This file contains a single record with two types of fields, either of which may appear any number of times. The rest of this chapter describes these fields, focusing on the syntax of the orthography changes.
An orthography change is defined by the \ch
field code followed
by the actual orthography change. Any number of orthography changes
may be defined in the dictionary orthography change table. The output
of each change serves as the input the following change. That is, each
change is applied as many times as necessary to a dictionary allomorph
before the next change from the dictionary orthography change table is
applied.
See section 10.2 Text output orthographic changes: \ch,
for the syntax of orthography changes.
A string class is defined by the \scl
field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names. A string
class name used as part of the class definition must be enclosed in
square brackets.
The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines.
Each \scl
field defines a single string class. Any number of
\scl
fields may appear in the file. The only restriction is
that a string class must be defined before it is used.
If no \scl
fields appear in the dictionary orthography changes
file, then STAMP does not allow any string classes in dictionary
orthography change environment constraints unless they are defined in
the STAMP declarations file, the transfer control file, or the synthesis
control file.
This chapter describes the content of STAMP dictionary files. These are normally divided into
With the `-u' command line option in conjunction with the
\unified
field in the dictionary code table file, the
dictionary can be stored as one or more files containing entries of any
type: prefix, infix, suffix, or root.
The following sections describe the different types of fields used in the different types of dictionary files. Remember, the mapping from the actual field codes used in the dictionary files to the type codes that STAMP uses internally is controlled by the dictionary code table file (see section 7. Dictionary Code Table File).
Each dictionary entry must contain one or more allomorph fields. Each of these contains one of the infix's allomorphs, that is, the string of characters by which the affix is represented in text and recognized by STAMP.
If an affix has multiple allomorphs, each one must be entered in its own allomorph field. These fields should be ordered with those on which the strictest constraints have been imposed preceding those with less strict or no constraints. The only exception to this is the use of indexed string classes to indicate reduplication. (See lines 20 and 21 below.)
Properties, constraints, and comments may follow the allomorph string. Any properties must be listed before any constraints. String, punctuation and morpheme environment constraints may be intermixed, but must come before any comments. A complete BNF grammar of an allomorph field is given below.
1a. <allomorph_field> ::= <allomorph>
1b. <allomorph> <properties>
1c. <allomorph> <constraints>
1d. <allomorph> <properties> <constraints>
1e. <allomorph> <comment>
1f. <allomorph> <properties> <comment>
1g. <allomorph> <constraints> <comment>
1h. <allomorph> <properties> <constraints> <comment>
2a. <allomorph> ::= <literal>
2b. <literal> { <literal> }
2c. <redup_pattern>
2d. <redup_pattern> { <literal> }
3a. <properties> ::= <literal>
3b. <literal> <properties>
4a. <constraints> ::= <string_constraint>
4b. <morph_constraint>
4c. <punct_constraint>
4d. <string_constraint> <constraints>
4e. <morph_constraint> <constraints>
4f. <punct_constraint> <constraints>
5. <comment> ::= <comment_char> anything to the end of the line
6a. <string_constraint> ::= / <envbar> <string_right>
6b. / <string_left> <envbar>
6c. / <string_left> <envbar> <string_right>
7a. <string_left> ::= <string_side>
7b. <boundary>
7c. <boundary> <string_side>
7d. <string_side> # <string_side>
7e. <boundary> <string_side> # <string_side>
8a. <string_right> ::= <string_side>
8b. <boundary>
8c. <string_side> <boundary>
8d. <string_side> # <string_side>
8e. <string_side> # <string_side> <boundary>
9a. <string_side> ::= <string_item>
9b. <string_item> <string_side>
9c. <string_item> ... <string_side>
10a. <string_item> ::= <string_piece>
10b. ( <string_piece> )
11a. <string_piece> ::= ~ <string_piece>
11b. <literal>
11c. [ <literal> ]
11d. [ <indexed_literal> ]
12a. <morph_constraint> ::= +/ <envbar> <morph_right>
12b. +/ <morph_left> <envbar>
12c. +/ <morph_left> <envbar> <morph_right>
13a. <morph_left> ::= <morph_side>
13b. <boundary>
13c. <boundary> <morph_side>
13d. <morph_side> # <morph_side>
13e. <boundary> <morph_side> # <morph_side>
14a. <morph_right> ::= <morph_side>
14b. <boundary>
14c. <morph_side> <boundary>
14d. <morph_side> # <morph_side>
14e. <morph_side> # <morph_side> <boundary>
15a. <morph_side> ::= <morph_item>
15b. <morph_item> <morph_side>
15c. <morph_item> ... <morph_side>
16a. <morph_item> ::= <morph_piece>
16b. ( <morph_piece> )
17a. <morph_piece> ::= ~ <morph_piece>
17b. <literal>
17c. [ <literal> ]
17d. { <literal> }
18a. <punct_constraint> ::= ./ <envbar> <punct_right>
18b. ./ <punct_left> <envbar>
18c. ./ <punct_left> <envbar> <punct_right>
19a. <punct_left> ::= <punct_side>
19b. <boundary>
19c. <boundary> <punct_side>
20a. <punct_right> ::= <punct_side>
20b. <boundary>
20c. <punct_side> <boundary>
21a. <punct_side> ::= <punct_item>
21b. <punct_item> <punct_side>
22a. <punct_item> ::= <punct_piece>
22b. ( <punct_piece> )
23a. <punct_piece> ::= ~ <punct_piece>
23b. <literal>
23c. [ <literal> ]
24a. <envbar> ::= _
24b. ~_
25a. <boundary> ::= #
25b. ~#
26a. <redup_pattern> ::= [ <indexed_literal> ]
26b. <literal> [ <indexed_literal> ]
26c. [ <indexed_literal> ] <literal>
26d. [ <indexed_literal> ] <redup_pattern>
26e. <redup_pattern> [ <indexed_literal> ]
27. <indexed_literal> ::= <literal> ^ <number>
28. <literal> ::= one or more contiguous characters
29. <comment_char> ::= character defined by `-c' command
line option, or |
by default
30. <number> ::= one or more contiguous digits (0-9)
\ap
field in the analysis data file.
...
) indicates a possible break in contiguity.
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
\scl
field in the analysis data file or the
dictionary orthography change table file.
...
) indicates a possible break in contiguity.
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
\mcl
field in the analysis data file.
root
, prefix
, infix
, or
suffix
\ap
or \mp
field in the
analyis data file
\ca
field in the analysis data file
\ccl
field in the analysis
data file
\mcl
field in the analysis
data file
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
~
) attached to the environment bar inverts the sense of
the constraint as a whole.
~#
) indicates that it
must not be a word boundary.
^
) and a number) must be the name of a string class defined by a
\scl
field in the analysis data file or the dictionary orthography
change table file.
\+ \/ \# \~ \[ \] \( \) \{ \} \. \_ \\
The allomorph field is used in all types of dictionary entries: prefix, infix, suffix, and root.
Each dictionary entry must contain a category field. If multiple category fields exist, then their contents are merged together.
For affix entries, this field must contain at least one category pair
for the morpheme, but may contain any number of category pairs
separated by spaces or tabs. Each category pair consists of two
category names separated by a slash (/
). The category names
must have been defined by a \ca
field in the analysis data
file. The first category is the from category, that is, the
category of the unit to which this morpheme can be affixed. The second
category is the to category, that is, the category of the result
after this morpheme has been affixed.
For root entries, this field contains one or more morphological
categories as defined by a \ca
field in the analysis data file.
If multiple categories are listed, they should be separated by spaces
or tabs.
The category field is used in all types of dictionary entries: prefix, infix, suffix, and root.
WRITE ME!
The elsewhere allomorph field is used in all types of dictionary entries: prefix, infix, suffix, and root.
The infix location field serves to restrict where infixes may be found, and must be included in each infix dictionary entry. Subject to the constraints imposed by the infix location field, STAMP searches the rest of the word for any occurrence of any allomorph string of the infix. This makes infixes rather expensive, computationally, so they should be constrained as much as possible.
1. <infix_location> ::= <types> <constraints> 2a. <types> ::= <type> 2b. <type> <types> 3a. <constraints> ::= <environment> 3b. <environment> <constraints> 4a. <environment> ::= <marker> <leftside> <envbar> <rightside> 4b. <marker> <leftside> <envbar> 4c. <marker> <envbar> <rightside> 5a. <leftside> ::= <side> 5b. <boundary> 5c. <boundary> <side> 6a. <rightside> ::= <side> 6b. <boundary> 6c. <side> <boundary> 7a. <side> ::= <item> 7b. <item> <side> 7c. <item> ... <side> 8a. <item> ::= <piece> 8b. ( <piece> ) 9a. <piece> ::= ~ <piece> 9b. <literal> 9c. [ <literal> ] 10a. <type> ::= prefix 10b. root 10c. suffix 11a. <marker> ::= / 11b. +/ 12a. <envbar> ::= _ 12b. ~_ 13a. <boundary> ::= # 13b. ~# 14. <literal> ::= one or more contiguous characters
prefix
, root
, or suffix
. If prefix
is given, then STAMP looks for infixes after exhausting the possible
prefixes at a given point in the word, and resumes looking for more
prefixes after finding an infix. Similarly, if root
is given,
then STAMP looks for infixes after running out of roots while parsing
the word, and if it finds an infix, it looks for more roots. Suffixes
are treated the same way if suffix
is given in the infix
location field.
#
) on the left side of the environment bar
refers to the place in the word which the parse has reached before
looking for infixes, not to the beginning of the word.
#
) on the right side of the environment bar
refers to the end of the word.
...
) indicates a possible break in contiguity.
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
+/
is usually used for morpheme environment constraints, but may
used for infix location environment constraints as well.
~_
) inverts the sense of
the constraint as a whole.
~#
) indicates that it
must not be a word boundary.
\+ \/ \# \~ \[ \] \( \) \{ \} \. \_ \\
The infix location field is used only in infix dictionary entries.
A morphname is an arbitrary name for a given morpheme. Only the first word (string of contiguous nonspace characters) following the morphname field code is used as the morphname. Morphnames must be less than 64 characters long.
A morphname serves two important functions:
Generally, a morphname is an identifier of a morpheme and does not need to faithfully represent that morpheme's meaning or function.
If a dictionary entry has more than one morphname field, the morphname from the first one is used; the others cause an error message. The morphname field is used in all types of dictionary entries: prefix, infix, suffix, and root. The usage differs somewhat between affix and root dictionary entries, so these two types of morphnames are described separately.
Every affix dictionary entry must have a morphname field. Users are strongly encouraged to observe the following suggestions in creating affix morphnames:
1
rather than 1P
unless
there is good reason to add the P
for person or possessive. For
a first person object marker, 1O
might serve as well as
1OBJ
.
MORPHNAME = GENDER CASE NUMBERwhere
GENDER
is M
for masculine, F
for feminine
and N
for neuter; CASE
is N
for nominative,
A
for accusative, G
for genitive, and so on; and NUMBER
is S
for singular and P
for plural. The name for
masculine nominative singular would then be MNS
.
Root morphnames are generally either glosses or etymologies.
Etymologies are frequently marked with a leading asterisk (*
).
(This is used by STAMP to indicate regular sound changes.)
If the morphname field contains only an asterisk, the morphname becomes an asterisk followed by whatever allomorph is matched. If the morphname field is omitted, or if it contains only a comment, STAMP puts whatever allomorph was matched in the text into the analysis. If the morpheme contains any alternate forms, it is wise to include an explicit morphname field.
The order class of an affix is a number indicating its position relative to other morphemes. Prefixes should be assigned negative numbers and suffixes should be assigned positive numbers. Infixes should be assigned order class values appropriate to where they can appear in the word relative to the prefixes and suffixes.
If the order class field is omitted, then a default value of zero (0) is assigned to the affix. Order class values must be between -32767 and 32767.
Order classes are used only by tests in the analysis data file. They are needed only if appropriate tests are written to take advantage of them.
The order class field is used only in affix type dictionary entries: prefix, infix, and suffix. Roots always have an implicit order class of zero.
Beginning with AMPLE version 3.6.0, one may have up to two order class numbers in an order class field (separated by white space). These represent the minimum and the maximum values of the positions this affix can span. The first number is the minimum and the second is the maximum. Therefore the first number should be less than or equal to the second. If only one number appears, both the minimum and maximum values are set to that number. If no number appears, then both the minimum and maximum are set to zero.
Note that for STAMP, only the first order class number has any use (it is used for transfer insertion rules whose environments do not indicate a location where the morpheme is to be inserted).
This field contains one or more morpheme properties. These properties
must have been defined by a \mp
field in the analysis data file.
A morpheme property is inherited by all allomorphs of the morpheme.
The morpheme property field is optional, and may be repeated. If multiple properties apply to a morpheme, they may be given all in a single field or each in a separate field.
Morpheme properties typically indicate a characteristic of the morpheme which conditions the occurrence of allomorphs of an adjacent morpheme. Morpheme properties are used in tests defined in the analysis data file and in morpheme environment constraints.
The morpheme property field is used in all types of dictionary entries: prefix, infix, suffix, and root.
In a unified dictionary, the type of an entry is determined by the
first letter following the morpheme type field code: p
or
P
for prefixes, i
or I
for infixes, s
or
S
for suffixes, and r
or R
for roots. The
morpheme type field is not needed for root entries because the entry
type defaults to root.
The morpheme type field is used only in unified dictionary files, since the morpheme type is otherwise implicit.
When a "do not load" field is included in a record, STAMP ignores the record altogether. This makes it possible to include records in the dictionary for linguistic purposes, while not needlessly taking up memory space if the dictionary is used for some other purpose.
The "do not load" field is used in all types of dictionary entries: prefix, infix, suffix, and root.
The text output module restores a processed document from the internal format to its textual form. It re-imposes capitalization on words and restores punctuation, format markers, white space, and line breaks. Also, orthography changes can be made, and the delimiter that marks ambiguities and failures can be changed. This chapter describes the control file given to the text output module.(1)
The text output module flags words that either produced no results or
multiple results when processed. These are flagged with percent signs
(%
) by default, but this can be changed by declaring the desired
character with the \ambig field code. For example, the following would
change the ambiguity delimiter to @
:
\ambig @
The text output module allows orthographic changes to be made to the processed words. These are given in the text output control file.
An orthography change is defined by the \ch
field code followed
by the actual orthography change. Any number of orthography changes
may be defined in the text output control file. The output of each
change serves as the input the following change. That is, each change
is applied as many times as necessary to an input word before the next
change from the text output control file is applied.
To substitute one string of characters for another, these must be made
known to the program in a change. (The technical term for this sort of
change is a production, but we will simply call them changes.) In the
simplest case, a change is given in three parts: (1) the field code
\ch
must be given at the extreme left margin to indicate that
this line contains a change; (2) the match string is the string for
which the program must search; and (3) the substitution string is the
replacement for the match string, wherever it is found.
The beginning and end of the match and substitution strings must be
marked. The first printing character following \ch
(with at
least one space or tab between) is used as the delimiter for that line.
The match string is taken as whatever lies between the first and second
occurrences of the delimiter on the line and the substitution string is
whatever lies between the third and fourth occurrences. For example,
the following lines indicate the change of hi to bye, where the
delimiters are the double quote mark ("
), the single quote mark
('
), the period (.
), and the at sign (@
).
\ch "hi" "bye" \ch 'hi' 'bye' \ch .hi. .bye. \ch @hi@ @bye@
Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise.
Change tables follow these conventions:
\ch "thou" "you" \ch "thou" to "you" \ch "thou" > "you" \ch "thou" --> "you" \ch "thou" becomes "you"
;
).
The following lines illustrate the use of comments:
\ch "qeki" "qiki" | for cases like wawqeki \ch "thou" "you" | for modern English
\ch
, or by placing the comment character
(|
) in front of it. For example, only the
first of the following three lines would effect a change:
\ch "nb" "mp" \no \ch "np" "np" |\ch "mb" "nb"
The changes in the text output control file are applied as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes:
\ch "aib" > "ayb" \ch "yb" > "yp"
Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.)
The way in which change tables are applied allows certain
tricks. For example, suppose that for Quechua, we wish to change
hw to f, so that hwista becomes fista and hwis
becomes fis. However, we do not wish to change the sequence
shw or chw to sf or cf (respectively). This could
be done by the following sequence of changes. (Note, @
and
$
are not otherwise used in the orthography.)
\ch "shw" > "@" | (1) \ch "chw" > "$" | (2) \ch "hw" > "f" | (3) \ch "@" > "shw" | (4) \ch "$" > "chw" | (5)
Lines (1) and (2) protect the sh and ch by changing them to
distinguished symbols. This clears the way for the change of hw to
f in (3). Then lines (4) and (5) restore @
and $
to
sh and ch, respectively. (An alternative, simpler way to do
this is discussed in the next section.)
It is possible to impose string environment constraints (SECs) on changes in the orthography change tables. The syntax of SECs is described in detail in section .
For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes:
\ch "o" "u" / _ q / q _ \ch "e" "i" / _ q / q _
This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that--for cases like upun "he drinks" and upoq "the one who drinks"--the root needs to be represented internally only as upu "drink". But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned.
In reality, the regressive vowel-lowering effect of /q/ can pass
over various intervening consonants, including /y/, /w/,
/l/, /ll/, /r/, /m/, /n/, and /n/. For
example, /ullq/ becomes ollq, /irq/ becomes erq,
and so on. Rather than list each of these cases as a separate constraint, it
is convenient to define a class (which we label +resonant
) and
use this class to simplify the SEC. Note that the string class
must be defined (with the \scl
field code) before it is used in a
constraint.
\scl +resonant y w l ll r m n n~ \ch "o" "u" / q _ / _ ([+resonant]) q \ch "e" "i" / q _ / _ ([+resonant]) q
This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/.
Consider the problem posed for Quechua in the previous section, that of
changing hw to f. An alternative is to condition the change
so that it does not apply adjacent to a member of the string class
Affric
which contains s and c.
\scl Affric c s \ch "hw" "f" / [Affric] ~_
It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples.
\ch "this" "that" | anywhere in the word \ch "this" "that" / # _ | only if word initial \ch "this" "that" / _ # | only if word final \ch "this" "that" / # _ # | only if entire word
The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as the internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation.
Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word "king" is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If the program is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text output control file:
\ch "x" "ks"
In this, x is the match string and ks is the substitution string, as discussed in section . Whenever x is found, ks is substituted for it.
Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ "I am", where the length (represented by a colon) is the morpheme meaning "first person subject". Other examples, such as upoo /upu:/ "I drink" and upichee /upi-chi-:/ "I extinguish", motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this:
\ch "aa" "a:" \ch "ee" "i:" \ch "ii" "i:" \ch "oo" "u:" \ch "uu" "u:"
Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb "drink" can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes:
\ch "ca" "ka" \ch "co" "ko" \ch "cu" "ku" \ch "qu" "k"
The first three handle c and the last handles qu. By virtue of
including the vowel after c, we avoid changing ch to kh.
There are other ways to achieve the same effect. One way exploits the
fact that each change is applied to the output of all previous changes.
Thus, we could first protect ch by changing it to some distinguished
character (say @
), then changing c to k, and then
restoring @
to ch:
\ch "ch" "@" \ch "c" "k" \ch "@" "ch" \ch "qu" "k"
Another approach conditions the change by the adjacent characters. The changes could be rewritten as
\ch "c" "k" / _a / _o / _u | only before a, o, or u \ch "qu" "k" | in all cases
The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section .
Input orthography changes are made when the text being processed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to running the program, it is possible to have the program convert the orthography as it loads and before it processes each word.
The changes loaded from the text output control file are applied after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply.
We include here the entire orthography input change table for Caquinte (a language of Peru). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e.
\ch "mp" "Np" | for unspecified nasals \ch "nch" "Nch" \ch "nc" "Nk" \ch "nqu" "Nk" \ch "nt" "Nt" \ch "ch" "@" | to protect ch \ch "c" "k" | other c's to k \ch "@" "ch" | to restore ch \ch "qu" "k" \ch "j" "h" \ch "gue" "ge" \ch "gui" "gi"
This change table can be simplified by the judicious use of string environment constraints:
\ch "m" > "N" / _p \ch "n" > "N" / _c / _t / _qu \ch "c" > "k" / _~h \ch "qu" > "k" \ch "j" > "h" \ch "gu" > "g" / _e /_i
As suggested by the preceding examples, the text orthography change
table is composed of all the \ch
fields found in the
text output control file. These may appear anywhere in the file relative to
the other fields. It is recommended that all the orthography changes
be placed together in one section of the text output control file, rather than
being mixed in with other fields.
This section presents a grammatical description of the syntax of orthography changes in BNF notation.
1a. <orthochange> ::= <basic_change> 1b. <basic_change> <constraints> 2a. <basic_change> ::= <quote><quote> <quote><string><quote> 2b. <quote><string><quote> <quote><quote> 2c. <quote><string><quote> <quote><string><quote> 3. <quote> ::= any printing character not used in either the ``from'' string or the ``to'' string 4. <string> ::= one or more characters other than the quote character used by this orthography change 5a. <constraints> ::= <change_envir> 5b. <change_envir> <constraints> 6a. <change_envir> ::= <marker> <leftside> <envbar> <rightside> 6b. <marker> <leftside> <envbar> 6c. <marker> <envbar> <rightside> 7a. <leftside> ::= <side> 7b. <boundary> 7c. <boundary> <side> 8a. <rightside> ::= <side> 8b. <boundary> 8c. <side> <boundary> 9a. <side> ::= <item> 9b. <item> <side> 9c. <item> ... <side> 10a. <item> ::= <piece> 10b. ( <piece> ) 11a. <piece> ::= ~ <piece> 11b. <literal> 11c. [ <literal> ] 12. <marker> ::= / +/ 13. <envbar> ::= _ ~_ 14. <boundary> ::= # ~# 15. <literal> ::= one or more contiguous characters
<quote>
character must be used at both the beginning
and the end of both the "from" string and the "to" string.
"
) and single quote ('
) characters are
most often used.
...
) indicates a possible break in contiguity.
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
\scl
field in the analysis data file, or
earlier in the dictionary orthography change file.
+/
is usually used for morpheme environment constraints, but may
used for change environment constraints in \ch
fields in the
dictionary orthography change table file.
~_
) inverts the sense of
the constraint as a whole.
~#
) indicates that it
must not be a word boundary.
\+ \/ \# \~ \[ \] \( \) \. \_ \\
The \dsc
field defines the character used to separate the
morphemes in the decomposition field of the input analysis file. For
example, to use the equal sign (=
), the text input control file
would include:
\dsc =
This would handle a decomposition field like the following:
\d %3%kay%ka=y%ka=y%
It makes sense to use the \dsc
field only once in the text output
control file. If multiple \dsc
fields do occur in the file, the
value given in the first one is used. If the text output control file
does not have an \dsc
field, a dash (-
) is used.
The first printing character following the \dsc
field code is used
as the morpheme decomposition separator character. The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the output control files. Thus,
one normally cannot use the vertical bar (|
) as the decomposition
separation character.
This field is provided for use by the INTERGEN program. It is of little use to STAMP.
The \format
field designates a single character to flag the
beginning of a primary format marker. For example, if the format
markers in the text files begin with the at sign (@
), the
following would be placed in the text input control file:
\format @
This would be used, for example, if the text contained format markers like the following:
@ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure
If a \format
field occurs in the text input control file without
a following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.
It makes sense to use the \format
field only once in the text
input control file. If multiple \format
fields do occur in the
file, the value given in the first one is used.
The first printing character following the \format
field code is
used to flag format markers. The character currently used to mark
comments cannot be assigned to also flag format markers. Thus, the
vertical bar (|
) cannot normally be used to flag format markers.
This field is provided for use by the INTERGEN program. It is of little use to STAMP.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any other characters that have lowercase and uppercase
forms, these must given in a \luwfc
field in the text input
control file.
The \luwfc
field defines pairs of characters; the first member
of each pair is a lowercase character and the second is the
corresponding uppercase character. Several such pairs may be placed in
the field or they may be placed on separate fields. Whitespace may be
interspersed freely. For example, the following three examples are
equivalent:
\luwfc éÉ ñÑ
or
\luwfc éÉ | e with acute accent \luwfc ñÑ | enyee
or
\luwfc é É ñ Ñ
Note that comments can be used as well (just as they can in any
STAMP control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the vertical bar (|
), then a different comment character
must be defined with the `-c' command line option when
STAMP is initiated; see
above.
section 2.1 STAMP Command Options.
The \luwfc
field can be entered anywhere in the text input control file,
although a natural place would be before the \wfc
(word formation
character) field.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfc
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfc z X
Note that Z
will still have z
as its lower-case
equivalent in this case.
The \luwfc
field is allowed to map multiple lower case characters to
the same upper case character, and vice versa. This is needed for
languages that do not mark tone on upper case letters.
The \luwfcs
field extends the character pair definitions of the
\luwfc
field to multibyte character sequences. Like the
\luwfc
field, the \luwfcs
field defines pairs of
characters; the first member of each pair is a multibyte lowercase
character and the second is the corresponding multibyte uppercase
character. Several such pairs may be placed in the field or they may be
placed on separate fields. Whitespace separates the members of each
pair, and the pairs from each other. For example, the following three
examples are equivalent:
\luwfcs e' E` n~ N^ ç C&
or
\luwfcs e' E` | e with acute accent \luwfcs n~ N^ | enyee \luwfcs ç C& | c cedilla
or
\luwfcs e' E` n~ N^ ç C&
Note that comments can be used as well (just as they can in any
STAMP control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the vertical bar (|
), then a different comment character
must be defined with the `-c' command line option when
STAMP is initiated; see
section 2.1 STAMP Command Options.
Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character.
The \luwfcs
field can be entered anywhere in the text input
control file. \luwfcs
fields may be mixed with \luwfc
fields in the same file.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfcs
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfcs z X
Note that Z
will still have z
as its lowercase
equivalent in this case.
The \luwfcs
field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes. If
ambiguous capitalization is supported, then for the previous example,
z
will have both X
and Z
as uppercase equivalents,
and X
will have both x
and Z
as lowercase
equivalents.
section 10.7 Text output string classes: \scl.
A string class is defined by the \scl
field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names. A string
class name used as part of the class definition must be enclosed in
square brackets.
For example, the sample text output control file given below contains
the following lines:
a. \scl X t s c b. \ch "h" "j" / [X] ~_
Line a defines a string class including t, s, and c; change rule b makes use of this class to block the change of h to j when it occurs in the digraphs th, sh, and ch.
The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines.
Each \scl
field defines a single string class. Any number of
\scl
fields may appear in the file. The only restriction is
that a string class must be defined before it is used.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any characters that do not have different lowercase and
uppercase forms, these must given in a \wfc
field in the text
input control file.
For example, English uses an apostrophe character ('
) that could
be considered a word formation character. This information is provided
by the following example:
\wfc ' | needed for words like don't
Notice that the characters in the \wfc
field may be separated by
spaces, although it is not required to do so. If more than one
\wfc
field occurs in the text input control file, the program
uses the combination of all characters defined in all such fields as
word formation characters.
The comment character cannot be designated as a word formation character.
If the orthography includes the vertical bar (|
), then a different
comment character must be defined with the `-c' command line option
when STAMP is initiated; see
section 2.1 STAMP Command Options.
The \wfcs
field allows multibyte characters to be defined as
"caseless" word formation characters. It has the same relationship to
\wfc
that \luwfcs
has to \luwfc
. The multibyte word
formation characters are separated from each other by whitespace.
A complete text output control file used for adapting to Asheninca Campa is given below.
\id AEouttx.ctl for Asheninca Campa \ch "N" "m" / _ p | assimilates before p \ch "N" "n" | otherwise becomes n \ch "ny" "n~" \ch "ts" "th" / ~_ i | (N)tsi is unchanged \ch "tsy" "ch" \ch "sy" "sh" \ch "t" "tz" / n _ i \ch "k" "qu" / _ i / _ e \ch "k" "q" / _ y \ch "k" "c" \scl X t s c | define class of t s c \ch "h" "j" / [X] ~_ | change except in th, sh, ch \ch "#" " " | remove fixed space \ch "@" "" | remove blocking character
Analysis files are record oriented standard format files.
This means that the files are divided into records, each representing a
single word in the original input text file, and records are divided
into fields. An analysis file contains at least one record, and may
contain a large number of records. Each record contains one or more
fields. Each field occupies at least one line, and is marked by a
field code at the beginning of the line. A field code begins
with a backslash character (\
), and contains 1 or more letters
in addition.
This section describes the possible fields in an analysis file. The
only field that is guaranteed to exist is the analysis (\a
)
field. All other fields are either data dependent or optional.
The analysis field (\a
) starts each record of an analysis file.
It has the following form:
\a PFX IFX PFX < CAT root CAT root > SFX IFX SFX
where PFX
is a prefix morphname, IFX
is an infix
morphname, SFX
is a suffix morphname, CAT
is a root
category, and root
is a root gloss or etymology. In the
simplest case, an analysis field would look like this:
\a < CAT root >
where CAT
is a root category and root
is a root gloss or
etymology.
The morpheme decomposition field (\d
) follows the analysis
field. It has the following form:
\d anti-dis-establish-ment-arian-ism-s
where the hyphens separate the individual morphemes in the surface form of the word.
The category field (\cat
) provides rudimentary category
information. This may be useful for sentence level parsing. It has
the following form:
\cat CAT
where CAT
is the word category.
section 11.1.3 Category field: \cat.
If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers.
The properties field (\p
) contains the names of any allomorph or
morpheme properties found in the analysis of the word. It has the
form:
\p ==prop1 prop2=prop3=
where prop1
, prop2
, and prop3
are property names.
The equal signs (=
) serve to separate the property information
of the individual morphemes. Note that morphemes may have more than
one property, with the names separated by spaces, or no properties at
all.
The feature descriptor field (\fd
) contains the feature names
associated with each morpheme in the analysis. It has the following
form:
\fd ==feat1 feat2=feat3=
where feat1
, feat2
, and feat3
are feature
descriptors. The equal signs (=
) serve to separate the feature
descriptors of the individual morphemes. Note that morphemes may have
more than one feature descriptor, with the names separated by spaces,
or no feature descriptors at all.
If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers.
The underlying form field (\u
) is similar to the decomposition
field except that it shows underlying forms instead of surface forms.
It looks like this:
\u a-para-a-i-ri-me
where the hyphens separate the individual morphemes.
The original word field (\w
) contains the original input word as
it looks before decapitalization and orthography changes. It looks
like this:
\w The
Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form.
The format information field (\f
) records any formatting codes
or punctuation that appeared in the input text file before the word.
It looks like this:
\f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s
where backslashes (\
) in the input text are doubled, newlines
are represented by \n
, and additional lines in the field start
with a tab character.
The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.
The capitalization field (\c
) records any capitalization of the
input word. It looks like this:
\c 1
where the number following the field code has one of these values:
1
2
4-32767
Note that the third form is of limited utility, but still exists because of words like the author's last name.
The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized.
The nonalphabetic field (\n
) records any trailing punctuation,
bar codes,
or whitespace characters. It looks like this:
\n |r.\n
where newlines are represented by \n
. The nonalphabetic field
ends with the last whitespace character immediately following the word.
The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.
The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example,
\a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n
where the percent sign (%
) separates the different analyses in
each field. Note that only those fields which contain analysis
information are marked for ambiguity. The other fields (\w
,
\f
, \c
, and \n
) are the same regardless of the
number of analyses.
The previous sections assumed that words are successfully analyzed.
This does not always happen. Analysis failures are marked the same way
as multiple analyses, but with zero (0
) for the ambiguity count.
For example,
\a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n
Note that only the \a
and \d
fields contain any
information, and those both have the original word as a place
holder. The other analysis fields (\cat
, \p
, \fd
,
and \u
) are marked for failure, but otherwise left empty.
Jump to: - - \ - a - d - i - s - t
This chapter is adapted from chapter 8 of Weber (1990).
This document was generated on 20 March 2003 using texi2html 1.56k.