Comparing the output from two different runs of programs like AMPLE or
KTEXT can be quite difficult. The analysis files can differ only in
the order of ambiguous analyses, or in the number of analyses. Either
situation is difficult to analyze with simple file comparison utilities
like fc.exe
or diff
. That is why ANADIFF was written.
The ANADIFF program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.
-a character
%
.)
-q
-v
ANADIFF expects the names of two analysis files following any options on the command line. The examples below illustrate this.
ANADIFF normally writes a summary when it has finished comparing the two analysis files. For example, comparing two identical analysis files would look like this:
C> anadiff a.aaa a.ana a.aaa and a.ana are the same (78 of 78 are identical)
Comparing two analysis files that are logically, but not literally, identical produces similar output:
C> anadiff b.aaa b.ana b.aaa and b.ana are the same (70 of 78 are identical)
As would be expected, more output is generated when the two analysis files actually are different. The following example illustrates four types of differences:
Note that ANADIFF displays analyses somewhat differently than they appear in the analysis files.
C> anadiff c.aaa c.ana 1. diff: \a < N0 *hirka > LOC < N0 *hirka > LOCATIVE \d hirka-chaw \p = \w HIRKACHAW \f \\id HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s \c 2 2. \a < V2 *yatra > CAUS 3 \d yacha-chi-n diff: \p =Mlowers=foreshortens =Morphlowers=foreshortens \w YACHACHIN \c 2 \n \n 3. \a < N0 *runa > PLUR \d runa-kuna \p = diff: \a < N0 *runa > plural [missing in c.ana] diff: \d runa-kuna [missing in c.ana] diff: \p = [missing in c.ana] \w runakuna 4. \a < V1 *n~awpa > 3 COND \d n~awpa-n-man \p =foreshortens= diff: \a < N0 *n~awpa > 3P GOAL < N0 *n~awpa > 3P GOALIE \d n~awpa-n-man \p == \w n~awpanman c.aaa and c.ana differ 4 times (63 of 78 are identical)
ANADIFF returns the number of actual differences that it found when it exits. This allows it to be used in batch files for regression tests. For example, consider the following MS-DOS batch file for testing different versions of AMPLE on a specific set of data:
if "%1" == "" goto done %1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log anadiff aetest.aaa aetest.ana >aetest.dif if errorlevel 1 goto done del aetest.dif del aetest.ana del aetest.log :done
(The %1
is an MS-DOS batch variable corresponding to the first
command line argument following the name of the batch file in the
command.) This batch file runs the given version of AMPLE on a
specific set of data, and then compares the output file
(aetest.ana
to the output from a previous run
(aetest.aaa
). If the output is the same, except possibly for
the order of ambiguous analyses, then the new output and log files are
deleted. The same is accomplished by the following Unix shell script.
#! /bin/sh if [ $# -gt 0 ]; then $1 -m -f aetest.cmd -i aetest.txt -o aetest.ana >aetest.log if (anadiff aetest.aaa aetest.ana >aetest.dif) then rm aetest.dif aetest.ana aetest.log fi fi
Analysis files are record oriented standard format files.
This means that the files are divided into records, each representing a
single word in the original input text file, and records are divided
into fields. An analysis file contains at least one record, and may
contain a large number of records. Each record contains one or more
fields. Each field occupies at least one line, and is marked by a
field code at the beginning of the line. A field code begins
with a backslash character (\
), and contains 1 or more letters
in addition.
This section describes the possible fields in an analysis file. The
only field that is guaranteed to exist is the analysis (\a
)
field. All other fields are either data dependent or optional.
The analysis field (\a
) starts each record of an analysis file.
It has the following form:
\a PFX IFX PFX < CAT root CAT root > SFX IFX SFX
where PFX
is a prefix morphname, IFX
is an infix
morphname, SFX
is a suffix morphname, CAT
is a root
category, and root
is a root gloss or etymology. In the
simplest case, an analysis field would look like this:
\a < CAT root >
where CAT
is a root category and root
is a root gloss or
etymology.
The morpheme decomposition field (\d
) follows the analysis
field. It has the following form:
\d anti-dis-establish-ment-arian-ism-s
where the hyphens separate the individual morphemes in the surface form of the word.
The category field (\cat
) provides rudimentary category
information. This may be useful for sentence level parsing. It has
the following form:
\cat CAT
where CAT
is the word category.
section 3.1.3 Category field: \cat.
If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers.
The properties field (\p
) contains the names of any allomorph or
morpheme properties found in the analysis of the word. It has the
form:
\p ==prop1 prop2=prop3=
where prop1
, prop2
, and prop3
are property names.
The equal signs (=
) serve to separate the property information
of the individual morphemes. Note that morphemes may have more than
one property, with the names separated by spaces, or no properties at
all.
The feature descriptor field (\fd
) contains the feature names
associated with each morpheme in the analysis. It has the following
form:
\fd ==feat1 feat2=feat3=
where feat1
, feat2
, and feat3
are feature
descriptors. The equal signs (=
) serve to separate the feature
descriptors of the individual morphemes. Note that morphemes may have
more than one feature descriptor, with the names separated by spaces,
or no feature descriptors at all.
If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers.
The underlying form field (\u
) is similar to the decomposition
field except that it shows underlying forms instead of surface forms.
It looks like this:
\u a-para-a-i-ri-me
where the hyphens separate the individual morphemes.
The original word field (\w
) contains the original input word as
it looks before decapitalization and orthography changes. It looks
like this:
\w The
Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form.
The format information field (\f
) records any formatting codes
or punctuation that appeared in the input text file before the word.
It looks like this:
\f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s
where backslashes (\
) in the input text are doubled, newlines
are represented by \n
, and additional lines in the field start
with a tab character.
The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.
The capitalization field (\c
) records any capitalization of the
input word. It looks like this:
\c 1
where the number following the field code has one of these values:
1
2
4-32767
Note that the third form is of limited utility, but still exists because of words like the author's last name.
The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized.
The nonalphabetic field (\n
) records any trailing punctuation,
bar codes,
or whitespace characters. It looks like this:
\n |r.\n
where newlines are represented by \n
. The nonalphabetic field
ends with the last whitespace character immediately following the word.
The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.
The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example,
\a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n
where the percent sign (%
) separates the different analyses in
each field. Note that only those fields which contain analysis
information are marked for ambiguity. The other fields (\w
,
\f
, \c
, and \n
) are the same regardless of the
number of analyses.
The previous sections assumed that words are successfully analyzed.
This does not always happen. Analysis failures are marked the same way
as multiple analyses, but with zero (0
) for the ambiguity count.
For example,
\a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n
Note that only the \a
and \d
fields contain any
information, and those both have the original word as a place
holder. The other analysis fields (\cat
, \p
, \fd
,
and \u
) are marked for failure, but otherwise left empty.
This document was generated on 20 March 2003 using texi2html 1.56k.