This section briefly describes what KTEXT does, places KTEXT in its computational context, lists technical specifications of the program, and gives information on use and support of the program.
KTEXT is a text processing program that uses the PC-KIMMO parser (see below about PC-KIMMO). KTEXT operates in two modes: analysis and synthesis. In analysis mode, KTEXT reads a text from a disk file, parses each word, and writes the results to a new disk file. This new file is in the form of a structured text file where each word of the original text is represented as a database record composed of several fields. Each word record contains a field for the original word, a field for the underlying or lexical form of the word, and a field for the gloss string. For example, if the text in the input file contains the word `beginning' (to use an English example), KTEXT's output file will have a record of this format:
\a be`gin +ING \d be`gin-+ING \cat V \fd ing \w beginning
This record consists of five fields, each tagged with a backslash code.(1) The first field, tagged with \a for analysis, contains the gloss string for the word. The second field, tagged with \d for (morpheme) decomposition, contains the underlying or lexical form of the word. The third field, tagged with \cat for category, contains the grammatical category of the word. The fourth field, tagged with \fd for feature descriptions, contains a list of feature abbreviations associated with the word, and the fourth field, tagged with \w for word, contains the original word. The word `pictures' (which can be analyzed as either a verb or a noun) demonstrates how KTEXT handles multiple parses:
\a %2%`picture +3SG%`picture +PL% \d %2%`picture-+s%`picture-+s% \cat %2%V%N% \fd %d%s%-3sg pl% \w pictures
Percent signs (or some other designated character) separate the multiple results in the \a, \d, \cat, and \fd fields, with a number indicating how many results were found.
A word record also saves any capitalization or punctuation associated with the original word. For example, if a sentence begins "Obviously, this hypothesis...", KTEXT will output the first word like this:
\a `obvious +AVR1 \d `obvious-+ly \cat AV \w obviously \c 1 \n ,
The \w field contains the original word without capitalization or the following comma. The \c field contains the number 1 which indicates that the first letter of the original word is upper case. The \n field contains the comma that follows the original word. The purpose of retaining the capitalization and punctuation of the original text is, of course, to enable one to recover the original text from KTEXT's output file.
In synthesis mode, KTEXT takes an analysis file compatible with that produced by KTEXT in analysis mode and produces an orthographic text file comparable to the original.
KTEXT can only be understood by describing two other programs: PC-KIMMO and CARLA. First, we will take a look at PC-KIMMO.
KTEXT is intended to be used with PC-KIMMO (though it is a stand-alone program). PC-KIMMO is a program for doing computational phonology and morphology. It is typically used to build morphological parsers for natural language processing systems. PC-KIMMO is described in the book PC-KIMMO: a two-level processor for morphological analysis by Evan L. Antworth, published by the Summer Institute of Linguistics (1990). The PC-KIMMO software is available for MS-DOS and Windows (IBM PCs and compatibles), Macintosh, and UNIX. The book (including software) is available for $23.00 (plus postage) from:
International Academic Bookstore 7500 W. Camp Wisdom Road Dallas TX, 75236 U.S.A. phone: 972/708-7404 fax: 972/708-7433
The KTEXT program which this document describes will be of very little use to you without the PC-KIMMO program and book. The remainder of this document assumes that you are familiar with PC-KIMMO.
PC-KIMMO was deliberately designed to be reuseable. The core of PC-KIMMO is a library of functions such as load rules, load lexicon, generate, and recognize. The PC-KIMMO program supplied on the release diskette is just a user shell built around these basic functions. This shell provides an environment for developing and testing sets of rules and lexicons. Since the shell is a development environment, it has very little built-in data processing capability. But because PC-KIMMO is modular and portable, you can write your own data processing program that uses PC-KIMMO's function library. KTEXT is an example of how to use PC-KIMMO to create a new natural language processing program. KTEXT is a text processing program that uses PC-KIMMO to do morphological parsing.
KTEXT is also closely related to a system called CARLA, which stands for Computer Assisted Related Language Adaptation. CARLA is a type of machine translation system designed to work between closely related languages. CARLA is based on the Analysis Transfer Synthesis (ATS) paradigm of adaptation. This paradigm involves three stages:
When used in analysis mode, KTEXT performs the Analysis task. In the original CARLA system, analysis is done by a program called AMPLE (Weber et al. 1988), which is also a morphological parser designed to process text. KTEXT was created by replacing AMPLE's parsing engine with the PC-KIMMO parser. Thus KTEXT has the same text-handling mechanisms as AMPLE and produces output similar or even identical to AMPLE. The advantages of this design are (1) we were able to develop KTEXT very quickly and easily since it involved very little new code, and (2) existing programs that use AMPLE's output format can also use KTEXT's output. The disadvantage of basing KTEXT on AMPLE is that the format of the output file is perhaps not consistent with terminology already established for PC-KIMMO.
When KTEXT is used in synthesis mode, it performs the Synthesis task. In the original CARLA system, synthesis is done by a program called STAMP (Weber et al. 1990). However, STAMP also performs the Transfer task; KTEXT does not have this capability.
KTEXT runs under four operating systems:
KTEXT does not require any graphics capability. It handles eight-bit characters (such as the IBM PC extended character set or the Windows ANSI character set). The Windows and Macintosh versions have the same user interface as the MS-DOS and UNIX versions, namely a batch-processing, command-line interface. In other words, a GUI version does not exist.
The MS-DOS executable requires a 386 or newer CPU and a DPMI server. This has the advantage of allowing the program to use as much memory as necessary without constraining it to the archaic 640K limit. (DPMI is provided automatically by Windows. A free DPMI server is distributed with the MS-DOS executable.)
The program is written in C and is very portable. The Macintosh version was compiled with the Metrowerks C compiler. The sources available at URL ftp://ftp.sil.org/software/test/ can be compiled for any of the four target platforms.
KTEXT was developed by Stephen McConnel and Evan Antworth of SIL International. Several qualifications apply to its use and support:
Bug reports, wish lists, requests for support, and positive feedback should be directed to Evan Antworth at this address:
Stephen McConnel Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 phone: 972/708-7361 email: Stephen_McConnel@sil.org
Typically, the steps involved in using KTEXT to analyze texts are:
To demonstrate how to use KTEXT to process a text in analysis, we will use Englex, a morphological grammar of English for PC-KIMMO, and analyze a paragraph of Alice's Adventures in Wonderland, by Lewis Caroll. The first paragraph of the text is shown in figure 1.
Figure 1. Excerpt from Alice \id Alice.txt - Lewis Carroll's Alice's Adventures in Wonderland \ti Down the Rabbit-Hole \p Alice was beginning to get very tired of sitting by her sister on the bank and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, "and what is the use of a book," thought Alice, "without pictures or conversations?" \p So she was considering, in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
The text was keyboarded using a very simple system of document markup
that tags parts of the document with backslash codes. The \it
tag identifies the text; the \ti
tag indicates the title of the
story; and the \p
tag indicates the beginning of a paragraph.
The next step is to process the text with KTEXT in analysis mode. Run
the KTEXT application with these command line options:
ktext -x ana.ctl -i alice.txt -o alice.ana -l ana.log
where `ana.ctl' is the analysis control file, `alice.txt' is the input text file, `alice.ana' is the output analysis file, and `alice.log' is the analysis log file. The following display will appear on the screen:
KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] affix.lex 255 entries noun.lex 10461 entries verb.lex 4215 entries adjectiv.lex 3345 entries adverb.lex 400 entries minor.lex 379 entries proper.lex 1057 entries abbrev.lex 127 entries technica.lex 813 entries natural.lex 435 entries foreign.lex 88 entries 5...2.2..2 ..22.2.2.. 2.222.2.22 ..22.2.222 22..22.22. 2..23.22.. 2.2.22.225 2..2...22. ......2... 4.22.2..4. 100 ...2..2..2 2.222
Each dot represents one word successfully processed. Multiple analyses of a word are indicated by numbers; thus the first word down received five analyses. When the program is done, it will return you to the operating system prompt. A fragment of the resulting output file is shown in figure 2.
Figure 2 Output of KTEXT \a %5%`down%`down%`down%`down%`down% \d %5%`down%`down%`down%`down%`down% \cat %5%AV%V%AJ%N%PP% \fd %5%%vbase%%sg%% \w down \c 1 \a the \d the \cat DT \w the \a `rabbit - `hole \d `rabbit---`hole \cat N \fd sg \w rabbit-hole \c 516 \n \n
One obvious way to continue is to reassemble the text in interlinear format. That is, we could write a program that would take the data structures shown in figure 2 and create a new file where the text is stored in interlinear format. The resulting interlinear text is shown in figure 3. An interlinear text editor like IT(2) could then be used to add more lines of annotations to the text.
Figure 3 An English example of interlinear text format Down the Rabbit - Hole `down the `rabbit - `hole PP DT N - N
Interlinear translation is a time-honored format for presenting analyzed vernacular texts. An interlinear text consists of a baseline text and one or more lines of annotations that are vertically aligned with the baseline. In the text shown in figure 3, the first line is the baseline text. The second line provides the lexical form of each original word, including morpheme breaks. The third line gives the category or part-of-speech of each word.
Another way to proceed would be to take the output of KTEXT as shown in figure 2 and format it directly for printing. In other words, there would be no disk file of interlinear text corresponding to figure 3; rather, the interlinear text is created on the fly as it is prepared for printing. Fortunately, the software required to print interlinear text is now available. As a complement to the IT program, a system for formatting interlinear text for typesetting has recently been developed (see Kew and McConnel, 1991). Called ITF, for Interlinear Text Formatter,(3) it is a set of TeX(4) macros that can format an arbitrary number of aligning annotations with up to two freeform (nonaligning) annotations. While ITF is primarily intended to format the data files produced by IT (similar to the interlinear text shown in figure 3), an auxiliary program provided with ITF accepts the output of the KTEXT program. The final printed result of the formatting process is shown in figure 4.(5) It should be noted that this is just one of many formats that ITF can produce. Because ITF is built on a full-featured typesetting system, virtually all aspects of the formatting detail can be customized, including half a dozen different schemes for laying out the freeform annotations relative to the interlinear text.
Figure 4. Output of ITF
Normally, in an adaptation project, the text is adapted from a source language to a target language via a Transfer component. For the purpose of this example, we will use English as both the source and target language, thus obviating the need for a Transfer component. If the synthesis operation produces a text which is identical to the original text, then we have proved the efficacy of the system.
Typically, the steps involved in using KTEXT in synthesis mode are:
To synthesize the original text from the analysis file, run the KTEXT application with these command line options:
ktext -s -x syn.ctl -i alice.ana -o alice.syn -l syn.log
where `syn.ctl' is the synthesis control file, `alice.ana' is the input analysis file, `alice.syn' is the synthesized output text file, and `syn.log' is the synthesis log file. The following display will appear on the screen:
KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] affix.lex 255 entries noun.lex 10461 entries verb.lex 4215 entries adjectiv.lex 3345 entries adverb.lex 400 entries minor.lex 379 entries proper.lex 1057 entries abbrev.lex 127 entries technica.lex 813 entries natural.lex 435 entries foreign.lex 88 entries .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... 100 .......... .....
Notice that every word received a single synthesis. Open the output file `alice.syn' and you will see that it is identical to the input text file shown in figure 1.
This section describes KTEXT's user interface and the input files it uses.
KTEXT is a batch-processing program. This means that the program takes as input a text from a disk file and returns as output the processed text in a new disk file. KTEXT is run from the command line by giving it the information it needs (file names and other options). It does not have an interactive interface. The user controls KTEXT's operation by means of special files that contain all the information KTEXT needs to process the input text. These files are called control files.
The operation of the program is controlled by using command line
options. To see a list of the command line options, run the KTEXT
application with -h
as a command line option. You will see a
display similar to this:
KTEXT (analyze/synthesize words using PC-Kimmo functions) Version 2.0b11 (November 1, 1996), Copyright 1996 SIL Beta test version compiled Nov 7 1996 15:11:16 with PC-Kimmo functions version 2.1b7 (November 6, 1996) and PC-PATR functions version 0.99b0 (November 7, 1996) For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] Usage: ktext [options] -c char make char the comment character for the control files (default is ;) -s synthesis mode (default is analysis) -v for synthesis, verify each result with a word parse -x ctlfile specify the KTEXT control file (default is ktext.ctl) -i infile specify the input file (required: no default) -o outfile specify the output file (default is based on infile) -l logfile specify the KTEXT log file (default is none)
The command line options (`-c', `-s', and so on) are all lower case letters. Here is a detailed description of each command line option.
-x english
KTEXT will try to load the file
`english.ctl'. If the `-x' option is not used, KTEXT will
try to load a control file with the default file name `ktext.ctl'.
In all instances where file names are supplied to KTEXT, an optional
directory path can be included; for example, -i c:\texts\alice.txt
.
KTEXT uses three main functional modules in analysis mode: the text input module, the analysis module, and the structured output module. The diagram in figure 5 shows the flow of data through these modules. The input text is fed into the text input module which outputs the text as a stream of normalized words with capitalization and punctuation stripped out and saved. The text input module is controlled by a file that specifies orthographic changes. Each word is then passed to the analysis module where it is parsed. The analysis module is controlled by the PC-KIMMO rules, lexicon, and grammar files. The parsed words are then passed to the structured output module and written to the output file as database records.
In analysis mode, KTEXT uses six different input files and produces one output file (plus an optional log file). These six input file are:
The PC-KIMMO rules, lexicon, and grammar files are described in the PC-KIMMO documentation and will not be discussed further in this document; see Antworth (1990) and Antworth (1995). The other input files and the analysis output data file are described in the following chapters.
KTEXT also uses three main functional modules in synthesis mode: the structured input module, the synthesis module, and the text output module. The diagram in figure 6 shows the flow of data through these modules. A structured input text containing parsed words is fed into the structured input module, which outputs the text as a stream of parsed words with capitalization and punctuation stripped out and saved. Each parsed word is then passed to the synthesis module where it is rebuilt from its pieces. The synthesis module is controlled by the PC-KIMMO rules and lexicon files. (Synthesis normally does not use the grammar file.) The synthesized words are then passed to the text output module and written to the output file as a synthesized text with the punctuation and capitalization merged back in. The text output module is controlled by a file that specifies orthographic changes.
In synthesis mode, KTEXT also uses six different input files and produces one output file (plus an optional log file). These six input file are:
The PC-KIMMO rules, lexicon, and grammar files are described in the PC-KIMMO documentation and will not be discussed further in this document; see Antworth (1990) and Antworth (1995). The other input files and the synthesis output text file are described in the following chapters.
The input text file contains the text that KTEXT will process. It must be a plain text file, not a file formatted by a word processor. If you use a word processor such as Microsoft Word to create your text, you must save it as plain text with no formatting. KTEXT preserves all the "white space" used in the text file. That is, it saves in its output file the location of all line breaks, blank lines, tabs, spaces, and other nonalphabetic characters. This enables you to recover from the output file the precise format and page layout of the original text.
While KTEXT will accept text with no formatting information other than
white space characters, it will also handle text that contains special
format markers. These format markers can indicate parts of the text such
as sentences, paragraphs, sections, section headings, and titles. The
use of special format markers is called descriptive markup. KTEXT
(because it is based on AMPLE) works best with a system of descriptive
markup called "standard format" that is used by SIL International. SIL
standard format marks the beginning of each text unit with a format
marker. There is no explicit indication of the end of a unit. A format
marker is composed of a special character (a backslash by default)
followed by a code of one or more letter. For example, \ti
for
title, \ch
for chapter, \p
for paragraph, \s
for
sentence, and so on. KTEXT does not "know" any particular format
markers. You can use whatever markers you like, as long as you declare
them in the text input control file. For more on format markers, see
section 7.3.2 below.
One of the best know systems of descriptive markup is SGML (Standard Generalized Markup Language). One very significant difference between SGML and SIL standard format is that SGML uses markers in pairs, one at the beginning of a text unit and a matching one at the end. This should not pose a problem for KTEXT, since KTEXT just preserves all format markers wherever they occur. Another difference is that SGML flags format markers with angle brackets, for instance <paragraph>. KTEXT can recognize SGML markers by changing the format marker flag character from backslash to left angled bracket (see section 7.3.2 below). Recognizing the end of the SGML format marker is a bit of a problem. While SGML uses a matching right angled bracket to indicate the end of the marker, SIL standard format simply uses a space to delineate the format marker from the following text. This means that for KTEXT to find the end of an SGML tag, you must leave at least one space after it, and there must not be any spaces in the middle of the SGML tag.
KTEXT uses an overall control file to customize its operation. This file is structured as a standard format database, composed of various fields marked by backslash codes. The fields in the control file are as follows.
Figure 7 shows a sample KTEXT control file.
Figure 7. Sample KTEXT control file \textin engintx.ctl \rules d:\opac\test\ktext\englex\english.rul \lexicon d:\opac\test\ktext\englex\english.lex \grammar d:\opac\test\ktext\englex\english.grm \textout engoutx.ctl \cat <head pos> \fd singular <number> = SG \fd plural <number> = PL
When KTEXT reads its control file, it ignores any lines beginning with
field codes other than those listed above. For example, a line
beginning \co
would be ignored. Such lines are treated as
comments. Comments in the control file can also be indicated with the
comment character, which by default is semicolon. This is the only way
to place comments on the same line as a field. The comment character
can be changed with the command line option `-c' when running
KTEXT (see chapter 3).
This chapter describes the expected characteristics of an input text file, and the options offered for describing these characteristics by a text input control file.(6)
Text input control files define a simple model of input text files. They are plain text files with two types of embedded format markers.
\
). Thus, each of
the following would be recognized as a format marker and would not be
processed by the program:
\ \p \sp \begin{enumerate} \very-long.and;muddled/format*marker,to#be$sureNote that format markers cannot have a space or tab embedded in them; the first space or tab encountered terminates the format marker. One final note: the format character under discussion here applies only to the input text files which are to be processed. It has absolutely nothing to do with the use of backslash (
\
) to
flag field codes in control files such as the text input control file.
|
), causing this type of
format marker to be frequently called a bar code. The following could
be valid (secondary) format markers and would not be processed by
the program:
|b |i |r
Consider the following two lines of input text:
\bgoodbye\r |bgoodbye|r
Using the default definitions of format markers, the first line is
considered to be a single format marker, and provides nothing which the
program should try to parse. The second line, however contains two
format markers, |b
and |r
, and the word goodbye
which would be processed by the program.
The primary format markers serve to divide the text into fields. See section 7.7 Fields to Exclude: \excl and section 7.9 Fields to Include: \incl for details on how these fields are used. There is no requirement that the format markers be at the beginning of a line as with the field codes used in KTEXT control files.
The \ambig
field defines the character used to mark ambiguities
and failures in the analysis output file. For example, to use the hash
mark (#
), the text input control file would include:
\ambig #
This would cause an ambiguous analysis to be output as follows:
\a #3#< N0 kay >#< V1 ka > IMP#< V1 ka > INF#
It makes sense to use the \ambig
field only once in the text
input control file. If multiple \ambig
fields do occur in the
file, the value given in the first one is used. If the text input
control file does not have an \ambig
field, the percent sign
(%
) is used.
The first printing character following the \ambig
field code is
used as the ambiguity marker. The character currently being used to mark
comments cannot be assigned to also mark ambiguities in the output file.
Thus, the semicolon (;
) cannot normally be used as the
ambiguity marker. Logically, this field should be in the
KTEXT control
file rather than the text input control file since it affects
output instead of input. Nevertheless, compatibility demands that it
stays this way.
The \barchar
defines the character that begins a two-character
secondary format marker. For example, if this type of format marker
begins with the dollar sign ($
), the following would be placed
in the text input control file:
\barchar $
An empty \barchar
field in the text input control file prevents
any bar code format markers from being recognized. Thus, the following
field effectively turns off special treatment of this style of format
marking (assuming the ;
is marking comments):
\barchar ; no bar character
It makes sense to use the \barchar
field only once in the text
input control file. If multiple \barchar
fields do occur in the
file, the value given in the first one is used.
The first printing character following the \barchar
field code
is used as the bar code format marker. The character currently being
used to mark comments cannot be assigned to also flag format markers in
input text files.
That is, \barchar ;
is treated as \barchar
followed
only by a comment, which effectively removes the concept of bar codes
since no marker character is defined.
In conjunction with the special format marking character discussed in
the previous section, the \barcodes
field defines the individual
characters used with in bar codes. These characters may be separated by
spaces or lumped together. Thus, the following two fields are
equivalent:
\barcodes abcdefg ; lumped together \barcodes a b c d e f g ; separated
If provided more than one \barcodes
field in the text input
control file, the combination of all characters defined in all such
fields is used. No check is made for repeated characters: the previous
example would be accepted without complaint despite the redundancy of
the second line.
The default value for the bar codes is the set of lowercase alphabetic
letters a
-z
. Therefore, if the text input
control file contains neither a \barchar
nor a \barcodes
field, the following bar codes are considered to be formatting
information by KTEXT: |a
, |b
, |c
, ...,
|x
, |y
, and |z
.
An orthography change is defined by the \ch
field code followed
by the actual orthography change. Any number of orthography changes
may be defined in the text input control file. The output of each
change serves as the input the following change. That is, each change
is applied as many times as necessary to an input word before the next
change from the text input control file is applied.
To substitute one string of characters for another, these must be made
known to the program in a change. (The technical term for this sort of
change is a production, but we will simply call them changes.) In the
simplest case, a change is given in three parts: (1) the field code
\ch
must be given at the extreme left margin to indicate that
this line contains a change; (2) the match string is the string for
which the program must search; and (3) the substitution string is the
replacement for the match string, wherever it is found.
The beginning and end of the match and substitution strings must be
marked. The first printing character following \ch
(with at
least one space or tab between) is used as the delimiter for that line.
The match string is taken as whatever lies between the first and second
occurrences of the delimiter on the line and the substitution string is
whatever lies between the third and fourth occurrences. For example,
the following lines indicate the change of hi to bye, where the
delimiters are the double quote mark ("
), the single quote mark
('
), the period (.
), and the at sign (@
).
\ch "hi" "bye" \ch 'hi' 'bye' \ch .hi. .bye. \ch @hi@ @bye@
Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise.
Change tables follow these conventions:
\ch "thou" "you" \ch "thou" to "you" \ch "thou" > "you" \ch "thou" --> "you" \ch "thou" becomes "you"
;
), or whatever is indicated as the comment character
by means of the -c
option when KTEXT is started.
The following lines illustrate the use of comments:
\ch "qeki" "qiki" ; for cases like wawqeki \ch "thou" "you" ; for modern English
\ch
, or by placing the comment character
(;
) in front of it. For example, only the
first of the following three lines would effect a change:
\ch "nb" "mp" \no \ch "np" "np" ;\ch "mb" "nb"
The changes in the text input control file are applied as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes:
\ch "aib" > "ayb" \ch "yb" > "yp"
Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.)
The way in which change tables are applied allows certain
tricks. For example, suppose that for Quechua, we wish to change
hw to f, so that hwista becomes fista and hwis
becomes fis. However, we do not wish to change the sequence
shw or chw to sf or cf (respectively). This could
be done by the following sequence of changes. (Note, @
and
$
are not otherwise used in the orthography.)
\ch "shw" > "@" ; (1) \ch "chw" > "$" ; (2) \ch "hw" > "f" ; (3) \ch "@" > "shw" ; (4) \ch "$" > "chw" ; (5)
Lines (1) and (2) protect the sh and ch by changing them to
distinguished symbols. This clears the way for the change of hw to
f in (3). Then lines (4) and (5) restore @
and $
to
sh and ch, respectively. (An alternative, simpler way to do
this is discussed in the next section.)
It is possible to impose string environment constraints (SECs) on changes in the orthography change tables. The syntax of SECs is described in detail in section .
For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes:
\ch "o" "u" / _ q / q _ \ch "e" "i" / _ q / q _
This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that--for cases like upun "he drinks" and upoq "the one who drinks"--the root needs to be represented internally only as upu "drink". But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned.
In reality, the regressive vowel-lowering effect of /q/ can pass
over various intervening consonants, including /y/, /w/,
/l/, /ll/, /r/, /m/, /n/, and /n/. For
example, /ullq/ becomes ollq, /irq/ becomes erq,
and so on. Rather than list each of these cases as a separate constraint, it
is convenient to define a class (which we label +resonant
) and
use this class to simplify the SEC. Note that the string class
must be defined (with the \scl
field code) before it is used in a
constraint.
\scl +resonant y w l ll r m n n~ \ch "o" "u" / q _ / _ ([+resonant]) q \ch "e" "i" / q _ / _ ([+resonant]) q
This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/.
Consider the problem posed for Quechua in the previous section, that of
changing hw to f. An alternative is to condition the change
so that it does not apply adjacent to a member of the string class
Affric
which contains s and c.
\scl Affric c s \ch "hw" "f" / [Affric] ~_
It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples.
\ch "this" "that" ; anywhere in the word \ch "this" "that" / # _ ; only if word initial \ch "this" "that" / _ # ; only if word final \ch "this" "that" / # _ # ; only if entire word
The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as the internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation.
Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word "king" is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If the program is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text input control file:
\ch "x" "ks"
In this, x is the match string and ks is the substitution string, as discussed in section . Whenever x is found, ks is substituted for it.
Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ "I am", where the length (represented by a colon) is the morpheme meaning "first person subject". Other examples, such as upoo /upu:/ "I drink" and upichee /upi-chi-:/ "I extinguish", motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this:
\ch "aa" "a:" \ch "ee" "i:" \ch "ii" "i:" \ch "oo" "u:" \ch "uu" "u:"
Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb "drink" can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes:
\ch "ca" "ka" \ch "co" "ko" \ch "cu" "ku" \ch "qu" "k"
The first three handle c and the last handles qu. By virtue of
including the vowel after c, we avoid changing ch to kh.
There are other ways to achieve the same effect. One way exploits the
fact that each change is applied to the output of all previous changes.
Thus, we could first protect ch by changing it to some distinguished
character (say @
), then changing c to k, and then
restoring @
to ch:
\ch "ch" "@" \ch "c" "k" \ch "@" "ch" \ch "qu" "k"
Another approach conditions the change by the adjacent characters. The changes could be rewritten as
\ch "c" "k" / _a / _o / _u ; only before a, o, or u \ch "qu" "k" ; in all cases
The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section .
Input orthography changes are made when the text being processed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to running the program, it is possible to have the program convert the orthography as it loads and before it processes each word.
The changes loaded from the text input control file are applied after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply.
We include here the entire orthography input change table for Caquinte (a language of Peru). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e.
\ch "mp" "Np" ; for unspecified nasals \ch "nch" "Nch" \ch "nc" "Nk" \ch "nqu" "Nk" \ch "nt" "Nt" \ch "ch" "@" ; to protect ch \ch "c" "k" ; other c's to k \ch "@" "ch" ; to restore ch \ch "qu" "k" \ch "j" "h" \ch "gue" "ge" \ch "gui" "gi"
This change table can be simplified by the judicious use of string environment constraints:
\ch "m" > "N" / _p \ch "n" > "N" / _c / _t / _qu \ch "c" > "k" / _~h \ch "qu" > "k" \ch "j" > "h" \ch "gu" > "g" / _e /_i
As suggested by the preceding examples, the text orthography change
table is composed of all the \ch
fields found in the
text input control file. These may appear anywhere in the file relative to
the other fields. It is recommended that all the orthography changes
be placed together in one section of the text input control file, rather than
being mixed in with other fields.
This section presents a grammatical description of the syntax of orthography changes in BNF notation.
1a. <orthochange> ::= <basic_change> 1b. <basic_change> <constraints> 2a. <basic_change> ::= <quote><quote> <quote><string><quote> 2b. <quote><string><quote> <quote><quote> 2c. <quote><string><quote> <quote><string><quote> 3. <quote> ::= any printing character not used in either the ``from'' string or the ``to'' string 4. <string> ::= one or more characters other than the quote character used by this orthography change 5a. <constraints> ::= <change_envir> 5b. <change_envir> <constraints> 6a. <change_envir> ::= <marker> <leftside> <envbar> <rightside> 6b. <marker> <leftside> <envbar> 6c. <marker> <envbar> <rightside> 7a. <leftside> ::= <side> 7b. <boundary> 7c. <boundary> <side> 8a. <rightside> ::= <side> 8b. <boundary> 8c. <side> <boundary> 9a. <side> ::= <item> 9b. <item> <side> 9c. <item> ... <side> 10a. <item> ::= <piece> 10b. ( <piece> ) 11a. <piece> ::= ~ <piece> 11b. <literal> 11c. [ <literal> ] 12. <marker> ::= / +/ 13. <envbar> ::= _ ~_ 14. <boundary> ::= # ~# 15. <literal> ::= one or more contiguous characters
<quote>
character must be used at both the beginning
and the end of both the "from" string and the "to" string.
"
) and single quote ('
) characters are
most often used.
...
) indicates a possible break in contiguity.
~
) reverses the desirability of an element, causing the
constraint to fail if it is found rather than fail if it is not found.
\scl
field in the analysis data file, or
earlier in the dictionary orthography change file.
+/
is usually used for morpheme environment constraints, but may
used for change environment constraints in \ch
fields in the
dictionary orthography change table file.
~_
) inverts the sense of
the constraint as a whole.
~#
) indicates that it
must not be a word boundary.
\+ \/ \# \~ \[ \] \( \) \. \_ \\
The \dsc
field defines the character used to separate the
morphemes in the decomposition field of the output analysis file. For
example, to use the equal sign (=
), the text input control file
would include:
\dsc =
This would cause a decomposition field to be output as follows:
\d %3%kay%ka=y%ka=y%
It makes sense to use the \dsc
field only once in the text input
control file. If multiple \dsc
fields do occur in the file, the
value given in the first one is used. If the text input control file
does not have an \dsc
field, a dash (-
) is used.
The first printing character following the \dsc
field code is used
as the morpheme decomposition separator character. The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the input control files. Thus,
one normally cannot use the semicolon (;
) as the decomposition
separation character.
Logically, this field should be in the KTEXT control file rather than the text input control file since it affects output instead of input. Nevertheless, compatibility demands that it stays this way.
The \excl
field excludes one or more fields from processing.
For example, to have the program ignore everything in \co
and
\id
fields, the following line is included in the text input
control file:
\excl \co \id ; ignore these fields
If more than one \excl
field is found in the text input control
file, the contents of each field is added to the overall list of text
fields to exclude. This list is initially empty, and stays empty
unless the text input control file contains an \excl
field.
Thus, no text fields are excluded from processing by default.
If the text input control file contains \excl
fields, then only
those text fields are not processed. Every word in every text field
not mentioned explicitly in an \excl
field will be processed.
Note that every text field in the input text files is processed
unless the text input control file contains either an \excl
or
an \incl
field. One or the other is used to limit processing,
but never both.
The \format
field designates a single character to flag the
beginning of a primary format marker. For example, if the format
markers in the text files begin with the at sign (@
), the
following would be placed in the text input control file:
\format @
This would be used, for example, if the text contained format markers like the following:
@ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure
If a \format
field occurs in the text input control file without
a following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.
It makes sense to use the \format
field only once in the text
input control file. If multiple \format
fields do occur in the
file, the value given in the first one is used.
The first printing character following the \format
field code is
used to flag format markers. The character currently used to mark
comments cannot be assigned to also flag format markers. Thus, the
semicolon (;
) cannot normally be used to flag format markers.
The \incl
field explicitly includes one or more text fields for
processing, excluding all other fields. For instance, to process
everything in \txt
and \qt
fields, but ignore everything
else, the following line is placed in the text input control file:
\incl \txt \qt ; process these fields
If more than one \incl
field is found in the text input control
file, the contents of each field is added to the overall list of text
fields to process. This list is initially empty, and stays empty
unless the text input control file contains an \incl
field.
If the text input control file contains \incl
fields, then only
those text fields are processed. Every word in every text field not
mentioned explicitly in an \incl
field will not be processed.
Note that every text field in the input text files is processed unless
the text input control file contains either an \excl
or an
\incl
field. One or the other is used to limit processing, but
never both.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any other characters that have lowercase and uppercase
forms, these must given in a \luwfc
field in the text input
control file.
The \luwfc
field defines pairs of characters; the first member
of each pair is a lowercase character and the second is the
corresponding uppercase character. Several such pairs may be placed in
the field or they may be placed on separate fields. Whitespace may be
interspersed freely. For example, the following three examples are
equivalent:
\luwfc éÉ ñÑ
or
\luwfc éÉ ; e with acute accent \luwfc ñÑ ; enyee
or
\luwfc é É ñ Ñ
Note that comments can be used as well (just as they can in any
KTEXT control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the semicolon (;
), then a different comment character
must be defined with the `-c' command line option when
KTEXT is initiated; see
`Running KTEXT' above.
section 3. Running KTEXT.
The \luwfc
field can be entered anywhere in the text input control file,
although a natural place would be before the \wfc
(word formation
character) field.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfc
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfc z X
Note that Z
will still have z
as its lower-case
equivalent in this case.
The \luwfc
field is allowed to map multiple lower case characters to
the same upper case character, and vice versa. This is needed for
languages that do not mark tone on upper case letters.
The \luwfcs
field extends the character pair definitions of the
\luwfc
field to multibyte character sequences. Like the
\luwfc
field, the \luwfcs
field defines pairs of
characters; the first member of each pair is a multibyte lowercase
character and the second is the corresponding multibyte uppercase
character. Several such pairs may be placed in the field or they may be
placed on separate fields. Whitespace separates the members of each
pair, and the pairs from each other. For example, the following three
examples are equivalent:
\luwfcs e' E` n~ N^ ç C&
or
\luwfcs e' E` ; e with acute accent \luwfcs n~ N^ ; enyee \luwfcs ç C& ; c cedilla
or
\luwfcs e' E` n~ N^ ç C&
Note that comments can be used as well (just as they can in any
KTEXT control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the semicolon (;
), then a different comment character
must be defined with the `-c' command line option when
KTEXT is initiated; see
`Running KTEXT'
above.
section 3. Running KTEXT.
Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character.
The \luwfcs
field can be entered anywhere in the text input
control file. \luwfcs
fields may be mixed with \luwfc
fields in the same file.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfcs
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfcs z X
Note that Z
will still have z
as its lowercase
equivalent in this case.
The \luwfcs
field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes. If
ambiguous capitalization is supported, then for the previous example,
z
will have both X
and Z
as uppercase equivalents,
and X
will have both x
and Z
as lowercase
equivalents.
The \maxdecap
field sets the maximum number of different
decapitalizations allowed. Since the \luwfc
field can map
several lowercase characters onto a single uppercase character, a word
with uppercase characters can (logically) generate a number of
alternatives when decapitalized. This is especially true of words that
are entirely capitalized to begin with. The default limit is 100.
The usual behavior is to normalize input words to lowercase. The program remembers the case of the word as one of four possibilities:
However, not all orthographies use the concept of capitalization. To
help deal with these, the field code \nocap
disables all case
normalization if it appears anywhere in the text input control file.
The handling of mixed uppercase and lowercase is limited in utility,
and sometimes causes more problems than it solves. For this reason,
the \noincap
field code turns off mixed case decapitalization.
The program would still decapitalize words that are entirely
capitalized and words that begin with a capital letter.
A string class is defined by the \scl
field code followed by the
class name, which is followed in turn by one or more contiguous
character strings or (previously defined) string class names. A string
class name used as part of the class definition must be enclosed in
square brackets.
The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines.
Each \scl
field defines a single string class. Any number of
\scl
fields may appear in the file. The only restriction is
that a string class must be defined before it is used.
String classes must be defined before being used. For example, the first two lines of the simpler Caquinte example above could be given as follows:
\scl -bilabial c t qu \ch "m" > "N" / _ p \ch "n" > "N" / _ [-bilabial]
The string class definition could be in another control file: string classes defined elsewhere can be used in the text input control file as well.
If no \scl
fields appear in the text input control file, then
KTEXT does not allow any string classes in text input orthography
change environment constraints unless they are defined in the
KTEXT control
file.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any characters that do not have different lowercase and
uppercase forms, these must given in a \wfc
field in the text
input control file.
For example, English uses an apostrophe character ('
) that could
be considered a word formation character. This information is provided
by the following example:
\wfc ' ; needed for words like don't
Notice that the characters in the \wfc
field may be separated by
spaces, although it is not required to do so. If more than one
\wfc
field occurs in the text input control file, the program
uses the combination of all characters defined in all such fields as
word formation characters.
The comment character cannot be designated as a word formation character.
If the orthography includes the semicolon (;
), then a different
comment character must be defined with the `-c' command line option
when KTEXT is initiated; see
`Running KTEXT'
above.
section 3. Running KTEXT.
The \wfcs
field allows multibyte characters to be defined as
"caseless" word formation characters. It has the same relationship to
\wfc
that \luwfcs
has to \luwfc
. The multibyte word
formation characters are separated from each other by whitespace.
The following is the complete text input control file for Huallaga Quechua (a language of Peru):
\id HGTEXT.CTL - for Huallaga Quechua, 25-May-88 \co WORD FORMATION CHARACTERS \wfc ' ~ \co FIELDS TO EXCLUDE \excl \id ; identification fields \co ORTHOGRAPHY CHANGES \ch "aa" > "a:" ; for long vowels \ch "ee" > "i:" \ch "ii" > "i:" \ch "oo" > "u:" \ch "uu" > "u:" \ch "qeki" > "qiki" ; for cases like wawqeki \ch "~n" > "n~" ; for typos ; for Spanish loans like hwista \scl sib s c ; sibilants \ch "hw" > "f" / ~[sib]_
The text output module restores a processed document from the internal format to its textual form. It re-imposes capitalization on words and restores punctuation, format markers, white space, and line breaks. Also, orthography changes can be made, and the delimiter that marks ambiguities and failures can be changed. This chapter describes the control file given to the text output module.(7)
The text output module flags words that either produced no results or
multiple results when processed. These are flagged with percent signs
(%
) by default, but this can be changed by declaring the desired
character with the \ambig field code. For example, the following would
change the ambiguity delimiter to @
:
\ambig @
The text output module allows orthographic changes to be made to the processed words. These are given in the text output control file. (They have exactly the same form as the input orthographic changes; see The output orthographic changes allow conversion from the internal representation used by the program to the practical orthography of the target language. These changes are applied to the words after they have been processed, but before the text is re-assembled (from the internal format) for output.
\ch "N" "m" / _ p ; assimilates before p \ch "N" "n" ; otherwise stays n
The first change makes N into m when it directly precedes p; the second makes all other N's into n.
The \dsc
field defines the character used to separate the
morphemes in the decomposition field of the input analysis file. For
example, to use the equal sign (=
), the text input control file
would include:
\dsc =
This would handle a decomposition field like the following:
\d %3%kay%ka=y%ka=y%
It makes sense to use the \dsc
field only once in the text output
control file. If multiple \dsc
fields do occur in the file, the
value given in the first one is used. If the text output control file
does not have an \dsc
field, a dash (-
) is used.
The first printing character following the \dsc
field code is used
as the morpheme decomposition separator character. The same character
cannot be used both for separating decomposed morphemes in the analysis
output file and for marking comments in the output control files. Thus,
one normally cannot use the semicolon (;
) as the decomposition
separation character.
This field is provided for use by the INTERGEN program. It is of little use to KTEXT.
The \format
field designates a single character to flag the
beginning of a primary format marker. For example, if the format
markers in the text files begin with the at sign (@
), the
following would be placed in the text input control file:
\format @
This would be used, for example, if the text contained format markers like the following:
@ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure
If a \format
field occurs in the text input control file without
a following character to serve for flagging format markers, then the
program will not recognize any format markers and will try to parse
everything other than punctuation characters.
It makes sense to use the \format
field only once in the text
input control file. If multiple \format
fields do occur in the
file, the value given in the first one is used.
The first printing character following the \format
field code is
used to flag format markers. The character currently used to mark
comments cannot be assigned to also flag format markers. Thus, the
semicolon (;
) cannot normally be used to flag format markers.
This field is provided for use by the INTERGEN program. It is of little use to KTEXT.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any other characters that have lowercase and uppercase
forms, these must given in a \luwfc
field in the text input
control file.
The \luwfc
field defines pairs of characters; the first member
of each pair is a lowercase character and the second is the
corresponding uppercase character. Several such pairs may be placed in
the field or they may be placed on separate fields. Whitespace may be
interspersed freely. For example, the following three examples are
equivalent:
\luwfc éÉ ñÑ
or
\luwfc éÉ ; e with acute accent \luwfc ñÑ ; enyee
or
\luwfc é É ñ Ñ
Note that comments can be used as well (just as they can in any
KTEXT control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the semicolon (;
), then a different comment character
must be defined with the `-c' command line option when
KTEXT is initiated; see
`Running KTEXT'
above.
section 3. Running KTEXT.
The \luwfc
field can be entered anywhere in the text input control file,
although a natural place would be before the \wfc
(word formation
character) field.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfc
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfc z X
Note that Z
will still have z
as its lower-case
equivalent in this case.
The \luwfc
field is allowed to map multiple lower case characters to
the same upper case character, and vice versa. This is needed for
languages that do not mark tone on upper case letters.
The \luwfcs
field extends the character pair definitions of the
\luwfc
field to multibyte character sequences. Like the
\luwfc
field, the \luwfcs
field defines pairs of
characters; the first member of each pair is a multibyte lowercase
character and the second is the corresponding multibyte uppercase
character. Several such pairs may be placed in the field or they may be
placed on separate fields. Whitespace separates the members of each
pair, and the pairs from each other. For example, the following three
examples are equivalent:
\luwfcs e' E` n~ N^ ç C&
or
\luwfcs e' E` ; e with acute accent \luwfcs n~ N^ ; enyee \luwfcs ç C& ; c cedilla
or
\luwfcs e' E` n~ N^ ç C&
Note that comments can be used as well (just as they can in any
KTEXT control file). This means that the comment character
cannot be designated as a word formation character. If the orthography
includes the semicolon (;
), then a different comment character
must be defined with the `-c' command line option when
KTEXT is initiated; see
`Running KTEXT' above.
section 3. Running KTEXT.
Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character.
The \luwfcs
field can be entered anywhere in the text input
control file. \luwfcs
fields may be mixed with \luwfc
fields in the same file.
Any standard alphabetic character (that is a
through z
or
A
through Z
) in the \luwfcs
field will override the
standard lower- upper case pairing. For example, the following will
treat X
as the upper case equivalent of z
:
\luwfcs z X
Note that Z
will still have z
as its lowercase
equivalent in this case.
The \luwfcs
field is allowed to map multiple multibyte lowercase
characters to the same multibyte uppercase character, and vice versa.
This may be useful in some situations, but it introduces an element of
ambiguity into the decapitalization and recapitalization processes. If
ambiguous capitalization is supported, then for the previous example,
z
will have both X
and Z
as uppercase equivalents,
and X
will have both x
and Z
as lowercase
equivalents.
It is possible to define string classes, as discussed in section section 7.15 String class: \scl. For example, the sample text output control file given below contains the following lines:
a. \scl X t s c b. \ch "h" "j" / [X] ~_
Line a defines a string class including t, s, and c; change rule b makes use of this class to block the change of h to j when it occurs in the digraphs th, sh, and ch.
Changes in the text output control file may also make use of string classes defined in the KTEXT control file.
To break a text into words, the program needs to know which characters
are used to form words. It always assumes that the letters A
through Z
and a
through z
are used as word
formation characters. If the orthography of the language the user is
working in uses any characters that do not have different lowercase and
uppercase forms, these must given in a \wfc
field in the text
input control file.
For example, English uses an apostrophe character ('
) that could
be considered a word formation character. This information is provided
by the following example:
\wfc ' ; needed for words like don't
Notice that the characters in the \wfc
field may be separated by
spaces, although it is not required to do so. If more than one
\wfc
field occurs in the text input control file, the program
uses the combination of all characters defined in all such fields as
word formation characters.
The comment character cannot be designated as a word formation character.
If the orthography includes the semicolon (;
), then a different
comment character must be defined with the `-c' command line option
when KTEXT is initiated; see
`Running KTEXT' above.
section 3. Running KTEXT.
The \wfcs
field allows multibyte characters to be defined as
"caseless" word formation characters. It has the same relationship to
\wfc
that \luwfcs
has to \luwfc
. The multibyte word
formation characters are separated from each other by whitespace.
A complete text output control file used for adapting to Asheninca Campa is given below.
\id AEouttx.ctl for Asheninca Campa \ch "N" "m" / _ p ; assimilates before p \ch "N" "n" ; otherwise becomes n \ch "ny" "n~" \ch "ts" "th" / ~_ i ; (N)tsi is unchanged \ch "tsy" "ch" \ch "sy" "sh" \ch "t" "tz" / n _ i \ch "k" "qu" / _ i / _ e \ch "k" "q" / _ y \ch "k" "c" \scl X t s c ; define class of t s c \ch "h" "j" / [X] ~_ ; change except in th, sh, ch \ch "#" " " ; remove fixed space \ch "@" "" ; remove blocking character
Analysis files are record oriented standard format files.
This means that the files are divided into records, each representing a
single word in the original input text file, and records are divided
into fields. An analysis file contains at least one record, and may
contain a large number of records. Each record contains one or more
fields. Each field occupies at least one line, and is marked by a
field code at the beginning of the line. A field code begins
with a backslash character (\
), and contains 1 or more letters
in addition.
This section describes the possible fields in an analysis file. The
only field that is guaranteed to exist is the analysis (\a
)
field. All other fields are either data dependent or optional.
The analysis field (\a
) starts each record of an analysis file.
It has the following form:
\a PFX IFX PFX < CAT root CAT root > SFX IFX SFX
where PFX
is a prefix morphname, IFX
is an infix
morphname, SFX
is a suffix morphname, CAT
is a root
category, and root
is a root gloss or etymology. In the
simplest case, an analysis field would look like this:
\a < CAT root >
where CAT
is a root category and root
is a root gloss or
etymology.
The morpheme decomposition field (\d
) follows the analysis
field. It has the following form:
\d anti-dis-establish-ment-arian-ism-s
where the hyphens separate the individual morphemes in the surface form of the word.
The \dsc
field in the text input control file can replace the
hyphen with another character for separating the morphemes; see
section 7.6 Decomposition Separation Character: \dsc.
The category field (\cat
) provides rudimentary category
information. This may be useful for sentence level parsing. It has
the following form:
\cat CAT
where CAT
is the word category.
section 9.1.3 Category field: \cat.
To request KTEXT to output the final category, include the field
\cat
in the KTEXT control file. This field specifies the
feature path in the word level feature structure that contains the
grammatical category (part of speech). Note that this requires a word
grammar to be loaded.
If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers.
The feature descriptor field (\fd
) contains the feature names
associated with each morpheme in the analysis. It has the following
form:
\fd ==feat1 feat2=feat3=
where feat1
, feat2
, and feat3
are feature
descriptors. The equal signs (=
) serve to separate the feature
descriptors of the individual morphemes. Note that morphemes may have
more than one feature descriptor, with the names separated by spaces,
or no feature descriptors at all.
The feature descriptor field requires a word grammar and one or more
\feat
fields in the KTEXT control file.
If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers.
The original word field (\w
) contains the original input word as
it looks before decapitalization and orthography changes. It looks
like this:
\w The
Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form.
The format information field (\f
) records any formatting codes
or punctuation that appeared in the input text file before the word.
It looks like this:
\f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s
where backslashes (\
) in the input text are doubled, newlines
are represented by \n
, and additional lines in the field start
with a tab character.
The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words.
The capitalization field (\c
) records any capitalization of the
input word. It looks like this:
\c 1
where the number following the field code has one of these values:
1
2
4-32767
Note that the third form is of limited utility, but still exists because of words like the author's last name.
The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized; see
The nonalphabetic field (\n
) records any trailing punctuation,
bar code
or whitespace characters. It looks like this:
\n |r.\n
where newlines are represented by \n
. The nonalphabetic field
ends with the last whitespace character immediately following the word.
The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it.
The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example,
\a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n
where the percent sign (%
) separates the different analyses in
each field. Note that only those fields which contain analysis
information are marked for ambiguity. The other fields (\w
,
\f
, \c
, and \n
) are the same regardless of the
number of analyses.
The \ambig
field in the text input control file can replace the
percent sign with another character for separating the analyses; see
section 7.2 Ambiguity Marker Character: \ambig.
The previous sections assumed that words are successfully analyzed.
This does not always happen. Analysis failures are marked the same way
as multiple analyses, but with zero (0
) for the ambiguity count.
For example,
\a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n
Note that only the \a
and \d
fields contain any
information, and those both have the original word as a place
holder. The other analysis fields (\cat
, \p
, \fd
,
and \u
) are marked for failure, but otherwise left empty.
The \ambig
field in the text input control file can replace the
percent sign with another character for marking analysis failures and
ambiguities; see
section 7.2 Ambiguity Marker Character: \ambig.
KTEXT tries to recreate the format of the original input to analysis in its synthesis output. The main feature worth noting is that synthesis ambiguities and failures are marked similarly to analysis ambiguities and failures in KTEXT analysis output.
The particular choice of field markers and the order of fields in a record is due to the fact that KTEXT uses the same text-handling routines as an existing program called AMPLE (Weber et al., 1988). This has the advantage that KTEXT's output is compatible with that program, but the disadvantage that the record structure is perhaps not consistent with terminology already established for PC-KIMMO. It should also be noted that the quasi-database design of KTEXT's output is used by many other programs developed by SIL International.
IT (pronounced "eye-tee") is an interlinear text editor that maintains the vertical alignment of the interlinear lines of text and uses a lexicon to semi-automatically gloss the text. See Simons and Versaw (1991) and Simons and Thomson (1988).
ITF was developed by the Academic Computing Department of the Summer Institute of Linguistics. It runs under MS-DOS, UNIX, and the Apple Macintosh.
TeX is a typesetting language developed by Donald Knuth (see Knuth, 1986).
The plain text version of this documentation does not include figure 4, since it is an image of typeset output.
This chapter is adapted from chapters 7, 8, and 9 of Weber (1988).
This chapter is adapted from chapter 8 of Weber (1990).
This document was generated on 20 March 2003 using texi2html 1.56k.