INTERGEN Reference Manual producing simple interlinear text version 2.1b3 May 11, 1998 by H. Andrew Black Copyright (C) 2000 SIL International Published by: Language Software Development SIL International 7500 W. Camp Wisdom Road Dallas, TX 75236 U.S.A. Permission is granted to make and distribute verbatim copies of this file provided the copyright notice and this permission notice are preserved in all copies. The author may be reached at the address above or via email as `steve@acadcomp.sil.org'. Introduction to the INTERGEN program ************************************ AMPLE outputs its result as a database in which there is one record per word, with capitalization, white space, format marking, and punctuation represented in other fields. INTERGEN converts text from this format into something much more like an interlinear text. It restores the capitalization, format markers, and white space of the text. By default it produces one line for the decomposition and another line for the analysis. In the analysis line, the root categories are discarded and hyphens are inserted between morphnames. INTERGEN v. 1.0 was designed to be a preprocessor for the INTER.PTP program (which was used with the ManuScripter program). Since INTER.PTP expected no more than one space between words, the individual words in the decomposition and analysis lines were not aligned with each other. Beginning with version 1.0d, an `-a' command line option caused the words to be aligned. Version 1.2g allowed the underlying form field to also be shown whenever it was present in the database. Version 2.0 adds an option to selectively show any or all of the available content fields from the database. They may be shown in any order and may be repeated. Running INTERGEN **************** Command line options -------------------- The INTERGEN program uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter. `-a' each word is aligned with its associated gloss in the following line. There is also a blank line after each gloss line. Failures are give a gloss of `%0%'. Ambiguous analyses are shown one after the other with a percent sign `%' separating each one. `-n' to be used in conjunction with the `-a' option. For ambiguous analyses, rather than showing every analysis, it shows the number of ambiguities and one analysis. `-e' to be used in conjunction with the `-a' option. For ambiguous analyses, rather than showing just one decomposition ambiguity, it shows all of them. The `-e' option is only effective if the `-n' option is not present. `-f' to be used with the `-a' option. Each line of output is prepended with a standard format field code. The default field codes are: `\\wrd' for the `\w' (original word) field `\\dec' for the `\d' (morpheme decomposition) field `\\ana' for the `\a' (analysis) field `\\und' for the `\u' (underlying form) field `\\prp' for the `\p' (properties) field `\\cat' for the `\c' (category) field `\\fea' for the `\fd' (feature descriptor) field The `-g' option may be used to change the defaults. `-g code' to be used in conjunction with the `-f' option. This allows the user to select the standard format field code to be prepended for each output line. `code' consists of the single character code of the field as listed under the `-s' option followed immediately by the standard format field (without the backslash) to be used for the field. E.g. `-g agls' will prepend `\\gls' to the analysis line. More than one field may be designated by separating each instance by a forward slash `/' character. E.g. `-g ww/dd/ag' will prepend the following standard format field codes to the indicated lines: `\\w' for the original word line `\\d' for the morpheme decomposition line `\\g' for the analysis line The other lines (if output) will use the default standard format field codes. See `-f'. Instead of using the forward slash convention, one may also use multiple instances of the `-g' switch. Thus, `-g ww/dd/ag' is equivalent to `-g ww -g dd -g ag'. `-m' monitors the progress of the interlinear conversion by displaying a code on stderr (usually the screen) for each word. `*' means an analysis failure, `.' means a single analysis, `2'-`9' means 2-9 ambiguities, and `>' means 10 or more ambiguities. This is not compatible with the `-q' option. `-c character' selects the control file comment character. The default is the vertical bar (`|'). `-d character' selects the decomposition separation character (the one used to separate morphnames in the analysis). The default is a hyphen (`-'). `-i filename' selects a single input AMPLE analysis database file. `-o filename' selects a single output interlinear text file. `-q' tells INTERGEN to operate _quietly_ with minimal screen output. This is not compatible with the `-m' option. `-s codes' tells INTERGEN which fields to output and in what order. The fields are indicated by the characters in `codes'. The possible characters within `codes' along with their meanings are: `w' means show the `\w' (original word) field `d' means show the `\d' (morpheme decomposition) field `a' means show the `\a' (analysis) field `u' means show the `\u' (underlying form) field `p' means show the `\p' (properties) field `c' means show the `\c' (category) field `f' means show the `\fd' (feature descriptor) field The characters may be repeated more than once. The default is `-sdua'. If more than one instance of `-s' appears, only the last one will take effect. `-t filename' selects a single text output control file (see `Text Output Control File' below). `-w number' sets the maximum line width to `number'. No output line will be longer than `number'. The default is 77. `-/' increments the debugging level. The default is zero (no debugging output). The following options exist only in beta-test versions of the program, since they are used only for debugging. `-z filename' opens a file for recording a memory allocation log. `-Z address,count' traps the program at the point where `address' is allocated or freed for the `count''th time. Examples -------- If the `-i' command option is not used, INTERGEN prompts for a number of file names, reading the standard input for the desired values. The interactive dialog goes like this: C> intergen INTERGEN: Generate interlinear text from AMPLE output Version 2.1b3 (May 11, 1998), Copyright 1998 SIL, Inc. Beta test version compiled May 11 1998 16:21:42 For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] Conversion Performed Mon May 11 16:36:53 1998 Text Control File (xxOUTX.CTL) [none]: Input File in Database Format (xxxxxx.CED): Output file: [aetest.it] 47 words read from aetest.aaa Next Input File: [no more]: C> Note that each prompt contains a reminder of the expected form of the answer in parentheses and ends with a colon. Several of the prompts also contain the default answer in brackets. Using the command options for input and output filenames changes the appearance of the program screen output only in that no file names are requested. For example, C> intergen -i aetest.aaa -o aetest.it INTERGEN: Generate interlinear text from AMPLE output Version 2.1b3 (May 11, 1998), Copyright 1998 SIL, Inc. Beta test version compiled May 11 1998 16:21:42 For 386 CPU (or better) under MS-DOS [compiled with DJGPP 2.1/GNU C 2.7] Conversion Performed Mon May 11 16:38:22 1998 47 words read from aetest.aaa C> The format of the output file depends on the command line options chosen. The default might look like: \p A-kem-ako-veNt-i-ri pairani apaani maini 1I-*kem-DAT-BEN-NF-3MO%1I-*kem-DAT-BEN-NF-REL *pairani aparoni *maini h-ay-i-ro kooya. Te-ma osyeki 3MNF-*ag-NF-3FO *kooya *te-CNJT *osyeki hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-REL%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-NFR-CAUTION ovaa-Ntsi-poosyi-ki, i-tsoNk-at-ii-ro i-tso-t-i-ro tivana. *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana O-pony-aasyi-t-ak-a 3F-*pony-PURP-&-PERF-NFR%3F-*pony-PURP-&-NFR-INDEF%3F-*pony-&-hoja-&-PERF-NFR%3F-*pony-&-hoja-&-NFR-INDEF ironyaaka i-kaNt-a-ve-t-ak-a-ro 0-iri kooya-ka: *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 3FPOS-*iri *kooya-PROX \p -- N-isyiNtyo-', atake pi-pipiya-t-ak-a 1POS-*isyiNtyo-VOC *atake 2-pipiya-&-PERF-NFR%2-pipiya-&-NFR-INDEF ovaa-Ntsi-poosyi-ki, h-ay-i-mi-kari maini. *ovaa-ABS-BAJADA-LOC 3MNF-*ag-NF-2O-CAUTION *maini ... Using `-a', the output would look like: \p A-kem-ako-veNt-i-ri pairani apaani maini 1I-*kem-DAT-BEN-NF-3MO%1I-*kem-DAT-BEN-NF-REL *pairani aparoni *maini h-ay-i-ro kooya. Te-ma osyeki 3MNF-*ag-NF-3FO *kooya *te-CNJT *osyeki hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-REL%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-NFR-CAUTION ovaa-Ntsi-poosyi-ki, i-tsoNk-at-ii-ro i-tso-t-i-ro tivana. *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana O-pony-aasyi-t-ak-a 3F-*pony-PURP-&-PERF-NFR%3F-*pony-PURP-&-NFR-INDEF%3F-*pony-&-hoja-&-PERF-NFR%3F-*pony-&-hoja-&-NFR-INDEF ironyaaka i-kaNt-a-ve-t-ak-a-ro 0-iri kooya-ka: *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 3FPOS-*iri *kooya-PROX \p -- N-isyiNtyo-', atake pi-pipiya-t-ak-a 1POS-*isyiNtyo-VOC *atake 2-pipiya-&-PERF-NFR%2-pipiya-&-NFR-INDEF ovaa-Ntsi-poosyi-ki, h-ay-i-mi-kari maini. *ovaa-ABS-BAJADA-LOC 3MNF-*ag-NF-2O-CAUTION *maini ... Using `-a -n' the output would look like: \p A-kem-ako-veNt-i-ri pairani apaani maini h-ay-i-ro kooya. %2 1I-*kem-DAT-BEN-NF-3MO *pairani aparoni *maini 3MNF-*ag-NF-3FO *kooya Te-ma osyeki hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri *te-CNJT *osyeki %3 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO ovaa-Ntsi-poosyi-ki, i-tsoNk-at-ii-ro i-tso-t-i-ro tivana. *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana O-pony-aasyi-t-ak-a ironyaaka i-kaNt-a-ve-t-ak-a-ro %4 3F-*pony-PURP-&-PERF-NFR *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 0-iri kooya-ka: 3FPOS-*iri *kooya-PROX \p -- N-isyiNtyo-', atake pi-pipiya-t-ak-a ovaa-Ntsi-poosyi-ki, 1POS-*isyiNtyo-VOC *atake %2 2-pipiya-&-PERF-NFR *ovaa-ABS-BAJADA-LOC h-ay-i-mi-kari maini. 3MNF-*ag-NF-2O-CAUTION *maini ... Using `-a -e' the output would look like: \p A-kem-ako-veNt-i-ri%a-kem-ako-veNt-i-ri pairani apaani maini 1I-*kem-DAT-BEN-NF-3MO%1I-*kem-DAT-BEN-NF-REL *pairani aparoni *maini h-ay-i-ro kooya. Te-ma osyeki 3MNF-*ag-NF-3FO *kooya *te-CNJT *osyeki hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri%hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri%hi-nyaaNpoiri-t-apiiNt-a-ve-t-a-kari 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-REL%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-NFR-CAUTION ovaa-Ntsi-poosyi-ki, i-tsoNk-at-ii-ro i-tso-t-i-ro tivana. *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana O-pony-aasyi-t-ak-a%o-pony-aasyi-t-a-ka%o-pony-aa-syi-t-ak-a%o-pony-aa-syi-t-a-ka 3F-*pony-PURP-&-PERF-NFR%3F-*pony-PURP-&-NFR-INDEF%3F-*pony-&-hoja-&-PERF-NFR%3F-*pony-&-hoja-&-NFR-INDEF ironyaaka i-kaNt-a-ve-t-ak-a-ro 0-iri kooya-ka: *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 3FPOS-*iri *kooya-PROX \p -- N-isyiNtyo-', atake pi-pipiya-t-ak-a%pi-pipiya-t-a-ka 1POS-*isyiNtyo-VOC *atake 2-pipiya-&-PERF-NFR%2-pipiya-&-NFR-INDEF ovaa-Ntsi-poosyi-ki, h-ay-i-mi-kari maini. *ovaa-ABS-BAJADA-LOC 3MNF-*ag-NF-2O-CAUTION *maini ... Using the `-a -swda' options would look like: \p Aquemacoventziri pairani apaani maini a-kem-ako-veNt-i-ri pairani apaani maini 1I-*kem-DAT-BEN-NF-3MO%1I-*kem-DAT-BEN-NF-REL *pairani aparoni *maini jayiro cooya. Tema oshequi h-ay-i-ro kooya te-ma osyeki 3MNF-*ag-NF-3FO *kooya *te-CNJT *osyeki jin~aampoiritapiintavetacari hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-REL%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-NFR-CAUTION ovaantsipooshiqui, ithoncatziiro ithotziro tzivana. ovaa-Ntsi-poosyi-ki i-tsoNk-at-ii-ro i-tso-t-i-ro tivana *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana Opon~aashitaca o-pony-aasyi-t-ak-a 3F-*pony-PURP-&-PERF-NFR%3F-*pony-PURP-&-NFR-INDEF%3F-*pony-&-hoja-&-PERF-NFR%3F-*pony-&-hoja-&-NFR-INDEF iron~aaca icantavetacaro iri cooyaca: ironyaaka i-kaNt-a-ve-t-ak-a-ro 0-iri kooya-ka *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 3FPOS-*iri *kooya-PROX \p -- Nishintyo', ataque pipipiyataca n-isyiNtyo-' atake pi-pipiya-t-ak-a 1POS-*isyiNtyo-VOC *atake 2-pipiya-&-PERF-NFR%2-pipiya-&-NFR-INDEF ovaantsipooshiqui, jayimicari maini. ovaa-Ntsi-poosyi-ki h-ay-i-mi-kari maini *ovaa-ABS-BAJADA-LOC 3MNF-*ag-NF-2O-CAUTION *maini ... As a final example, using the `-a -swda' options along with `-f -g wword/ddecomp/agls' would look like: \p \word Aquemacoventziri pairani apaani maini \decomp a-kem-ako-veNt-i-ri pairani apaani maini \gls 1I-*kem-DAT-BEN-NF-3MO%1I-*kem-DAT-BEN-NF-REL *pairani aparoni *maini \word jayiro cooya. Tema oshequi \decomp h-ay-i-ro kooya te-ma osyeki \gls 3MNF-*ag-NF-3FO *kooya *te-CNJT *osyeki \word jin~aampoiritapiintavetacari \decomp hi-nyaaNpoiri-t-apiiNt-a-ve-t-ak-a-ri \gls 3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-3MO%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-PERF-NFR-REL%3MNF-nyaaNpoiri-&-HAB-&-FRUS-&-NFR-CAUTION \word ovaantsipooshiqui, ithoncatziiro ithotziro tzivana. \decomp ovaa-Ntsi-poosyi-ki i-tsoNk-at-ii-ro i-tso-t-i-ro tivana \gls *ovaa-ABS-BAJADA-LOC 3M-*tsoNk-PROG-NF-3FO 3M-*tso-&-NF-3FO *tivana \word Opon~aashitaca \decomp o-pony-aasyi-t-ak-a \gls 3F-*pony-PURP-&-PERF-NFR%3F-*pony-PURP-&-NFR-INDEF%3F-*pony-&-hoja-&-PERF-NFR%3F-*pony-&-hoja-&-NFR-INDEF \word iron~aaca icantavetacaro iri cooyaca: \decomp ironyaaka i-kaNt-a-ve-t-ak-a-ro 0-iri kooya-ka \gls *ironyaaka 3M-*kaNt-&-FRUS-&-PERF-NFR-3FO 3FPOS-*iri *kooya-PROX \p -- \word Nishintyo', ataque pipipiyataca \decomp n-isyiNtyo-' atake pi-pipiya-t-ak-a \gls 1POS-*isyiNtyo-VOC *atake 2-pipiya-&-PERF-NFR%2-pipiya-&-NFR-INDEF \word ovaantsipooshiqui, jayimicari maini. \decomp ovaa-Ntsi-poosyi-ki h-ay-i-mi-kari maini \gls *ovaa-ABS-BAJADA-LOC 3MNF-*ag-NF-2O-CAUTION *maini ... Input Analysis Files ******************** Analysis files are "record oriented standard format files". This means that the files are divided into records, each representing a single word in the original input text file, and records are divided into fields. An analysis file contains at least one record, and may contain a large number of records. Each record contains one or more fields. Each field occupies at least one line, and is marked by a "field code" at the beginning of the line. A field code begins with a backslash character (`\'), and contains 1 or more letters in addition. Analysis file fields ==================== This section describes the possible fields in an analysis file. The only field that is guaranteed to exist is the analysis (`\a') field. All other fields are either data dependent or optional. Analysis field: \a ------------------ The analysis field (`\a') starts each record of an analysis file. It has the following form: \a PFX IFX PFX < CAT root CAT root > SFX IFX SFX where `PFX' is a prefix morphname, `IFX' is an infix morphname, `SFX' is a suffix morphname, `CAT' is a root category, and `root' is a root gloss or etymology. In the simplest case, an analysis field would look like this: \a < CAT root > where `CAT' is a root category and `root' is a root gloss or etymology. Decomposition field: \d ----------------------- The morpheme decomposition field (`\d') follows the analysis field. It has the following form: \d anti-dis-establish-ment-arian-ism-s where the hyphens separate the individual morphemes in the surface form of the word. Category field: \cat -------------------- The category field (`\cat') provides rudimentary category information. This may be useful for sentence level parsing. It has the following form: \cat CAT where `CAT' is the word category. If there are multiple analyses, there will be multiple categories in the output, separated by ambiguity markers. Properties field: \p -------------------- The properties field (`\p') contains the names of any allomorph or morpheme properties found in the analysis of the word. It has the form: \p ==prop1 prop2=prop3= where `prop1', `prop2', and `prop3' are property names. The equal signs (`=') serve to separate the property information of the individual morphemes. Note that morphemes may have more than one property, with the names separated by spaces, or no properties at all. Feature Descriptors field: \fd ------------------------------ The feature descriptor field (`\fd') contains the feature names associated with each morpheme in the analysis. It has the following form: \fd ==feat1 feat2=feat3= where `feat1', `feat2', and `feat3' are feature descriptors. The equal signs (`=') serve to separate the feature descriptors of the individual morphemes. Note that morphemes may have more than one feature descriptor, with the names separated by spaces, or no feature descriptors at all. If there are multiple analyses, there will be multiple feature sets in the output, separated by ambiguity markers. Underlying form field: \u ------------------------- The underlying form field (`\u') is similar to the decomposition field except that it shows underlying forms instead of surface forms. It looks like this: \u a-para-a-i-ri-me where the hyphens separate the individual morphemes. Word field: \w -------------- The original word field (`\w') contains the original input word as it looks before decapitalization and orthography changes. It looks like this: \w The Note that this is a gratuitous change from earlier versions of AMPLE and KTEXT, which wrote the decapitalized form. Formatting field: \f -------------------- The format information field (`\f') records any formatting codes or punctuation that appeared in the input text file before the word. It looks like this: \f \\id MAT 5 HGMT05.SFM, 14-feb-84 D. Weber, Huallaga Quechua\n \\c 5\n\n \\s where backslashes (`\') in the input text are doubled, newlines are represented by `\n', and additional lines in the field start with a tab character. The format information field is written to the output analysis file whenever it is needed, that is, whenever formatting codes or punctuation exist before words. Capitalization field: \c ------------------------ The capitalization field (`\c') records any capitalization of the input word. It looks like this: \c 1 where the number following the field code has one of these values: `1' the first (or only) letter of the word is capitalized `2' all letters of the word are capitalized `4-32767' some letters of the word are capitalized and some are not Note that the third form is of limited utility, but still exists because of words like the author's last name. The capitalization field is written to the output analysis file whenever any of the letters in the word are capitalized. Nonalphabetic field: \n ----------------------- The nonalphabetic field (`\n') records any trailing punctuation, bar codes, or whitespace characters. It looks like this: \n |r.\n where newlines are represented by `\n'. The nonalphabetic field ends with the last whitespace character immediately following the word. The nonalphabetic field is written to the output analysis file whenever the word is followed by anything other than a single space character. This includes the case when a word ends a file with nothing following it. Ambiguous analyses ================== The previous section assumed that only one analysis is produced for each word. This is not always possible since words in isolation are frequently ambiguous. Multiple analyses are handled by writing each analysis field in parallel, with the number of analyses at the beginning of each output field. For example, \a %2%< A0 imaika > CNJT AUG%< A0 imaika > ADVS% \d %2%imaika-Npa-ni%imaika-Npani% \cat %2%A0 A0=A0/A0=A0/A0%A0 A0=A0/A0% \p %2%==%=% \fd %2%==%=% \u %2%imaika-Npa-ni%imaika-Npani% \w Imaicampani \f \\v124 \c 1 \n \n where the percent sign (`%') separates the different analyses in each field. Note that only those fields which contain analysis information are marked for ambiguity. The other fields (`\w', `\f', `\c', and `\n') are the same regardless of the number of analyses. Analysis failures ================= The previous sections assumed that words are successfully analyzed. This does not always happen. Analysis failures are marked the same way as multiple analyses, but with zero (`0') for the ambiguity count. For example, \a %0%ta% \d %0%ta% \cat %0%% \p %0%% \fd %0%% \u %0%% \w TA \f \\v 12 |b \c 2 \n |r\n Note that only the `\a' and `\d' fields contain any information, and those both have the original word as a place holder. The other analysis fields (`\cat', `\p', `\fd', and `\u') are marked for failure, but otherwise left empty. Text Output Control File ************************ The text output module restores a processed document from the internal format to its textual form. It re-imposes capitalization on words and restores punctuation, format markers, white space, and line breaks. Also, orthography changes can be made, and the delimiter that marks ambiguities and failures can be changed. This chapter describes the control file given to the text output module.(1) ---------- Footnotes ---------- (1) This chapter is adapted from chapter 8 of Weber (1990). Text output ambiguity delimiter: \ambig ======================================= The text output module flags words that either produced no results or multiple results when processed. These are flagged with percent signs (`%') by default, but this can be changed by declaring the desired character with the \ambig field code. For example, the following would change the ambiguity delimiter to `@': \ambig @ Text output orthographic changes: \ch ===================================== The text output module allows orthographic changes to be made to the processed words. These are given in the text output control file. An orthography change is defined by the `\ch' field code followed by the actual orthography change. Any number of orthography changes may be defined in the text output control file. The output of each change serves as the input the following change. That is, each change is applied as many times as necessary to an input word before the next change from the text output control file is applied. Basic changes ------------- To substitute one string of characters for another, these must be made known to the program in a change. (The technical term for this sort of change is a production, but we will simply call them changes.) In the simplest case, a change is given in three parts: (1) the field code `\ch' must be given at the extreme left margin to indicate that this line contains a change; (2) the match string is the string for which the program must search; and (3) the substitution string is the replacement for the match string, wherever it is found. The beginning and end of the match and substitution strings must be marked. The first printing character following `\ch' (with at least one space or tab between) is used as the delimiter for that line. The match string is taken as whatever lies between the first and second occurrences of the delimiter on the line and the substitution string is whatever lies between the third and fourth occurrences. For example, the following lines indicate the change of hi to bye, where the delimiters are the double quote mark (`"'), the single quote mark (`''), the period (`.'), and the at sign (`@'). \ch "hi" "bye" \ch 'hi' 'bye' \ch .hi. .bye. \ch @hi@ @bye@ Throughout this document, we use the double quote mark as the delimiter unless there is some reason to do otherwise. Change tables follow these conventions: 1. Any characters (other than the delimiter) may be placed between the match and substitution strings. This allows various notations to symbolize the change. For example, the following are equivalent: \ch "thou" "you" \ch "thou" to "you" \ch "thou" > "you" \ch "thou" --> "you" \ch "thou" becomes "you" 2. Comments included after the substitution string are initiated by a designated character such as a semicolon (`;'). The following lines illustrate the use of comments: \ch "qeki" "qiki" | for cases like wawqeki \ch "thou" "you" | for modern English 3. A change can be ignored temporarily by turning it into a comment field. This is done either by placing an unrecognized field code in front of the normal `\ch', or by placing the comment character (`|') in front of it. For example, only the first of the following three lines would effect a change: \ch "nb" "mp" \no \ch "np" "np" |\ch "mb" "nb" The changes in the text output control file are applied as an ordered set of changes. The first change is applied to the entire word by searching from left to right for any matching strings and, upon finding any, replacing them with the substitution string. After the first change has been applied to the entire word, then the next change is applied, and so on. Thus, each change applies to the result of all prior changes. When all the changes have been applied, the resulting word is returned. For example, suppose we have the following changes: \ch "aib" > "ayb" \ch "yb" > "yp" Consider the effect these have on the word paiba. The first changes i to y, yielding payba; the second changes b to p, to yield paypa. (This would be better than the single change of aib to ayp if there were sources of yb other than the output of the first rule.) The way in which change tables are applied allows certain tricks. For example, suppose that for Quechua, we wish to change hw to f, so that hwista becomes fista and hwis becomes fis. However, we do not wish to change the sequence shw or chw to sf or cf (respectively). This could be done by the following sequence of changes. (Note, `@' and `$' are not otherwise used in the orthography.) \ch "shw" > "@" | (1) \ch "chw" > "$" | (2) \ch "hw" > "f" | (3) \ch "@" > "shw" | (4) \ch "$" > "chw" | (5) Lines (1) and (2) protect the sh and ch by changing them to distinguished symbols. This clears the way for the change of hw to f in (3). Then lines (4) and (5) restore `@' and `$' to sh and ch, respectively. (An alternative, simpler way to do this is discussed in the next section.) Environmentally constrained changes ----------------------------------- It is possible to impose string environment constraints (SECs) on changes in the orthography change tables. The syntax of SECs is described in detail in section {No value for `words.vs.format'}. For example, suppose we wish to change the mid vowels (e and o) to high vowels (i and u respectively) immediately before and after q. This could be done with the following changes: \ch "o" "u" / _ q / q _ \ch "e" "i" / _ q / q _ This is not entirely a hypothetical example; some Quechua practical orthographies write the mid vowels e and o. However, in the environment of /q/ these could be considered phonemically high vowels /i/ and /u/. Changing the mid vowels to high upon loading texts has the advantage that-for cases like upun "he drinks" and upoq "the one who drinks"-the root needs to be represented internally only as upu "drink". But note, because of Spanish loans, it is not possible to change all cases of e to i and o to u. The changes must be conditioned. In reality, the regressive vowel-lowering effect of /q/ can pass over various intervening consonants, including /y/, /w/, /l/, /ll/, /r/, /m/, /n/, and /n/. For example, /ullq/ becomes ollq, /irq/ becomes erq, and so on. Rather than list each of these cases as a separate constraint, it is convenient to define a class (which we label `+resonant') and use this class to simplify the SEC. Note that the string class must be defined (with the `\scl' field code) before it is used in a constraint. \scl +resonant y w l ll r m n n~ \ch "o" "u" / q _ / _ ([+resonant]) q \ch "e" "i" / q _ / _ ([+resonant]) q This says that the mid vowels become high vowels after /q/ and before /q/, possibly with an intervening /y/, /w/, /l/, /ll/, /r/, /m/, /n/, or /n/. Consider the problem posed for Quechua in the previous section, that of changing hw to f. An alternative is to condition the change so that it does not apply adjacent to a member of the string class `Affric' which contains s and c. \scl Affric c s \ch "hw" "f" / [Affric] ~_ It is sometimes convenient to make certain changes only at word boundaries, that is, to change a sequence of characters only if they initiate or terminate the word. This conditioning is easily expressed, as shown in the following examples. \ch "this" "that" | anywhere in the word \ch "this" "that" / # _ | only if word initial \ch "this" "that" / _ # | only if word final \ch "this" "that" / # _ # | only if entire word Using text orthography changes ------------------------------ The purpose of orthography change is to convert text from an external orthography to an internal representation more suitable for morphological analysis. In many cases this is unnecessary, the practical orthography being completely adequate as the internal representation. In other cases, the practical orthography is an inconvenience that can be circumvented by converting to a more phonemic representation. Let us take a simple example from Latin. In the Latin orthography, the nominative singular masculine of the word "king" is rex. However, phonemically, this is really /reks/; /rek/ is the root meaning king and the /s/ is an inflectional suffix. If the program is to recover such an analysis, then it is necessary to convert the x of the external, practical orthography into ks internally. This can be done by including the following orthography change in the text output control file: \ch "x" "ks" In this, x is the match string and ks is the substitution string, as discussed in section {No value for `output.file'}. Whenever x is found, ks is substituted for it. Let us consider next an example from Huallaga Quechua. The practical orthography currently represents long vowels by doubling the vowel. For example, what is written as kaa is /ka:/ "I am", where the length (represented by a colon) is the morpheme meaning "first person subject". Other examples, such as upoo /upu:/ "I drink" and upichee /upi-chi-:/ "I extinguish", motivate us to convert all long vowels into a vowel followed by a colon. The following changes do this: \ch "aa" "a:" \ch "ee" "i:" \ch "ii" "i:" \ch "oo" "u:" \ch "uu" "u:" Note that the long high vowels (i and u) have become mid vowels (e and o respectively); consequently, the vowel in the substitution string is not necessarily the same as that of the match string. What is the utility of these changes? In the lexicon, the morphemes can be represented in their phonemic forms; they do not have to be represented in all their orthographic variants. For example, the first person subject morpheme can be represented simply as a colon (-:), rather than as -a in cases like kaa, as -o in cases like qoo, and as -e as in cases like upichee. Further, the verb "drink" can be represented as upu and the causative suffix (in upichee) can be represented as -chi; these are the forms these morphemes have in other (nonlowered) environments. As the next example, let us suppose that we are analyzing Spanish, and that we wish to work internally with k rather than c (before a, o, and u) and qu (before i and e). (Of course, this is probably not the only change we would want to make.) Consider the following changes: \ch "ca" "ka" \ch "co" "ko" \ch "cu" "ku" \ch "qu" "k" The first three handle c and the last handles qu. By virtue of including the vowel after c, we avoid changing ch to kh. There are other ways to achieve the same effect. One way exploits the fact that each change is applied to the output of all previous changes. Thus, we could first protect ch by changing it to some distinguished character (say `@'), then changing c to k, and then restoring `@' to ch: \ch "ch" "@" \ch "c" "k" \ch "@" "ch" \ch "qu" "k" Another approach conditions the change by the adjacent characters. The changes could be rewritten as \ch "c" "k" / _a / _o / _u | only before a, o, or u \ch "qu" "k" | in all cases The first change says, "change c to k when followed by a, o, or u." (This would, for example, change como to komo, but would not affect chal.) The syntax of such conditions is exactly that used in string environment constraints; see section {No value for `words.vs.format'}. Where orthography changes apply ------------------------------- Input orthography changes are made when the text being processed may be written in a practical orthography. Rather than requiring that it be converted as a prerequisite to running the program, it is possible to have the program convert the orthography as it loads and before it processes each word. The changes loaded from the text output control file are applied after all the text is converted to lower case (and the information about upper and lower case, along with information about format marking, punctuation and white space, has been put to one side.) Consequently, the match strings of these orthography changes should be all lower case; any change that has an uppercase character in the match string will never apply. A sample orthography change table --------------------------------- We include here the entire orthography input change table for Caquinte (a language of Peru). There are basically four changes that need to be made: (1) nasals, which in the practical orthography reflect their assimilation to the point of articulation of a following noncontinuant, must be changed into an unspecified nasal, represented by N; (2) c and qu are changed to k; (3) j is changed to h; and (4) gu is changed to g before i and e. \ch "mp" "Np" | for unspecified nasals \ch "nch" "Nch" \ch "nc" "Nk" \ch "nqu" "Nk" \ch "nt" "Nt" \ch "ch" "@" | to protect ch \ch "c" "k" | other c's to k \ch "@" "ch" | to restore ch \ch "qu" "k" \ch "j" "h" \ch "gue" "ge" \ch "gui" "gi" This change table can be simplified by the judicious use of string environment constraints: \ch "m" > "N" / _p \ch "n" > "N" / _c / _t / _qu \ch "c" > "k" / _~h \ch "qu" > "k" \ch "j" > "h" \ch "gu" > "g" / _e /_i As suggested by the preceding examples, the text orthography change table is composed of all the `\ch' fields found in the text output control file. These may appear anywhere in the file relative to the other fields. It is recommended that all the orthography changes be placed together in one section of the text output control file, rather than being mixed in with other fields. Syntax of Orthography Changes ----------------------------- This section presents a grammatical description of the syntax of orthography changes in BNF notation. 1a. ::= 1b. 2a. ::= 2b. 2c. 3. ::= any printing character not used in either the ``from'' string or the ``to'' string 4. ::= one or more characters other than the quote character used by this orthography change 5a. ::= 5b. 6a. ::= 6b. 6c. 7a. ::= 7b. 7c. 8a. ::= 8b. 8c. 9a. ::= 9b. 9c. ... 10a. ::= 10b. ( ) 11a. ::= ~ 11b. 11c. [ ] 12. ::= / +/ 13. ::= _ ~_ 14. ::= # ~# 15. ::= one or more contiguous characters Comments on selected BNF rules .............................. 2. The same `' character must be used at both the beginning and the end of both the "from" string and the "to" string. 3. The double quote (`"') and single quote (`'') characters are most often used. 7-8. Note that what can appear to the left of the environment bar is a mirror image of what can appear to the right. 9c. An ellipsis (`...') indicates a possible break in contiguity. 10b. Something enclosed in parentheses is optional. 11a. A tilde (`~') reverses the desirability of an element, causing the constraint to fail if it is found rather than fail if it is not found. 11c. A literal enclosed in square brackets must be the name of a string class defined by a `\scl' field in the analysis data file, or earlier in the dictionary orthography change file. 12. A `+/' is usually used for morpheme environment constraints, but may used for change environment constraints in `\ch' fields in the dictionary orthography change table file. 13. A tilde attached to the environment bar (`~_') inverts the sense of the constraint as a whole. 14b. The boundary marker preceded by a tilde (`~#') indicates that it must not be a word boundary. 15. The special characters used by environment constraints can be included in a literal only if they are immediately preceded by a backslash: \+ \/ \# \~ \[ \] \( \) \. \_ \\ Decomposition Separation Character: \dsc ======================================== The `\dsc' field defines the character used to separate the morphemes in the decomposition field of the input analysis file. For example, to use the equal sign (`='), the text input control file would include: \dsc = This would handle a decomposition field like the following: \d %3%kay%ka=y%ka=y% It makes sense to use the `\dsc' field only once in the text output control file. If multiple `\dsc' fields do occur in the file, the value given in the first one is used. If the text output control file does not have an `\dsc' field, a dash (`-') is used. The first printing character following the `\dsc' field code is used as the morpheme decomposition separator character. The same character cannot be used both for separating decomposed morphemes in the analysis output file and for marking comments in the output control files. Thus, one normally cannot use the vertical bar (`|') as the decomposition separation character. Primary format marker character: \format ======================================== The `\format' field designates a single character to flag the beginning of a primary format marker. For example, if the format markers in the text files begin with the at sign (`@'), the following would be placed in the text input control file: \format @ This would be used, for example, if the text contained format markers like the following: @ @p @sp @make(Article) @very-long.and;muddled/format*marker,to#be$sure If a `\format' field occurs in the text input control file without a following character to serve for flagging format markers, then the program will not recognize any format markers and will try to parse everything other than punctuation characters. It makes sense to use the `\format' field only once in the text input control file. If multiple `\format' fields do occur in the file, the value given in the first one is used. The first printing character following the `\format' field code is used to flag format markers. The character currently used to mark comments cannot be assigned to also flag format markers. Thus, the vertical bar (`|') cannot normally be used to flag format markers. Lowercase/uppercase character pairs: \luwfc =========================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any other characters that have lowercase and uppercase forms, these must given in a `\luwfc' field in the text input control file. The `\luwfc' field defines pairs of characters; the first member of each pair is a lowercase character and the second is the corresponding uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace may be interspersed freely. For example, the following three examples are equivalent: \luwfc éÉ ñÑ or \luwfc éÉ | e with acute accent \luwfc ñÑ | enyee or \luwfc é É ñ Ñ Note that comments can be used as well (just as they can in any INTERGEN control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the vertical bar (`|'), then a different comment character must be defined with the `-c' command line option when INTERGEN is initiated; see `Running INTERGEN' above. The `\luwfc' field can be entered anywhere in the text input control file, although a natural place would be before the `\wfc' (word formation character) field. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfc' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfc z X Note that `Z' will still have `z' as its lower-case equivalent in this case. The `\luwfc' field is allowed to map multiple lower case characters to the same upper case character, and vice versa. This is needed for languages that do not mark tone on upper case letters. Multibyte lowercase/uppercase character pairs: \luwfcs ====================================================== The `\luwfcs' field extends the character pair definitions of the `\luwfc' field to multibyte character sequences. Like the `\luwfc' field, the `\luwfcs' field defines pairs of characters; the first member of each pair is a multibyte lowercase character and the second is the corresponding multibyte uppercase character. Several such pairs may be placed in the field or they may be placed on separate fields. Whitespace separates the members of each pair, and the pairs from each other. For example, the following three examples are equivalent: \luwfcs e' E` n~ N^ ç C& or \luwfcs e' E` | e with acute accent \luwfcs n~ N^ | enyee \luwfcs ç C& | c cedilla or \luwfcs e' E` n~ N^ ç C& Note that comments can be used as well (just as they can in any INTERGEN control file). This means that the comment character cannot be designated as a word formation character. If the orthography includes the vertical bar (`|'), then a different comment character must be defined with the `-c' command line option when INTERGEN is initiated; see `Running INTERGEN' above. Also note that there is no requirement that the lowercase form be the same length (number of bytes) as the uppercase form. The examples shown above are only one or two bytes (character codes) in length, but there is no limit placed on the length of a multibyte character. The `\luwfcs' field can be entered anywhere in the text input control file. `\luwfcs' fields may be mixed with `\luwfc' fields in the same file. Any standard alphabetic character (that is `a' through `z' or `A' through `Z') in the `\luwfcs' field will override the standard lower- upper case pairing. For example, the following will treat `X' as the upper case equivalent of `z': \luwfcs z X Note that `Z' will still have `z' as its lowercase equivalent in this case. The `\luwfcs' field is allowed to map multiple multibyte lowercase characters to the same multibyte uppercase character, and vice versa. This may be useful in some situations, but it introduces an element of ambiguity into the decapitalization and recapitalization processes. If ambiguous capitalization is supported, then for the previous example, `z' will have both `X' and `Z' as uppercase equivalents, and `X' will have both `x' and `Z' as lowercase equivalents. Text output string classes: \scl ================================ A string class is defined by the `\scl' field code followed by the class name, which is followed in turn by one or more contiguous character strings or (previously defined) string class names. A string class name used as part of the class definition must be enclosed in square brackets. For example, the sample text output control file given below contains the following lines: a. \scl X t s c b. \ch "h" "j" / [X] ~_ Line a defines a string class including t, s, and c; change rule b makes use of this class to block the change of h to j when it occurs in the digraphs th, sh, and ch. The class name must be a single, contiguous sequence of printing characters. Characters and words which have special meanings in tests should not be used. The actual character strings have no such restrictions. The individual members of the class are separated by spaces, tabs, or newlines. Each `\scl' field defines a single string class. Any number of `\scl' fields may appear in the file. The only restriction is that a string class must be defined before it is used. Caseless word formation characters: \wfc ======================================== To break a text into words, the program needs to know which characters are used to form words. It always assumes that the letters `A' through `Z' and `a' through `z' are used as word formation characters. If the orthography of the language the user is working in uses any characters that do not have different lowercase and uppercase forms, these must given in a `\wfc' field in the text input control file. For example, English uses an apostrophe character (`'') that could be considered a word formation character. This information is provided by the following example: \wfc ' | needed for words like don't Notice that the characters in the `\wfc' field may be separated by spaces, although it is not required to do so. If more than one `\wfc' field occurs in the text input control file, the program uses the combination of all characters defined in all such fields as word formation characters. The comment character cannot be designated as a word formation character. If the orthography includes the vertical bar (`|'), then a different comment character must be defined with the `-c' command line option when INTERGEN is initiated; see `Running INTERGEN' above. Multibyte caseless word formation characters: \wfcs =================================================== The `\wfcs' field allows multibyte characters to be defined as "caseless" word formation characters. It has the same relationship to `\wfc' that `\luwfcs' has to `\luwfc'. The multibyte word formation characters are separated from each other by whitespace. A sample text output control file ================================= A complete text output control file used for adapting to Asheninca Campa is given below. \id AEouttx.ctl for Asheninca Campa \ch "N" "m" / _ p | assimilates before p \ch "N" "n" | otherwise becomes n \ch "ny" "n~" \ch "ts" "th" / ~_ i | (N)tsi is unchanged \ch "tsy" "ch" \ch "sy" "sh" \ch "t" "tz" / n _ i \ch "k" "qu" / _ i / _ e \ch "k" "q" / _ y \ch "k" "c" \scl X t s c | define class of t s c \ch "h" "j" / [X] ~_ | change except in th, sh, ch \ch "#" " " | remove fixed space \ch "@" "" | remove blocking character Table of Contents ***************** Introduction to the INTERGEN program Running INTERGEN Input Analysis Files Analysis file fields Analysis field: \a Decomposition field: \d Category field: \cat Properties field: \p Feature Descriptors field: \fd Underlying form field: \u Word field: \w Formatting field: \f Capitalization field: \c Nonalphabetic field: \n Ambiguous analyses Analysis failures Text Output Control File Text output ambiguity delimiter: \ambig Text output orthographic changes: \ch Basic changes Environmentally constrained changes Using text orthography changes Where orthography changes apply A sample orthography change table Syntax of Orthography Changes Decomposition Separation Character: \dsc Primary format marker character: \format Lowercase/uppercase character pairs: \luwfc Multibyte lowercase/uppercase character pairs: \luwfcs Text output string classes: \scl Caseless word formation characters: \wfc Multibyte caseless word formation characters: \wfcs A sample text output control file