KTagger is a stand-alone application built with PC-Kimmo's basic parsing functions. It accepts as input a word list file, consisting of one word per line, and produces as output a structured text file containing the morphological parse(s) of each word. The content and format of the output file is determined by a "control" file constructed by the user. KTagger can be used to do part-of-speech tagging, or to produce a word lexicon or any other kind of structured output.
To use KTagger, you need a PC-KIMMO language description such as Englex. The description must include a word grammar file. You do not need PC-Kimmo itself to use KTagger.
KTagger runs on these systems: MS-DOS, Windows, Macintosh, and Unix.
KTagger is a batch process oriented program. It reads a control file, and then processes an input text file to produce an output analysis file.
KTagger uses an old-fashioned command line interface following the convention of options starting with a dash character (`-'). The available options are listed below in alphabetical order. Those options which require an argument have the argument type following the option letter.
-h
-i filename
-l filename
-o filename
-q
-x filename
The following options exist only in beta-test versions of the program, since they are used only for debugging.
-/
-z filename
-Z address,count
address
is allocated or
freed for the count
'th time.
\rules
\lexicon
\grammar
\header
\footer
\header
or \footer
declarations is a
string that is delimited by double-quote characters (an empty string is
indicated by ""
). A string can contain these special
characters:
\n
\t
\f
\"
\\
\recordstarttag
\recordendtag
\field
\starttag
\endtag
\path
\field
declaration
has no content; it merely indicates the start of a field definition.
The \starttag
declaration contains a string (possibly empty)
inserted before each instance of that field in the output file. The
\endtag
declaration containss a string (possibly empty) inserted
after each instance of that field in the output file. The \path
declaration contains a feature path specification that refers to the
parse result of a word. There are five reserved path specifications:
<WORD>
<LEX>
<GLOSS>
<TREE>
<FEAT>
\path
declaration may specify any feature path
found in the top node features. Using Englex, a path declaration of
<head> would return all head features, while <head pos> would return
just the value of the pos feature. Thus it is possible to output any
feature value made available by the grammar of the language
description.
\rem
This chapter illustrates how to use KTagger by giving three sample control files used in conjunction with the Englex PC-Kimmo description of English and the following list of words:
be began but by child children compute computer computerize could
Consider the following control file:
\rem TDF.CTL - control file for KTagger \rem Produces output file in tab-delimited format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "" \footer "" \recordstarttag "" \recordendtag "\n" \field \starttag "" \endtag "\t" \path <WORD> \field \starttag "" \endtag "\t" \path <LEX> \field \starttag "" \endtag "\t" \path <head pos> \field \starttag "" \endtag "" \path <head>
For the given set of input words, the following output is created:
be be V [ pos:V vform:BASE ] be be AUX [ neg:- pos:AUX ] began be`gan V [ finite:+ pos:V tense:PAST vform:ED ] but but CJ [ pos:CJ ] but but PP [ pos:PP ] by by PP [ pos:PP ] by by AV [ pos:AV ] child `child N [ agr:[ 3sg:+ ] number:SG pos:N proper:- verbal:- ] children `children N [ agr:[ 3sg:- ] number:PL pos:N proper:- verbal:- ] compute com`pute V [ pos:V vform:BASE ] computer com`pute+er N [ agr:[ 3sg:+ ] number:SG pos:N ] computerize com`pute+er+ize V [ finite:- pos:V vform:BASE ] could could AUX [ modal:+ neg:- pos:AUX ]
(Lines that are too long have been split, with the
<head>
feature shown on the second line indented one tab stop.)
Consider the following control file:
\rem SFM.CTL - control file for KTagger \rem Produces output file in standard format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "" \footer "" \recordstarttag "" \recordendtag "\n" \field \starttag "\\w " \endtag "\n" \path <WORD> \field \starttag "\\lx " \endtag "\n" \path <LEX> \field \starttag "\\pos " \endtag "\n" \path <head pos> \field \starttag "\\lemma " \endtag "\n" \path <root> \field \starttag "\\lempos " \endtag "\n" \path <root_pos>
For the given set of input words, the following output is created:
\w be \lx be \pos V \lemma be \lempos V \w be \lx be \pos AUX \lemma be \lempos AUX \w began \lx be`gan \pos V \lemma be`gin \lempos V \w but \lx but \pos CJ \lemma but \lempos CJ \w but \lx but \pos PP \lemma but \lempos PP \w by \lx by \pos PP \lemma by \lempos PP \w by \lx by \pos AV \lemma by \lempos AV \w child \lx `child \pos N \lemma `child \lempos N \w children \lx `children \pos N \lemma `child \lempos N \w compute \lx com`pute \pos V \lemma com`pute \lempos V \w computer \lx com`pute+er \pos N \lemma com`pute \lempos V \w computerize \lx com`pute+er+ize \pos V \lemma com`pute \lempos V \w could \lx could \pos AUX \lemma could \lempos AUX
Consider the following control file:
\rem SGML.CTL - control file for KTagger \rem Produces output file in SGML LEXICON format \rem Uses Englex (English description for PC-KIMMO) \rules ../../../pckimmo/test/eng/english.rul \lexicon ../../../pckimmo/test/eng/english.lex \grammar ../../../pckimmo/test/eng/english.grm \header "<!DOCTYPE LEXICON [\n <!ELEMENT LEXICON - - (LE+)>\n <!ELEMENT LE - - ( W, LX, POS, LEMMA, LEMPOS )>\n <!ELEMENT W - - (#PCDATA)>\n <!ELEMENT LX - - (#PCDATA)>\n <!ELEMENT POS - - (#PCDATA)>\n <!ELEMENT LEMMA - - (#PCDATA)>\n <!ELEMENT LEMPOS - - (#PCDATA)>\n ]>\n\n <lexicon>\n" \footer "</lexicon>\n" \recordstarttag "<le>\n" \recordendtag "</le>\n" \field \starttag "<w>" \endtag "</w>\n" \path <WORD> \field \starttag "<lx>" \endtag "</lx>\n" \path <LEX> \field \starttag "<pos>" \endtag "</pos>\n" \path <head pos> \field \starttag "<lemma>" \endtag "</lemma>\n" \path <root> \field \starttag "<lempos>" \endtag "</lempos>\n" \path <root_pos>
For the given set of input words, the following output is created:
<!DOCTYPE LEXICON [ <!ELEMENT LEXICON - - (LE+)> <!ELEMENT LE - - ( W, LX, POS, LEMMA, LEMPOS )> <!ELEMENT W - - (#PCDATA)> <!ELEMENT LX - - (#PCDATA)> <!ELEMENT POS - - (#PCDATA)> <!ELEMENT LEMMA - - (#PCDATA)> <!ELEMENT LEMPOS - - (#PCDATA)> ]> <lexicon> <le> <w>be</w> <lx>be</lx> <pos>V</pos> <lemma>be</lemma> <lempos>V</lempos> </le> <le> <w>be</w> <lx>be</lx> <pos>AUX</pos> <lemma>be</lemma> <lempos>AUX</lempos> </le> <le> <w>began</w> <lx>be`gan</lx> <pos>V</pos> <lemma>be`gin</lemma> <lempos>V</lempos> </le> <le> <w>but</w> <lx>but</lx> <pos>CJ</pos> <lemma>but</lemma> <lempos>CJ</lempos> </le> <le> <w>but</w> <lx>but</lx> <pos>PP</pos> <lemma>but</lemma> <lempos>PP</lempos> </le> <le> <w>by</w> <lx>by</lx> <pos>PP</pos> <lemma>by</lemma> <lempos>PP</lempos> </le> <le> <w>by</w> <lx>by</lx> <pos>AV</pos> <lemma>by</lemma> <lempos>AV</lempos> </le> <le> <w>child</w> <lx>`child</lx> <pos>N</pos> <lemma>`child</lemma> <lempos>N</lempos> </le> <le> <w>children</w> <lx>`children</lx> <pos>N</pos> <lemma>`child</lemma> <lempos>N</lempos> </le> <le> <w>compute</w> <lx>com`pute</lx> <pos>V</pos> <lemma>com`pute</lemma> <lempos>V</lempos> </le> <le> <w>computer</w> <lx>com`pute+er</lx> <pos>N</pos> <lemma>com`pute</lemma> <lempos>V</lempos> </le> <le> <w>computerize</w> <lx>com`pute+er+ize</lx> <pos>V</pos> <lemma>com`pute</lemma> <lempos>V</lempos> </le> <le> <w>could</w> <lx>could</lx> <pos>AUX</pos> <lemma>could</lemma> <lempos>AUX</lempos> </le> </lexicon>
Note that this output contains exactly the same information as the previous example, except for being packaged as SGML rather than as a standard format file.
All of the files in this release of KTagger (source code, executables, examples, documentation) are copyrighted by SIL International (Language Software Development, 7500 W. Camp Wisdom Road, Dallas, TX 75236, U.S.A.). Permission is hereby granted to the user to copy, use, and distribute the KTagger files under the following conditions:
This document was generated on 20 March 2003 using texi2html 1.56k.