ptmorph -- statistic morphological tokenisator (the "pt" in ptmorph comes from "prefix tree".) Ptmorph is an implementation of a simple and well-known idea -- extracting information from character streams by predicting the next character. The assumption is that at a morphological boundary, the effectiveness of prediction will suddenly drop. The method is very vulnerable to "false morphemes" such as overly common letter combinations. Therefore, you should make sure that every phoneme in your language is written as one character (ptmorph uses gauche, which uses UTF-8 for input and output). For example, if your language has a lot of "ch" combinations but "c" is almost never followed by any other character, every "ch" causes the prediction success to go to a spike between the characters, which in turn causes an almost certain false "morpheme boundary" after the "h". Ptmorph works in three modes. You must first convert your corpus into a prefix tree representation which is saved into a database file; this database file can then be used to analyse further data into morphemes or to generate random gibberish that resembles the corpus data. Unlike most history based prediction algorithms, ptmorph uses simultaneously a history of 0, 1, ... max-depth characters to predict the next character. This means that adding depth to the tree does not generally make the results worse; it just causes (a lot of) additional used space. Modes ptmorph build Reads data from the file and spits out a prefix tree of depth . The size of the prefix tree is governed by the number of unique character sequences of length in your corpus. A good value for to start with is 6 for most languages. ptmorph analyse Reads a prefix tree from file and processes data from standard input by adding "|" (pipe/bar) characters at probable morpheme boundaries. controls the sensitivity of the analysis -- the higher the precision, the lower the number of boundaries. A good value for for most applications is 2.0. ptmorph generate This is included mostly for fun. Generate characters of gibberish that is "predictable" by prefix tree in .