 ISSN 0235-716X
|
2006 m. Nr. 2
 Dabartinės lietuvių kalbos gramatinių formų vartosena morfologiškai anotuotame tekstyne
Erika RIMKUTĖ
This paper deals with the usage of parts of speech and their grammatical features in the morphologically annotated corpus of the Lithuanian language. This corpus was compiled and processed at the Center of Computational Linguistics of Vytautas Magnus University. The morphologically annotated corpus is a set of XML files, containing 1 million morphologically annotated running words. Each annotation for a word form contains its normalized form (lemma) and a full set of morphological properties. Non-word textual units, such as punctuation marks, spaces, paragraphs, numbers, etc. are represented in the morphologically annotated corpus by special marks. The morphologically annotated corpus showed out that the variety of inflectional forms in real usage is not so great as in the grammatical system, since highly inflected parts of speech as verbs and nouns have less than 3 word-forms on average. Pronouns demonstrated a surprisingly big number of word forms actually used in the contemporary Lithuanian language. Overall, the tendencies for the usage of different word classes coincide with the data obtained by other researches, i.e. nouns and other nominal words have the biggest coverage (39% are nouns, 8.7% pronouns, 7.33% adverbs, 6.72% adjectives, and 20% are verbs). The morphologically annotated corpus is of great importance for the future development of parsing tools, treebanks and other resources of the Lithuanian language.
|
Numeriai:
2011 - T.57 Nr.1, Nr.2, Nr.3, Nr.42010 - T.56 Nr.1-42009 - T.55 Nr.1-2, Nr.3-42008 - T.54 Nr.1, Nr.2, Nr.3, Nr.42007 - T.53 Nr.1, Nr.2, Nr.3, Nr.42006 Nr.1, Nr.2, Nr.3, Nr.42005 Nr.1, Nr.2, Nr.3, Nr.42004 Nr.1, Nr.2, Nr.3, Nr.42003 Nr.1, Nr.2, Nr.3, Nr.42002 Nr.1, Nr.2, Nr.3, Nr.42001 Nr.1, Nr.2, Nr.3, Nr.4 |