Quantitative Estimation of Grammatical Ambiguity: Case of European Languages

The material was received by the Editorial Board: 25.10.2019

Abstract 

 The grammatical ambiguity (multiple sets of grammatical features for one word form or coinciding surface forms of different words) can be of different types. We distinguish six classes of grammatical ambiguity: unambiguous, ambiguous by grammatical features, by part of speech, by lemma, by lemma and part of speech, and out-of-vocabulary words. These classes are found in all languages, but word distribution may vary significantly. We calculated and analysed the statistics of these six ambiguity classes for a number of European languages. We found that the distribution of ambiguous words among these classes depends primarily on basic linguistic features of a language determining its typology class. Although it is influenced by text style and the considered vocabulary, the distinctive shape of the distribution is preserved under different conditions and differs significantly from distributions for other languages. The fact that the shape is primarily defined by linguistic properties is corroborated by the fact that closely related languages demonstrated in our research similar properties as far as their ambiguous words are concerned. We established that Slavic languages feature a low rate of part-of-speech ambiguous words and a high rate of words which are ambiguous by grammatical features. The former is also true for French and Italian, while the latter holds for German and Swedish, whereas the combination of these traits is characteristic of Slavic languages alone. The experiments showed that reduction of the grammatical feature set does not change the shape of distribution and therefore does not reflect similarity among languages. On the other hand, we found that the top 1000 most frequent words in all the languages considered have different distribution in ambiguity classes unlike in the rest of the words. At the same time, for the majority of considered languages, less frequent words are less unambiguous by part of speech. In Romance and Germanic languages, the ambiguity is reduced for less frequent words. We also investigated the differences in statistics for texts of different genres in the Russian language. We found out that fiction texts are more ambiguous by part of speech than newswire, which are in turn more ambiguous by grammatical features. Our results suggest that the quality of multilingual morphological taggers should be measured relying only on ambiguous words as opposed to all words of the processed text. Such an approach can help get a more objective linguistic picture and enhance the performance of linguistic tools.

Keywords: natural written language processing, grammatical ambiguity, statistics of occurrence

References: Klyshinsky, Eduard S.; Logacheva, Varvara K.; Karpik, Olesya V.; Bondarenko, Alexander V. Quantitative Estimation of Grammatical Ambiguity: Case of European Languages. NSU Vestnik. Series: Linguistics and Intercultural Communication. 2020, Vol. 18, 1. P. 5–21. DOI: DOI 10.25205/1818-7935-2020-18-1-5-21