Concentration indices for measuring and comparing of word frequency lists

The material was received by the Editorial Board: 05.02.2017
Abstract
We analyze the system of indices that characterize frequency concentration and scattering of lexical units in word frequency lists. If a word frequency list is presented in the form of rank distribution, the classic index proposed by the Italian scholar C. Gini (the Gini index, or Gini ratio) can be applied to it. The other indices applicable here are the index proposed by the Russian statistician V. P. Trofimov and two indices proposed by G. Ya. Martynenko, which are based on the rank mean. The relationship between these four indices is examined, and the possibility of their application for studying the structure of word frequency lists is shown. The analyzed indices represent the important generalizing statistics, which allow to compare different word frequency lists with each other in terms of concentration and scattering of lexical units. Further, the paper examines the classical statistical distributions (Zipf - Pareto, Weibull, logistic) in a rank form and the analytical expressions corresponding to these distributions. The possibility of pplying the analyzed concentration indices is shown on the material of three word frequency lists of classical Russian fiction (by Anton Chekhov, Leonid Andreev and Alexander Kuprin), a specialized word frequency dictionary on electronics, and two small frequency dictionaries.

Keywords: word frequency list, automatic text analysis, rank distribution, status distribution, concentration, dispersion, rank means, concentration index, the Gini index, the Trofimov index, the Martynenko indexes, Zipf distribution, Weibull distribution, logistic distribution

References: Martynenko, G. Ya. Concentration indices for measuring and comparing of word frequency lists. NSU Vestnik Journal, Series: Linguistics and Intercultural Communication. 15, 1. P. 41–53.