A new toolkit for natural text processing with the TXM platform and its application to a corpus for analysis of texts propagating extremist views

A new toolkit for natural text processing with the TXM platform and its application to a corpus for analysis of texts propagating extremist views

The material was received by the Editorial Board: 29.05.2018

Abstract
TXM platform provides a wide range of corpus analysis tools including correspondence analysis, clustering, lexical table construction, and parametrized subcorpus selection. The default structural
unit of analysis for TXM is a token. The only TXM extension available by default is TreeTagger which performs automated morphological analysis and lemmatization during the corpus import process. However, it is possible to supply each token with a number of features enabling a more advanced text analysis. In this work we present a number of tools developed for even a more extensive, complex and flexible corpus analysis with TXM relying both on the tools previously developed by our team and on publicly available software libraries. We focus in particular on a stemming technique that uses a word structural pattern method and on noun phrase recognition that together make it possible to perform more sophisticated and powerful queries and analyses of the corpus not limited to word forms.
The structural pattern stemming method is based on a set of specific language rules that allow separating a word stem fr om all affixes. The recognition of noun phrases is based on rules allowing the detection of subordination and coordination relations among nouns. These extensions result in the improvement of performance of statistical tools used by TXM, such as specificity scores and correspondence analysis. The new set of tools has been tested on a corpus including texts marked as «extremist» by experts along with «neutral» texts in similar domains. The corpus of approximately 900,000 words is divided into eight subcorpora: neutral texts oppose seven thematic subcorpora considered as extremist (namely aggressive, fascist, ideological, nationalistic, religious, separatist, and terroristic). The specificity analysis detects the words (or other structural units) that are significantly more or less frequent in a given subcorpus compared to the entire corpus. The specificity score for selected units can be compared across all the subcorpora in order to verify their difference or similarity. The correspondence analysis produces a chart wh ere the subcorpora are represented as points in a twodimensional space based on their similarity as to the frequency of selected units. All tests demonstrated a significant difference between neutral texts, on one side, and marked, on the other. Two «extremist» subcorpora, religious and ideological, demonstrated similar results and can probably be merged. These facts encourage further research on fully automatic or computer-aided expert recognition of extremist texts.

Keywords: corpus linguistics, automated morphological analysis, automated syntactic parsing, TXM platform, correspondence analysis, specificity, detecting extremist texts.

References: Lavrentiev, A.M., Solovyev F.N., Suvorova (Ananyeva) M.I., Fokina A.I., Chepovskiy A.M. A new toolkit for natural text processing with the TXM platform and its application to a corpus for analysis of texts propagating extremist views. NSU Vestnik Journal, Series: Linguistics and Intercultural Communication. 16, 3. P. 19–31. DOI: 10.25205/1818-7935-2018-16-3-19-31