- NSU Vestnik. Series: Linguistics and Intercultural Communication
- Archive
- 2018
- Volume 16. Issue 2
- Computer and Applied Linguistics
AUTOMATIC EXTRACTION OF FORMULAIC EXPRESSIONS FROM RUSSIAN TEXTS
The material was received by the Editorial Board: 31.03.2018
AbstractThe present paper describes automatic extraction of linguistic items we call formulaic expressions fr om the Russian drama texts. Particularly, by formulaic expressions (FE) we mean multiword constructions that contain no variables and are used as reactions to verbal stimuli. We consider FE to be a specific kind of constructions in the framework of Construction Grammar. Therefore, they are to be described in the Constructicon project, which is a web-platform where the constructions of one language are presented in a special way for automatic search by various aspects. To facilitate the compilation of comprehensive FE list, we developed a module for automatic FE extraction. Implementation of the module consisted of several stages, including manual annotation of dramatic texts. The first step involved describing the features of FE and their difference compared to other syntactic items such as parenthetical words, lexical verbs and meaningful parts of sentence. Afterwards, two annotators marked Fes in 24 dramatic texts and 46 texts were annotated semiautomatically. Subsequently, we used 34 dramatic texts with the highest inter-annotator agreement. The process of FE extraction involves splitting the text into the special fragments corresponding to clauses, predicting whether each fragment is an FE corresponding to a particular feature set and compiling the final list of FEs. For prediction, we use a uniform weight vote of four classifiers (Random Forest Classifier, Logistic Regression, Ridge Classifier, Support Vector Classifier), which showed the best performance compared to rule-based baseline and classifiers outside the ensemble. We also compared the prediction quality of systems based on different feature sets and used the one with all the features. The best quality currently achieved is precision 0.30 and recall 0.73 (F1-score 0.42). Further development includes improving the preprocessing stage and employing left context, wh ere FE stimulus is located. We also consider using distributional semantic models like word2vec for word embedding and neural networks.
Keywords: formulaic expressions, construction grammar, machine learning, automatic entity extraction.
References: Puzhaeva, S.Yu., Gerasimenko E.A., Zakharova E.S., Rakhilina E.V. AUTOMATIC EXTRACTION OF FORMULAIC EXPRESSIONS FROM RUSSIAN TEXTS. NSU Vestnik Journal, Series: Linguistics and Intercultural Communication. 16, 2. P. 5–18. DOI: 10.25205/1818-7935-2018-16-2-5-18