Texts of “internet confessions” as a source for training data set for the research on the sentiment-analysis field

The material was received by the Editorial Board:

Abstract

The article aims to analyze the validity of Internet confession texts used as a source of training data set for designing computer classifier of Internet texts in Russian according to their emotional tonality. Thus, the classifier, backed by Lövheim’s emotional cube model, is expected to detect eight classes of emotions represented in the text or to assign the text to the emotionally neutral class. The first and one of the most important stages of the classifier creation is the training data set selection. The training data set in Machine Learning is the actual dataset used to train the model for performing various actions. The internet text genres that are traditionally used in sentiment analysis to train two or three tonalities classifiers are twits, films and market reviews, blogs and financial reports. The novelty of our project consists in designing multiclass classifier that requires a new non-trivial training data. As such, we have chosen the texts from public group Overheard in Russian social network VKontakte. As all texts show similarities, we united them under the genre name “Internet confession”. To feature the genre, we applied the method of narrative semiotics describing six positions forming the deep narrative structure of “Internet confession”: Addresser – a person aware of her/his separateness from the society; Addressee – society/public opinion; Subject – a narrator describing his/her emotional state; Object – the person’s self-image; Helper – the person’s frankness; Adversary – the person’s shame. The above mentioned genre features determine its primary advantage – a qualitative one – to be especially focused on the emotionality while more traditional sources of textual data are based on such categories as expressivity (twits) or axiological estimations (all sorts of reviews).

The structural analysis of texts under discussion has also demonstrated several advantages due to the technological basis of the Overheard project: the text hashtagging prevents the researcher from submitting the whole collection to the crowdsourcing assessment; its size is optimal for assessment by experts; despite their hyperbolized emotionality, the texts of Internet confession genre share the stylistic features typical of different types of personal internet discourse.

However, the narrative character of all Internet confession texts implies some restrictions in their use within sentiment analysis project.

Keywords: sentiment analysis, training data set, Internet texts, Internet confession genre, social networks, narratives.


References: Kolmogorova A.V. Texts of “internet confessions” as a source for training data set for the research on the sentiment-analysis field. NSU Vestnik Journal, Series: Linguistics and Intercultural Communication. 17, 3.