Text Vectorization Methods for Retrieval-Based Chatbot

The material was received by the Editorial Board: 21.02.2020

Abstract

Nowadays, a field of dialogue systems and conversational agents is one of the rapidly growing research areas in artificial intelligence applications. Business and industry are showing increasing interest in implementing intelligent conversational agents into their products. There are numerous applications of chatbots in industry, banking, healthcare, and education; and it keeps on growing year-by-year. Many recent studies has tended to focus on possibility of creating intelligent bots helping users not only to accomplish specific tasks (by identifying their intents from text or voice conversations using artificial intelligence), but to capture the user’s identity, attributes, engagement data, and any feedback the user provides – to better handle a wide variety of conversational topics imitating human-like behavior. In this paper, we review the recent progress in developing intelligent conversational agents (or chatbots), its current architecture (rule-based, retrieval based and generative-based models) as well as discuss the main advantages and disadvantages of the approaches. Additionally, we conduct a comparative analysis of state-of-the-art text data vectorization methods (i.e. word/sentence embeddings) which we apply in implementation of a retrieval-based chatbot as an experiment. The results of the experiment are presented as a quality of the chatbot responses selection using various R10@k measures. We also focus on the features of open data sources providing dialogues in Russian. Natural language processing (NLP) techniques for the collected dialogue data are described. Both the final dataset and program code are published. In this paper, the authors also discuss the issues of assessing the quality of chatbots response selection, in particular, emphasizing the importance of choosing the proper evaluation method. We also demonstrate examples of chatbot dialogues implemented using text vectorization models (TF-IDF-weighted Word2Vec embeddings and LASER sentence embeddings) which revealed best performance. Our future work research is also briefly described in this paper.

Keywords: natural language processing, natural language understanding, dialogue systems, conversational AI, intelligent chatbot, retrieval-based chatbot, word embeddings, text vectorization, generative models

References: Chizhik Anna, Zherebtsova Yulia Text Vectorization Methods for Retrieval-Based Chatbot. Vestnik NSU. Series: Linguistics and Intercultural Communication . 2020. Vol. 18, 3.