Preview

Information and Innovations

Advanced search

Problems of Algorithms Development to Determine Quality of Topic Models Ensembles for Make Rubricators

https://doi.org/10.31432/1994-2443-2018-13-3-53-58

Abstract

Intelligent data mining is one of the most relevant areas of research in the modern world. The spectrum of its application is extremely wide and covers practically all scientific disciplines. The task of analyzing text collections with the purpose of establishing thematic headings, which should be classified as separate articles with observance of the principle of systematization “from the general to the particular” and the formation of the list of “nuclear” categories, is very actual. Clustering and, in particular, topic modeling is one of the methods of intelligent text analysis. The solution of the problem of clustering text collections is fundamentally ambiguously, and there are several reasons. Firstly, there isn’t known clearly the best criterion of quality of clustering. There are a lot of reasonable criteria, but they all can give different results. Secondly, the number of clusters is usually unknown in advance and determined according by some subjective criterion. Thirdly, clustering result depends significantly on the distance metric, the choice of which is usually subjective and set by the expert. Nowadays ensembles of models are becoming more widespread among the data mining techniques. They can significantly improve the accuracy of modeling results. The main purpose of this research is to increase the clustering effectiveness of textual information by using the ensemble thematic models. This article describes the usage of a voting algorithm, which is based on a group of different evaluation algorithms. Voting algorithm allows you to select the most appropriate solution, to accurately assess the quality of the topic model and to generate a set of relevant topics. Computational experiment demonstrates coincidence with the results of expert assessments and the evaluations of formal criteria in general. The concept for quality evaluation of thematic models ensemble, which uses the simple voting algorithm, was explored and proposed for further researches.

About the Authors

A. P. Shiryaev
National Research University of Electronic Technology, Moscow, Russia
Russian Federation


A. R. Fedorov
National Research University of Electronic Technology, Moscow, Russia
Russian Federation


P. A. Fedorov
National Research University of Electronic Technology, Moscow, Russia
Russian Federation


L. G. Gagarina
National Research University of Electronic Technology, Moscow, Russia
Russian Federation


E. M. Portnov
National Research University of Electronic Technology, Moscow, Russia
Russian Federation


References

1. Воронцов К.В. Вероятностное тематическое моделирование. URL: http://www.machinelearning. ru/wiki/images/2/22/Voron-2013-ptm.pdf (дата обращения 26.09.2018)

2. Бериков В.Б., Лбов Г.С. Современные тенденции в кластерном анализе. URL: https://docplayer. ru/26851064-Sovremennye-tendencii-v-klasternom- analize-v-b-berikov-g-s-lbov.html (дата обращения)

3. Кашницкий Ю.С., Игнатов Д.И. Ансамблевый метод машинного обучения, основанный на рекомендации классификаторов // Интеллектуальные системы. Теория и приложения. 2015. Т. 19. № 4. С. 37-55

4. Skurichina M., Duin R. P. W. Limited bagging, boosting and the random subspace method for linear classifiers // Pattern Analysis Applications. - 2002. - Pp. 121-135.

5. Журавлев Ю.И., Рязанов В.В., Сенько О.В. Распознавание. Математические методы. Программная система. Практические применения. - М: Фазис, 2005 г. , 159 стр.

6. Blei D., Ng A., and Jordan M. Latent Dirichlet allocation // Journal of Machine Learning Research. - 2003. - vol. 3. - Pp. 993-1022.

7. Thomas Hofmann. Probabilistic latent semantic analysis // Proceedings of the Twenty-Second Annual International SIGIR Conference on Research and Development in Information Retrieval. 1999

8. Vorontsov K.V., Potapenko A.A. EM-like algorithms modification for probabilistic topic modeling // Machine learning and data analysis - 2013. - vol. 1, № 6. - Pp. 657-686

9. Воронцов К.В. Лекции по алгоритмам кластеризации многомерного шкалирования URL: http://www.cs.ru/voron/download/Clustering.pdf (дата обращения 26.09.2018)

10. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space// ICLR Workshop. - 2013

11. David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. Automatic evaluation of topic coherence // In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. - Association for Computational Linguistics, 2010. - Pp. 100-108


Review

For citations:


Shiryaev A.P., Fedorov A.R., Fedorov P.A., Gagarina L.G., Portnov E.M. Problems of Algorithms Development to Determine Quality of Topic Models Ensembles for Make Rubricators. Information and Innovations. 2018;13(3):53-58. (In Russ.) https://doi.org/10.31432/1994-2443-2018-13-3-53-58

Views: 412


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 1994-2443 (Print)
ISSN 2949-2157 (Online)