Журнал «Информационные технологии и вычислительные системы» - E. A. Golovastova , D. N. Krasotin Effective Clustering of a Text Sample Depending on the Different Parameterization of this Sample

Просматривается номер 2019 / 04

CONTROL AND DECISION-MAKING

M. A. Kudrov, K. D. Bukharov, E. A. Zakharov, D. R. Mahotkin, N. E. Krivoshein, N. A. Grishin, V. Semenkin Intelligent control algorithm for a group of unmanned aerial vehicles

CONTROL SYSTEMS

S. A. Ilyuhin, D. V. Polevoy, T. S. Chernov Improving the Accuracy of Neural Network Methods of Verification of Persons by Spatial-Weighted Normalization of Brightness Image

SOFTWARE ENGINEERING

A. S. Suleikin, N. N. Bakhtadze Architecture Models of Supply Chain Management Digital Ecosystem

DATA PROCESSING AND ANALYSIS

R. N. Ermakov, V. V. Alekseev Primary Data Processing for Constructing Network Package Classifiers in Deep Packet Inspection Analysis and in the Intrusion Detection Systems

R. K. Klassen, V. A. Raikhlin Improving the Efficiency of ClusterixLike DBMS for Big Data Analytical Processing

E. A. Golovastova , D. N. Krasotin Effective Clustering of a Text Sample Depending on the Different Parameterization of this Sample

V.N. Gridin, D.S. Smirnov, V.A. Perepelov The development of Modern Tools for Morphometric Analysis of the Hippocampus of the Brain According to MRI

PATTERN RECOGNITION

E. I. Andreeva, V. V. Arlazarov, A. V. Gayer, E. P. Dorokhov, A.V. Sheshkus, O.A. Slavin Document Recognition Method Based on Convolutional Neural Network Invariant to 180 Degree Rotation Angle

I. M. Janiszewski, V. V. Arlazarov, D. G. Slugin Achieving Statistical Dependence of the CNN Response on the Input Data Distortion for OCR Problem

SECURITY ISSUES

G.P. Akimova, A.Yu. Danilenko, E.V. Pashkina, M.A. Pashkin, A.A. Podrabinovich, A.V. Soloviev, I.V. Tumanova Ensuring Safety in the digitalization of Educational Institutions


	E. A. Golovastova , D. N. Krasotin Effective Clustering of a Text Sample Depending on the Different Parameterization of this Sample
Abstract. The Internet becomes the primary means of receiving text news. As a result, there is a necessity in automated processing of large data amount. One of the most important tasks is the automated cultivation of text information. In this paper we will consider the problem of effective clustering for objects from text sample. The most common representation of the text set is the matrix, which elements are the statistical measure values calculated on the basis of the word frequency. In opposition to we suggest parametrization by the text key words. We use two methods to provide the clustering: K-means and Dbscan. This paper considers the analysis of mentioned methods and provide comparison of the clustering quality results, which depend on various text parameterization and the used algorithm. Keywords: Clustering, text set, sample parameterization, tf-idf-measure, keywords, effective method. PP. 60-69. DOI 10.14357/20718632190406 References 1. Aggarwal C. C. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data: 575–586. 2. Guha S. Mishra N.-Motwani R., O’Callaghan L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science: 359–366. 3. O’Callaghan L. Mishra N.-Meyerson A. Guha S., Motwani R. 2002. Streaming data algorithms for high-quality clustering. In Proceedings of the 18th International Conference on Data Engineering: 685–694. 4. Jones K. S. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. MCB University: MCB University Press. Vol. 60, no. 5: 493-502. 5. [Internet resource] Available at: https://www.python.org/ 6. Bird S. 2006. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics: 69-72. 7. Ester M., Kriegel H.P., Sander J., XiaoweiXu A. 1996. Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press: 264-323. 8. [Internet resource] Available at: https://scikit-learn.org/ 9. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery. 1997. Numerical Recipes in C. Cambridge: Cambridge University Press. 1018p. 10. Bejar, J. K-means vs Mini Batch K-means: a comparison. Available at: http://hdl.handle.net/2117/23414 (accessed -10.05.2019). 11. Martin Ester, Hans-Peter Kriegel, J&g Sander, Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD: 226-231. 12. Bolshakova E.I., Klyshinsky E.S., Lande D.V., Noskov A.A., Peskova O.V., Yagunova E.V. 2011. Avtomaticheskaia obrabotka tekstov na estestvennom iazyke i komp'iuternaia lingvistika [Automatic processing of natural language texts and computer linguistics]. Moscow: MIEM 272 p. 13. Jaccard P. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. 140: S. 241-272. 14. Peter J. Rousseeuw. 1987. Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. Vol. 20.:53–65. doi:10.1016/0377-0427(87)90125-7.

2024 / 01

2023 / 04

2023 / 03

2023 / 02

Abstract.

Keywords: