CONTROL AND DECISION-MAKING
CONTROL SYSTEMS
SOFTWARE ENGINEERING
DATA PROCESSING AND ANALYSIS
E. A. Golovastova , D. N. Krasotin Effective Clustering of a Text Sample Depending on the Different Parameterization of this Sample
PATTERN RECOGNITION
SECURITY ISSUES
E. A. Golovastova , D. N. Krasotin Effective Clustering of a Text Sample Depending on the Different Parameterization of this Sample

Abstract.

The Internet becomes the primary means of receiving text news. As a result, there is a necessity in automated processing of large data amount. One of the most important tasks is the automated cultivation of text information. In this paper we will consider the problem of effective clustering for objects from text sample. The most common representation of the text set is the matrix, which elements are the statistical measure values calculated on the basis of the word frequency. In opposition to we suggest parametrization by the text key words. We use two methods to provide the clustering: K-means and Dbscan. This paper considers the analysis of mentioned methods and provide comparison of the clustering quality results, which depend on various text parameterization and the used algorithm.

Keywords:

Clustering, text set, sample parameterization, tf-idf-measure, keywords, effective method.

PP. 60-69.

DOI 10.14357/20718632190406

References

1. Aggarwal C. C. 2003. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data: 575–586.
2. Guha S. Mishra N.-Motwani R., O’Callaghan L. 2000. Clustering data streams. In Proceedings of the IEEE Symposium on Foundations of Computer Science: 359–366.
3. O’Callaghan L. Mishra N.-Meyerson A. Guha S., Motwani R. 2002. Streaming data algorithms for high-quality clustering. In Proceedings of the 18th International Conference on Data Engineering: 685–694.
4. Jones K. S. 2004. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation. MCB University: MCB University Press. Vol. 60, no. 5: 493-502.
5. [Internet resource] Available at: https://www.python.org/
6. Bird S. 2006. NLTK: the natural language toolkit. In Proceedings of the COLING/ACL on Interactive presentation sessions. Association for Computational Linguistics: 69-72.
7. Ester M., Kriegel H.P., Sander J., XiaoweiXu A. 1996. Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining. AAAI Press: 264-323.
8. [Internet resource] Available at: https://scikit-learn.org/
9. William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P. Flannery. 1997. Numerical Recipes in C. Cambridge: Cambridge University Press. 1018p.
10. Bejar, J. K-means vs Mini Batch K-means: a comparison. Available at: http://hdl.handle.net/2117/23414 (accessed -10.05.2019).
11. Martin Ester, Hans-Peter Kriegel, J&g Sander, Xiaowei Xu. 1996. A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. KDD: 226-231.
12. Bolshakova E.I., Klyshinsky E.S., Lande D.V., Noskov A.A., Peskova O.V., Yagunova E.V. 2011. Avtomaticheskaia obrabotka tekstov na estestvennom iazyke i komp'iuternaia lingvistika [Automatic processing of natural language texts and computer linguistics]. Moscow: MIEM 272 p.
13. Jaccard P. 1901. Distribution de la flore alpine dans le Bassin des Dranses et dans quelques regions voisines. Bull. Soc. Vaudoise sci. Natur. V. 37. Bd. 140: S. 241-272.
14. Peter J. Rousseeuw. 1987. Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis. Computational and Applied Mathematics. Vol. 20.:53–65. doi:10.1016/0377-0427(87)90125-7.
 

2024 / 01
2023 / 04
2023 / 03
2023 / 02

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".