Yu. A. Kotov Comparative Analysis of Four Methods for Identifying Letters of Texts
Yu. A. Kotov Comparative Analysis of Four Methods for Identifying Letters of Texts


The article presents the results of a comparison of four known frequency methods for identifying letters of texts that are necessary for an applied solution of cryptoanalysis, steganography, and general text analysis problems known in computer science as text mining. To compare and obtain a complete and unified characterization of the methods, an evaluation method is proposed, which includes the measurement of three identification errors and the formation of an integral characteristic based on them, called the goodness of the method. According to this method, an experimental comparison and qualitative analysis of one unigram and three bigram methods of identifying letters of texts was carried out. The comparison was made on representative samples of fragments of Russian texts. The qualitative and quantitative features of the methods, the boundaries of their effective use, the relationship with the type and volume of the text being processed are determined. It is also shown that an important boundary of text volume for frequency methods and Russianlanguage texts is a text of approximately 4,000 characters. Such a volume is quite sufficient for the frequency identification of alphabet characters in a Russian-language text with minimal error, and in some cases for obtaining an exact solution. It is shown that with this and a larger amount of text, frequency methods for alphabet characters identification and the proposed estimates of their inaccuracies can be used to quantify certain stylistic features of the text.


text, alphabet character, unigram, bigram, identification, one-to-one substitution, cipher, text analysis.

PP. 41-56.

DOI 10.14357/20718632190304


1. Shannon C. Communication theory of secrecy systems // Bell System Technical Journal. 1949. vol. 28. no. 4. pp. 656–715.
2. Jakobsen T. A fast Method for Cryptanalysis of Substitution Ciphers // Cryptologia. 1995. vol.19. no 3. pp. 265-274.
3. Corlett E. An Exact A* Method for Solving Letter Substition Ciphers //University of Toronto. 2011.- ftp://ftp.cs.toronto.edu/pub/gh/Corlett-MSc-2011.pdf.
4. Maya Mohan, M. K. Kavitha Devi, V. Jeevan Prakash Security Analysis and Modification of Classical Encryption Scheme // Indian Journal of Science and Technology. 2015. vol. 8 no. 8. pp. 542–548.
5. Bradly Haner, Ryan Hayward, Grzegorz Kondrak Solving Substitution Ciphers with Combined Language Models // Proceedings of COLING 2014, the 25th International Conference of Computational Linguistics: Technical Papers. Dublin, Ireland, August 23-29. 2014. pp. 2314-2325.
6. Rohit Vobbilisetty, Fabio Di Troia, Richard M. Low, Corrado Aaron Visaggio, Mark Stamp Classic cryptanalysis using hidden Markov models // Criptologia. 2017. vol. 41. no.1. pp.1–28.
7. Bidisha Goswami, Ravichandra G. Public cloud user authentication and data confidentiality using image steganography with hash function // American Journal of Applied Mathematics. 2015. vol.3. no. 1-2. pp. 1-8.
8. James Collins, Sos Agaian High Capacity Image Steganography Using Adjunctive Numerical Representations with Multiple Bit-Plane Decomposition Methods // International Journal on Cryptography and Information Security (IJCIS). 2016. Vol. 6, No. 1-2. pp. 1-21.
9. Vorob'eva A.A. Metodika identifikacii internetpol'zovatelja na osnove stilisticheskih i lingvisticheskih harakteristik korotkih jelektronnyh soobshhenij [The method of identification of the Internet user on the basis of stylistic and linguistic characteristics of short electronic messages]. Informacija i kosmos. 2017. no. 1. pp. 127- 130. (In Russ.).
10. Razieh Nokhbeh Zaeem, Monisha Manoharan, Yongpeng Yang, K. Suzanne Barber Modeling and analysis of identity threat behaviors through text mining of identity theft stories // Computers & Security. 2017. no. 65. pp.50-63.
11. Weiming Liang , Haoran Xie, Yanghui Rao , Raymond Y.K. Lau, Fu Lee Wang Universal affective model for Readers’ emotion classification over short texts // Expert Systems with Applications. 2018. No. 114. pp. 322—333.
12. Attila Novak, Borbala Siklosi Grapheme-to-Phoneme Transcription in Hungarian // International Journal of Computational Linguistics and Applications. 2016. vol. 7. no. 1, pp. 161—173.
13. Haithem Afli, Loic Barrault, Holger Schwenk OCR Error Correction Using Statistical Machine Translation // International Journal of Computational Linguistics and Applications. 2016. vol. 7. no. 1, pp. 175—191.
14. Grigori Sidorov. Syntactic Dependency Based N-grams in Rule Based Automatic English as Second Language Grammar Correction // International Journal of Computational Linguistics and Applications, Vol. 4, No. 2, pp. 169—188, 2013.
15. Alireza Yousefpour, Roliana Ibrahim, HazaNuzlyAbdel Hamed Ordinal-based and frequency-based integration of feature selection methods for sentiment analysis // Expert Systems with Applications. 2017. no. 75. pp. 80–93.
16. Sanja Štajner, Horacio Saggion, Simone Paolo Ponzetto Improving lexical coverage of text simplification systems for Spanish // Expert Systems with Applications. 2019. no. 118. pp. 80–91.
17. Silvia García-Méndez, Milagros Fernández-Gavilanes, Enrique Costa-Montenegro, Jonathan Juncal-Martínez, F. Javier González-Castaño A library for automatic natural language generation of spanish texts // Expert Systems with Applications. 2019. no. 120. pp. 372–386.
18. Tret'jakov F.I., Serebrjanaja L.V. Metody avtomaticheskogo postroenija referatov na osnove chastotnogo analiza tekstov [Methods of automatic construction of abstracts based on frequency analysis of texts]. Doklady Belorusskogo gosudarstvennogo universiteta informatiki i radiojelektroniki. 2014. no. 3. pp.40-44. (In Russ.).
19. Fomin V.V., Flegontov A.V., Osochkin A.A. Metod chastotno-morfologicheskoj klassifikacii tekstov [Method of frequency-morphological classification of texts]. Programmnye produkty i sistemy. 2017. no.3. pp.478-486. (In Russ.).
20. Nadir Zanini, Vikas Dhawan Text Mining: An introduction to theory and some applications // A Cambridge Assessment publication. 2015. http://www.cambridgeassessment.org.uk/researchmatters/.
21. Abdenov A.J., Kotov Yu.A., Sanina O.V. [Values of some unigram characteristics of Russian texts]. Nauchnyj vestnik Novosibirskogo gosudarstvennogo tehnicheskogo universiteta. 2017. № 2. pp.146-162. (In Russ.).
22. Kotov Yu.A., Sanina O.V. [Values of some bigram characteristics of Russian texts]. Vestnik SibGUTI (Sibirskij gosudarstvennyj universitet telekommunikacii i informatiki). 2017. № 4. pp.24-34. (In Russ.).
23. Kotov Yu.A., Sanina O.V. [Space identification with unknown sign encoding of Russian texts]. Vestnik SibGUTI (Sibirskij gosudarstvennyj universitet telekommunikacii i informatiki). 2018. № 4. pp.48-60. (In Russ.).
24. Kotov Yu.A. [Determinate Identification of Russian Text Letter Bigrams]. SPIIRAS Proceedings. 2016. no 1. pp.181-197. (In Russ.).
25. Kotov Yu.A. [Approximation of Distributions of Text Characters Bigrams Frequencies for Alphabetic Characters Identification]. SPIIRAS Proceedings. 2017. no 1. pp.190-208. (In Russ.).


2019 / 03
2019 / 02
2019 / 01
2018 / 04

© ФИЦ ИУ РАН 2008-2018. Создание сайта "РосИнтернет технологии".