M.A. Panov1, I.I. Ivanovichev2
1,2 Ural State University of Economics (Yekaterinburg, Russia)
1 panov79@ya.ru, 2 Ivan0vi4ev.ivan@gmail.com
Problem setting. Social networking sites, forum sites, advertising platforms with reviews from manufacturers and buyers are sources of complete and reliable information about the degree of relevance and truthfulness of the services offered and the degree of consumer satisfaction with the quality of these services. However, to increase conversion or, conversely, to lower the reputation of the manufacturer, users resort to artificially increasing the content, resorting to automatic means of generating text data. This paper discusses the analysis of textual big data using the Naive Bayes machine learning algorithm.
Target. Development of a method for processing, grouping and annotating materials from reference and information sites in order to identify fake reviews using a Naive Bayes classifier algorithm. The main task is to create a system capable of automatically identifying fake reviews on various Internet resources, with the lowest percentage of unsuccessful assumptions.
Results. During the application of the classifier on the analyzed resource, an analysis of user reviews was carried out. The classifier, based on the trained data, was able to obtain information about the ratio of real and unreal, fake reviews written by an Internet bot on the site. The use of text classifiers allows you to quickly analyze heterogeneous big data of WEB content and is a cost-effective technology for intelligent systems. The main advantage of the model is that it is trained on real reviews taken from open Internet sources, due to which it shows high accuracy in identifying real or fake reviews in the real world.
Practical significance. This study offers practical tools to combat fake reviews and improve the reliability of information on reference and information sites, which is of high importance to business, consumers and the scientific community. And thanks to training on real data and the high accuracy of the classifier itself, you can now use the proposed model in various Internet resources to improve the situation with fake reviews on the Internet.
Panov M.A., Ivanovichev I.I. Development of a method for processing, grouping and annotating materials from reference and information sites in order to identify fake reviews using the naive Bayes classifier algorithm. Neurocomputers. 2023. V. 25. № 6. Р. 13-26. DOI: https://doi.org/10.18127/j19998554-202306-02 (In Russian)
- Abakumov A.A., Sidorov D.P., Shibaykin S.D. Application of machine learning methods for text analysis in the formation of normative reference information. Scientific and Technical Bulletin of the Volga region. 2019. № 11. P. 96–102. (In Russian)
- Gordeeva E.V., Kochkarov R.A., Rylov A.A. Analysis of the text theme recognition problem using machine learning. Neurocomputers. 2023. V. 25. № 4. Р. 7-15. DOI 10.18127/j19998554-202304-02. (In Russian)
- Abbakumov A.A., Sidorov D.P., Shibaykin S.D. Analysis of the application of machine learning methods of computer systems to increase protection from fraudulent texts. Bulletin of the Astrakhan State Technical University. Series: Management, Computer Engineering and Computer Science. 2020. №. 1. P. 29–40. (In Russian)
- Berezkin D.V., Shi J., Li T. Applying and comparing multiple machine learning techniques to detect fraudulent credit card transactions. Dynamics of complex systems. 2021. V. 15. № 2. Р. 5−13. DOI 10.18127/j19997493-202102-01. (in Russian)
- Beyer M.A., Laney D. The importance of «big data»: a definition. Stamford, CT: Gartner. 2012.
- Velichko N.A., Mitreikin I.P. Big Data Technology. Big Data Market Analysis. Synergy of Sciences 2018. № 30. P. 937–943. (in Russian)
- Nazarenko Yu.L. Review of Big Data technology and hardware and software used for their analysis and processing. European Science. 2017. № 9(31). P. 25-30. (in Russian)
- Ghani N.A., Hamid S., Hashem I. A.T., Ahmed E. Social media big data analytics: A survey. Computers in Human Behavior. 2019. V. 101. P. 417–428.DOI 10.1016/j.chb.2018.08.039.
- Azretbergenova G.Zh., Syzdykova A.O. Application of Big Data in the Banking Sector of Kazakhstan. Economic series of the Bulletin of the L.N. Gumilev. 2020. № 4. P. 132–140. DOI 10.32523/2079-620X-2020-4-132-14
- Mintarya L.N., Halim J.N.M., Angie C., Achmad S., Kurniawan A. Machine learning approaches in stock market prediction: A systematic literature review. Procedia Computer Science. 2023. V. 216. P. 96–102. DOI 10.1016/j.procs.2022.12.115.
- Pugacheva S.D., Ignatiev A.E. Online marketing: problems and opportunities. Naukosphere. 2021. № 2-1. P. 175–183. (in Russian)
- Müller-Hansen F., Callaghan M.W., Minx J.C. Text as big data: Develop codes of practice for rigorous computational text analysis in energy social science. Energy Research & Social Science. 2020. V. 70. P. 101691. DOI 10.1016/j.erss.2020.101691.
- Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys. 2002. V. 34. № 1. P. 1–47. DOI 10.1145/505282.505283.
- Sahami M., Dumais S., Heckerman D., Horvitz E. A Bayesian approach to filtering junk e-mail. AAAI Conference on Artificial Intelligence. 1998. P. 55–62.
- Batura T.V. Methods of automatic classification of texts. Software products and systems. 2017. № 1. P. 85–99. (in Russian)
- Gulin V.V. Comparative analysis of classification methods of text documents. Bulletin of the Moscow Power Engineering Institute. 2011. № 6. P. 100–108. (in Russian)
- Kim H., Kim J., Kim J., Lim P. Towards perfect text classification with Wikipedia-based semantic Naïve Bayes learning. Neurocomputing. 2018. V. 315. P. 128–134. DOI 10.1016/J.NEUCOM.2018.07.002.
- Kazantsev A.A., Prokhorov M.V., Khudyakova P.S. Review of approaches to the classification of texts by actual methods. Economics and quality of communication systems. 2021. № 1(19). P. 57–67. (in Russian)
- Salehin I., Dip S.T., Talha I.M., Rayhan I., Nammi K.F. Impact on Human Mental Behavior after Pass through a Long Time Home Quarantine Using Machine Learning. International Journal of Education and Management Engineering. 2021. V. 11. № 1. P. 41–50. DOI 10.5815/ijeme.2021.01.05.
- Shanov S.V., Chupin P.G., Afonin A.Yu. Application of the Bayesian classifier to determine the subject of the text. Modeling, optimization and information technology. 2018. V. 6. № 1(20). P. 131–139. (in Russian)
- Tarshkhoeva J.T. Python programming language. Python libraries. Young scientist. 2021. № 5(347). P. 20–21. (in Russian)
- Gong Y., Liu G., Xue Y., Li R., Meng L. A survey on dataset quality in machine learning. Information and Software Technology. 2023. V. 162. P. 107268. DOI 10.1016/j.infsof.2023.107268.
- Monakhov V.I., Sevostyanov P.A. Big data. Tasks, methods and solutions. Collection of scientific papers: "Modern technologies of storage, processing and analysis of big data" M.: FSBEI of HE "Kosygin Russian State University". 2021. P. 12–18. (in Russian)
- Kramarenko I.V., Galichenko E.A. Comparative analysis of approaches to automatic text categorization. Modern Economy Success. 2023. № 3. P. 62–70. (in Russian)
- Pandas documentation. [Electronic resource] – Access mode: https://pandas.pydata.org/docs/pandas.zip, date of reference 05.06.2023.
- Wes M. Python and data analysis. Saratov: Vocational education. 2017. 482 p. ISBN 978-5-4488-0046-7. (in Russian)
- Ilyichev V.Yu., Brik E.A. Analysis of data arrays using the Pandas library for Python. Scientific Review. Technical sciences. 2020. № 4. P. 41–45. (in Russian)
- Seliverstov Ya.A., Chigur V.I., Sazanov A.M., Seliverstov S.A., Svistunova A.S. Development of a system for tone analysis of portal user reviews "AUTOSTRADA.INFO/RU". Proceedings of SPIIRAN. 2019. V. 18. № 2. P. 354–389. DOI 10.15622/sp.18.2.354-389. (in Russian)
- Kokorev D.S., Stepanenko D.B. Scikit-learn: machine learning in Python. Alley of Science. 2018. V. 1. № 1(17). P. 834–838. (in Russian)