350 rub
Journal Science Intensive Technologies №1 for 2023 г.
Article in number:
Application of machine learning methods in plagiarism detection systems in order to solve the problem of highly specialized concepts
Type of article: scientific article
DOI: https://doi.org/10.18127/j19998554-202206-03
UDC: 349
Authors:

I.I. Starkov1, A.I. Vlasov2

1,2 Bauman Moscow State Technical University (Moscow, Russia)
 

Abstract:

This work is devoted to the application of machine learning methods in plagiarism detection systems in order to solve the problem of highly specialized concepts. As you know, in plagiarism detection systems, the Shingles algorithm is used to determine the uniqueness of the text. This algorithm cannot determine the context and topic of a scientific article in order to exclude from the check those words that are characteristic of this topic of the article. This can lead to a situation where a person understands that the uniqueness of the article is above the required threshold, and the algorithm, since the work is replete with the same words (terms), will consider it insufficiently unique.

The purpose of this work is to build a machine learning model that determines the characteristic words for the selected thematic area, which will be removed from the article, and in turn the already processed text will be received at the input of the Shingles algorithm.

The problems of his identification of plagiarism from the position of highly specialized texts are touched upon. A brief overview of existing systems for detecting plagiarism is given, reflecting their strengths and weaknesses. The Shingles algorithm is considered – a classic algorithm for detecting plagiarism in text works. A classification model of machine learning is being developed, which will determine which topic the article refers to base on its name, so that later on to compile a list of words that are most often found in articles on this topic (of course, not counting prepositions, conjunctions, etc.). The results of the work of this model, reflected in the corresponding metrics of the quality of classification models, are presented. Recommendations for improving the quality of the developed model are given.

The introduction of the described machine learning model into modern online services specializing in checking text for borrowings will reduce the number of cases when, when checking scientific papers, the work being checked turns out to have low uniqueness due to the use of highly specialized terms. The author will not have to spend time selecting synonyms for these terms where it does not make sense, therefore, user interest in such services will increase.

Pages: 69-76
For citation

Starkov I.I., Vlasov A.I. Application of machine learning methods in plagiarism detection systems in order to solve the problem of highly specialized concepts. Neurocomputers. 2022. V. 24. № 6. Р. 30-37. DOI: https://doi.org/10.18127/j19998554-202206-03 (In Russian)

References
  1. Chirkin E.S. Sistemy avtomatizirovannoy proverki na nepravomernyye zaimstvovaniya. Vestnik Tambovskogo univer-siteta. Seriya: Gumanitarnyye nauki. 2013. № 12 (128). S. 164–174. (in Russian).
  2. Klyuyev M.A. Sudebnaya zashchita avtorskogo prava v sisteme vysshego obrazovaniya: problemy i perspektivy. Universi-tetskoye upravleniye: praktik i analiz. 2006. № 2. (45). S. 88–92. (in Russian).
  3. Chevtayeva L.N. Internet-piratstvo: vchera i segodnya. Vestnik Saratovskogo gosudarstvennogo tekhnicheskogo universi-teta. Seriya: Filosofiya. sotsiologiya i kultura. 2013. T. 4. № 1 (73). S. 284–289. (in Russian).
  4. Ugolovnyy kodeks Rossiyskoy Federatsii. ch.1. st. 146 «Narusheniye avtorskikh i smezhnykh prav». (in Russian).
  5. Andreyev V.V., Gay V.E., Tarasova N.P., Samoylov A.A., Ermolenko E.D., Satayev A.A. Razrabotka sistemy "Tekhnicheskiy antiplagiat" dlya VKR bakalavrov. Nauchno-tekhnicheskiy vestnik Povolzhia. 2022. № 2. S. 13–15. (in Russian).
  6. Sharapova E.V. Sravnitelnyy analiz servisov proverki originalnosti tekstov. Mashinostroyeniye i bezopasnost zhiznedeyatelnosti. 2019. № 1 (39). S. 48-51. (in Russian).
  7. Antiplagiat.ru – URL: https://www.antiplagiat.ru (data obrashcheniya 24.05.2022) (in Russian).
  8. Text.ru – URL: https://text.ru (data obrashcheniya 24.05.2022) (in Russian).
  9. eTXT.ru – URL: https://www.etxt.ru (data obrashcheniya 24.05.2022) (in Russian).
  10. Antiplagiat.RGB – URL: https://rgb.antiplagiat.ru (data obrashcheniya 24.05.2022) (in Russian).
  11. Advego – URL: https://advego.com (data obrashcheniya 24.05.2022) (in Russian).
  12. Mikheyev M.Yu., Somin N.V., Galina I.V., Zolotarev O.V., Kozerenko E.B., Morozova Yu.I., Sharnin M.M. Falshteksty: klassifikatsiya i metody opoznaniya tekstovykh imitatsiy i dokumentov s podmenoy avtorstva. Informatika i eye pri-meneniya. 2014. T. 8. № 4. S. 70–77. (in Russian).
  13. Ayvazyan S.A., Enyukov I.S., Meshalkin L.D. Prikladnaya statistika: issledovaniye zavisimostey. M.: Finansy i statistika. 1985. (in Russian).
  14. Prudius A.A., Karpunin A.A., Vlasov A.I. Analysis of Machine learning methods to improve efficiency of big data processing in Industry 4.0. Journal of Physics: Conference Series. 2019. Article № 032065.
  15. Kondrativ V.O., Demin A.A. Analiz perspektiv kvantovogo mashinnogo obucheniya. Sb. trudov XIX Vseros. nauch. konferentsii «Neyrokompyutery i ikh primeneniye». 2021. S. 29–33. (in Russian).
  16. Kaftannikov I.L., Parsich A.V. Problemy formirovaniya obuchayushchey vyborki v zadachakh mashinnogo obucheniya. Yuzhno-Uralskiy gosudarstvennyy universitet. 2016. 10 s. (in Russian).
  17. Fedorovskiy A.N. Kostin M.Yu. Mail.ru na ROMIP-2005. v sb. «Trudy ROMIP’2005» Trudy tretyego rossiyskogo seminara po otsenke metodov informatsionnogo poiska. Pod red. I.S. Nekrestianova. SPb.: NII Khimii SPbGU. 2005. S. 106–124. (in Russian).
  18. Hastie T., Tibshirani R., Friedman J. Random Forests. Chapter 15. in: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. Springer-Verlag. 2009. 746 p.
Date of receipt: 04.10.2022
Approved after review: 20.10.2022
Accepted for publication: 22.11.2022