350 rub
Journal Neurocomputers №1 for 2017 г.
Article in number:
Automation language text identification process on the basis of existing solutions
Authors:
S.N. Kalegin - Post-graduate Student, Head of NTO Sector, Moscow Television Research Institute; Applicant of ICS RAS E-mail: skalegin@inbox.ru
Abstract:
In the article presents the results of the pilot study of modern software solutions to determine linguistic affiliation of the text, conducted an employee of the Moscow Research Institute of television S.N. Kalegin to ascertain their effectiveness when used in automatic processing systems unstructured text data. We compared the programs and systems from different vendors and types (monomethod and hybrid). The study was conducted in two stages: first test a program to separate phrases in different languages (total 50 phrases in each), and then each program was accepted 1000 various subjects fragments of texts (news, stories, detectives, etc.) amount 1 paragraph, which allowed us to obtain reliable results. During the testing was found out that the language identifiers are often mistaken for various reasons that have been well documented. According to the results of the study, the following conclusions. 1. The results of the language identification depends on the used writing system, as well as compliance with the rules of grammar and spelling in all the analyzed text, which makes it impossible to use when writing text, nonliterary, non-traditional or old-fashioned way (for example, if instead the Cyrillic used the Latin alphabet or diacritics are ignored). 2. Accuracy of the reviewed language identification programs and systems depends on the methods of identification and technical implementation, as well as the volume of the analyzed text. 3. Best results showed hybrid systems and program, if identification method is based on comparing n-gram templates. However, the disadvantages of this method associated with the lack of linguistic analysis should take into account when using these programs in automatic text processing systems, as the error can cause critical failures and even crash the entire software complex. 4. The hybrid system detect the language of the belonging better than monomethod identifiers, but they have not only the advantages, but also disadvantages combinable all ways that significantly complicate such systems can lead to critical errors. 5. Modern solutions in the field of machine identification of the language are far from perfect and requires substantial refinement, which does not allow to fully automate the language identification process in a multilingual information processing systems because of the high probability of errors. 6. This software is relatively well can be used in a small number of previously known and dramatically different unrelated languages in non-critical tasks, such as automatic sorting of the original Russian, Arabic and Chinese literature in digital libraries, in particular automated systems of data selection, etc., where the identification error will not lead to failure of the entire software complex. The results of this study will be useful not only for developers and users of language identifiers and systems for processing unstructured information, but the compilers of teaching aids appropriate subjects for faculties and departments of information technology, as well as teachers and students of these areas.
Pages: 56-65
References

 

  1. OGAS. Operedivshaja vremja [EHlektronnyjj resurs]. URL: http://vestnikburi.com/ogas-operedivshaya-vremya/. Data obrashhenija: 10.10.2016.
  2. Avtomatizacija tekhnicheskogo ucheta ehnergoresursov [EHlektronnyjj resurs]. URL: http://nforceit.ru/products /avtomatizaciya_ucheta_energoresursov. Data obrashhenija: 10.10.2016.
  3. Avtomatizirovannaja Sistema obrabotki NOTAM-informacii [EHlektronnyjj resurs]. URL: http:// www.monitorsoft.ru/ products/as-notam/. Data obrashhenija: 10.10.2016.
  4. Baranovskaja T.P., Lojjko V.I., Semenov M.I., Trubilin A.I. Informacionnye sistemy i tekhnologii v ehkonomike: Uchebnik. Izd. 2-e, dop. i pererab. M.: Finansy i statistika. 2005. 416 s.
  5. Sysoeva L.A. Modeli vnedrenija tekhnologijj analiticheskojj obrabotki dannykh v informacionnuju sistemu organizacii // Info-Strategija 2014: Obshhestvo. Gosudarstvo. Obrazovanie: Materialy VI Mezhdunar. nauch.-prakt. konf. Samara. 2014. S. 143-146.
  6. Lazarev V.M., Ljubimov A.E. Predlozhenija po ispolzovaniju informacionno-analiticheskikh sistem v informa­cionno-pravovom obespechenii organov zakonodatelnojj i ispolnitelnojj vlasti federalnogo, regionalnogo i mestnogo urovnejj // Pravovaja informatika. 2013. № 1.
  7. Informacionno-analiticheskaja sistema «Lavina» [EHlektronnyjj resurs]. URL: http://poisk-it.ru/produkciya/IAS_Lavina/. Data obrashhenija: 10.10.2016.
  8. Lukashevich N.V. Modeli i metody avtomaticheskojj obrabotki nestrukturirovannojj informacii na osnove bazy znanijj ontologicheskogo tipa: Diss. - dokt. tekhn. nauk. M. 2014.
  9. Kalegin S.N. Avtomaticheskoe opredelenie jazyka teksta // Filologos. № 4 (27). Elec: EGU im. I.A. Bunina. 2015. S. 21-28.
  10. Kalegin S.N. Sposoby opredelenija jazyka teksta // Filologicheskie nauki. Voprosy teorii i praktiki. № 12 (54): v 4-kh ch. CH. II. Tambov: Gramota. 2015. S. 84-89.
  11. Avtomaticheskijj opredelitel jazyka teksta "Guesser" [EHlektronnyjj resurs]. URL: http://guesser.ru/. Data obrashhe­nija: 10.10.2016.
  12. Automatic language identifier (Avtomaticheskijj opredelitel jazyka) [EHlektronnyjj resurs]. URL: http://labs.translated.net/. Data obrashhenija: 10.10.2016.
  13. Programma TextCat [EHlektronnyjj resurs]. URL: http://odur.let.rug.nl/~vannoord/TextCat/. Data obrashhenija: 10.10.2016.
  14. Avtomaticheskijj opredelitel jazyka teksta Poliglot 3000 (P3000) [EHlektronnyjj resurs]. URL: http://www. polyglot3000. com/. Data obrashhenija: 10.10.2016.
  15. Language Identifier by Henrik Falck [EHlektronnyjj resurs]. URL: http://whatlanguageisthis.com/. Data obrashhenija: 10.10.2016.
  16. SILC RALI [EHlektronnyjj resurs]. URL: http://rali.iro.umontreal.ca/rali/
  17. Open Xerox Language Identifier [EHlektronnyjj resurs]. URL: http://open.xerox.com/Services/LanguageIdentifier/. Data obrashhenija: 10.10.2016.
  18. Grothe L., E. William De Luca, A. Nurnberger A Comparative Study on Language Identification Methods / Conference LREC 2008, Morocco.
  19. Indhuja K, Indu M, Sreejith C, P. C. Reghu Raj Text Based Language Identification System for Indian Languages Following Devanagiri Script / International Journal of Engineering Research & Technology (IJERT), Vol. 3 Issue 4, 2014.