350 rub
Journal Information-measuring and Control Systems №12 for 2014 г.
Article in number:
OCR-systems benchmarking for the task of building a search engine for archival documents images
Keywords:
OCR benchmarking
OCR-evaluation
digitization of documents
confidentiality of information
assessment of efficiency
OCR-system
Authors:
S. V. Smirnov - Ph.D. (Eng.), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: serge.smir@gmail.com
Abstract:
In this paper we consider the actual task of selecting a suitable project-specific OCR system. Comparative analysis and selection is made in the context of OCR recognition problem of Russian documents archive fund. The aim is to recognize the results of the subsequent full-text indexing for full-text search on document content.
In comparison involving several recognized leaders of the commercial systems («Abbyy Finereader», «Nuance Omnipage»), one less popular commercial system («IRIS Readiris») and all freeware that support recognition of the Russian language («Cuneiform Linux», «Cuneiform Windows», «Tesseract»).
When analyzing the resulting outcomes focuses on the evaluation criteria that operate on words, since these criteria more accurately suited to tasks found for images.
«Abbyy Finereader» reaches maximum performance on all data sets and is the undisputed leader. «Cuneiform Linux» on the contrary shows the worst results of recognition. Remaining systems occupy an intermediate position, with minor deviations with respect to each other.
System «Tesseract» ranking of all the freely distributed systems only provides information about the coordinates of the words in the image and shows the high quality relative to other systems, with the exception of «Abbyy Finereader». Thus, the system «Tesseract» is a good choice for the task of building a search engine for images.
Quality of the final recognition results largely depends on the pretreatment procedures and post-correction. To improve the quality of open-source and commercial systems required to develop and implement methods for pre and post processing.
For a more detailed assessment of the quality of the recognition results in the context of building the search engines for images requires the development of additional criteria , taking into account the needs and characteristics of the search engine.
Pages: 44-51
References
- Smirnov S.V., Belozerova M.V. Ocifrovka, katalogizacija, khranenie i poisk arkhivnojj dokumentacii // Informacionno-izmeritelnye i upravljajushhie sistemy. 2010. T.8. № 7. S. 97-101.
- ABBYY FineReader.http://www.abbyy.ru/finereader/
- AnyDoc Software.http://www.anydocsoftware.com/
- Cvision ocr. https://www.cvisiontech.com
- Dynamsoft OCR SDK. http://www.dynamsoft.com
- ExperVision TypeReader & RTK. http://www.expervision.com
- IRIS Readiris.http://www.irislink.com
- LEADTOOLS OCR SDK.http://www.leadtools.com
- Nuance OmniPage.http://www.nuance.com
- Transym OCR. http://www.transym.com
- Clara OCR. http://freecode.com/projects/claraocr
- Cuneiform Linux.https://launchpad.net/cuneiform-linux
- Cuneiform Windows.http://cognitiveforms.com/ru/products_and_services/cuneiform
- GOCR. http://jocr.sourceforge.net
- Javaocr. http://sourceforge.net/projects/javaocr/
- LOCR. http://www.math.northwestern.edu/~mlerma/locr/
- Ocrad. http://www.gnu.org/software/ocrad/
- OCRchie. http://www.eecs.berkeley.edu/~fateman/kathey/ocrchie.html
- Ocre. http://lem.eui.upm.es/ocre.html
- OCRFeeder. https://wiki.gnome.org/action/show/Apps/OCRFeeder
- Ocropus. https://code.google.com/p/ocropus/.
- SimpleOCR. http://www.simpleocr.com/.
- Tesseract-ocr.http://code.google.com/p/tesseract-ocr/.
- Wikipedia, Cuneiform.http://ru.wikipedia.org/wiki/CuneiForm
- Cuneiform Linux repozitorijj. http://bazaar.launchpad.net/~jpakkane/cuneiform-linux/trunk/revision/536-start_revid=536
- The hOCR Embedded OCR Workflow and Output Format. https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#
- Smirnov S.V. Kriterii ocenki kachestva rezultatov opticheskogo raspoznavanija // Sbornik materialov XVI Mezhdunar. nauchno-prakt. konf. «Perspektivy razvitija informacionnykh tekhnologijj». Novosibirsk. 2013. S. 33-38.