OCR-systems benchmarking for the task of building a search engine for archival documents images

350 rub

Journal Information-measuring and Control Systems №12 for 2014 г.

Article in number:

Keywords: OCR benchmarking OCR-evaluation digitization of documents confidentiality of information assessment of efficiency OCR-system

Authors:

S. V. Smirnov - Ph.D. (Eng.), St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS). E-mail: serge.smir@gmail.com

Abstract:

In this paper we consider the actual task of selecting a suitable project-specific OCR system. Comparative analysis and selection is made in the context of OCR recognition problem of Russian documents archive fund. The aim is to recognize the results of the subsequent full-text indexing for full-text search on document content. In comparison involving several recognized leaders of the commercial systems («Abbyy Finereader», «Nuance Omnipage»), one less popular commercial system («IRIS Readiris») and all freeware that support recognition of the Russian language («Cuneiform Linux», «Cuneiform Windows», «Tesseract»). When analyzing the resulting outcomes focuses on the evaluation criteria that operate on words, since these criteria more accurately suited to tasks found for images. «Abbyy Finereader» reaches maximum performance on all data sets and is the undisputed leader. «Cuneiform Linux» on the contrary shows the worst results of recognition. Remaining systems occupy an intermediate position, with minor deviations with respect to each other. System «Tesseract» ranking of all the freely distributed systems only provides information about the coordinates of the words in the image and shows the high quality relative to other systems, with the exception of «Abbyy Finereader». Thus, the system «Tesseract» is a good choice for the task of building a search engine for images. Quality of the final recognition results largely depends on the pretreatment procedures and post-correction. To improve the quality of open-source and commercial systems required to develop and implement methods for pre and post processing. For a more detailed assessment of the quality of the recognition results in the context of building the search engines for images requires the development of additional criteria , taking into account the needs and characteristics of the search engine.

Pages: 44-51

References

Smirnov S.V., Belozerova M.V. Ocifrovka, katalogizacija, khranenie i poisk arkhivnojj dokumentacii // Informacionno-izmeritelnye i upravljajushhie sistemy. 2010. T.8. № 7. S. 97-101.
ABBYY FineReader.http://www.abbyy.ru/finereader/
AnyDoc Software.http://www.anydocsoftware.com/
Cvision ocr. https://www.cvisiontech.com
Dynamsoft OCR SDK. http://www.dynamsoft.com
ExperVision TypeReader & RTK. http://www.expervision.com
IRIS Readiris.http://www.irislink.com
LEADTOOLS OCR SDK.http://www.leadtools.com
Nuance OmniPage.http://www.nuance.com
Transym OCR. http://www.transym.com
Clara OCR. http://freecode.com/projects/claraocr
Cuneiform Linux.https://launchpad.net/cuneiform-linux
Cuneiform Windows.http://cognitiveforms.com/ru/products_and_services/cuneiform
GOCR. http://jocr.sourceforge.net
Javaocr. http://sourceforge.net/projects/javaocr/
LOCR. http://www.math.northwestern.edu/~mlerma/locr/
Ocrad. http://www.gnu.org/software/ocrad/
OCRchie. http://www.eecs.berkeley.edu/~fateman/kathey/ocrchie.html
Ocre. http://lem.eui.upm.es/ocre.html
OCRFeeder. https://wiki.gnome.org/action/show/Apps/OCRFeeder
Ocropus. https://code.google.com/p/ocropus/.
SimpleOCR. http://www.simpleocr.com/.
Tesseract-ocr.http://code.google.com/p/tesseract-ocr/.
Wikipedia, Cuneiform.http://ru.wikipedia.org/wiki/CuneiForm
Cuneiform Linux repozitorijj. http://bazaar.launchpad.net/~jpakkane/cuneiform-linux/trunk/revision/536-start_revid=536
The hOCR Embedded OCR Workflow and Output Format. https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#
Smirnov S.V. Kriterii ocenki kachestva rezultatov opticheskogo raspoznavanija // Sbornik materialov XVI Mezhdunar. nauchno-prakt. konf. «Perspektivy razvitija informacionnykh tekhnologijj». Novosibirsk. 2013. S. 33-38.