N.N. Oltyan1
1 Financial university under the Government of the Russian Federation (Moscow, Russia)
1 nikitaoltyan@mail.ru
In modern data analysis and machine learning, a common problem is the need to transform textual information into semi-structured formats like JSON and XML. Existing approaches, including those based on rules, deep learning, and Large Language Models (LLMs), have significant drawbacks, such as high maintenance costs, the need for large amounts of labeled data, and unstable output.
The goal is to systematize data structuring approaches, including post-validation algorithms, and to develop a concept for a deterministic pipeline to ensure both syntactic and semantic validity.
An analysis of modern data structuring methods was conducted, which showed that none of them simultaneously provide strict typing, deterministic output, and syntactic and semantic validity. Key limitations of existing solutions were identified. An approach based on using deterministic pipelines is proposed, which combines the strengths of LLMs with formal validation mechanisms to achieve reliable data structuring.
The developed approach helps to overcome the limitations of existing methods and to improve the reliability and quality of structuring textual data. This will facilitate the development of data analysis systems that require precise and predictable transformation of unstructured information.
Oltyan N.N. Evolution of methods for extracting and structuring data from text to JSON and XML. Neurocomputers. 2025. V. 27. № 6. P. 37−49. DOI: 10.18127/j19997493-202506-04 (in Russian).
- Teterevenkov D.L. E`kspertno-orientirovanny`e metody` ocenki kachestva tekstovoj generacii bol`shix yazy`kovy`x modelej. Myagkie izmereniya i vy`chisleniya. 2025. № 5. T. 90. S. 30–37. https://doi.org/10.36871/26189976.2025.05.003
- Adam H., Lin J., Keenan H., Wilson A. & Ghassemi M. Clinical Information Extraction with Large Language Models: A Case Study on Organ Procurement. AMIA Annu Symp Proc (2025): 115–123.
- Ai, Qianxiang, Meng, Fanwang, Shi, Jiale, Pelkie, Brenden, & Coley, Connor W. Extracting structured data from organic synthesis procedures using a fine-tuned large language model. Digital Discovery, 3(9) (2024): 1822–1831.
- Amavi, Joshua, Bouchou, Béatrice, & Savary, Agata. On correcting XML documents with respect to a schema. The Computer Journal, 57(5) (2014): 639–674.
- Attouche, Lyes, Baazizi, Mohamed-Amine, Colazzo, Dario, Ghelli, Giorgio, Sartiani, Carlo, & Scherzinger, Stefanie. Validation of modern JSON schema: formalization and complexity. Proceedings of the ACM on Programming Languages, 8(POPL) (2024): 1451–1481.
- Chiu, Jason P.C., & Nichols, Eric. Named entity recognition with bidirectional LSTM-CNNs. Transactions of the Association for Computational Linguistics, 4 (2016): 357–370.
- Cunningham, Hamish, Maynard, Diana, & Bontcheva, Kalina. Text Processing with GATE. Gateway Press CA, 2011.
- Dagdelen, John, Dunn, Alexander, Lee, Sanghoon, Walker, Nicholas, Rosen, Andrew S., Ceder, Gerbrand, Persson, Kristin A., & Jain, Anubhav. Structured information extraction from scientific text with large language models. Nature Communications, 15(1) (2024): 1418.
- Delaunay, Julien, Tran, Hanh Thi Hong, González-Gallardo, Carlos-Emiliano, Bordea, Georgeta, Sidere, Nicolas, & Doucet, Antoine. A comprehensive survey of document-level relation extraction (2016–2023). arXiv preprint arXiv:2309.16396 (2023).
- Ferrucci, David, & Lally, Adam. UIMA: an architectural approach to unstructured information processing in the corporate research environment. Natural Language Engineering, 10(3–4) (2004): 327–348.
- Ferrucci, David, & Lally, Adam. Building an example application with the Unstructured Information Management Architecture. IBM Systems Journal, 43(3) (2004): 455–475.
- Geng, Saibo, Cooper, Hudson, Moskal, Michał, Jenkins, Samuel, Berman, Julian, Ranchin, Nathan, West, Robert, Horvitz, Eric, & Nori, Harsha. Generating structured outputs from language models: benchmark and studies. arXiv preprint arXiv:2501 (2025).
- Geng, Saibo, Josifoski, Martin, Peyrard, Maxime, & West, Robert. Grammar-constrained decoding for structured NLP tasks without finetuning. arXiv preprint arXiv:2305.13971 (2023).
- Hu, Yan, Chen, Qingyu, Du, Jingcheng, Peng, Xueqing, Keloth, Vipina Kuttichi, Zuo, Xu, Zhou, Yujia et al. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association, 31(9) (2024): 1812–1820.
- Huang, Zhiheng, Xu, Wei, & Yu, Kai. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015).
- Jiao, Yizhu, Zhong, Ming, Li, Sha, Zhao, Ruining, Ouyang, Siru, Ji, Heng, & Han, Jiawei. Instruct and extract: instruction tuning for on-demand information extraction. arXiv preprint arXiv:2310.16040 (2023).
- Korn, Flip, Saha, Barna, Srivastava, Divesh, & Ying, Shanshan. On repairing structural problems in semi-structured data. Proceedings of the VLDB Endowment, 6(9) (2013): 601–612.
- Lample, Guillaume, Ballesteros, Miguel, Subramanian, Sandeep, Kawakami, Kazuya, & Dyer, Chris. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360 (2016).
- Lin, Ying, Ji, Heng, Huang, Fei, & Wu, Lingfei. A joint neural model for information extraction with global features. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7999–8009 (2020).
- Lu, Yaojie, Liu, Qing, Dai, Dai, Xiao, Xinyan, Lin, Hongyu, Han, Xianpei, Sun, Le, & Wu, Hua. Unified structure generation for universal information extraction. arXiv preprint arXiv:2203.12277 (2022).
- Ma, Xuezhe, & Hovy, Eduard. End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. arXiv preprint arXiv:1603.01354 (2016).
- Rosier, Arnaud, Burgun, Anita, & Mabo, Philippe. Using regular expressions to extract information on pacemaker implantation procedures from clinical reports. AMIA Annual Symposium Proceedings, vol. 2008, p. 81 (2008).
- Sarawagi, Sunita. Information extraction. Foundations and Trends in Databases, 1(3) (2008): 261–377.
- Sawsaa, Ahlam, & Lu, Joan. Extracting information science concepts based on JAPE regular expression. In Proceedings of the International Conference on Internet Computing (ICOMP), p. 1. The Steering Committee of the World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), (2011).
- Vangala, S. R., Krishnan, S. R., & Bung, N. et al. Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature. J Cheminform, 16, 131 (2024).
- Viotti, Juan Cruz, & Mior, Michael J. Blaze: compiling JSON Schema for 10x faster validation. arXiv preprint arXiv:2503.02770 (2025).
- Wadden, David, Wennberg, Ulme, Luan, Yi, & Hajishirzi, Hannaneh. Entity, relation, and event extraction with contextualized span representations. arXiv preprint arXiv:1909.03546 (2019).
- Waltl, Bernhard, Bonczek, Georg, & Matthes, Florian. Rule-based information extraction: advantages, limitations, and perspectives. Jusletter IT, 4 (2018).
- Wang, Xiao, Zhou, Weikang, Zu, Can, Xia, Han, Chen, Tianze, Zhang, Yuansen, Zheng, Rui, et al. Instructuie: multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085 (2023).
- Xu, Derong, Chen, Wei, Peng, Wenjun, Zhang, Chao, Xu, Tong, Zhao, Xiangyu, Wu, Xian, Zheng, Yefeng, Wang, Yang, & Chen, Enhong. Large language models for generative information extraction: a survey. Frontiers of Computer Science, 18(6) (2024).
- Yao, Yunzhi, Mao, Shengyu, Zhang, Ningyu, Chen, Xiang, Deng, Shumin, Chen, Xi, & Chen, Huajun. Schema-aware reference as prompt improves data-efficient knowledge graph construction. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 911–921 (2023).
- Zhao, Xiaoyan, Deng, Yang, Yang, Min, Wang, Lingzhi, Zhang, Rui, Cheng, Hong, Lam, Wai, Shen, Ying, & Xu, Ruifeng. A comprehensive survey on relation extraction: recent advances and new frontiers. ACM Computing Surveys, 56(11) (2024): 1–39.
- GATE Documentation. JAPE: Regular Expressions over Annotations. https://gate.ac.uk/releases/gate-5.0-build3244-ALL/doc/tao/ splitch7.html (data obrashcheniya: 16.10.2025).
- JSON Repair Tool. Online utility for fixing malformed JSON (well-formedness repair). https://jsonrepair.com (data obrashcheniya: 16.10.2025).

