Hardware accelerator for vision transformer inference on programmable logic using microscaling integer format

500 rub

Journal Neurocomputers №2 for 2026 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j19998554-202602-01

UDC: 004.31

Keywords: Vision transformer FPGA hardware acceleration quantization MXINT data path optimization nonlinear operations

Authors:

O.V. Zobov1, A.A. Spiridonov2
1,2 JSC «SPC «Kryptonite» (Moscow, Russia)

1 o.zobov@kryptonite.ru, 2 a.spiridonov@kryptonite.ru

Abstract:

Vision transformers demonstrate high efficiency in computer vision tasks owing to their ability to capture global dependencies between spatially distant image regions. The key challenge of hardware acceleration is the presence of nonlinear operations sensitive to quantization and therefore inefficient on low-bitwidth computing units. Existing accelerators on field-programmable gate arrays use fixed-point integer formats for matrix multiplications but execute nonlinear operations on the host CPU or in floating-point format, creating a performance bottleneck due to intermediate data transfers between the processor and the accelerator.

The objective of the article is development of a vision transformer inference accelerator on programmable logic with full on-device execution of all operations, using a low-bitwidth data representation format while maintaining high inference accuracy of the transformer model.

An accelerator has been proposed that uses the microscaling integer (MXINT) format with an adapted configuration: 6-bit elements, blocks of 16 elements for activations and 256 for weights. Specialized lookup-table-based approximations have been developed for LayerNorm, GELU, and Softmax. Data bitwidth reduction of 2,46 times (from 16 to 6,5 bits per element), nonlinear operator area reduction of 8–13 times, speedup of 92 times relative to Float16 and 1,9 times relative to reference Int8 solutions on FPGA have been achieved with classification accuracy loss of no more than 1% on the ImageNet dataset.

The proposed approach provides a throughput of 500 images per second on the Alveo U250 platform, enabling the use of vision transformers in real-time applications (autonomous vehicles, video surveillance, medical diagnostics). The methodology is applicable to a broad class of transformer architectures for computer vision and natural language processing tasks.

Pages: 5-20

For citation

Zobov O.V., Spiridonov A.A. Hardware accelerator for vision transformer inference on programmable logic using microscaling integer format. Neurocomputers. 2026. V. 28. № 2. P. 5–20. DOI: https://doi.org/10.18127/j19998554-202602-01 (in Russian)

References

Dosovitskiy A., Beyer L., Kolesnikov A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. Inter-national Conference on Learning Representations (ICLR). 2021.
Touvron H., Cord M., Douze M. et al. Training data-efficient image transformers & distillation through attention. Proceedings of the 38th International Conference on Machine Learning (ICML). 2021. V. 139. P. 10347–10357.
Deng J., Dong W., Socher R. et al. ImageNet: A large-scale hierarchical image database // 2009 IEEE Conference on Computer Vision and Pattern Recognition. 2009. P. 248–255.
Darvish Rouhani B., Zhao R., Klinefelter A. et al. Pushing the limits of narrow precision inferencing at cloud scale with Microsoft floating point. Advances in Neural Information Processing Systems (NeurIPS). 2020. V. 33. P. 22292–22303.
Open compute project. OCP microscaling formats (MX) specification version 1.0. 2023 [Elektronnyj resurs]. URL: https://www.opencompute.org/ documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf (data obrashcheniya: 08.11.2024).
Liu Z., Wang Y., Han K. et al. Post-training quantization for vision transformer. Advances in Neural Information Processing Systems (NeurIPS). 2021. V. 34. P. 28092–28103.
Lin J., Tang J., Tang H. et al. FQ-ViT: Post-training quantization for fully quantized vision transformer. International Journal of Computer Vision. 2022. V. 130. № 12. P. 3091–3108.
Li Z., Gu Q. I-ViT: Integer-only quantization for efficient vision transformer inference. IEEE/CVF International Conference on Computer Vision (ICCV). 2023. P. 17174–17185.
Yuan Z., Xue C., Chen Y. et al. PTQ4ViT: Post-training quantization framework for vision transformers with twin uniform quantization. European Conference on Computer Vision (ECCV). 2022. P. 191–207.
Li Y., Wei Y., Yan J. et al. PSAQ-ViT V2: Towards accurate and general data-free quantization for vision transformers. IEEE Transactions on Neural Networks and Learning Systems. 2024. V. 35. № 4. P. 4654–4667.
Zhong Y., Huang L., Chen C., Wang Y. Data-free quantization via mixed-precision compensation without fine-tuning. Neural Computing and Applications. 2023. V. 35. P. 15067–15082.
Han Y., Wang Y., Zhang C. et al. AutoViT-Acc: An FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. 2023 60th ACM/IEEE Design Automation Conference (DAC). 2023. P. 1–6.
Dong P., Kalantidis Y., Hsieh C.-J., Wang Y. HeatViT: Hardware-efficient adaptive token pruning for vision transformers. 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 2023. P. 1–13.
Huang M., Xu W., Wang J. et al. An integer-only and group-vector systolic accelerator for efficiently mapping vision transformer on edge. IEEE Transactions on Circuits and Systems I: Regular Papers. 2023. V. 70. № 4. P. 1439–1452.
Wang K., Liu Z., Lin Y. et al. HAQ: Hardware-aware automated quantization with mixed precision. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2019. P. 8612–8620.
Xiao G., Lin J., Seznec M. et al. SmoothQuant: Accurate and efficient post-training quantization for large language models. ArXiv preprint arXiv:2211.10438. 2022.
Frantar E., Ashkboos S., Hoefler T., Alistarh D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ArXiv preprint arXiv:2210.17323. 2022.
Dettmers T., Lewis M., Belkada Y., Zettlemoyer L. LLM.int8(): 8-bit matrix multiplication for transformers at scale. ArXiv preprint arXiv:2208.07339. 2022.
Yao Z., Aminabadi R.Y., Zhang M. et al. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers. ArXiv preprint arXiv:2206.01861. 2022.
Zhang C., Zhao Y., Chen Y. et al. Revisiting block-based quantisation: What is important for sub-8-bit LLM inference? ArXiv preprint arXiv:2310.05079. 2023.
Song Z., Liu Z., Wang Y. et al. DRQ: Dynamic region-based quantization for deep neural network acceleration. 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). 2020. P. 474–487.
Zadeh A.H., Edo I., Santana O.M. et al. Mokey: Enabling narrow fixed-point inference for out-of-the-box floating-point transformer models. Proceedings of the 49th Annual International Symposium on Computer Architecture. 2022. P. 876–892.
Vaswani A., Shazeer N., Parmar N. et al. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS). 2017. V. 30. P. 5998–6008.
Devlin J., Chang M.-W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT). 2019. P. 4171–4186.
Radford A., Wu J., Child R. et al. Language models are unsupervised multitask learners. OpenAI blog. 2019. V. 1. № 8. P. 9.
Paszke A., Gross S., Massa F. et al. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS). 2019. V. 32. P. 8024–8035.
Wightman R. PyTorch image models. 2019 [Elektronnyj resurs]. URL: https://github.com/huggingface/pytorch-image-models (data obrashcheniya: 08.11.2024).
Xilinx Inc. Vivado design suite user guide: High-level synthesis (UG902, v2023.2). 2023 [Elektronnyj resurs]. URL: https://www.xilinx.com (data obrashcheniya: 08.11.2024).
Demin A., Vlasov A., Selivanov K. et al. Integration of embedded components into cyber-physical systems: Design, analysis, and applications. Artificial Intelligence and Digital Transformation. Lecture Notes in Information Systems and Organisation. 2025. V. 78. P. 207–221.
Vlasov A., Gladkikh A., Kutaev K. Application of modern programming languages in solving the problem of emulator development for embedded systems. Artificial Intelligence Algorithm Design for Systems. CSOC 2024. Lecture Notes in Networks and Systems. 2024. V. 1120. P. 574–598.
Yuldashev M.N., Vlasov A.I., Novikov A.N. Energy-efficient algorithm for classification of states of wireless sensor network using machine learning methods. Journal of Physics: Conference Series. 2018. V. 1015. № 032153.
Zhalnin V.P., Zakharova A.S., Uzenkov D.A. et al. Configuration-making algorithm for the smart machine controller based on the Internet of Things concept. International Review of Electrical Engineering. 2019. V. 14. № 5. P. 375–384.

Date of receipt: 26.11.2025

Approved after review: 16.12.2025

Accepted for publication: 10.03.2026