V.O. Ivanov1
1 Military University of Radioelectronics (Cherepovets, Russia)
1 info@radiotec.ru
Modern information-analytical processing systems operating within the «human – information» paradigm inevitably encounter limitations driven by the cognitive constraints of an analyst’s memory. The integration of generative language models based on the Transformer architecture represents a significant step toward enhancing the automation of information processing. However, phenomena such as generative confabulation (hallucinations) and the limited size of the contextual window inherent in these models lead to potential distortions of factual information, resulting in reduced reliability of generated outputs.
The objective of the article is to investigate the nature of hallucination effects in autoregressive language models and identify robust informative features for detecting and potentially regulating factual distortions in the latent space of neural networks.
The model of an autoregressive neural network has been developed, accounting for knowledge obsolescence and the superposition of multilayer perceptron parameters, enabling analysis of the relationship between model parameters and the distributions of generated tokens. It has been found that the first eigenvector of the spectral decomposition of the difference between covariance matrices of the final layer serves as the most stable discriminative feature of hallucination. It has been demonstrated that manipulating activations along this direction can reduce factual distortions and control the semantics of the output text.
The developed feature can be integrated into tools for monitoring and controlling text generation to automatically detect and correct factual distortions during inference. It has the potential to be extended to any abstract concepts, paving the way for more flexible and reliable semantic control of large language models.
Ivanov V.O. Mechanisms of emergence and suppression of factual distortions in autoregressive language models. Neurocomputers. 2025. V. 27. № 3. P. 40–48. DOI: https://doi.org/10.18127/j19998554-202503-06 (in Russian)
- GOST R 43.4.1. Informatsionnoe obespechenie tekhniki i operatorskoj deyatel'nosti. Sistema chelovek – informatsiya. Obshchie polozheniya. Vved. 07.01.2013. M.: Izd-vo standartov. 2011. St. 3.46. (in Russian)
- Miller G. The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review. 2003. [Elektronnyj resurs]. URL: https://www.researchgate.net/publication/375454712 (data obrashcheniya: 16.04.2025). DOI: 10.7551/mitpress/2834.003.0029.
- Huang L. et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems. 2025. V. 43. № 2. P. 1–55.
- Bereska L., Gavves E. Mechanistic interpretability for AI safety – A review. arXiv preprint arXiv:2404.14082. 2024.
- Templeton A., Jermyn A., Conerly T. et al. Scaling monosemanticity: Extracting interpretable features from Claude 3 Sonnet. Transformer Circuits [Elektronnyj resurs]. URL: https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html (data obrashcheniya: 16.04.2025).
- Balagansky N., Maksimov I., Gavrilov D. Mechanistic permutability: Match features across layers. arXiv preprint arXiv:2410.07656.2024.
- Larsen K.G., Nelson J. Optimality of the Johnson-Lindenstrauss lemma. 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE. 2017. P. 633–648.
- Smith L., Rajamanoharan S., Conmy A. et al. Negative results for SAEs on downstream tasks and deprioritizing SAE research. LessWrong. 2025 [Elektronnyj resurs]. URL: https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks (data obrashcheniya: 16.04.2025).
- Okawa M. et al. Compositional abilities emerge multiplicatively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems. 2023. V. 36. P. 50173–50195.
- Lin S., Hilton J., Evans O. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958. 2021.

