Hardware solutions for fault-tolerant strategies for computing systems

350 rub

Journal Nonlinear World №4 for 2022 г.

Article in number:

Type of article: scientific article

DOI: https://doi.org/10.18127/j20700970-202204-04

UDC: 681.326.7

Keywords: Fault tolerance redundancy hardware solutions computing systems reliability enhancement dynamic redundancy fault tolerance strategies masking detection codes error correction self-test modules fault localization reconfiguration and repair

Authors:

A.N. Stalnov1, O.N. Andreeva2, E.G. Berger3

1,2 JSC “Concern Morinsis-Agat” (Moscow, Russia)

3 MIREA – Russian Technological University) (Moscow, Russia)

Abstract:

Relevance. With growing density of hardware there is a risk of multiple temporary fault which grows at order of magnitude prime concern for designers of new computer systems for safety-critical application. Hardware faults occur due to natural phenomena such as ionized radiation, variations in the manufacturing process, vibrations, etc. Computer systems devoted to critical-control applications must have an extremely high degree of reliability as faults in computer systems can cause vast economic losses and even endanger human life. Some computer systems have commercial, off-the-shelf components that do not provide the required degree of reliability. One solution to reliability problems is the creation of fault-tolerant systems. Redundant configurations of computer systems have been used in research and design to provide system fault tolerance. The fault tolerance is concerned with the continuation of the correct operation of a system despite an internal fault. Fault tolerance is achieved by using different methods of time redundancy, information redundancy, and software redundancy. Hardware redundancy is frequently used. Hardware fault tolerance is the most mature area in fault-tolerant computing. Many hardware fault tolerance techniques have been developed and used in practice in critical applications, ranging from telephone exchanges to space missions. In the past, the main obstacle to a wide use of hardware fault tolerance was the cost of the extra hardware required. With the continued reduction in the cost of hardware, this is no longer a significant drawback, and the use of hardware fault tolerance techniques is expected to increase. However, other constraints, notably on power consumption, may continue to restrict the use of massive redundancy in many applications. The task of designing and understanding fault-tolerant distributed system architectures is notoriously difficult: one has to stay in control of not only the standard system activities when all components are well, but also of the complex situations which can occur when some components fail. The difficulty of this task can be exacerbated by the lack of clear structuring concepts. In this regard, the analysis and systematization of methods and hardware solutions for fault-tolerant strategies of computing systems is relevant.

The aim of this paper is to analyze hardware solutions of fault-tolerant strategies for computing systems.

Results and their novelty. An element of the novelty of the work is the identified general development trends and problematic issues of the formation and functioning of mechanisms for ensuring the fault tolerance of hardware mechanisms of computing systems. This article discusses the basic concepts and relationships of hardware redundancy (redundancy) with the fault tolerance of computing systems. In particular, it is shown that hardware redundancy can range from simple redundancy to complex structures that include redundant blocks when the active ones fail. These forms of hardware redundancy are associated with high overhead costs, so their use is usually reserved for mission-critical systems where such overhead can be justified. The paper discloses the mechanisms of increasing reliability due to fault tolerance. The elements of fault-tolerant strategies and mechanisms of their implementation are identified.

Practical significance. The presented analysis will be useful for developers of computing systems to substantiate new technological solutions that ensure their hardware fault tolerance.

Pages: 38-50

For citation

Stalnov A.N., Andreeva O.N., Berger E.G. Hardware solutions for fault-tolerant strategies for computing systems. Nonlinear World. 2022. V. 20. № 4. 2022. P. 38-50. DOI: https://doi.org/10.18127/j20700970-202204-04 (In Russian)

References

Nelson V.P. Fault-tolerant computing: fundamental concepts. Computer. 1990. V. 23. № 7. P. 19–25.
Schagaev I., Zouev E., Thomas K. Software design for resilient computer systems. Second Edition. Springer Nature Switzerland AG. 2020. 315 p.
Koren I., Krishna S.M. Fault-tolerant systems. Elsevier, Inc. 2021. 411 p.
Baumann R. Soft errors in advanced computer systems. IEEE Des Test Computing. 2005. V. 22. № 3. P. 258–266.
Constantinescu C. Intermittent faults and effects on reliability of integrated circuits. In: RAMS 2008. Annual. Jan 2008. P. 370-374.
Gray J.N. Why do computers stop and what can be done about it ? Reliability in Distributed Software and Database Systems. January 1986. P. 3-12.
Shubinskij I.B. Nadezhnye otkazoustojchivye informacionnye sistemy. Metody sinteza. M.: Nadezhnost'. 2016. 547 s. (In Russian).
Mladshie chipsety AMD 500-j serii. [Jelektronnyj resurs]. 02.06.2021. – URL: https://www.overclockers.ua/news/hard-ware/2019-06-14 (data obrashhenija 02.06.2021) (In Russian).
McCluskey E.J. Design techniques for testable embedded error checkers. Computer. 1990. V. 23. № 7. P. 26-38.
Toy W.N. Fault-tolerant design of local ess processors. Proceeding IEEE. Oct. 1978. V. 66. № 10. P. 1126-1145.
Grebeshkov A.Ju. Lekcija 1. Telekommunikacionnye sistemy i jelementy apparatnyh sredstv. Samara: FGBOU VO Povolzhskij gos. un-t telekommunikacij i informatiki. 2017. 402 s. (In Russian).
Wensley J.H. SIFT: Design and analysis of a fault-tolerant computer for aircraft control. Proceedings of the IEEE. Oct. 1978. V. 66. № 10. P. 1240-1255.
SIFT (SANS Investigative Forensic Toolkit) - ОС для криминалистического анализа. [Электронный ресурс]. 02.06.2021. - URL: https://spy-soft.net/sift-sans-investigative-forensic-toolkit/ (дата обращения 02.06.2021).
Sklaroff J.R. Redundancy management technique for space shuttle computers. IBM Journal Research and Development. Jan. 1976. V. 20. № 1. P. 20-28.
Kohler W.H. A survey of techniques for synchronization and recovery in decentralized computer systems. Computing Surwys. June 1981. V. 13. № 2. P. 149-183.
Johnson D. The Intel 432: A VLSI architecture for fault-tolerant computer systems. Computer. August 1984. V. 17. № 8. P. 40-48.
Hopkins A.L. FTMP-A highly reliable fault-tolerant multiprocessor for aircraft. Proceedings IEEE. October 1978. V. 66. № 10. P. 1221-1239.
Kuhl J.G., Reddy S.M. Fault-tolerance considerations in large, multiple-processor systems. Computer. March 1986. V. 19. № 3. P. 56-67.
Negrini R., Sami M., Stefanelli R. Fault tolerance techniques for array structures used in supercomputing. Computer. February 1986. V. 19. № 2. P. 78-87.
Volosenkov V.O., Andrianova E.G., Son I.R., Shirjaev M.V., Krjukov D.A. Sposob ocenki zashhishhennosti informacii v raspredelennyh vychislitel'nyh sistemah. Nelinejnyj mir. 2022. T. 20. № 1. S. 50-54 (In Russian).

Date of receipt: 15.09.2022

Approved after review: 29.09.2022

Accepted for publication: 27.10.2022