Fault Tolerance And Risk Management
Abstract
Fault tolerance is one of those mundane things in computer science. It is not a sexy subject. But it is essential. For example, in the course of producing this paper something went wrong with the WindowsXP operating system on the writer’s computer. Normally, office 2003 does an excellent job of preserving work in progress. For some reason this time the backup and recovery function failed –days of work were lost. Even the autosave function (set to save the document every ten minutes) failed. This story illustrates the agony of what happens when faults occur.
Nevertheless finding general information on fault tolerance is difficult. One reason is the subject of fault tolerance in general is huge. The subject extends beyond computer information systems into virtually all areas of automation engineering.
Anywhere there are machines operating, fault tolerance is a design issue. This is because no machine operates completely within its parameters on every piston stroke. Microprocessors fail on some clock pulses and succeed on others. Engineers face the reality of predictable failures by employing means to detect and correct failures without bringing down the entire system. US Navy ships are packed with redundant systems from engines to fire fighting equipment. Gaming devices are required to preserve accounting information in some jurisdictions for as long as 6 months after a power loss.
This paper seeks to define “fault tolerance”.
What exactly is “fault tolerance”? On whatis.com fault tolerance is defined as,
Fault-tolerant describes a computer system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service. Fault tolerance can be provided with software, or embedded in hardware, or provided by some combination.
(http://searchsmallbizit.techtarget.com/sDefinition/0,,sid44_gci214456,00.html )
The difference between fault tolerant systems and systems that are simply prone to fail is that in fault tolerant systems malfunctions are anticipated and handled. Regardless of the highest design standards, regardless of the excellence of the workers assembling the system, failures are expected and in order for fault tolerance to work specific failures are planned for and action is taken in advance to handle those failures. This is an important feature of fault tolerance. The idea is the system should be able to “fight hurt”.
Although catch-all general fault protection can be built into a system, “Traditionally, fault-tolerance has referred to building subsystems from redundant components that are placed in parallel” (http://www.cigitallabs.com/resources/definitions/fault_tolerance.html). A more efficient way to provide reliability in a system is to anticipate specific failure probabilities and provide protection for those potentialities. To accomplish this planning is the key (Hutchinson, para 7). The scope of this paper will be limited to building fault tolerance into computer and information systems. This paper will no longer explore fault tolerance in other kinds of systems, although sources from other fields may be used.
In robust computer systems, fault tolerance is built into the software, the hardware and the network. The computer network is a combination of hardware and software. Physical redundancy can be designed into the network in creating parallel paths to network resources with extra cable runs and creative router configurations. This alone can increase the fault tolerance of the network. On the other hand, the network protocols are an important part of network reliability. So, the writer sees networks as a third environment requiring fault tolerant design distinct from the software and hardware environments.
In the software environment, fault tolerance is built into the operating system, but can also be built into applications and other layers as needed. For example, a multi-threaded application, correctly implemented, will be more robust than an ill-conceived single threaded program. This is because in a multi-threaded application a fault in a single execution thread does not necessarily bring down the entire application. This may give the end user time to save work before closing the entire application. Threading is so common in programming as a way to build robustness into code that it is even available in Visual Basic for Applications, the “macro” language that is used to extend and automate Microsoft Office applications.
At the hardware level, fault tolerance is usually achieved through some kind of redundancy scheme. RAID and parallel processing are examples of this. The hardware environment, like the physical layer of the network, can achieve a large degree of robustness simply through employing redundant layers of fault handling circuits. This approach is a bit crude, brutish and very costly. Many pieces of hardware are designed with large tolerances. This “worst case” approach handles large ranges of faults, before they can occur. (Horowitz and Hill, page 212).
So fault tolerance manifests differently at different levels in a computer network. Nevertheless, at each level, fault tolerance is the ability of a system to continue functioning regardless of the occurrence of a fault in the system that without pre-planned intervention would have resulted in a sudden, catastrophic loss of the system. Combined, the various implementations of fault tolerance in a system produce an overall increase in system reliability.
The goal frequently discussed in literature on the subject of fault tolerance is reliability. (Caldara et al, Gilmer, Hutchinson). But what is reliability? Sometimes the view is promoted that only 100% reliability is acceptable. This standard is impractical. Something can always go wrong. No one can anticipate every possible fault. Engineers accept that as a premise. Rather than eliminate all errors, reliability can be achieved by allowing “reasonable performance over a wide range of possible conditions” (Hutchinson para 7). Usually the approach to accomplish this is to allow a certain number of errors to pass unfettered or to categorize errors as to their impact. This allows engineers to focus most of their effort on managing errors that have a potential of crashing the entire system. So in these schools of thought the definition of reliability does not envision 100% success. Rather it seeks to manage and control the impact of errors and produce a network that has “high availability” (Gilmer, para 3).
Security could be mentioned at this point. Although fault tolerance does not mean security, a fault tolerant system would contemplate various security risks that threaten contemporary computer networks. The virus and worm attacks can insert faults into a system. This possibility should be contemplated when considering network fault tolerance. A reliable system then is not a system which never fails but a system that can fail without causing a catastrophic loss of service or data. As with the issue of network security, faults must be managed not eliminated.
Conclusions
Fault tolerant systems are not bug free, but they do exhibit a large degree of robustness under a wide range of conditions. Sometimes law, but most of the time the demand of the market place drives improvements in system performance. Recently, even open source software, has jumped on the high reliability, fault tolerant bandwagon. MySql has produced an enterprise caliber version of its database software that uses clustering (Krill). A fault tolerance technique that allows multiple servers to be treated as one unit thus building redundancy into a system that is available to end users.
Future systems and architectures will become increasingly more reliable despite constant hacker attacks, hardware failures and other yet to be seen vulnerabilities. In fact it may be that the adversity every network faces only serves to make the next generation of networks stronger.
References
Caldara, Stephen. Manning, Thomas. Quinlivan, Joseph. (December 2000). Guidelines for a fault tolerant network. Reprinted from Network Reliability supplement to America’s Network. Retrieved from www.ITpapers.com on July 2, 2004.
Gilmer, Brad. (November 2003). Building disaster-resistant computer networks. Broadcast Engineering; Nov2003, Vol. 45 Issue 11. Retrieved from EBSCO host on June 30, 2004.
Horowitz, Paul and Hill, Winfield. (August 1, 1989). The art of electronics. Cambridge University Press; 2nd edition.
Hutchinson, Art. (February 17, 1997). Security is just one issue in understanding e-commerce. Electronic Commerce. Communications week. Retrieved from EBSCOHost on June 29, 2004.
Krill, Paul. (April, 19, 2004). Database vendors updatewares. Retrieved from www.infoworld.com on June 27, 2004.