Dependability of a System

Since the advent of artificial intelligence, technology has become increasingly complex. Such complexity has put tremendous pressure on experts to build a dependable system that is able to carry out its intended operation without risks to humans. Regarding dependability, experts suggest that it is an integration of attributes such as reliability, availability, security and maintainability, among others. Therefore, a system, which is capable of performing its job without any negative effects, is considered, dependable.

Sophisticated medical devices like Multileaf Collimator utilize artificial intelligence to replicate surgical procedures that once were only possible under acute human supervision. Developers of Multileaf Collimator should understand the implications of a dependable system that can provide effective radiotherapy without harming patients. A threat to a dependable system is usually present in the form of faults, errors and failures. Developers must invent precautions to safeguards against these threats.

Fault, Errors and Failure  A Potent Combination
A fault mostly lay dormant but whenever it becomes active, it produces error. Users and developers of the system may not know about a fault unless it activates. Consequently, the error generated by such a fault can cause failure. Mostly, errors are generated but without major consequences, which are why, most users are not concerned about most error. It is only when a certain error produces a major failure, only then developers and users become concerned.

Case Study - Therac 25
In fact, such failures can be catastrophic. In 1986, such was the case at East Texas Cancer Center where an unintended fault by Therac 25 linear accelerator produced an error that ultimately became the basis of a catastrophic failure leading to the death of a patient (Leveson, 1995). Perhaps, this case will help developers understand how faults, errors and failures interact and cause a chain reaction to threaten dependability of an otherwise very popular system.

It all started when a cancer patient was being treated by a linear accelerator. The operator at backend was an expert who had previously performed such procedures multiple times. On that unfortunate day, the operator mistakenly put erroneous dosage data in the system. Upon realizing his mistake and before turning on the system, the operator quickly retyped correct parameters.

The entire process to fix that particular typing error took less than 8 seconds. Unknown to the operator, the system did not register correct parameters because Therac 25 was busy processing the original command. Because the operator had become an expert at typing, he had corrected his mistake in less than 8 seconds whereby the system was already processing his initial request. By the time operator had corrected the mistake, he did not realize that his correct command was never registered.

This fault produced a rare error malfunctioning error 54. This error interacted with service interface to deliver a lethal doze to the patient. The lethal dosage was a catastrophic failure resulting in the death of patient. Interestingly, this error was also present in an earlier version, Therac 20. Subsequent investigations reveled that error 54 was also produced in Therac 20 but it contained hardware interlocks, which mitigated the probability of injecting such lethal doze.

It will not be an overstatement to suggest that developers of Therac 25 were overconfident in their design of software. Actually, they were so confident that they overlooked hardware interlocks and checkpoints, which could have stopped the system from activating the injector. It is therefore recommended that developers of sophisticated systems should not merely focus on software but also coordinate with hardware manufacturers to develop a dependable system that can effectively be intelligent enough to prevent such threats. In short, safety should be seen as a property of an entire system and not just of the software.

PART II
Moral Standards
Software developers are entrusted by general public to hold high moral values. Although, software development is a kind of behind the scene job which does not require public interaction but computer scientists are as accountable to society as any other professional. Generally, relevant professional bodies such as Association for Computer Machinery ACM provide guiding principle which dictates appropriate course of action for its members.

While it is almost impossible to enlist every moral conduct, professionals are expected to understand rules and in the absence of exact rules, make logical decision based on prescribed guiding principles. Actually, the importance of reading and understanding these guidelines is signified in an event where professionals have to make ethical decision without the existence of a precise rule. Under those circumstances, only knowledgeable professionals are able to make a decision which is technically similar to adopted guidelines.

High Standards of Software Developers
As stated, devices like Collimator are part of sophisticated medical equipment that depend on artificial intelligence and may inflict terminal injuries, without proper safeguards. Based on its function, any development team designing such instrument is ethically more liable to its user than the makers of software for personal computers. Even though, Sociology does not distinguish between high or low ethics but it does differentiates between software that has the ability to affect mentally than from that which can indirectly or directly inflict mortal injury. This is precisely why development of Collimator is a job that should only be undertaken by professional who not only understands ethics but is able to derive decisions from relevant codes of ethics. In short, developers of Collimator control software are responsible to entire medical community including those who seek treatment and advice.

According to moral imperatives defined by ACM, the aim of computer professionals should be to make positive contribution to society by improving quality of life and creating safe environment. Since 1965, medical software developers have certainly improved quality of care by introducing advance radiation therapy techniques helping patients undergo complex medical procedures (Avizienis, Laprie  Randell). Still, events related to Therac 25 continue to challenge typical notion of morality. It should be understood that definition of harm may not solely involve intentional injury but it also constitutes associated negative consequence. Therac 25 incident at Kennestone Regional Oncology Center exemplifies how negative consequences may alleviate, leading to seriously harmful situations.

Investigations into these case revealed that AECL, Atomic Energy of Canada Limited, who had been instrumental in Therac development, did not have a formal procedure to follow up reports of suspected accidents. It was only after a lawsuit was filed that AECL had any sort of official record. Furthermore, it is also assumed that management of AECL was aware of the incident but no action was taken to solve the problem. It can be argued that AECL management did not act morally because it helped induce negative consequence as a result of their reluctance to do anything without receiving formal report. Therefore, software developers of MLC should not rest on official confirmation but instead be pro-active in managing compliance affairs.

In this regard, software engineer may question feasibility of acting on every other software glitch. In fact, a well placed mechanism to report such incidents may be an answer to deal with such problems. NASA implemented such confidential reporting system after the accident of Challenger, which was caused by a series of minor faults. The new reporting system called all 100,000 NASA employees and contractors to anonymously submit report on incident, major or minor, to a third party which acted as NASA safety consultant on space missions (Miya, 1987). Such a reporting system allowed NASA to overcome problems that are not publicized. One would wonder if a similar procedure were in place for AECL, subsequent incident may not have taken place. These examples highlight the fact that not only software engineers need to look at the ramifications of their programming but they also need to coordinate with other professionals in their industry who are directly or indirectly engaged in the development process.

Such a task may also require software developers to work with the person, in charge of writing operating manual. For example, radiotherapy is conducted by an operator who is neither a medical doctor nor software developer therefore their duties primarily rely on the operating manuals and instructions provided by technicians. Under these circumstances, a good software developer can make sure that errors have simple and detailed explanation which is easy to understand. Similarly, they may even help write operating manual by elaborating on frequently generated errors. As such, good coordination among different departments is also considered good ethics. Case files of Therac 25 accidents expose poor communication because software developers. It indicates that developers may not have justifiably coordinated with hardware equipment supplier which could have provided additional safeguards.

ACM guidelines also provide clear instructions on privacy issues. Privacy issues may not be seen as a huge concern for software developers of MLC but examples from Sears departmental store will help absorb its importance. In January 2008, a software glitch allowed anyone in Public to access Sears customer purchase history by typing in the name and related profiles at  HYPERLINK httpwww.managemyhome.com www.managemyhome.com (Benjamin, 2008). To most of us, purchase history may not reveal much information but consider what would happen if a miscreant acting as a servicemen or marketer use this data to gain access inside the homes of Sears customers. Such intrusion may occur if an appliance is recalled and the imposter acts as repairmen to gain access.

Interestingly, prior to publishing this news, a similar breach at Buy.com store was also noticed. Now consider how vulnerable would a patient, receiving radiotherapy treatment, feel if their personal information is accessed by an authorized individual. A software developer of MLC will need to consider how much information should be provided to an operator or anyone reading a particular medical report. They will also need to analyze when additional information will be beneficial to medical experts. As stated earlier, software development of MLC is not a stand alone process but such a task requires engineers to consult relevant authorities to reveal data that is not only effective and transparent but also safe. This attitude is in line with ACM professional responsibilities of computer professionals who are required to give comprehensive evaluation of a computer system and their impacts, including analysis of possible risks.

This year, New York Times has starting publishing detail analysis of radiation therapy accidents by liner accelerators (Bogdanich, 2010). Investigators found nearly 621 mistakes from 2001 to 2008 in which 133 incidents directly involved a radiotherapy device. Moreover, there is not a single agency overseeing medical radiation. Dr. John J Feldmeier at University of Toledo, a well known radiation injury expert, predicts that 1 in 20 of all therapy patients sustain some kind of injury.  The NY Times further reports that there were 3000 cases of radiation injury in US leading wound care company.
These reports suggest that there is still a long way to go in developing an extremely dependable system. Irrespective of software malfunction, developers of MLC should not only be concerned about programming but make sure that they envision their responsibilities on a much broader scale, incorporating moral values such as communication and privacy. Only then, they will be able to realize their roles within the ethical framework of entire society.

PART III
Fault Analysis Techniques
Scientists have explained various fault analysis tools that can help software designers perform checks before, during and after implementation of a system. By definition, terms such as hazard, faults and errors are conjoined therefore it is viable to discuss safeguarding techniques in context of faults. It will not be an oversimplification to suggest that dependability of a system depends on four techniques. These are fault prevention, fault tolerance, fault removal and fault forecasting. Moreover, it can generally be assumed that all fault analysis methods fall within these four categories.

In his book, Hazard Analysis Techniques for System Safety, Clifton A. Ericson provides multiple fault analysis techniques depending on the type of analysis and development phase. These analysis types include proven models such as Petri net analysis, Fault tree analysis, Markov analysis and Bent Pin Analysis including others. Actually, the author explained more than 20 methods for fault analysis but in general all of these techniques are form of fault prevention, fault tolerance, fault removal or fault forecasting (Figure 1). Among these, fault removal and fault prevention are type of hazard analysis techniques which continue during and after development phase.

Fault Removal
Fault removal is an on going process which includes active analysis during the developmental phase of software. Verification is the first step in fault removal where system designers would verify that their system can perform all functions that it is designed for. If a fault is detected, the system has to undergo re-verification process before it could continue its development phase. This verification process constitutes static verification which is often carried out without any physical operation. Walk through, inspection and theorems proving are kinds of static verification which does not require software engineers to physically interrupt the system. In contrast, dynamic verification requires engineers to supply necessary inputs to the system. These inputs can be symbolic in nature but mostly these include real input, also termed as testing.

Fault Removal  Therac 25
In fault removal analysis, fault injection technique is probably one of the most popular types of verification techniques. Designers intentionally inject a fault within the system to check is the system is able to operate beyond the limits of its prescribed usage.  As a software designer, fault injection analysis would allow professionals to contain the software within a prescribed boundary of human endurance, by limiting the amount of doze that can be injected. Reconsider cases of Therac 25 where software fault led to beam injecting lethal doze, hundred times more than a person is able to withstand.

In fault prevention, preventive maintenance is routinely carried out before operations. Its aim is to uncover faults before they are induced whereas corrective maintenance is deployed after one or more errors have already been reported. Again, it is useful to recall Therac 25 incidents. Had there been a corrective maintenance mechanism in place, errors such a malfunction 54 could have been reproduced and dealt with, accordingly. Lack of communication and fault removal methods allowed six patients to undergo faulty procedure that could have been detected on earlier model, Therac 20, when such errors were produced by new trainee students.

Fault Forecasting
Fault forecasting is usually conducted whenever fault is activated. Qualitative and quantitative evaluations are its two types. In qualitative fault forecasting, analysts try to identify faults by classifying them into various categories. This technique allows the developer to build priorities when working with multiple faults. On the other hand, quantitative fault forecasting evaluates in terms of probabilities. Each fault is measured according to the extent of probability that it has satisfied different attributes of dependability. Failure mode and effect analysis are examples of qualitative fault forecasting while Markov chains are considered a type of quantitative analysis.

Example - Fault Tree Forecasting
In fact, fault tree analysis is an increasingly popular method of fault forecasting that utilizes both quantitative and qualitative methods of fault forecasting. Therefore, a brief overview of fault tree can provide useful insight into the functioning of a reliable system. Fault tree is a robust system which is designed to calculate probability of occurrence of a certain event.

The popularity of fault tree is based on the fact that is very easy to implement, understand and design in comparison to other fault analysis system. Using fault tree methodology, analysts can utilize a pre-defined methodology to model a combination of fault events that can trigger fault within a certain system (Figure 2). As its name dictates, the tree consists of various branches protruding from different causes of failures, to reproduce all possible faults of a problem. The structure of fault tree is continuously elaborated whenever a new fault emerges. Its graphical representation is simply a combination of logic gates and fault events, all of which can easily be translated into mathematical model that is able to calculate the probability fault generation from other single causes. The hierarchical nature of fault tree allows software developers to add or remove fault events, allowing them to design an extensive but understandable cause and effect relationship.

Fault Forecasting  Therac 25
As with fault removal methods, fault forecasting in fault tree also utilize two functions. Proactive logic is integrated during the development stages while reactive function is designed after a fault is detected. The simple graphical interface of a fault tree is helpful in drawing fault events that have not yet activated. This graphical interface also allows developers to understand the interrelation of system events and their interdependence on each other. Recall events of Therac 25 where software developers would not have guessed how the earlier version, Therac 20 was depended on hardware checks to prevent an accident. Theoretically, fault tree analysis would have helped designers to generate further possibilities once multiple errors had been detected.

Recommendations
To be cost effective, experts recommend that fault tree analysis should be used earlier in the design process. SHARPE, SPNP, SURF-2 are some tools that help software engineers use fault tree and Markov chain fault forecasting models. Irrespective of the methodology used, it is important to understand that software developers use literature which is able to define fault analysis theory in detail (Ericson, 2005). Until now there has been a shortage of such literature. Hence, books such as Hazard Analysis Techniques for System Safety by Clifton A. Ericson should serve as good source of information on different methods of fault analysis. 

0 comments:

Post a Comment