System Reliability, Availability, and Maintainability
Reliability, availability, and maintainability (RAM) are three system attributes of tremendous interest to systems engineers, logisticians, and userss alike. Collectively, they affect the economic life-cycle costs of a system and its utility.
Probability models for populations
Reliability is defined as the probability that an item will perform a defined function in a defined environment without failure for a specified period of time. A precise definition must include a detailed description of the function and what constitutes a failure, the environment, and a definition of the time scale. Each can be surprisingly difficult to precisely define. Different failure mechanisms are referred to as failure modes, and can be modeled separately or aggregated into a single failure model. Let T be a random time to failure. Reliability can be thought of as the complement of the cumulative distribution function (CDF) for T for a given set e of environmental conditions:
Maintainability is defined as the probability that an item can be repaired in a defined environment within a specified period of time. Increased maintainability implies shorter repair times.
Availability is the probability that a repairable system is operational at a given point in time, under a given set of environmental conditions. Availability depends on reliability and maintainability, and is discussed in detail later in this topic.
Each of these probability models is usually specified by a continuous, non-negative distribution. Typical distributions used in practice include the exponential (possibly with a threshold parameter), the Weibull (possibly with a threshold parameter), the log-normal, the generalized gamma, and others.
Maintainability models present some interesting challenges. The time to repair an item is the sum of the time required for evacuation, diagnosis, assembly of resources (parts, bays, tool, and mechanics), repair, inspection, and return. Administrative delay (such as holidays) can also affect repair times. Often these sub-processes have a minimum time to complete that is not zero, resulting in the distribution used to model maintainability having a threshold parameter. A threshold parameter is defined as the minimum time to repair that has positive probability. Estimation of the maintainability can be further complicated by queuing effects, resulting in times to repair that are not independent. This dependency frequently makes analytical solution of problems involving maintainability intractable and promotes the use of simulation to support analysis.
b. Data issues
True RAM models for a system are generally never known. Data on a given system is assumed or collected, and that data is used to select a distribution for a model and then to fit the parameters of the distribution. This process differs significantly from the one usually taught in an introductory statistics course. First, the normal distribution is seldom used as a life distribution, since it is defined for all negative times. Second, and more importantly, reliability data is different from classic experimental data. Reliability data is often censored, biased, observational, and missing information about covariates such as environmental conditions. Data from testing is often expensive, resulting in small sample sizes. These problems with reliability data require sophisticated strategies and processes to mitigate them. One consequence of these issues is that estimates based on limited data can be very imprecise.
c. Design issues
System requirements should include specifications for reliability, maintainability, and availability, and each should be conditioned on the projected operating environments.
A proposed design should be analyzed prior to development to estimate if it meets those specifications. This is usually done by assuming historical data on actual or similar components represents the future performance of the components for the proposed system. If no data is available, conservative engineering judgment is often applied. The system dependency on the reliability of its components can be captured in several ways, including reliability block diagrams, fault trees, and failure mode effects and criticality analyses.
If a proposed design does not meet the preliminary RAM specifications, it can be adjusted. Critical failures are mitigated so that the overall risk is reduced to acceptable levels. This can be done in several ways.
Fault tolerance is a strategy that seeks to make the system robust against the failure of a component. This can be done by introducing redundancy. Redundant units can operate in a ‘stand-by’ mode. A second tolerance strategy is to have the redundant components share the load, so that one or more of them may fail yet the system continues to operate. There are modeling issues associated with redundancy, including switching between components, warm-up, and increased failure rates for surviving units under increased load when another load-sharing unit fails.
Redundancy can be an expensive strategy as there are cost, weight, volume, and power penalties associated with stand-by components.
Fault avoidance seeks to improve individual components so that they are more reliable. This too can be an expensive strategy, but it avoids the power, weight, and volume penalties associated with using redundant components, as well as the switching issues.
A third strategy is to repair or replace a component following a preventive maintenance schedule. This usually requires the assumption that the repair returns the component to “good as new” status, or possibly to an earlier age-equivalent. These assumptions can cause difficulties --- for example, an oil change on a vehicle does not return the engine to ‘good as new’ status. Scheduled replacement can return a unit to good as new, but at the cost of wasting potential life for the replaced unit. As a result, the selection of a replacement period is a non-linear optimization problem that minimizes total expected life-cycle costs. These costs are the sum of the expected costs of planned and unplanned maintenance actions.
A fourth strategy is to control the environment so that a system is not operated under conditions that accelerate the aging of its components.
Any or all of the above strategies (fault tolerance, fault avoidance, preventive maintenance, and environmental control) may be applied to improve the designed reliability of a system.
d) Post-production management systems
Once a system is fielded, its reliability and availability should be tracked. Doing so allows the producer / owner to verify that the design has met its RAM objectives, to identify unexpected failure modes, to record fixes, to assess the utilization of maintenance resources, and to assess the operating environment.
One such tracking system is generically known as a FRACAS system (Failure Reporting and Corrective Action System). Such a system captures data on failures and improvements to correct failures. This database is separate from a warranty data base, which is typically run by the financial function of an organization and tracks costs only.
A FRACAS for an organization is a system, and itself should be designed following systems engineering principles. In particular, a FRACAS system supports later analyses, and those analyses impose data requirements. Unfortunately, careful consideration of the backward flow from decision to analysis to model to required data too seldom occurs, and analysts find after the fact that their data collection systems are missing essential information. Proper prior planning prevents this poor performance.
Of particular import is a plan to track data on units that have not failed. Units whose precise times of failure are unknown are referred to as censored units. Inexperienced analysts frequently do not know how to analyze censored data, and omit the censored units as a result. This can badly bias an analysis.
An organization should have an integrated data system that allows reliability data to be considered with logistical data, such as parts, personnel, tools, bays, transportation and evacuation, queues, and costs, allowing a total awareness of the interplay of logistical and RAM issues. These issues in turn must be integrated with management and operational systems, for an organization to reap the benefits that can occur from complete situational awareness with respect to RAM.
There are a wide range of models that estimate and predict reliability. Simple models, such as the exponential distribution, can be useful for ‘back of the envelope’ calculations. There are more sophisticated probability models used for life data analysis. They are best characterized by their failure rate behavior, which is defined as the probability that a unit fails in the next small interval of time, given it has lived until the beginning of the interval, and divided by the length of the interval.
The models can be considered for a fixed environmental condition, or they can be extended to model the effect of environmental conditions on system life. Those models can in turn be used for accelerated life testing (ALT) where a system is deliberately and carefully overstressed to induce failures more quickly, and the data extrapolated to usual use conditions. This is often the only way to obtain estimates of the life of highly reliable products in a reasonable amount of time.
Also useful are degradation models, where some characteristic of the system is associated with the propensity of the unit to fail. As that characteristic degrades, we can estimate times of failure before they occur.
It is often the case that the initial developmental units of a system do not meet their RAM specifications. Reliability growth models allow us to estimate the amount of resources (particularly testing time) necessary before a system will mature to meet those goals.
Maintainability models describe the time necessary to return a failed repairable system to service. They are usually the sum of a set of models describing different aspects of the maintenance process --- diagnosis, repair, inspection, reporting, evacuation, et cetera. These models often have threshold parameters, which are minimum times until an event can occur.
Logistical support models attempt to describe flows through a logistics system, and quantify the interaction between maintenance activities and the resources available to support those activities. Queue delays, in particular, are a major source of down-time for a repairable system and a logistical support model allows one to explore the trade space between resources and availability.
All these models are abstractions of reality, and all are at best approximations to reality. To the extent they provide useful insights, they are still very valuable. The more complicated the model, the more data necessary to estimate it precisely. The greater the extrapolation required for a prediction, the greater the imprecision. Extrapolation is often unavoidable, because high reliability equipment typically can have long life, and the amount of time required to observe failures may exceed test times. This requires strong assumptions be made about future life (such as the absence of masked failure modes) and these assumptions increase our uncertainty about our predictions. The uncertainty introduced by strong model assumptions is often not quantified, and presents a risk to the system engineer.
f) System metrics
Probabilistic metrics describe system performance for RAM. Reliability, availability, and maintainability were described earlier in this topic. Quantiles, means, and modes of the distributions used to model RAM are also useful.
Availability has some additional definitions, characterizing what downtime is counted against a system. For inherent availability, only downtime associated with corrective maintenance counts against the system. For achieved availability, downtime associated with both corrective and preventive maintenance counts against a system. Finally, operational availability counts all sources of downtime, including logistical and administrative, against a system.
Availability can also be calculated instantaneously, averaged over an interval, or reported as an asymptotic value. Asymptotic availability can be calculated easily, but care must be taken to analyze whether or not a systems settles down or settles up to the asymptotic value, and how long it takes until the system approaches that asymptotic value.
Reliability importance measures the effect on the system reliability of a small improvement in a component’s reliability. It is defined as the partial derivative of the system reliability with respect to the reliability of a component.
Criticality is the product of a component’s reliability, the consequences of a component failure, and the frequency with which a component failure results in a system failure. Criticality is a guide to prioritizing reliability improvement efforts.
Many of these metrics cannot be calculated directly because the integrals involved are intractable. They are usually estimated using simulation.
g) System models
There are many ways to characterize the reliability of a system, including fault trees, failure mode effects analysis, and reliability block diagrams.
A reliability block diagram (RBD) is a graphical representation of the reliability dependence of a system on its components. It is a directed, acyclic graph. Each path through the graph represents a subset of system components, and if the components in that path are operational, the system is operational. Component lives are usually assumed to be independent in a RBD. Simple topologies include a series system, a parallel system, a k of n system, and combinations of these.
Reliability block diagrams are often nested, with one RBD serving as a component in a higher level model. These hierarchical models allow the analyst to have the appropriate resolution of detail as necessary while permitting abstraction.
System models require even more data to fit them well. “Garbage in, garbage out” (GIGO) applies with particular strength to system models.
h) Software tools
The specialized analyses required for RAM drive specialized software. General purpose statistical languages, and even spreadsheets, can be used with sufficient effort for reliability analysis, but almost every serious practitioner uses specialized software.
Minitab (versions 13 and later) includes functions for life data analysis. Win Smith is a specialized package that fits reliability models to life data, and can be extended for reliability growth analysis and other analyses. Relex has an extensive historical database of component reliability data, and is useful for estimating system reliability in the design phase.
There is a suite of products from ReliaSoft. Weibull++ fits life models to life data. ALTA fits accelerated life models to accelerated life test data. BlockSim models system reliability, given component data.