Resilience Engineering

From SEBoK
Jump to: navigation, search

According to the Oxford English Dictionary on Historical Principles (1973), resilience is “the act of rebounding or springing back.” This definition most directly fits the situation of materials which return to their original shape after deformation. For human-made systems this definition can be extended to say “the ability of a system to recover from a disruption .” The US government definition for infrastructure systems is the “ability of systems, infrastructures, government, business, communities, and individuals to resist, tolerate, absorb, recover from, prepare for, or adapt to an adverse occurrence that causes harm, destruction, or loss of national significance” (DHS 2010). The concept of creating a resilient human-made system or resilience engineering is discussed by Hollnagel, Woods, and Leveson (2006). The principles are elaborated by Jackson (2010).

Overview

Resilience is a relatively new term in the SE realm, appearing only in the 2006 timeframe and becoming popularized in the 2010 timeframe. The recent application of “resilience” to engineered systems has led to confusion over its meaning and a proliferation of alterative definitions. (One expert claims that well over 100 unique definitions of resilience have appeared.) The details will continue to be discussed and debated, but the information here should provide a working understanding of the meaning and implementation of resilience, sufficient for a system engineer to effectively address it.

Definition

It is difficult to identify a single definition that – word for word – satisfies all. However, it is possible to gain general agreement of what is meant by resilience of engineered systems; viz., Resilience is the ability to provide required capability in the face of adversity.

Scope of the means

In applying this definition, one needs to consider the range of means by which resilience is achieved: The means of achieving resilience include avoiding, withstanding, recovering from and evolving and adapting to adversity. These may also be considered the fundamental objectives of resilience. Classically, resilience includes “withstanding” and “recovering” from adversity. For the purpose of engineered systems, “avoiding” adversity is considered a legitimate means of achieving resilience. Jackson and Ferris (2016). Also, it is believed that resilience should consider the system’s ability to “evolve and adapt” to future threats and unknown-unknowns.

Scope of the adversity

Another consideration will be to identify the range of adversities that are to be considered by resilience. We propose that the SE must consider all; i.e., environmental, normal failure and opponent, friendly & neutral parties.

The purpose of resilience engineering and architecting is to achieve full or partial recovery of a system following an encounter with a threat that disrupts the functionality of that system. Threats can be natural, such as earthquakes, hurricanes, tornadoes, or tsunamis. Threats can be internal and human-made such as reliability flaws and human error. Threats can be external and human-made, such as terrorist attacks. Often, a single incident is the result of multiple threats, such as a human error committed in the attempt to recover from another threat.

Figure 1 depicts the loss and recovery of the functionality of a system. System types include product systems of a technological nature and enterprise systems such as civil infrastructures. They can be either individual systems or systems of systems. A resilient system possesses four attributes — capacity , flexibility , tolerance , and cohesion — and thirteen top level design principles through which to achieve these attributes. The four attributes are adapted from Hollnagel, Woods, and Leveson (2006), and the design principles are extracted from Hollnagel et al. and are elaborated based on Jackson (2010).

Figure 1. Disruption Diagram. (SEBoK Original)

The Capacity Attribute

Capacity is the attribute of a system that allows it to withstand a threat. Resilience allows that the capacity of a system may be exceeded, forcing the system to rely on the remaining attributes to achieve recovery. The following design principles apply to the capacity attribute:

  • The absorption design principle calls for the system to be designed including adequate margin to withstand a design-level threat.
  • The physical redundancy design principle states that the resilience of a system is enhanced when critical components are physically redundant.
  • The functional redundancy design principle calls for critical functions to be duplicated using different means.
  • The layered defense design principle states that single points of failure should be avoided.

The absorption design principle requires the implementation of traditional specialties, such as Reliability and Safety.

The Flexibility Attribute

Flexibility is the attribute of a system that allows it to restructure itself in the face of a threat. The following design principles apply to the flexibility attribute:

  • The reorganization design principle says that the system should be able to change its own architecture before, during, or after the encounter with a threat. This design principle is applicable particularly to human systems.
  • The human backup design principle requires that humans be involved to back up automated systems especially when unprecedented threats are involved.
  • The complexity avoidance design principle calls for the minimization of complex elements, such as software and humans, except where they are essential (see human backup design principle).
  • The drift correction design principle states that detected threats or conditions should be corrected before the encounter with the threat. The condition can either be immediate as for example the approach of a threat, or they can be latent within the design or the organization.

The Tolerance Attribute

Tolerance is the attribute of a system that allows it to degrade gracefully following an encounter with a threat. The following design principles apply to the tolerance attribute.

  • The localized capacity design principle states that, when possible, the functionality of a system should be concentrated in individual nodes of the system and stay independent of the other nodes.
  • The loose coupling design principle states that cascading failures in systems should be checked by inserting pauses between the nodes. According to Perrow (1999) humans at these nodes have been found to be the most effective.
  • The neutral state design principle states that systems should be brought into a neutral state before actions are taken.
  • The reparability design principle states that systems should be reparable to bring the system back to full or partial functionality.

Most resilience design principles affect system design processes such as architecting. The reparability design principle affects the design of the sustainment system.

The Cohesion Attribute

Cohesion is the attribute of a system that allows it to operate before, during, and after an encounter with a threat. According to (Hitchins 2009), cohesion is a basic characteristic of a system. The following global design principle applies to the cohesion attribute.

  • The inter-node interaction design principle requires that nodes (elements) of a system be capable of communicating, cooperating, and collaborating with each other. This design principle also calls for all nodes to understand the intent of all the other nodes as described by (Billings 1991).

The Resilience Process

Implementation of resilience in a system requires the execution of both analytic and holistic processes. In particular, the use of architecting with the associated heuristics is required. Inputs are the desired level of resilience and the characteristics of a threat or disruption. Outputs are the characteristics of the system, particularly the architectural characteristics and the nature of the elements (e.g., hardware, software, or humans).

Artifacts depend on the domain of the system. For technological systems, specification and architectural descriptions will result. For enterprise systems, enterprise plans will result.

Both analytic and holistic methods, including the principles of architecting, are required. Analytic methods determine required capacity. Holistic methods determine required flexibility, tolerance, and cohesion. The only aspect of resilience that is easily measurable is that of capacity. For the attributes of flexibility, tolerance, and cohesion, the measures are either Boolean (yes/no) or qualitative. Finally, as an overall measure of resilience, the four attributes (capacity, flexibility, tolerance, and cohesion) can be weighted to produce an overall resilience score.

The greatest pitfall is to ignore resilience and fall back on the assumption of protection. The Critical Thinking project (CIPP 2007) lays out the path from protection to resilience. Since resilience depends in large part on holistic analysis, it is a pitfall to resort to reductionist thinking and analysis. Another pitfall is failure to consider the systems of systems philosophy, especially in the analysis of infrastructure systems. Many examples show that systems are more resilient when they employ the cohesion attribute — the New York Power Restoration case study by Mendoca and Wallace (2006, 209-219) is one. The lesson is that every component system in a system of systems must recognize itself as such, and not as an independent system.

Practical Considerations

Resilience is difficult to achieve for infrastructure systems because the nodes (cities, counties, states, and private entities) are reluctant to cooperate with each other. Another barrier to resilience is cost. For example, achieving redundancy in dams and levees can be prohibitively expensive. Other aspects, such as communicating on common frequencies, can be low or moderate cost; even there, cultural barriers have to be overcome for implementation.

System Description

A system is “[a]n integrated set of elements, subsystems or assemblies that accomplish a defined objective.” INCOSE (2015) A capability is “…an expression of a system … to achieve a specific objective under stated conditions.” INCOSE (2015)

Resilience is the ability of a system to provide required capability in the face of adversity. Resilience in the realm of systems engineering involves identifying: 1) the capabilities that are required of the system, 2) the adverse conditions under which the system is required to deliver those capabilities, and 3) the systems engineering to ensure that the system can provide the required capabilities.

Put simply, resilience is achieved by a systems engineering focusing on adverse conditions.

Principles for Achieving Resilience

34 principles and support principles described by Jackson and Ferris (2013) include both design and process principles that will be used to define a system of interest in an effort to make it resilient. These principles were extracted from many sources. Prominent among these sources is Hollnagel et al (2006). Other sources include Leveson (1995), Reason (1997), Perrow (1999), and Billings(1997). Some principles were implied in case study reports, such as the 9/11 Commission report (2004) and the US-Canada Task Force report (2004) following the 2003 blackout.

These principles include very simple and well-known principles as physical redundancy and more sophisticated principles as loose coupling. Some of these principles are domain dependent, such as loose coupling, which is important in the power distribution domain as discussed by Perrow (1999). These principles will be the input to the state-transition analysis of Section 8 to determine the characteristics of a given system for a given threat.

In the resilience literature the term principle is used to describe both scientifically accepted principles and also heuristics, design rules determined from experience as described by Rechtin (1991). Jackson and Ferris (2013) showed that it is necessary to invoke these principles in combinations to enhance resilience. This concept is called defense in depth. Pariès (2011) illustrates how defense in depth was used to achieve resilience in the 2009 ditching of US Airways Flight 1549.

Uday and Marais (2015) apply the above principles to the design of a system-of-systems. Henry and Ramirez-Marquez (2016) describe the state of the U.S. East Coast infrastructure in resilience terms following the impact of Hurricane Sandy in 2012. Bodeau & Graubert (2011) propose a framework for understanding and addressing cyber-resilience. They propose a taxonomy comprised of four goals, eight objectives, and fourteen cyber-resilience practices. Many of these goals, objectives and practices can be applied to non-cyber resilience.

Discipline Management

Most enterprises, both military and commercial, include organizations generally known as Advanced Design. These organizations are responsible for defining the architecture of a system at the very highest level of the system architecture. This architecture reflects the resilience principles described in Section 2 and the processes associated with that system. In many domains, such as fire protection, no such organization will exist. However, the system architecture will still need to be defined by the highest level of management in that organization. In addition, some aspects of resilience will be defined by government imposed requirements as described in Section 5.

Discipline Relationships

Interactions

Resilience Discipline Outputs

The primary outputs of the resilience discipline are a subset of the principles described by Jackson and Ferris (2013) which have been determined to be appropriate for a given system, threat, and desired state of resilience as determined by the state-transition analysis described below. The processes requiring these outputs are the system design and system architecture processes.

Resilience Discipline Inputs

Inputs to the state-transition analysis described in Section 8 include (1) type of system of interest, (2) nature of threats to the system (earthquakes, terrorist threats, human error, etc.).

Dependencies

Discipline Standards

ASIS International

ASIS (2009) has published a standard pertaining to the resilience of organizational systems. Some of the principles described in this standard can also be found in the larger set of principles described by Jackson and Ferris (2013) for engineered systems in general containing hardware, software, and humans.

Personnel Considerations

None have been identified.

Metrics

Uday & Marais (2015) performed a survey of resilience metrics. Those identified include:

  • Time duration of failure
  • Time duration of recovery
  • Ratio of performance recovery to performance loss
  • A function of speed of recovery
  • Performance before and after the disruption and recovery actions
  • System importance measures

Jackson (2016) developed a metric to evaluate various systems in four domains: aviation, fire protection, rail, and power distribution, for the principles that were lacking in ten different case studies. The principles are from the set identified by Jackson and Ferris (2013) and are represented in the form of a histogram plotting principles against frequency of omission. The data in these gaps were taken from case studies in which the lack of principles was inferred from recommendations by domain experts in the various cases cited.

Brtis (2016) surveyed and evaluated a number of potential resilience metrics and identified the following: [Note: This reference is going through approval for public release and should be referenceable by the end of July 2016.]

  • Maximum outage period
  • Maximum brownout period
  • Maximum outage depth
  • Expected value of capability: the probability-weighted average of capability delivered
  • Threat resiliency (the time integrated ratio of the capability provided divided by the minimum needed capability)
  • Expected availability of required capability (the likelihood that for a given adverse environment the required capability level will be available)
  • Resilience levels (the ability to provide required capability in a hierarchy of increasingly difficult adversity)
  • Cost to the opponent
  • Cost-benefit to the opponent
  • Resource resiliency (the degradation of capability that occurs as successive contributing assets are lost)

Brtis found that multiple metrics may be required, depending on the situation. However, if one had to select a single most effective metric for reflecting the meaning of resilience, it would be the expected availability of the required capability. Expected availability of the required capability is the probability-weighted sum of the availability summed across the scenarios under consideration. In its most basic form, this metric can be represented mathematically as:

R=\sum _{{1}}^{{n}}\left({\frac  {^{{P_{{i}}}}}{T}}\int _{{0}}^{{T}}Cr(t)_{{i}},dt\right)

where,

R = Resilience of the required capability (Cr);

n = the number of exhaustive and mutually exclusive adversity scenarios within a context (n can equal 1);

Pi = the probability of adversity scenario I;

Cr(t)_i = timewise availability of the required capability during scenario I; --- 0 if below the required level --- 1 if at or above the required value (Where circumstances dictate this may take on a more complex, non-binary function of time.);

T = length of the time of interest.

Models

The state-transition model described by Jackson et al (2015) describes a system in its various states before, during, and after an encounter with a threat. The model identifies seven different states as the system passes from a nominal operational state to minimally acceptable functional state. In addition, the model identifies 28 transition paths from state to state. To accomplish each transition the designer must invoke one or more of the 34 principles or support principles described by Jackson and Ferris (2013). The designs implied by these principles can then be entered into a simulation to determine the total effectiveness of each design.

Tools

No tools dedicated to resilience have been identified.

References

Works Cited

9/11 Commission. (2004). 9/11 Commission Report.

ASIS International. (2009). Organizational Resilience: Security, Preparedness, and Continuity Management Systems--Requirements With Guidance for Use. Alexandria, VA, USA: ASIS International.

Billings, C. (1997). Aviation Automation: The Search for Human-Centered Approach. Mahwah, NJ: Lawrence Erlbaum Associates.

Bodeau, D. K, & Graubart, R. (2011). Cyber Resiliency Engineering Framework, MITRE Technical Report #110237, The MITRE Corporation.

Brtis, J. S. (2016). How to Think About Resilience, MITRE Technical Report, MITRE Corporation.

Hollnagel, E., D. Woods, and N. Leveson (eds). 2006. Resilience Engineering: Concepts and Precepts. Aldershot, UK: Ashgate Publishing Limited.

INCOSE (2015). Systems Engineering Handbook, a Guide for System Life Cycle Processes and Activities. San Diego, Wiley.

Jackson, S., & Ferris, T. (2013). Resilience Principles for Engineered Systems. Systems Engineering, 16(2), 152-164.

Jackson, S., Cook, S. C., & Ferris, T. (2015). A Generic State-Machine Model of System Resilience. Insight, 18.

Jackson, S. & Ferris, T. L. (2016). Proactive and Reactive Resilience: A Comparison of Perspectives.

Jackson, W. S. (2016). Evaluation of Resilience Principles for Engineered Systems. Unpublished PhD, University of South Australia, Adelaide, Australia.

Leveson, N. (1995). Safeware: System Safety and Computers. Reading, Massachusetts: Addison Wesley.

Madni, Azad,, & Jackson, S. (2009). Towards a conceptual framework for resilience engineering. Institute of Electrical and Electronics Engineers (IEEE) Systems Journal, 3(2), 181-191.

Pariès, J. (2011). Lessons from the Hudson. In E. Hollnagel, J. Pariès, D. D. Woods & J. Wreathhall (Eds.), Resilience Engineering in Practice: A Guidebook. Farnham, Surrey: Ashgate Publishing Limited.

Perrow, C. (1999). Normal Accidents: Living With High Risk Technologies. Princeton, NJ: Princeton University Press.

Reason, J. (1997). Managing the Risks of Organisational Accidents. Aldershot, UK: Ashgate Publishing Limited.

Rechtin, E. (1991). Systems Architecting: Creating and Building Complex Systems. Englewood Cliffs, NJ: CRC Press.

US-Canada Power System Outage Task Force. (2004). Final Report on the August 14, 2003 Blackout in the United States and Canada: Causes and Recommendations. Washington-Ottawa.

Uday, P., & Morais, K. (2015). Designing Resilient Systems-of-Systems: A Survey of Metrics, Methods, and Challenges. Systems Engineering, 18(5), 491-510.

Primary References

Hollnagel, E., Woods, D. D., & Leveson, N. (Eds.). (2006). Resilience Engineering: Concepts and Precepts. Aldershot, UK: Ashgate Publishing Limited.

Jackson, S., & Ferris, T. (2013). Resilience Principles for Engineered Systems. Systems Engineering, 16(2), 152-164.

Jackson, S., Cook, S. C., & Ferris, T. (2015). Towards a Method to Describe Resilience to Assist in System Specification. Paper presented at the INCOSE Systems 2015. 

Jackson, S.: Principles for Resilient Design - A Guide for Understanding and Implementation. Available at https://www.irgc.org/irgc-resource-guide-on-resilience Accessed 18th August 2016

Madni, Azad,, & Jackson, S. (2009). Towards a conceptual framework for resilience engineering. Institute of Electrical and Electronics Engineers (IEEE) Systems Journal, 3(2), 181-191.

Additional References

Henry, D., & Ramirez-Marquez, E. (2016). On the Impacts of Power Outages during Hurrican Sandy -- A Resilience Based Analysis. Systems Engineering, 19(1), 59-75.

Jackson, W. S. (2016). Evaluation of Resilience Principles for Engineered Systems. Unpublished PhD, University of South Australia, Adelaide, Australia.


< Previous Article | Parent Article | Next Article >
SEBoK v. 1.7 released 27 October 2016

SEBoK Discussion

Please provide your comments and feedback on the SEBoK below. You will need to log in to DISQUS using an existing account (e.g. Yahoo, Google, Facebook, Twitter, etc.) or create a DISQUS account. Simply type your comment in the text field below and DISQUS will guide you through the login or registration steps. Feedback will be archived and used for future updates to the SEBoK. If you provided a comment that is no longer listed, that comment has been adjudicated. You can view adjudication for comments submitted prior to SEBoK v. 1.0 at SEBoK Review and Adjudication. Later comments are addressed and changes are summarized in the Letter from the Editor and Acknowledgements and Release History.

If you would like to provide edits on this article, recommend new content, or make comments on the SEBoK as a whole, please see the SEBoK Sandbox.

blog comments powered by Disqus