It has been fascinating watching HBO’s Chernobyl, which details the greatest nuclear meltdown in human history-an event 400 times more powerful than Hiroshima’s atomic bomb.
I realized over the course of the TV series that this was a cataphoric failure destined to occur. There was a slow drift into failure from the very beginning, only accelerated by the people executing the experiment in accordance with their rules-based culture. However, why? Does failure just matter when it happens?
It’s a mystery how anything ever works
The pursuit of success in today’s dynamic, constantly changing, and complex business environment with limited resources and many conflicting goals can eventually cause breakdowns, a domino effect of latent failures, on a monumental level.
We see daily debacles in news feeds and newspapers, highlighting how the systems we design – with positive intentions – can create unintended consequences and harmful effects on society.
As we struggle to cope with many of our contexts, from Facebook hacks to Boeing 737 Max accidents, algorithmic autonomous-bot arguments to legacy top-down management structures, information flows and decision-making, it’s almost unbelievable that anything ever worked as intended.
In order to hold someone accountable for problems, we look for the root cause, the one broken piece or individual. We continue to analyze complex system breakdowns linearly, componentially, and reductively. That’s inhumane, in a nutshell.
In society, complexity has grown faster than our understanding of how complex systems work. According to Sydney Dekker, a human factors and safety author, “our theories have been overtaken by our technologies.”
Identifying and modeling failure drift
It was Jens Rasmussen who identified this failure-mode phenomenon, referred to as “drift to danger,” or “systemic migration of organizational behavior toward accident under the pressure of cost-effectiveness in an aggressive, competitive environment.”
Major initiatives are subject to multiple pressures, and we have to navigate within the space of possibilities formed by economic, workload, and safety constraints to achieve the desired outcomes.
While our capitalist landscape encourages decision-makers to focus on short-term incentives, financial success, and survival over long-term criteria such as safety, security, and scalability. As innovation cycles compound, customer expectations increase exponentially. As a result, workers must become more productive if they are to stay ahead and become “cheaper, faster, better.” These pressures push teams to the limits of acceptable (safe) performance. An accident occurs when the activity of a system crosses an unacceptable safety boundary.
In Rasmussen’s model, complexity is mapped and navigated toward properties that can be optimized. As an example, if we want to optimize for safety, then we need to know where our safety boundary is in his model. The primary, explicit outcome of chaos engineering is optimizing for safety.
There Is No Spoiler Alert (Or There Is)
While Chernobyl’s reactor was shut down, the crew performed a low power test to determine if residual turbine spin could generate enough electric power to maintain the cooling system.
An experiment of this type should have been planned in detail and conservatively. This was not the case. In this experiment, no criteria for when to terminate it were established ahead of time. It was a “see what happens” experiment, poorly designed. Multiple previous attempts had also failed.
Due to the lack of dedesignnd engineeringssistance, the crew proceeded without safety precautions and without properly communicating the procedure with safety personnel. As the top-performing reactor site in the Soviet Union, Chernobyl won awards for its performance.
Experimentation went awry.
A number of safety systems were shut down in order to prevent the reactor from shutting down completely. I ignored the remaining alarm for 20 seconds when it sounded.
Despite the questionable behavior of the team, there was also a deeper, latent failure. An engineering flaw caused the Chernobyl reactor to overheat during the test, one that was obfuscated from scientists by the government’s policy to ensure state secrets remained secret-the graphite components they chose for the reactor design were also believed to offer similar safety standards at a lower cost. The answer is, of course, no.
The reactor’s maximum power output was 3,200 MWt (megawatt thermal) during normal operation, but during the power surge, the output spiked to over 320,000 MWt. Several explosions of steam and fire resulted from the rupture of the reactor housing, destroying the reactor building and releasing large amounts of radiation.
An official explanation of the Chernobyl accident was published within three months of the accident in August 1986. According to the report, the catastrophe was caused by gross violations of operating rules and regulations at the power plant. In addition to a lack of knowledge and experience, the operator’s error was caused by his lack of understanding of the physics and engineering of nuclear reactors. Once the root caand the individual’sal’s error were identified, the case was closed.
After the 1993 revised analysis of the International Atomic Energy Agency, debate over the reactor’s design took place.
Because the primary data covering the catastrophe, as recorded by the instruments and sensors, was not fully published in official sources, there are differing viewpoints and debates about the accident’s causes.
Several low-level details about how the plant was designed were also kept from operators due to Soviet secrecy and censorship.
A graphite moderator and water coolant were used in the four Chernobyl reactors, which were very different from commercial designs and utilized a unique combination of a pressurized water reactor. It makes the RBMK design very unstable at low power levels, and prone to suddenly increasing energy production to a dangerous level. Operation crews were unaware of this counterintuitive behavior.
Moreover, Chernobyl had no fortified containment structure like most nuclear power plants. The environment was exposed to radioactive material without this protection.
Factors contributing to the failure to consider
Performance is driven by KPIs
In the case of Chernobyl, the crew was required to complete the test to confirm that the plant was operating in a safe manner. They had a narrow focus on what mattered to them in terms of safety, such as completing the test versus operating it in a safe manner. In advance of performing the experiment, no boundaries or success and failure criteria were defined. The information they had was not sufficient to set them up for success. They ignored other anomalies flagged by the system and pressed ahead to complete the experiment on time.
The flow of information
You can only make good decisions if you have good information and a good decision-making process. Operators followed a bad process with missing information.
There are many times when policy makers are not the ones doing the actual work, resulting in breakdowns between work-as-expected and work-as-done.
When employees are in a double-blind position, they often choose one value (such as timeliness or efficiency) over another (such as verification).
Behavior is guided by values, and behavior is influenced by resources
If the company tells you what behaviors lead to success, but you believe they lead to failure, what would you do? If following the rules went against your values, what would you do?
In some cases, KPIs are set in such a way that they stress behaviors and create unsafe systems. Ultimately, a night shift crew had to replace the Chernobyl team that was fully prepared to run the test when the test was delayed.
This put the chief engineer under further pressure to complete the test (and avoid further delay), ultimately clouding his judgment about the inherent risks associated with using an alternate crew.
Choosing value over failure (How to Drift into Value)
Variability in performance may lead to drift in your situation. It is possible to drift towards success over failure by creating experiences and social structures for people to learn how to handle uncertainty and navigate toward our desired outcomes.
Here are some principles and practices to consider.
- Share high-quality information frequently and liberally
- Where are decisions made in your organization? Can you move authority to where information is best, context is most current, and employees are closest to the situation?
- Keep an eye out for KPIs that conflict with company values, and compare them with the behaviors they may inspire
- Clearly communicate economic, workload, and safety constraints. It is implicit that these exist in Rasmussen’s mode, whether you acknowledge them or not. In your context, it is important to ensure that everyone doing the work understands all three boundaries
- Ensure you have fast feedback mechanisms in place to let you know when you reach your predefined risk and experiment boundaries
The conclusion
In complex systems, accidents are caused by emergent properties, so explaining accidents by going backwards will never provide a complete explanation.
Failure is rarely driven by one single act or decision, but rather by a series of tiny events and decisions over time that eventually reveal your system’s latent failures. Because of this, we must always keep in mind that our work progresses through new information, understanding, and knowledge, not just through time.
Do you have systems in place so that you can safely control the problem domain (not your people)?
Despite being the world’s worst nuclear accident, Chernobyl led to major changes in safety culture and industry cooperation, particularly between East and West before the end of the Soviet Union. During his tenuras Sovietet President, Gorbachev said that the Chernobyl accident was more important than his liberal reform program, Perestroika.