How big are the holes?

In issue 33 (Oct-Nov 2011) of Digital Energy Journal, consultant editor David Bamford wrote an interesting article on the ‘Swiss cheese’ model – particularly referencing its use within BP’s accident investigation report into Deepwater Horizon, published 8 September 2010.

Briefly, the Swiss cheese model is a way of envisaging the various barriers within a system (technological, people and systems) that should prevent a hazard (such as a dangerous substance) from becoming an
accident (e.g. leaking or exploding).  Each barrier is represented as a slice of Swiss cheese, and the holes are imperfections in a barrier (e.g. the failure of equipment or people, etc.).  If one layer fails the hazard will usually be caught by the next, but if holes in each layer align, this allows a ‘trajectory of accident opportunity’ to pass through them and lead to an accident.

Mr. Bamford considers whether holes of smaller size (suggesting lower probability failures, such as the failure of very high reliability equipment) are less likely to align than larger holes, which naturally have an increased area of overlap – i.e. more chance of happening simultaneously.  Holes are unavoidable – we can’t guarantee a barrier to be 100% effective – but we can make the holes as small as possible.  In order to understand how well our systems can hold up to risk, do we therefore need to ask “how big are our holes”?

The Deepwater Horizon report showed that the accident happened due to the failure of 7 barriers, plus one extra that failed to mitigate the consequences: the well cement job; mechanical barriers; the
pressure integrity testing; well monitoring; well control response; hydrocarbon surface containment; fire and gas system; and blow out preventer operation.

Mr. Bamford asks:  “It has been reported that cement jobs… fail from time to time.  How often? 1 time in 5; 1 time in 10; 1 time in 100?”  Can we accurately quantify our ‘hole sizes’ as probabilities of failure?

For a while now some industries have been trying to do just that for human factors issues – notably in the nuclear sector.  Human reliability assessment (HRA) techniques, such as THERP and HEART, are attempts at quantifying the probability of human failure for different tasks and under different circumstances (error producing conditions) – e.g. when distracted, tired, in cramped conditions, etc.

Typically, the starting point of using HRA is to determine the base failure rate (average failure under ideal circumstances) for a task – for a few industries a database of tasks and their failure rates have been
accumulated.  We should then multiply that base value by certain amounts depending on what error producing conditions (EPC) are present – e.g. x 3 if the task is done by an inexperienced person or  x 1.2 if the worker is low on morale, etc.  The goal is then to remove these EPCs and bring the probability down to as low as possible, whilst balancing this against the financial or practical costs of doing so.

The difficulty arises in the validity of the data in question.  Can we really be sure that the probability of failure is X?  Can we use failure rates that have been ascertained within one industry (e.g. nuclear) within another (e.g. oil and gas)?  Can we be sure of the effect that EPCs will have, and do several act independently or synergystically?

We should also consider if HRA techniques have an advantage over qualitative techniques.  If the objective is to be able to justify the measures put in place and balance these against costs, will a qualitative methodology (such as found in EI Guidance on human factors safety critical task analysis) working to ALARP (as low as reasonably practicable) provide a sufficient justification?

These are difficult questions to answer – although the EI is developing Guidance on quantified human reliability analysis (QHRA), scheduled to be published by end Q2 2012 that should help to answer these questions.

Human behaviour is hard to quantify, but failure rates for equipment may appear more straightforward.  Indeed, understanding failure rates (often in terms of the mean time between failures) is an important part of high reliability engineering.  However, uncertainty can arise when we begin to ask questions such who is designing the equipment; who has manufactured it, installed it, maintained (or not maintained) it; proof tested it to show that it works; and who has used (or misused) it?  The probability of equipment failure is likely to vary from company to company and site to site.

Can we really think of the holes within the Swiss cheese model as being due to single failures of equipment (or even people)?  Few of the barriers discussed in the Deepwater Horizon report are purely
mechanical.  For example, the report concluded that the cement job failed (allowing hydrocarbons to leak into the well), but this may have been, in part, caused by insufficient testing of the foam cement slurry prior to its construction, and insufficient risk assessment when designing the placement of the cement.  Similarly, whilst the blow out preventer failed, there were human and organisational factors that contributed to that, including maintenance records not being accurately reported within the maintenance management system.

It can be argued that barriers are often complex interactions between people, the organisation and equipment.  Whilst it is possible that a hole will appear due to failure of one piece of equipment, the
causes of that failure may be due to human and organisational factors causes, making the size of the hole (its probability of happening or contributing to an accident) hard to accurately predict.  Attempting to understand the size of a hole based on the failure of people or equipment working in isolation could potentially result in the implementation of a system that does not reflect the reality of operations.

Comments are closed.

%d bloggers like this: