Why Most Root Cause Analyses Fail: Engineering Mistakes That Hide the Real Cause

What is root cause analysis in engineering?

Root cause analysis (RCA) in engineering is a structured investigation process used to identify the underlying mechanism behind an equipment or system failure. Its purpose is not only to explain how a component failed, but to determine why the failure occurred, so that recurrence can be prevented, reliability improved, and maintenance strategy informed by evidence.

A pump trips unexpectedly during a night shift. Operations restores it by morning. A week later, the investigation team submits its RCA report: bearing failure due to overload. Recommendation: replace the bearing.

 

Six months later, the same pump fails again.

 

This scenario plays out in plants across every industry, oil and gas, petrochemical, power generation, manufacturing. RCA reports are written. Recommendations are issued. But the failures keep coming back.

 

Equipment reliability stagnates. Shutdown costs accumulate. Safety risks persist.

 

The uncomfortable truth is this: most RCA reports identify what happened, not why it truly happened. They stop at the visible failure, not the underlying mechanism. And that distinction is the difference between solving a problem permanently and scheduling the next breakdown.

 

Most RCAs fail not because engineers lack expertise, but because investigations stop too early, rely on incomplete data, or focus on the wrong problem entirely.

 

This article explains the nine most common engineering mistakes that cause root cause analysis to fail, how to recognise them in practice, and what engineering teams should do differently to identify the true failure mechanism.

What is Root Cause Analysis in Engineering

What Root Cause Analysis Is Supposed to Achieve

Root Cause Analysis is a structured engineering investigation method designed to identify the true, underlying cause of a failure. When performed correctly, RCA delivers several critical outcomes:

  • Identify the underlying failure mechanism driving equipment degradation
  • Prevent recurrence by addressing the system-level cause, not just the symptom
  • Improve long-term system reliability and reduce unplanned downtime
  • Inform maintenance strategy, inspection intervals, and design improvements
  • Provide documented engineering evidence for asset management decisions

Several structured methods support this process: 5 Whys, Fault Tree Analysis (FTA), Fishbone (Ishikawa) Diagram, Event and Causal Factor Analysis, and Failure Mode and Effects Analysis (FMEA).

 

Each method has its strengths. But none will produce a valid result if the fundamental investigation discipline is flawed. The tool is only as good as the investigation behind it.

Why Root Cause Analysis Often Fails in Industrial Investigations

The core problem in most failed RCAs is a fundamental confusion between the failure mode and the root cause. These are not the same thing. The failure mode describes how a component failed. The root cause explains why that failure occurred. Consider the following examples:

Failure Event Incorrect RCA Finding True Root Cause
Pump failure Bearing failure Lubrication contamination from ingress
Vessel crack Material defect Thermal fatigue from process cycling
Pipe leak Corrosion Undocumented process chemistry change
Compressor trip High vibration Impeller imbalance from fouling

In each case, the incorrect finding describes what broke. The true root cause explains the mechanism that drove the failure.  Addressing the failure mode alone, replacing the bearing, patching the weld, repainting the pipe only resets the countdown to the next failure.

Warning Signs in RCA Reports

Before diving into the nine mistakes, there is a fast way to identify a weak RCA. If an investigation report concludes with any of the following phrases as its final finding, it almost certainly has not reached the true root cause:

"Bearing failure"

"Material defect"

"Corrosion"

"Operator error"

"Component overload"

"Improper maintenance"

These phrases describe observable conditions, starting points for investigation, not conclusions. Each one prompts the same engineering question: why?

 

Why did the bearing fail? Why did the material defect go undetected? Why did corrosion progress to failure? Why was the maintenance improper?

 

A root cause analysis that ends at one of these phrases has stopped halfway. The investigation must continue until it reaches a mechanism that, if corrected, would prevent the failure from recurring.

9 Engineering Mistakes That Cause Root Cause Analysis to Fail

The following nine mistakes are the most common reasons engineering RCAs fail to identify the true failure mechanism. Each represents a specific breakdown in investigation discipline that leads to recurring failures and wasted resources.

Mistake 1 - Stopping the Root Cause Analysis Too Early

The most pervasive mistake is concluding the RCA once the failed component has been identified.

The team finds a cracked weld, a seized bearing, or a fractured shaft, and the investigation stops. The failed part becomes the root cause in the report.

 

But the failed component is almost always a victim, not a cause. The investigation needs to continue:

  • Pump failed > Bearing seized > Lubrication failed > Filtration system undersized
  • Filtration undersized > Design specification did not account for process change two years prior

True root cause rarely sits at the first layer of observation. Engineering investigations must trace the causal chain until they reach a mechanism that if corrected would prevent the failure from recurring.

Mistake 2 - Confusing Failure Mode With Root Cause in RCA

This distinction is fundamental and is frequently misunderstood even by experienced engineers.

Term Definition Example
Failure Mode How the component failed Fatigue crack propagation
Root Cause Why the failure occurred Cyclic vibration from shaft misalignment
Contributing Factor Condition that worsened the failure Elevated temperature accelerating crack growth

An RCA that reports ‘fatigue cracking’ as the root cause has only described the failure mode. The engineering question that matters: what was the source of the cyclic loading that drove the fatigue? Answering that question leads to the actual root cause, and the corrective action that will prevent recurrence.

Mistake 3 - Ignoring System-Level Interactions During Failure Investigation

Engineering failures rarely occur in isolation. A component that fails is almost always responding to its operating environment, the process conditions imposed on it, the mechanical loads applied, the maintenance history, and the control system behavior.

 

Investigations that focus exclusively on the failed component miss the wider system context. A heat exchanger that cracks may appear to be a materials issue. But a system-level investigation might reveal uneven flow distribution creating localized thermal stress concentrations that a properly loaded exchanger would never experience.

 

Effective RCA requires investigators to map the interactions between process, mechanical, instrumentation, and maintenance systems and to examine how those interactions contributed to the failure mechanism.

Mistake 4 - Performing Root Cause Analysis Without Operating Data

An RCA without data is engineering speculation dressed in a report format. To identify the true failure mechanism, investigators need access to the operating history of the failed system. In many cases, that data is not collected, not retained, or not accessible when the investigation begins.

 

The most commonly missing data categories include:

  • Operating pressure and temperature history in the period leading up to failure
  • Vibration monitoring logs and trending data
  • Corrosion monitoring and thickness measurement records
  • Process chemistry changes and upset event logs
  • Maintenance records, inspection findings, and previous repair history

When this data is unavailable, investigators are forced to make assumptions. Those assumptions become the conclusions in the RCA report. Data collection systems must be in place before failures occur,  not established in response to them.

Mistake 5 - Confirmation Bias in Investigation Teams

Confirmation bias is particularly dangerous in engineering failure investigations. It occurs when the team arrives with a pre-formed hypothesis and gathers evidence to confirm it rather than to test it.

 

In engineering, this often manifests as a rapid attribution to ‘material defect’ or ‘operator error’ explanations that remove design and system responsibility from the failure narrative.

 

The correct approach is to treat every investigation as hypothesis testing. Begin with multiple candidate causes. Design evidence collection to distinguish between them. Do not close the investigation until alternative causes have been explicitly evaluated and ruled out with evidence.

Mistake 6 - Not Involving the Right Technical Experts

Many RCA teams are assembled from the operations and maintenance personnel closest to the failure event. This makes practical sense, but for complex failures, first-hand familiarity is not a substitute for specialist technical expertise. Depending on the failure type, a complete investigation may require:

  • Metallurgists and materials scientists for fracture surface analysis and materials characterisation
  • Corrosion engineers for electrochemical damage mechanism identification
  • Structural and mechanical engineers for stress and fatigue analysis
  • FEA specialists for computational stress evaluation
  • Process engineers for operating envelope assessment and chemistry review

A fracture surface that looks like overload to a maintenance engineer may exhibit clear fatigue striations and crack initiation features to a metallurgist, features that completely change the failure narrative and the corrective actions required.

Mistake 7 - Performing Root Cause Analysis Without Engineering Calculations

There is a significant difference between an investigation and a discussion.

 

Many RCA processes consist primarily of meetings, interviews, and diagram exercises, 5 Whys sessions and fishbone workshops that produce a narrative conclusion without any supporting engineering calculation or physical analysis.

 

For serious equipment failures, this is insufficient. Depending on the failure type, a technically credible RCA may require:

  • Stress analysis to evaluate whether operating loads exceeded design margins
  • Fracture mechanics assessment to determine crack initiation and propagation rates
  • Fatigue analysis to quantify the effect of cyclic loading history
  • Corrosion rate evaluation to assess damage mechanism severity
  • Finite Element Analysis to validate stress concentration hypotheses

A pressure vessel nozzle cracking failure, for example, cannot be properly investigated through discussion alone. The investigation needs FEA to evaluate stress distribution under operating conditions.

 

Without the analysis, the corrective action is a guess.

Performing RCA Without Engineering Calculations

Mistake 8 - Poor Documentation and Preservation of Failure Evidence

Failure evidence has a very short window of availability.

 

During the urgency of a plant shutdown, failed components are frequently removed, cleaned, and discarded before any forensic documentation has been completed. Fracture surfaces, which contain critical information about crack initiation, propagation direction, and loading mode are contaminated or destroyed by handling.

 

Once this evidence is gone, it cannot be recovered. The investigation that follows is permanently compromised. Proper evidence preservation requires a defined protocol activated the moment a failure occurs:

  • Immediate photographic documentation before any disturbance
  • Careful removal and packaging of failed components
  • Preservation of fracture surfaces in an uncontaminated state
  • Secure storage until forensic engineering analysis is complete

Mistake 9 - Attributing Root Cause to Human Error Instead of System Design

Operator error’ is one of the most common root cause conclusions in industrial RCA reports, and one of the least useful.

While human performance failures do contribute to some incidents, attributing a failure to human error without investigating the system conditions that made that error possible is an incomplete analysis.

 

When investigators dig deeper into ‘human error’ conclusions, they frequently find:

  • Control interface designs that made incorrect actions easy and correct actions difficult
  • Alarm systems in a state of chronic overload, causing critical warnings to be missed
  • Operating procedures that were unclear, outdated, or inconsistent with current plant configuration
  • Training systems that did not prepare operators for the scenario that preceded the failure

These are system design deficiencies, not individual failures. Blaming the operator without addressing the system conditions that enabled the error guarantees that the same performance failure will occur again likely with the next person assigned to that role. Effective RCA asks not just ‘what did the person do?’ but ‘what system conditions made this outcome likely?’

What a Proper Engineering Root Cause Analysis Looks Like

A technically credible root cause analysis follows a structured methodology that separates evidence gathering from analysis, and analysis from conclusion. The process must be driven by physical evidence, not by organisational pressure to reach a quick finding.

Stage Activities Output
1. Event Timeline Reconstruction Document event sequence, operating conditions, alarm history, maintenance actions Verified chronology of failure event
2. Physical Evidence Analysis Photograph, sample, and forensically examine failed components Fracture characterisation, damage mechanism identification
3. Failure Mechanism Identification Metallurgical testing, corrosion analysis, fractography Confirmed failure mode and damage mechanism
4. Engineering Validation Stress analysis, FEA, fatigue assessment, corrosion modelling Quantified root cause with engineering evidence
5. Corrective Action Implementation Design modification, process change, maintenance improvement Prevention of recurrence verified by analysis

A complete RCA report must distinguish between three categories:

  • Apparent cause the immediately visible condition
  • Contributing factors conditions that worsened the outcome
  • True root cause the fundamental mechanism that must be eliminated to prevent recurrence

Engineering Tools That Improve Root Cause Analysis

Advanced investigations apply analytical tools that reveal failure mechanisms invisible to the naked eye or inaccessible through operational data alone:

  • Finite Element Analysis (FEA): Computational stress modelling to evaluate stress distribution, concentration factors, and load paths under operating conditions. Essential for pressure vessel, structural, and piping investigations.
  • Fractography: Microscopic examination of fracture surfaces to identify crack initiation sites, propagation direction, loading mode, and fracture mechanism. Provides definitive evidence for fatigue, overload, stress corrosion, or hydrogen embrittlement.
  • Metallurgical Testing: Material characterisation including hardness testing, microstructural examination, chemical composition analysis, and mechanical property verification.
  • Corrosion Analysis: Identification of active mechanisms including uniform corrosion, pitting, crevice corrosion, stress corrosion cracking, and microbiologically influenced corrosion.
  • Process Simulation: Modelling of fluid flow, heat transfer, and mass transfer to identify process conditions that deviate from design intent.
  • Reliability Modelling: Statistical analysis of failure history to identify patterns, recurrence intervals, and systemic weaknesses in maintenance and inspection strategy.

Example of a True Root Cause Analysis: Pressure Vessel Nozzle Cracking

The following case study illustrates the difference between a superficial investigation and a technically complete root cause analysis.

 

Failure Event: Circumferential cracking discovered at a nozzle-to-shell junction on a high-pressure process vessel during scheduled inspection. The crack had propagated through approximately 60% of the nozzle wall thickness.

 

Initial Assessment: Visual inspection identified the crack location. NDT confirmed dimensions. An initial finding attributed the crack to a potential material defect recommendation: repair-weld the nozzle and return to service.

 

Complete Root Cause Analysis: Before any repair was undertaken, a multidisciplinary team conducted a structured investigation:

  • Metallographic examination confirmed high-cycle fatigue as the failure mode, based on beach marks and crack front morphology
  • Review of operating history identified cyclic pressure fluctuations from a process control valve in a hunting condition
  • FEA stress analysis revealed a stress concentration factor above design intent due to non-standard reinforcement pad geometry from a previous modification
  • Fatigue life calculation confirmed that the combination of elevated stress concentration and cyclic loading was sufficient to initiate and propagate the crack within the observed service interval
RCA Stage Finding
Failure observed Circumferential cracking at nozzle-to-shell junction
Failure mode High-cycle fatigue
Contributing factor Cyclic pressure fluctuation from hunting control valve
Root cause Stress concentration from non-standard reinforcement pad geometry
Corrective action Redesign nozzle geometry + repair process control issue

Weld repair of the crack alone without addressing the stress concentration or the cyclic loading source would have restored the vessel to service and reinitiated the fatigue damage mechanism immediately. The RCA prevented that outcome.

Quick Self-Check for Engineering Teams After Any Failure

Use this checklist immediately after any significant equipment failure. If the answer to most of these questions is ‘no’, the investigation is not yet complete.

Question Why It Matters
Did we preserve the failed part before cleaning or repair? Prevents loss of forensic evidence that cannot be recovered later
Did we distinguish failure mode from root cause? Avoids symptom-based conclusions that allow the failure to recur
Did we collect process, vibration, and maintenance history? Validates the failure mechanism with real operating data
Did we test more than one hypothesis? Reduces confirmation bias and improves analytical accuracy
Did we involve the right technical specialists? Ensures complex mechanisms are not missed by generalist teams
Did we perform engineering calculations to validate our conclusion? Distinguishes evidence-based findings from informed speculation
Is the corrective action addressing the mechanism, not just the part? Prevents recurrence rather than resetting the failure cycle

Key Signs Your Root Cause Analysis Is Incomplete

Your RCA is likely incomplete if any of the following apply:

  • The same failure has recurred after previous RCA corrective actions were implemented
  • The root cause is listed as ‘human error’ without analysis of the enabling system conditions
  • No physical evidence from the failed component was collected, examined, or analysed
  • No engineering calculations were performed to validate the proposed failure mechanism
  • The corrective action consists solely of replacing the failed component with an identical part
  • The investigation was completed in less than 48 hours without specialist engineering input
  • Contributing factors and system interactions were not documented in the final report

How Engineers Can Perform Better Root Cause Analyses

Improving the quality of engineering failure investigations requires changes to both investigation process and organizational culture. The following practices form the foundation of a technically credible RCA program:

  • Activate evidence preservation protocols immediately: The first action after any failure should be documentation and preservation before any component is moved, cleaned, or discarded.
  • Collect and secure operating data without delay: Extract process historian data, vibration logs, alarm records, and maintenance histories covering the period leading up to the failure.
  • Assemble multidisciplinary investigation teams: Include specialist engineers from the outset not as a last resort when the standard investigation has stalled.
  • Apply engineering analysis tools: Do not rely exclusively on discussion and diagram methods for complex failures. Apply stress analysis, fractography, corrosion evaluation, or process simulation to validate the failure mechanism.
  • Validate conclusions before closing the investigation: Can the proposed mechanism explain the observed damage pattern, failure location, timing, and operating conditions? If not, the investigation is not complete.
  • Track corrective action effectiveness: Implement a formal process for monitoring whether recommended actions have actually prevented recurrence. If the failure returns, the RCA must be reopened, not simply repeated.

Conclusion

Root cause analysis fails when investigation teams stop at the first visible failure, the broken component, the cracked weld, the failed seal. These findings describe what broke. They do not explain why.

 

True engineering root cause analysis demands a deeper commitment: to trace the causal chain to its origin, to apply the analytical tools needed to validate the failure mechanism, and to develop corrective actions that address the system-level cause rather than the component-level symptom.

 

The recurring failure is not evidence that the problem is unsolvable. It is evidence that the previous investigation did not reach the root cause. The purpose of root cause analysis is not to explain what broke. It is to ensure it never breaks again.

Written By

IntPE Engineer & Founder, Paddy Updated Profile Image

PANDHARINATH SANAP

CEO and Co-Founder | IntPE

Pandharinath Sanap is the CEO and Co-Founder of Ideametrics, with more than 15 years of experience in mechanical engineering, engineering assessments, and technical reviews across industrial projects. He is an International Professional Engineer (IntPE)… Know more

Turning Complex Engineering Into Confident Decisions.

Ideametrics is where precision, compliance, and innovation come together, helping industries to solve complex challenges, achieve global standards, and move forward with confidence.

Scroll to Top