What is root cause analysis in engineering?
Root cause analysis (RCA) in engineering is a structured investigation process used to identify the underlying mechanism behind an equipment or system failure. Its purpose is not only to explain how a component failed, but to determine why the failure occurred, so that recurrence can be prevented, reliability improved, and maintenance strategy informed by evidence.
A pump trips unexpectedly during a night shift. Operations restores it by morning. A week later, the investigation team submits its RCA report: bearing failure due to overload. Recommendation: replace the bearing.
Six months later, the same pump fails again.
This scenario plays out in plants across every industry, oil and gas, petrochemical, power generation, manufacturing. RCA reports are written. Recommendations are issued. But the failures keep coming back.
Equipment reliability stagnates. Shutdown costs accumulate. Safety risks persist.
The uncomfortable truth is this: most RCA reports identify what happened, not why it truly happened. They stop at the visible failure, not the underlying mechanism. And that distinction is the difference between solving a problem permanently and scheduling the next breakdown.
Most RCAs fail not because engineers lack expertise, but because investigations stop too early, rely on incomplete data, or focus on the wrong problem entirely.
This article explains the nine most common engineering mistakes that cause root cause analysis to fail, how to recognise them in practice, and what engineering teams should do differently to identify the true failure mechanism.
What Root Cause Analysis Is Supposed to Achieve
Root Cause Analysis is a structured engineering investigation method designed to identify the true, underlying cause of a failure. When performed correctly, RCA delivers several critical outcomes:
- Identify the underlying failure mechanism driving equipment degradation
- Prevent recurrence by addressing the system-level cause, not just the symptom
- Improve long-term system reliability and reduce unplanned downtime
- Inform maintenance strategy, inspection intervals, and design improvements
- Provide documented engineering evidence for asset management decisions
Several structured methods support this process: 5 Whys, Fault Tree Analysis (FTA), Fishbone (Ishikawa) Diagram, Event and Causal Factor Analysis, and Failure Mode and Effects Analysis (FMEA).
Each method has its strengths. But none will produce a valid result if the fundamental investigation discipline is flawed. The tool is only as good as the investigation behind it.
Why Root Cause Analysis Often Fails in Industrial Investigations
The core problem in most failed RCAs is a fundamental confusion between the failure mode and the root cause. These are not the same thing. The failure mode describes how a component failed. The root cause explains why that failure occurred. Consider the following examples:
| Failure Event | Incorrect RCA Finding | True Root Cause |
|---|---|---|
| Pump failure | Bearing failure | Lubrication contamination from ingress |
| Vessel crack | Material defect | Thermal fatigue from process cycling |
| Pipe leak | Corrosion | Undocumented process chemistry change |
| Compressor trip | High vibration | Impeller imbalance from fouling |
In each case, the incorrect finding describes what broke. The true root cause explains the mechanism that drove the failure. Addressing the failure mode alone, replacing the bearing, patching the weld, repainting the pipe only resets the countdown to the next failure.
Warning Signs in RCA Reports
Before diving into the nine mistakes, there is a fast way to identify a weak RCA. If an investigation report concludes with any of the following phrases as its final finding, it almost certainly has not reached the true root cause:
"Bearing failure"
"Material defect"
"Corrosion"
"Operator error"
"Component overload"
"Improper maintenance"
These phrases describe observable conditions, starting points for investigation, not conclusions. Each one prompts the same engineering question: why?
Why did the bearing fail? Why did the material defect go undetected? Why did corrosion progress to failure? Why was the maintenance improper?
A root cause analysis that ends at one of these phrases has stopped halfway. The investigation must continue until it reaches a mechanism that, if corrected, would prevent the failure from recurring.
9 Engineering Mistakes That Cause Root Cause Analysis to Fail
The following nine mistakes are the most common reasons engineering RCAs fail to identify the true failure mechanism. Each represents a specific breakdown in investigation discipline that leads to recurring failures and wasted resources.
Mistake 1 - Stopping the Root Cause Analysis Too Early
The most pervasive mistake is concluding the RCA once the failed component has been identified.
The team finds a cracked weld, a seized bearing, or a fractured shaft, and the investigation stops. The failed part becomes the root cause in the report.
But the failed component is almost always a victim, not a cause. The investigation needs to continue:
- Pump failed > Bearing seized > Lubrication failed > Filtration system undersized
- Filtration undersized > Design specification did not account for process change two years prior
True root cause rarely sits at the first layer of observation. Engineering investigations must trace the causal chain until they reach a mechanism that if corrected would prevent the failure from recurring.
Mistake 2 - Confusing Failure Mode With Root Cause in RCA
This distinction is fundamental and is frequently misunderstood even by experienced engineers.
| Term | Definition | Example |
|---|---|---|
| Failure Mode | How the component failed | Fatigue crack propagation |
| Root Cause | Why the failure occurred | Cyclic vibration from shaft misalignment |
| Contributing Factor | Condition that worsened the failure | Elevated temperature accelerating crack growth |
An RCA that reports ‘fatigue cracking’ as the root cause has only described the failure mode. The engineering question that matters: what was the source of the cyclic loading that drove the fatigue? Answering that question leads to the actual root cause, and the corrective action that will prevent recurrence.
Mistake 3 - Ignoring System-Level Interactions During Failure Investigation
Engineering failures rarely occur in isolation. A component that fails is almost always responding to its operating environment, the process conditions imposed on it, the mechanical loads applied, the maintenance history, and the control system behavior.
Investigations that focus exclusively on the failed component miss the wider system context. A heat exchanger that cracks may appear to be a materials issue. But a system-level investigation might reveal uneven flow distribution creating localized thermal stress concentrations that a properly loaded exchanger would never experience.
Effective RCA requires investigators to map the interactions between process, mechanical, instrumentation, and maintenance systems and to examine how those interactions contributed to the failure mechanism.
Mistake 4 - Performing Root Cause Analysis Without Operating Data
An RCA without data is engineering speculation dressed in a report format. To identify the true failure mechanism, investigators need access to the operating history of the failed system. In many cases, that data is not collected, not retained, or not accessible when the investigation begins.
The most commonly missing data categories include:
- Operating pressure and temperature history in the period leading up to failure
- Vibration monitoring logs and trending data
- Corrosion monitoring and thickness measurement records
- Process chemistry changes and upset event logs
- Maintenance records, inspection findings, and previous repair history
When this data is unavailable, investigators are forced to make assumptions. Those assumptions become the conclusions in the RCA report. Data collection systems must be in place before failures occur, not established in response to them.
Mistake 5 - Confirmation Bias in Investigation Teams
Confirmation bias is particularly dangerous in engineering failure investigations. It occurs when the team arrives with a pre-formed hypothesis and gathers evidence to confirm it rather than to test it.
In engineering, this often manifests as a rapid attribution to ‘material defect’ or ‘operator error’ explanations that remove design and system responsibility from the failure narrative.
The correct approach is to treat every investigation as hypothesis testing. Begin with multiple candidate causes. Design evidence collection to distinguish between them. Do not close the investigation until alternative causes have been explicitly evaluated and ruled out with evidence.
Mistake 6 - Not Involving the Right Technical Experts
Many RCA teams are assembled from the operations and maintenance personnel closest to the failure event. This makes practical sense, but for complex failures, first-hand familiarity is not a substitute for specialist technical expertise. Depending on the failure type, a complete investigation may require:
- Metallurgists and materials scientists for fracture surface analysis and materials characterisation
- Corrosion engineers for electrochemical damage mechanism identification
- Structural and mechanical engineers for stress and fatigue analysis
- FEA specialists for computational stress evaluation
- Process engineers for operating envelope assessment and chemistry review
A fracture surface that looks like overload to a maintenance engineer may exhibit clear fatigue striations and crack initiation features to a metallurgist, features that completely change the failure narrative and the corrective actions required.
Mistake 7 - Performing Root Cause Analysis Without Engineering Calculations
There is a significant difference between an investigation and a discussion.
Many RCA processes consist primarily of meetings, interviews, and diagram exercises, 5 Whys sessions and fishbone workshops that produce a narrative conclusion without any supporting engineering calculation or physical analysis.
For serious equipment failures, this is insufficient. Depending on the failure type, a technically credible RCA may require:
- Stress analysis to evaluate whether operating loads exceeded design margins
- Fracture mechanics assessment to determine crack initiation and propagation rates
- Fatigue analysis to quantify the effect of cyclic loading history
- Corrosion rate evaluation to assess damage mechanism severity
- Finite Element Analysis to validate stress concentration hypotheses
A pressure vessel nozzle cracking failure, for example, cannot be properly investigated through discussion alone. The investigation needs FEA to evaluate stress distribution under operating conditions.
Without the analysis, the corrective action is a guess.
Mistake 8 - Poor Documentation and Preservation of Failure Evidence
Failure evidence has a very short window of availability.
During the urgency of a plant shutdown, failed components are frequently removed, cleaned, and discarded before any forensic documentation has been completed. Fracture surfaces, which contain critical information about crack initiation, propagation direction, and loading mode are contaminated or destroyed by handling.
Once this evidence is gone, it cannot be recovered. The investigation that follows is permanently compromised. Proper evidence preservation requires a defined protocol activated the moment a failure occurs:
- Immediate photographic documentation before any disturbance
- Careful removal and packaging of failed components
- Preservation of fracture surfaces in an uncontaminated state
- Secure storage until forensic engineering analysis is complete
Mistake 9 - Attributing Root Cause to Human Error Instead of System Design
Operator error’ is one of the most common root cause conclusions in industrial RCA reports, and one of the least useful.
While human performance failures do contribute to some incidents, attributing a failure to human error without investigating the system conditions that made that error possible is an incomplete analysis.
When investigators dig deeper into ‘human error’ conclusions, they frequently find:
- Control interface designs that made incorrect actions easy and correct actions difficult
- Alarm systems in a state of chronic overload, causing critical warnings to be missed
- Operating procedures that were unclear, outdated, or inconsistent with current plant configuration
- Training systems that did not prepare operators for the scenario that preceded the failure
These are system design deficiencies, not individual failures. Blaming the operator without addressing the system conditions that enabled the error guarantees that the same performance failure will occur again likely with the next person assigned to that role. Effective RCA asks not just ‘what did the person do?’ but ‘what system conditions made this outcome likely?’
What a Proper Engineering Root Cause Analysis Looks Like
A technically credible root cause analysis follows a structured methodology that separates evidence gathering from analysis, and analysis from conclusion. The process must be driven by physical evidence, not by organisational pressure to reach a quick finding.
| Stage | Activities | Output |
|---|---|---|
| 1. Event Timeline Reconstruction | Document event sequence, operating conditions, alarm history, maintenance actions | Verified chronology of failure event |
| 2. Physical Evidence Analysis | Photograph, sample, and forensically examine failed components | Fracture characterisation, damage mechanism identification |
| 3. Failure Mechanism Identification | Metallurgical testing, corrosion analysis, fractography | Confirmed failure mode and damage mechanism |
| 4. Engineering Validation | Stress analysis, FEA, fatigue assessment, corrosion modelling | Quantified root cause with engineering evidence |
| 5. Corrective Action Implementation | Design modification, process change, maintenance improvement | Prevention of recurrence verified by analysis |
A complete RCA report must distinguish between three categories:
- Apparent cause the immediately visible condition
- Contributing factors conditions that worsened the outcome
- True root cause the fundamental mechanism that must be eliminated to prevent recurrence
Engineering Tools That Improve Root Cause Analysis
Advanced investigations apply analytical tools that reveal failure mechanisms invisible to the naked eye or inaccessible through operational data alone:
- Finite Element Analysis (FEA): Computational stress modelling to evaluate stress distribution, concentration factors, and load paths under operating conditions. Essential for pressure vessel, structural, and piping investigations.
- Fractography: Microscopic examination of fracture surfaces to identify crack initiation sites, propagation direction, loading mode, and fracture mechanism. Provides definitive evidence for fatigue, overload, stress corrosion, or hydrogen embrittlement.
- Metallurgical Testing: Material characterisation including hardness testing, microstructural examination, chemical composition analysis, and mechanical property verification.
- Corrosion Analysis: Identification of active mechanisms including uniform corrosion, pitting, crevice corrosion, stress corrosion cracking, and microbiologically influenced corrosion.
- Process Simulation: Modelling of fluid flow, heat transfer, and mass transfer to identify process conditions that deviate from design intent.
- Reliability Modelling: Statistical analysis of failure history to identify patterns, recurrence intervals, and systemic weaknesses in maintenance and inspection strategy.
Example of a True Root Cause Analysis: Pressure Vessel Nozzle Cracking
The following case study illustrates the difference between a superficial investigation and a technically complete root cause analysis.
Failure Event: Circumferential cracking discovered at a nozzle-to-shell junction on a high-pressure process vessel during scheduled inspection. The crack had propagated through approximately 60% of the nozzle wall thickness.
Initial Assessment: Visual inspection identified the crack location. NDT confirmed dimensions. An initial finding attributed the crack to a potential material defect recommendation: repair-weld the nozzle and return to service.
Complete Root Cause Analysis: Before any repair was undertaken, a multidisciplinary team conducted a structured investigation:
- Metallographic examination confirmed high-cycle fatigue as the failure mode, based on beach marks and crack front morphology
- Review of operating history identified cyclic pressure fluctuations from a process control valve in a hunting condition
- FEA stress analysis revealed a stress concentration factor above design intent due to non-standard reinforcement pad geometry from a previous modification
- Fatigue life calculation confirmed that the combination of elevated stress concentration and cyclic loading was sufficient to initiate and propagate the crack within the observed service interval
| RCA Stage | Finding |
|---|---|
| Failure observed | Circumferential cracking at nozzle-to-shell junction |
| Failure mode | High-cycle fatigue |
| Contributing factor | Cyclic pressure fluctuation from hunting control valve |
| Root cause | Stress concentration from non-standard reinforcement pad geometry |
| Corrective action | Redesign nozzle geometry + repair process control issue |
Weld repair of the crack alone without addressing the stress concentration or the cyclic loading source would have restored the vessel to service and reinitiated the fatigue damage mechanism immediately. The RCA prevented that outcome.
Quick Self-Check for Engineering Teams After Any Failure
Use this checklist immediately after any significant equipment failure. If the answer to most of these questions is ‘no’, the investigation is not yet complete.
| Question | Why It Matters |
|---|---|
| Did we preserve the failed part before cleaning or repair? | Prevents loss of forensic evidence that cannot be recovered later |
| Did we distinguish failure mode from root cause? | Avoids symptom-based conclusions that allow the failure to recur |
| Did we collect process, vibration, and maintenance history? | Validates the failure mechanism with real operating data |
| Did we test more than one hypothesis? | Reduces confirmation bias and improves analytical accuracy |
| Did we involve the right technical specialists? | Ensures complex mechanisms are not missed by generalist teams |
| Did we perform engineering calculations to validate our conclusion? | Distinguishes evidence-based findings from informed speculation |
| Is the corrective action addressing the mechanism, not just the part? | Prevents recurrence rather than resetting the failure cycle |
Key Signs Your Root Cause Analysis Is Incomplete
Your RCA is likely incomplete if any of the following apply:
- The same failure has recurred after previous RCA corrective actions were implemented
- The root cause is listed as ‘human error’ without analysis of the enabling system conditions
- No physical evidence from the failed component was collected, examined, or analysed
- No engineering calculations were performed to validate the proposed failure mechanism
- The corrective action consists solely of replacing the failed component with an identical part
- The investigation was completed in less than 48 hours without specialist engineering input
- Contributing factors and system interactions were not documented in the final report
How Engineers Can Perform Better Root Cause Analyses
Improving the quality of engineering failure investigations requires changes to both investigation process and organizational culture. The following practices form the foundation of a technically credible RCA program:
- Activate evidence preservation protocols immediately: The first action after any failure should be documentation and preservation before any component is moved, cleaned, or discarded.
- Collect and secure operating data without delay: Extract process historian data, vibration logs, alarm records, and maintenance histories covering the period leading up to the failure.
- Assemble multidisciplinary investigation teams: Include specialist engineers from the outset not as a last resort when the standard investigation has stalled.
- Apply engineering analysis tools: Do not rely exclusively on discussion and diagram methods for complex failures. Apply stress analysis, fractography, corrosion evaluation, or process simulation to validate the failure mechanism.
- Validate conclusions before closing the investigation: Can the proposed mechanism explain the observed damage pattern, failure location, timing, and operating conditions? If not, the investigation is not complete.
- Track corrective action effectiveness: Implement a formal process for monitoring whether recommended actions have actually prevented recurrence. If the failure returns, the RCA must be reopened, not simply repeated.
Conclusion
Root cause analysis fails when investigation teams stop at the first visible failure, the broken component, the cracked weld, the failed seal. These findings describe what broke. They do not explain why.
True engineering root cause analysis demands a deeper commitment: to trace the causal chain to its origin, to apply the analytical tools needed to validate the failure mechanism, and to develop corrective actions that address the system-level cause rather than the component-level symptom.
The recurring failure is not evidence that the problem is unsolvable. It is evidence that the previous investigation did not reach the root cause. The purpose of root cause analysis is not to explain what broke. It is to ensure it never breaks again.
Written By
PANDHARINATH SANAP
CEO and Co-Founder | IntPE
Pandharinath Sanap is the CEO and Co-Founder of Ideametrics, with more than 15 years of experience in mechanical engineering, engineering assessments, and technical reviews across industrial projects. He is an International Professional Engineer (IntPE)… Know more