Rethinking Reliability: An Intro into Reliability-centred Maintenance (RCM)

In the modern industrial landscape, maintenance managers are facing a crisis of complexity. The old question—”Should I change this lightbulb after 10,000 hours?”—has been replaced by a web of competing demands: environmental regulations, safety protocols, edge-case computing integration, and the relentless pressure to cut costs while increasing uptime.

Maintenance is no longer just about changing the oil in your car or greasing a conveyor belt. It has become the critical firewall preventing plant downtime, catastrophic safety events, and environmental disasters.

As we move deeper into the “Third Generation” of maintenance, organizations are realizing that digitalization is not a silver bullet. Taking an outdated maintenance regime—one based on guesswork or 1970s manufacturer recommendations—and digitizing it does not create reliability. It simply makes a bad process faster.

To navigate this complexity, we need a robust framework. We need Reliability-centred Maintenance (RCM). Based on the gold-standard work of John Moubray, this article explores why our understanding of failure is wrong, and how RCM corrects it.

It gives us a robust methology to strategically identify risks to safety, the environment and asset performance and how to remedy.

The Changing World of Maintenance

Over the last eighty years, the expectations placed on maintenance teams have evolved radically. We have moved through three distinct generations:

  1. First Generation (Pre-WWII):
    • Equipment was simple and over-engineered.
    • The strategy was “Fix it when it breaks.”
  2. Second Generation (Post-WWII):
    • Mechanization increased, and labor costs rose.
    • The focus shifted to “Preventive Maintenance” and scheduled overhauls to keep assets alive.
  3. Third Generation (Current):
    • Just-in-Time manufacturing and automation mean that downtime is now exorbitantly expensive.
    • More advanced techniques are required to optimize the system constraints

However, the Third Generation is defined by more than just cost. There is a rapidly growing awareness of the connection between maintenance, product quality, and safety. A failure in a chemical plant or an aircraft isn’t just an operational nuisance; it is a potential tragedy.

The Cultural Shift: Operators to think as Engineers

This high-stakes environment has forced a cultural convergence. Operations personnel can no longer treat machines as “black boxes” that maintenance fixes. They are having to think and act like engineers and managers, understanding the limits and capabilities of their equipment. Conversely, maintenance teams are being forced to understand commercial and operational consequences.

2. The Reality of Failure

Why do traditional maintenance schedules fail to deliver reliability in modern plants? The answer lies in our fundamental misunderstanding of how things break.

For decades, the industry has relied on the assumption that most assets have a reliable life period, after which they wear out. The logic followed that if you overhauled a machine right before it hit that “wear-out zone,” you could prevent failure.

The Aviation Discovery

The commercial aviation industry was the first to be confronted with the lethal inadequacy of this model. Despite aggressive scheduled maintenance, crash rates remained worryingly high in the mid-20th century.

Research into Third Generation maintenance revealed a startling truth: Failure is rarely about age.

As shown in the six patterns of failure above:

  • Pattern A (The Bathtub Curve): actually accounts for a very small percentage of failures (typically simple items like brake pads or tires).
  • Pattern F (Infant Mortality): Shows a high probability of failure immediately after commissioning or maintenance, which then stabilizes.

The Insight: In civil aviation, studies showed that 68% of items conform to Pattern F.

  • This means that assets were most likely to fail shortly after being replaced or maintained. This is counterintuitive to the traditional assumption
  • This means that for complex equipment, invasive scheduled maintenance often increases the risk of failure by introducing human error or infant mortality into a stable system.

Multi-Modal Failure

Furthermore, modern equipment failure is multi-modal. A machine rarely loses functionality for a single reason. It could be mechanical wear, but it is just as likely to be software corruption, operator error, design flaws, or environmental stress.

Addressing this requires a toolkit that goes beyond wrenches. It requires Condition Monitoring, Design for Reliability, Hazard and Operability (HAZOP) studies, Edge Case Computing, and Failure Modes and Effects Analysis (FMEA).

Defining RCM

To manage this multi-modal, non-age-related risk, Moubray defined RCM as:

“A process used to determine what must be done to ensure that any physical asset continues to do what its users want it to do in its present operating context.”

The phrase “present operating context” is the key. A diesel generator used as a primary power source for a remote mine has a completely different maintenance requirement than the exact same generator used as a backup for a hospital. One runs constantly; the other sits idle. RCM adapts the maintenance to the context, not the hardware.

3. RCM: The Seven Basic Questions

RCM is not a brainstorm; it is a rigorous, structured audit of an asset. It is governed by seven specific questions that guide a team from the asset’s function down to the specific task required to sustain it.

  1. What are the functions and associated performance standards of the asset in its present operating context?
    • Insight: We don’t just “maintain the pump.” We maintain the pump’s ability to move 500 liters per minute (primary function) and contain the fluid without leaking (secondary function).
  2. In what ways does it fail to fulfill its functions?
    • Insight: “Broken” is too vague. Functional failures include “Stopped completely,” “Pumping too slow,” or “Leaking.”
  3. What causes each functional failure? (Failure Modes)
    • Insight: This is the specific cause—e.g., “Bearing seized due to lack of lubrication” or “Impeller eroded.” RCM forces us to distinguish between normal wear and human error.
  4. What happens when each failure occurs? (Failure Effects)
    • Insight: What is the evidence? Does a light flash? Does the machine smoke? This helps in diagnosis.
  5. In what way does each failure matter? (Failure Consequences)
    • Insight: This is the prioritization filter. Does the failure kill people (Safety), breach regulations (Environmental), stop production (Operational), or just cost money to fix (Non-Operational)?
  6. What can be done to predict or prevent each failure? (Proactive Tasks)
    • Insight: Can we use the P-F Interval (the time between a potential failure becoming detectable and the actual functional failure) to intervene?
  7. What should be done if a suitable proactive task cannot be found? (Default Actions)
    • Insight: If we can’t predict it, do we redesign the machine? Or, if the consequences are low, do we consciously decide to let it run to failure?

4. Applying the RCM Process

One of the greatest misconceptions about RCM is that it is a desk job for a reliability engineer. In reality, RCM is a team sport.

To answer the seven questions effectively, the process utilizes RCM Review Groups. These groups are designed to break down the traditional silos between “the guys who run it” and “the guys who fix it.”

A typical group consists of:

  • A Facilitator: The RCM process expert who keeps the group disciplined.
  • Operations Supervisors & Operators: They hold the deep knowledge of how the machine behaves day-to-day.
  • Engineering Supervisors & Craftsmen: They hold the technical knowledge of the internal mechanics.
  • External Specialists: Brought in for specific technical expertise.

This structure ensures that the Tribal Knowledge of the shop floor is captured and synthesized with engineering data. It transforms maintenance from a top-down directive into a collaborative strategy.

5. What RCM Achieves

When organizations stop guessing and start applying the logic of RCM, the benefits are systemic:

  • Greater Safety and Environmental Integrity: By explicitly identifying failure modes that affect safety and environment, RCM prioritizes these above all else. It moves safety from a slogan to a process.
  • Greater Maintenance Cost-Effectiveness: RCM ruthlessly eliminates unnecessary scheduled maintenance (the “busy work” that follows the false Bathtub Curve). Resources are redirected to where they actually add value.
  • Improved Operating Performance: By focusing on “Function” rather than “Assets,” the maintenance plan is directly tied to production goals/quality.
  • A Comprehensive Database: RCM leaves behind a fully documented audit trail. If a regulator asks, “Why do you check this valve every 3 months?”, you have a defensible, engineering-based answer.
  • Greater Motivation and Teamwork: Perhaps the most valuable insight is human. By involving operators and craftsmen in the decision-making process, RCM gives them ownership. They are no longer just turning wrenches; they are architects of the plant’s reliability.

Conclusion

We are living in a world where equipment is complex, failure is random, and the cost of being wrong is unacceptable. RCM offers the only viable path forward: a move away from the comfortable illusions of scheduled overhauls and toward a dynamic, evidence-based understanding of reliability.

See my previous post on the topic

Scroll to Top