Resilience

NIST defines resilience as, "the ability to prepare for and adapt to changing conditions and withstand and recover rapidly from disruption. Resilience includes the ability to withstand and recover from deliberate attacks, accidents, or naturally occurring threats or incidents.

Why Is Resilience A Term GRC Professionals Should Be Familiar With?

Resilience is part of a "three-legged stool" concept, where a cybersecurity function needs to have three key capabilities to remain stable and support the organization's business needs:

  1. Security Leg. The appropriate controls are in place to protect the system/initiative/organization from reasonable risks and threats.
  2. Compliance Leg. Reasonable evidence of due diligence and due care exists to demonstrate compliance with applicable laws, regulations and contractual obligations.
  3. Resilience Leg. The organization is capable of withstanding and recovering from reasonable cybersecurity incidents.

Resilience

There is a military saying that, "The more you sweat in peace, the less you bleed in war" and that is applicable to the concept of resilience. If an organization invests the time and effort to ensure resilience (e.g., nore you sweat in peace), then recovering from accidental or intended incidents will be minimal (e.g., less you bleed in war). This goes far beyond planning and involves the need to address the spectrum of People, Processes, Technologies, Data and Facilities (PPTDF) to create a holistic approach to resilient operations. 

Resilience Spans Incident Response, Disaster Recovery and Businesss Continuity Plans

At the time of an incident or suspected incident, those responding generally do not know the magnitude and duration of any disruption to business operations. This "fog of war" can be minimized to a degree by creating Indicators of Compromise (IoC) that are specific to the organization that can better guide responders down the right path for incident response operations. Those incident response operations may lead to Disaster Recovery (DR) operations, which then may lead to longer-term Business Continuity (BC) operations.

Resilience focuses on minimizing DR/BC operations by having the capabilities in place to adapt and respond / recover quickly, but that requires significant preparation to do properly.

Reactive vs Proactive Cybersecurity Capabilities

Fundamentally, resilience is an operational mindset to be proactive, rather than reactive. An incident (boom event) is the trigger that sets in motion IR & DR/BC operations:

Reactive Cybersecurity Operations

In reactive cybersecurity operations, minimal PPTDF preparation leaves a weak or non-existent resilience capability where "right of boom" incident response involves significant time and resources to recover Business As Usual (BAU) operations.

Reactive Cybersecurity Operations

Proactive Cybersecurity Operations

In proactive cybersecurity operations, significant PPTDF preparataion "left of boom" creates a resilience capability where "right of boom" incident response and recovery is minimal:

Proactive Cybersecurity Operations

Remediation Enables Resiliency in IT Security

In cybersecurity, resilience is the ability of systems to withstand, recover from, and adapt to threats or disruptions while keeping operations running with minimal interruption. Current IT security operations are geared towards post-boom (right of event) reactive activities, with a lack of knowledge or implementation of controls that affect proactive mitigation of risk, breach, downtime, and cost savings.

Rollback and remediation to baseline using integrity controls is innovative because it surgically reverses only unauthorized or malicious changes, preserving uptime and forensic visibility instead of wiping entire systems. Unlike traditional backup and reprovisioning, which are disruptive, time-consuming, and often erase critical evidence, integrity-based rollback is fast, precise, and minimizes data loss. This approach aligns with modern Zero Trust strategies by continuously maintaining system trust without sacrificing operational continuity.

Two Core Remediation Approaches

1. Integrity-Based Remediation to Baseline

  • Definition: Restores a system to its last known trusted baseline by detecting and reversing only unauthorized or non-compliant changes.
  • How it works: Integrity monitoring continuously tracks changes to files, configurations, binaries, and system settings. If a malicious or unauthorized change is detected, only that change is rolled back, automatically or manually, without affecting the rest of the system.
  • Key Advantages:
    • Surgical, fast recovery without rebuilding the system
    • Preserves uptime and business continuity
    • Maintains forensic logs for root-cause analysis
    • Allows suspicious changes to be quarantined for investigation
    • Delivers low RPO and RTO, minimizing data loss and downtime

2. System Reprovisioning or Software Backup Recovery

  • Definition: Completely wipes and rebuilds the system from a gold image, clean build, or backup.
  • How it works: After a compromise, the system is replaced by redeploying a fresh image. This method is often standard in traditional incident response and disaster recovery.
  • Drawbacks:
    • Time-intensive and disruptive
    • Higher Restore Time Objective (RTO) and Restore Point Objective (RPO)
    • Loss of forensic evidence and change history
    • Risks reintroducing vulnerabilities if the image is outdated
    • May overlook system-specific updates or customizations
    • Often restores operations without identifying the root cause

Why Integrity-Based Remediation Is the Preferred First Response

  • Precision: Only unauthorized changes are undone, minimizing disruption.
  • Speed: Systems are restored quickly without requiring a rebuild.
  • Compliance: Retains audit trails and configuration history for regulatory needs.
  • Resilience: Enables proactive defense, containment, and alignment with Zero Trust and continuous monitoring practices.

Complementary Roles in Cyber Resilience

Both approaches are essential parts of a layered resilience strategy:

Integrity-Based Remediation: Strengthening Incident Response Plans (IRP) & Business Continuity Plans (BCP)

  • Real-time rollback of malicious or unauthorized changes
  • Maintains uptime and continuity by restoring only affected components
  • Preserves forensic visibility for investigations
  • Best suited for incident response and business continuity plans where rapid recovery is the goal

System Reprovisioning: Safeguarding Disaster Recovery

  • Full system rebuild from a trusted image after catastrophic failures
  • Ensures recovery when system-wide integrity is lost
  • Critical for disaster recovery plans (DRPs) to restore functionality after major incidents like ransomware lockouts, physical destruction, or nation-state attacks
  • Essential in large-scale scenarios such as data center outages or systemic corruption

Why Both Are Necessary

  • Integrity controls with resiliency → fast, precise recovery to limit damage and maintain uptime.
  • Reprovisioning → complete recovery when systems require a full reset.
  • Together, they form a layered resilience model:
    • Rapid recovery from incidents (integrity remediation)
    • Full restoration from disasters (reprovisioning)

Final Takeaway

Reprovisioning resets the system. Integrity-based remediation restores trust faster while keeping operations online. For true resilience, integrity-driven remediation should be the frontline approach, with reprovisioning reserved as a critical safety net. Federal agencies and enterprises achieve maximum resilience by combining both—using integrity-based remediation for everyday incident recovery and reprovisioning as a core element of disaster recovery planning.