Lessons from the Global IT Outage

The recent CrowdStrike incident, which caused widespread disruptions due to a faulty software update, has brought to light critical issues in how organizations manage software updates, particularly for critical systems. This event serves as a stark reminder of the delicate balance between rapid security improvements and system stability.
Here are some lessons learned from this incident and best practices for managing software updates in enterprise environments.

The Two Sides of Rapid Updates

Benefits of Fast Rollouts

1. Improved Security:

Faster updates often mean quicker patches for known vulnerabilities, reducing the window of opportunity for potential attackers.

2. New Features:

Rapid update cycles can bring new functionality to users more quickly, potentially improving productivity or user experience.

3. Bug Fixes:

Quick updates allow for faster resolution of known issues or bugs.

Risks of Automatic and Rapid Updates

1. Rapid Spread of Issues:

When updates are automatically applied, a faulty update can quickly spread across an entire network, potentially causing widespread disruptions.

2. Lack of Control:

Automatic updates remove the ability for IT teams to properly test updates before they’re widely deployed.

3. Difficulty in Rolling Back:

Once an automatic update has been applied across a network, rolling it back can be challenging and time-consuming.

Striking a Balance: The N+1 or N+2 Update Scheme

One of the key lessons from the CrowdStrike incident is the value of using a staged update scheme, such as N+1 or N+2:

N+1 Scheme:

In this approach, production systems run one version behind the latest release.

N+2 Scheme:

Here, production systems run two versions behind the latest release.

Benefits of these approaches include:

1. Buffer for Testing:

These schemes provide a buffer period for thorough testing before updates reach production environments.

2. Early Issue Detection:

Problems can be identified and addressed before they impact critical systems.

3. Controlled Rollout:

IT teams can manage the update process more effectively, rolling out changes in stages.

Best Practices for Software Update Management

Drawing from the CrowdStrike incident, here are some best practices for managing software updates:

1.Assess Critical Systems:

Identify which systems are crucial to operations and may require more cautious update strategies.

2.Implement a Staged Deployment Strategy:

Utilize an N+1 or N+2 scheme for critical systems to ensure updates are thoroughly tested before reaching production.

3.Create a Test Environment:

Maintain a separate environment that mirrors production for testing updates before deployment.

5.Develop a Rollback Plan:

Always have a clear, tested plan for quickly reverting to a previous version if issues arise.

6.Monitor Post-Update Performance:

Closely watch system performance after updates to quickly identify any issues.

7.Stay Informed:

Keep abreast of industry news and vendor announcements to be aware of potential issues with updates.

8.Regular Audits:

Periodically review your update management strategy to ensure it aligns with current best practices and organizational needs.

Lesson Learned

The CrowdStrike incident serves as a valuable lesson in the importance of careful software update management. While keeping systems up-to-date is crucial for security and functionality, the method of doing so requires thoughtful consideration and a nuanced approach.

The key is to strike a balance between the benefits of rapid updates – improved security, new features, and quick bug fixes – and the need for stability and control in critical systems. By implementing staged update schemes like N+1 or N+2, following best practices for update management, and engaging in effective vendor management, organizations can optimize their update strategies to maximize security and functionality while minimizing the risk of disruptive incidents.

As we continue to witness the impact of this global disruption, it serves as a stark reminder of our reliance on digital systems and the importance of robust, adaptable cybersecurity measures.