Blogs

How to Do Production Testing Safely. Lessons From the CrowdStrike/Microsoft Incident

Mateusz Czerewko • QA Engineer

from Cogworks • 31 Jul 2024 • 1 min read

What can we learn from one of the biggest outages in recent times; one that caused airlines and major industry organisations to grind to a halt? Testing. Testing. Testing. QA Engineer is on hand with practical tips to employ now:

Continuous deployment and rapid iteration are “non-negotiables” for modern software development teams.

However, pushing updates directly into production without proper testing can lead to disastrous outcomes, as seen in recent high-profile incidents like CloudStrike.

Companies must adopt robust strategies like Canary Testing and Blue-Green Deployment to ensure rapid and reliable software releases.

This article delves into these methodologies, their differences, and how they can prevent critical failures like CrowdStrike's recent mishap with Microsoft that caused major financial losses to some of the world's biggest brands.

The Role of CI/CD

A smooth and well-managed Continuous Integration/Continuous Deployment (CI/CD) pipeline is the backbone of any effective deployment strategy.

CI/CD pipelines automate the integration and delivery processes, allowing teams to deploy code changes frequently and reliably. By leveraging CI/CD, organisations can implement sophisticated deployment strategies like Canary Testing and Blue-Green Deployment, ensuring safer and more efficient releases.

Benefits of a Robust CI/CD Pipeline

Automation: Reduces manual intervention, minimising errors and speeding up the deployment process.
Consistency: Ensures that deployments follow a repeatable and predictable process.
Scalability: Supports the deployment of complex applications across various environments.

Feedback Loops: Integrates monitoring and feedback mechanisms to continuously improve the deployment process.

Canary Testing

Everything you need to know about the popular testing method

Canary Testing, named after the canaries used in coal mines to detect toxic gases, is a strategy where a new software version is released to a small subset of users before a full-scale rollout.

Canary Testing allows teams to monitor the latest release for issues in a controlled environment, ensuring potential problems can be addressed before the software reaches a broader audience. It is also used for User Experience (UX) testing to gather feedback on the new features and interface changes from a small group of users.

How Does Canary Testing Work?

Release to a Subset: The new software version is deployed to a small group of users, typically from 1% to 10% of the total user base.
Monitor Performance: Detailed monitoring and logging tools track the behaviour and performance of the new release.
Analyse Feedback: Both automated metrics and user feedback are analysed to identify issues.
Gradual Rollout: If no significant problems are detected, the update is gradually rolled out to the remaining users.

Benefits of Canary Testing

Risk Mitigation: Limits the exposure of potential bugs or performance issues to a small user base.
Real-World Testing: Allows for real-world usage data, providing insights that may not be captured in a staging environment.
User Feedback: Early feedback from actual users can be invaluable for making last-minute adjustments.

Blue-Green Deployment

What is Blue-Green Deployment?

Blue-Green Deployment is a methodology where two identical production environments, termed Blue and Green, are maintained. At any given time, one environment (say Blue) is live, serving all production traffic, while the other (Green) is idle or being updated. When it is ready, a new software version is deployed to the idle environment. Once tested, traffic is switched from the live environment to the updated one.

How Does Blue-Green Deployment Work?

Duplicate Environments: Maintain two identical production environments.
Deploy to Idle: Deploy the new software version to the idle environment (Green).
Testing: Perform thorough testing in the Green environment without affecting the live users.
Switch Traffic: If tests pass, switch the traffic from Blue to Green, making Green the live environment.
Monitor and Rollback: Monitor the new environment closely. If issues arise, a quick rollback to the Blue environment is possible.

Benefits of Blue-Green Deployment

Zero Downtime: Enables seamless transitions with no downtime, ensuring continuous availability.
Instant Rollback: If a critical issue is discovered, switching back to the previous environment is straightforward.
Comprehensive Testing: Allows exhaustive testing in a production-like environment before going live.

Canary Testing vs. Blue-Green Deployment

Key differences

Granularity: Canary Testing is granular, targeting a small subset of users, whereas Blue-Green Deployment switches the entire user base between two environments.
Risk Exposure: Canary Testing gradually increases risk exposure, while Blue-Green Deployment involves a complete switch, requiring confidence in the new release's stability.
Rollback Complexity: Rolling back in Blue-Green Deployment is more straightforward since the previous environment remains intact. At the same time, Canary Testing might require halting the rollout and fixing issues in the existing environment.

Lessons from CrowdStrike's Microsoft Outage

Recently, CrowdStrike experienced a significant issue with Microsoft, where a botched update led to widespread disruption. This incident underscores the importance of meticulous production testing. Strategies like Canary Testing or Blue-Green Deployment could have mitigated or entirely avoided CrowdStrike's situation.

How Canary Testing Could have prevented the CrowdStrike incident

By releasing the problematic update to a small user group, CrowdStrike could have detected the issue early and prevented a large-scale impact. Monitoring the initial rollout would have provided insights and allowed for timely corrections.

The Blue-Green Advantage

Using a Blue-Green Deployment strategy, CrowdStrike could have tested the new update in the Green environment without affecting the live users. Any issues discovered would have remained isolated, ensuring the live climate remained stable.

Takeaway for web development teams? Be more “Canary, Blue and Green”.

Robust production testing strategies like Canary Testing and Blue-Green Deployment are crucial for maintaining software reliability and user trust. These methodologies mitigate risks, provide valuable real-world feedback, and ensure seamless transitions. The CrowdStrike-Microsoft incident is a powerful reminder of the potential consequences of inadequate production testing. Companies can achieve safer, more reliable software releases by learning from such examples and implementing these best practices.

Innerworks and Cogworks are proud to partner with Community TechAid who aim to enable sustainable access to technology and skills needed to ensure digital inclusion for all. Any support you can give is hugely appreciated.

Donate tech Donate funds About community tech aid

Continuous deployment

Deployment essential: CI/CD

Canary Testing

Blue-Green Deployment

Canary VS Blue-Green

Lessons from CrowdStrike

If Blue-Green Was Used...

Testing Tip For Developers