Some of the world's biggest websites are back online after a major outage in June this year. Thousands of sites, including major players like, gov.uk, The New York Times, PayPal and BBC.com were affected and the BBC estimates that even an hour's worth of downtime can cost companies up to $250,000 (£176,000). Ouch.
What causes network downtime?
Situations like this can happen a lot, whether that’s to a small hosting company or a large Cloud provider such as OVH, Fastly or Microsoft Azure. Just check the Azure status history to see how often incidents and problems happen every month and what the causes are. The effects of these incidents are of course not only financial, outages can damage the reputation of multiple businesses, starting from the solution creator and rippling down to a number of end clients.
Sometimes the physical elements cause havoc on digital too. A fire at OVH in eastern France on March 2021, sent millions of websites offline, knocking out government agencies’ portals, banks, shops, news websites and taking out a chunk of the .FR webspace, according to internet monitors.
Could website downtime be prevented?
To find the answer, let’s look at the three most common approaches developers and project teams choose in relation to outages and downtime:
1. Team Reaction.
The worst-case scenario is to just react to situations like the ones described above without a backup of your solution. In a nutshell, this means you lose everything. Unfortunately, this kind of reactive approach is common across agencies and software houses. Either the agencies don’t have the experience or end clients don’t want to invest in advanced prevention or detection based scenarios.
Tip to level up your approach:
If you don’t have data backups, get your data backed up, you really never know! If you’re already using something like Azure services, data retention and backups are extremely easy to configure for most of their services so complexity and infrastructure is no longer an obstacle for businesses.
Having backup means you can restore the solution from a recent state, re-point the domains and redirect users to the newly set up instance of your service and you’re saved. Although, backup can only work to a certain extent. Not much can bring back the lost traffic and potential lost revenue during the time when your site was down.
2. Team detection.
If you and your colleagues are in Team Detection, you’ve probably suffered at least one major incident and you simply want to be prepared for similar events in the future.
The detection approach can be as simple or as sophisticated as you need it to be.
Although at the very basic level, you are still reacting to reports after the event, the detection approach puts you in a much better position to make quick, logical and appropriate decisions based on real data. Uptime monitors can even contact your teams through a preferred mode, such as an email, phone call and Slack notifications!
If you just want things dealt with (and don’t want to hear about it whilst you’re sipping a pina-colada on holiday), the detection approach can be automated, set up to automatically detect and react.
The tools that are used with this approach:
- Uptime monitors. These are leading services like UptimeRobot, Pingdom, Sentry or Azure built-in Monitor or Insights services used for monitoring purposes. They all help to detect problems with your services and provide a reaction plan.
ARM Templates, IaaC, scripts. Any form of template that can help your team recover quickly is gold. For example, say you notice that your West EU datacenter is not working, how fast will it take you to restore and set up the whole solution in another location? Thanks to the templates and scripts created at the beginning of a project, you can save a lot of time at this stage.
- Automation runbooks. With automation runbooks, your development team can implement scenarios to help you recover solutions. Though, it does depend on the type of the incident. If the issue was at application level, for example, simple restarts or reboots might solve the problem. But, if the infrastructure is broken or even worse, the whole data centre - your runbook probably won’t work either.
It’s also worth subscribing to any monitors available on 3rd party providers websites such as Azure so you will always be notified first about any issues or problems they may be having. Each software provider these days exposes a status page or notification service which can be used to get notified as soon as possible about problems. As a general rule, it’d be great to know about any issues before your client or the end customer notices and report them.
3. Team prevention.
Companies like Netflix, whose services are widely used across the globe and on whom many rely, have implemented policies and tools to help them prevent common failure scenarios and outages to their services. For example, Netflix’s Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures! (They run a tight ship.)
You can set up a kind of “Chaos Monkey” in your team too, but it goes by the name of Red team. This is a great option to put your setups to the test, although some would argue that a junior developer with full access to privileges on Azure is a Chaos Monkey in human form! I’d also argue a whiskey-driven infrastructure cleanup completed by a project manager on a Friday night is an effective method too. But, there are, of course, better ways of preventing a digital nightmare from the inside of your organisation...
With a proper disaster recovery and backup plan, you and your team can prepare, and avoid losses and outages in the future.
Azure offers a great service called Azure Site Recovery which is ‘built-in disaster recovery as a service’. It uses Traffic Manager and a set of additional services such as Backups, Storage Accounts to detect failures and replicate setups so that downtime is reduced. It’s easy to set up and requires minimal technical knowledge. It’s worth noting that this doesn’t work for all types of applications, so do check out the full FAQs.
Other tools to help you avoid failure scenarios and outages:
For solutions that require a high Service Level Agreement (SLA) or level of availability, you might need to set up a multi-cloud and multi-region solution. This is just one of the options to make your solution more resilient, stable and scalable.
Digital disruption and outages can certainly be prevented for your services delivered to your clients with a failsafe plan that includes the calculation for the cost of the plan and tools together with the cost of the losses in the event of a one hour, one day or one-week long outage.
Reaction, detection or prevention at Cogworks?
We’re definitely Team Prevention! For our Umbraco clients, we have a range of support plans available that actively monitor and resolve outages and site failures before clients are aware they’ve happened. Outside of Umbraco clients (for our own services), we continue to practice and repeat prevention with everything we do. The insurance and peace of mind which comes with a solid and resilient setup is worth more than anything else.
We’d love to hear about your experiences or top tips when it comes to all things outages. Why not leave us a comment in the box below to start the conversation.