Blogs

Could Website Downtime Be Prevented?

Could Website Downtime Be Prevented?

Cogworks

29 Apr 2024 • 6 min read

We examine three common ways development teams are protecting companies from digital disruption.

Innerworks is coming soon...

This blog was originally published on our previous Cogworks blog page. The Cogworks Blog is in the process of evolving into Innerworks, our new community-driven tech blog. With Innerworks, we aim to provide a space for collaboration, knowledge-sharing, and connection within the wider tech community. Watch this space for Innerworks updates, but don't worry - you'll still be able to access content from the original Cogworks Blog if you want. 

Intro.

Some of the world's biggest websites are back online after a significant June outage. Thousands of sites, including major players like gov.UK, The New York Times, PayPal and BBC.com were affected, and the BBC estimates that even an hour's downtime can cost companies up to $250,000 (£176,000). Ouch.

 What causes network downtime?

Situations like this can happen often, whether to a small hosting company or a large Cloud provider such as OVH, Fastly or Microsoft Azure. Just check the Azure status history to see how often incidents and problems happen every month and what the causes are. The effects of these incidents are, of course, not only financial; outages can damage the reputation of multiple businesses, starting from the solution creator and rippling down to several end clients.

Sometimes, the physical elements cause havoc on digital, too. A fire at OVH in eastern France in March 2021 sent millions of websites offline, knocking out government agencies' portals, banks, shops, and news websites and taking out a chunk of the.FR webspace, according to internet monitors.

Could website downtime be prevented? 

To find the answer, let's look at the three most common approaches developers and project teams choose about outages and downtime:

1. Team Reaction.

The worst-case scenario is to just react to situations like the ones described above without a backup of your solution. In a nutshell, this means you lose everything.  Unfortunately, this reactive approach is shared across agencies and software houses. Either the agencies don't have the experience, or end clients don't want to invest in advanced prevention or detection-based scenarios.

Tip to level up your approach:

If you don't have data backups, get your data backed up; you really never know! If you're already using Azure services, data retention and backups are extremely easy to configure for most of their services, so complexity and infrastructure are no longer an obstacle for businesses.

Having a backup means you can restore the solution from a recent state, re-point the domains and redirect users to the newly set up instance of your service, and you’re saved. Although, backup can only work to a certain extent. Not much can bring back the lost traffic and potential lost revenue during the time when your site was down. 

2. Team detection.

If you and your colleagues are in Team Detection, you’ve probably suffered at least one major incident, and you simply want to be prepared for similar events in the future.

The detection approach can be as simple or as sophisticated as you need it to be.

Although, at the very basic level, you are still reacting to reports after the event, the detection approach puts you in a much better position to make quick, logical and appropriate decisions based on real data. Uptime monitors can even contact your teams through a preferred mode, such as an email, phone call or Slack notifications!

If you just want things dealt with (and don’t want to hear about it whilst you’re sipping a pina colada on holiday), the detection approach can be automated and set up to automatically detect and react.

 

The tools that are used with this approach:

- Uptime monitors. These are leading services like UptimeRobot, Pingdom, Sentry, or Azure, which are built-in monitor or insights services used for monitoring purposes. They all help to detect problems with your services and provide a reaction plan.

 ARM Templates, IaaC, scripts.  Any form of template that can help your team recover quickly is gold. For example, if you notice that your West EU data centre is not working, how fast will it take you to restore and set up the whole solution in another location? Thanks to the templates and scripts created at the beginning of a project, you can save a lot of time at this stage.

 - Automation runbooks. With automation runbooks, your development team can implement scenarios to help you recover solutions. However, it does depend on the type of incident. If the issue was at the application level, for example, simple restarts or reboots might solve the problem. But, if the infrastructure is broken or, even worse, the whole data centre - your runbook probably won’t work either.

It’s also worth subscribing to any monitors available on 3rd party provider websites such as Azure, so you will always be notified first about any issues or problems they may be having. These days, each software provider exposes a status page or notification service that can be used to get notified as soon as possible about problems. As a general rule, it’d be great to know about any issues before your client or the end customer notices and reports them. 

 

3. Team prevention.

Companies like Netflix, whose services are widely used across the globe and on whom many rely, have implemented policies and tools to help them prevent common failure scenarios and outages to their services. For example, Netflix’s Chaos Monkey is responsible for randomly terminating instances in production to ensure that engineers implement their services to be resilient to instance failures! (They run a tight ship.)

You can set up a kind of “Chaos Monkey” in your team too, but it goes by the name of Red Team. This is a great option to put your setups to the test, although some would argue that a junior developer with full access to privileges on Azure is a Chaos Monkey in human form! I’d also argue that a whiskey-driven infrastructure cleanup completed by a project manager on a Friday night is an effective method. But there are, of course, better ways of preventing a digital nightmare from the inside of your organisation...

With proper disaster recovery and a backup plan, you and your team can prepare and avoid losses and outages in the future.

Azure offers a great service called Azure Site Recovery, which is a ‘built-in disaster recovery as a service’. It uses Traffic Manager and a set of additional services, such as Backups Storage Accounts, to detect failures and replicate setups so that downtime is reduced. It’s easy to set up and requires minimal technical knowledge.

Reaction, detection or prevention at Cogworks?

We’re definitely Team Prevention! For our Umbraco clients, we have a range of support plans available that actively monitor and resolve outages and site failures before clients are aware they’ve happened. Outside of Umbraco clients (for our own services), we continue to practice and repeat prevention with everything we do. The insurance and peace of mind which come with a solid and resilient setup are worth more than anything else.


We’d love to hear about your experiences or top tips when it comes to all things outages. Why not leave us a comment in the box below to start the conversation?