Disaster Recovery for the Worst Case Scenarios

The unfortunate nature in the world of computing is that hardware can fail, malicious actions by third parties can never be predicted.  Due to this, it is highly recommended to be ready for this eventuality by planning with what is known colloquially throughout the industry as a disaster recovery (DR) plan.  This consists of the steps taken during an incident, as well as preventative measures taken before an incident to minimize its impact.  Both areas of concern are necessary and need to work in concert to recover as quickly as possible after the worst happens.

This article serves as a starting point for you to put together your own DR plan.  Every environment is unique, so the DR plan for that environment must be tailored to match.  For this reason, you may wish to take additional steps than are outlined here, or forego steps that are unnecessary for your environment.  Each suggestion here will also contain a link with more detail on how to accomplish that idea.

Plan ahead

Disaster recovery plans only work if you prepare ahead of time, and utilize those preparations during an incident.  Based on this, these are some items you should look to set up as soon as possible to help safeguard against future problems.

Create a golden image

The purpose of a ‘golden image’ is to preserve the baseline of the server.  This allows you to spin up new servers or rebuild a damaged one with an image that already has your applications and settings on it, reducing the time to recreate lost resources.

Backup your data

Having your applications and settings taken care of via a golden image means you only need to worry about backing up your data via file-level backups.  This allows you to create small efficient backups that can run more often and take up less storage space than backing up the entire server.  Since file-level backups can run more frequently, you limit the amount of data that could be potentially lost during an incident. Smaller backups also equate to quicker restoring of that data.  Combined with a golden image, this can mean mere minutes to recover from a device failure instead of potential hours configuring a new device from scratch.

Use a Load Balancer as the front-end to the environment

Load Balancers are assigned a static IP just like a Cloud Server but are high availability.  This means that the Load Balancer will remain available even through a physical device failure to the underlying hardware.  For this reason, we recommend using a Load Balancer as your front end and pointing your DNS to the load balancer.  This has two key advantages.  The first is your IP will remain as that IP until you delete the Load Balancer, meaning once you set it, you do not have to change DNS and thus wait for propagation changes again.  The second advantage this gives you is flexibility on what is serving the site behind the Load Balancer.  For example, say you have a problem with a web server.  Simply spin up a new one from your golden image, copy your latest backup onto it to update the data, then swap it with the problem node behind the Load Balancer.  The new node can continue to serve the site with little to no difference over the old one, which can then be further examined or deleted.  No more waiting on DNS to propagate before your site is back up.

Build redundancy into your environment

Problems can occur at any time, and it is helpful to have some reinforcements to take up the slack while you troubleshoot the problem.  If you have two web-servers behind a Load Balancer instead of one and a problem develops on one, you can simply take that one out of rotation and work with it or replace it.  You should take this mentality with as many parts of your environment as possible, and eliminate single points of failure.  For example, a simple master and slave DB setup allows for more functionality during a failure than a single server database setup.  Even better is a master-master setup, as this would mean no change in functionality during a failure.

Test your backups and images

“A backup is only good if it’s been tested” is a phrase you should make a mantra.  While testing every single backup you create is probably an unattainable goal, you should aim to test as often as time allows.  This allows you to ensure your backups are viable and there were no issues in the backup process itself.  It also allows you to test and practice your recovery methods, allowing for a faster recovery should the unthinkable happen.  Finally, it also serves as a way to double-check that you have configured the backup to capture everything you need and trim out what you don’t need.

Have offsite or local backups

With many of the automation tools out there in regards to version control and continuous integration, offsite backups may be easier for some developing an application vs running a website in one of the CMS platforms, but it’s important to have redundancy at your backups just as you might do for important photos you don’t want to lose. Most of the common CMS platforms have plugins as well that can help enable automated local or offsite backups to take that work off your plate.

What to do if your environment is down

This section assumes you have at the very least implemented golden images, regular backups, and have a Load Balancer front end.  

If an issue knocks out one of your servers, all that is necessary is to spin up a new server from the golden image.  Once the build completes, perform a backup recovery to the new server with the latest backup you have available.  Once the restore completes, give the server a quick once over to make sure all services are started and it responds as expected to requests.  Then simply add the new server as a node to the Load Balancer, and remove the problem node.  With practice, this process can be as short as just a few minutes from start to end, meaning very little downtime even in the worst cases.

Once the immediate concern of keeping your site or application running is addressed, you can then opt to investigate the problem node.  We recommend doing this so you can determine if any additional preventative measures should be added to your DR plan or changes be made to the server and thus the golden image as well.  In either case, once the server has been taken out of rotation, and any possible investigation has been performed, We recommend either bringing the repaired node back into rotation behind the Load Balancer or deleting it to reduce costs.

After the situation is resolved and the results from a possible investigation are in, make any changes necessary to your DR plan or environment based on the issues you ran into.  It can be painful to extend a stressful situation, but implementing any extra precautions while the issue is fresh in your mind reduces the chances of neglect and running into the same issue again in the future.

Preparing For a High-Traffic Event

You’re Making the Big Time!

You caught a big break and your site ended up on the front page of Reddit or lined up to air on Shark Tank. First of all, congrats on the success! This is only the beginning and you need to prepare yourself for what comes next because your environment is likely not ready for all that enthusiasm and traffic that comes along with it! This article is a high-level overview of things to consider when preparing for a high-traffic event. Truth be told, this won’t replace having good partners in this with you. If you don’t have a solid application/web developer and system admin, start with that!

Pre-event

  • Load Balance your web and application servers via horizontal scaling
  • Scale your database vertically and add replicas horizontally
  • Every tier (web, app, database) needs to be highly available and redundant
  • Server and application-level monitoring
  • Minimize dynamic content on your website
  • Maximize static content delivery over CDN
  • Content and DNS Caching using Cloud Flare or Incapsula
  • Create failover plans and understand your time to recovery
  • Load Test using a provider like Loader io, Soasta, Loadview, etc
  • Test some more and at every tier (web, app, and database)!!! Avoid the pitfall of spending tons of money on resources and not enough on testing
  • Alert the your Support team via ticket for awareness of when you expect high traffic

 

Mid-event

  • Actively monitor server and application monitoring metrics
  • Collect data throughout the event to analyze after event ends
  • Have your teams on stand by to be at the ready should you need to troubleshoot
  • Limited which regions access your site via ACL’s can work in a pinch if your environment is struggling
  • Should your environment crash under immense pressure, act diligently and swiftly to determine the quickest time to recovery. Now is not the time to make major changes that can further delay getting back online. Do your best to get your website back online and accessible to your intended audience.

 

Post-event

  • Scale down your environment to a baseline for normal operating
  • Analyze how your environment performed at every tier (web, application, database, CMS, etc)
  • Understand the root causes of any crashes or dips in performance
  • If you’ve implemented break fixes due to a crash such as disabling search functions, limiting dynamic content, using a queue service to throttle traffic, and so on, work to revert those changes you’ve put in place
  • Strategize for your next event.

Conclusion

We hope this guide served as a high-level overview of things to consider preparing for a high-traffic event, but also during and after the event. Knowledge is power, so the more you learn and take away from your high-traffic event the better!