The unfortunate nature in the world of computing is that hardware can fail, malicious actions by third parties can never be predicted. Due to this, it is highly recommended to be ready for this eventuality by planning with what is known colloquially throughout the industry as a disaster recovery (DR) plan. This consists of the steps taken during an incident, as well as preventative measures taken before an incident to minimize its impact. Both areas of concern are necessary and need to work in concert to recover as quickly as possible after the worst happens.
This article serves as a starting point for you to put together your own DR plan. Every environment is unique, so the DR plan for that environment must be tailored to match. For this reason, you may wish to take additional steps than are outlined here, or forego steps that are unnecessary for your environment. Each suggestion here will also contain a link with more detail on how to accomplish that idea.
Disaster recovery plans only work if you prepare ahead of time, and utilize those preparations during an incident. Based on this, these are some items you should look to set up as soon as possible to help safeguard against future problems.
Create a golden image
The purpose of a ‘golden image’ is to preserve the baseline of the server. This allows you to spin up new servers or rebuild a damaged one with an image that already has your applications and settings on it, reducing the time to recreate lost resources.
Backup your data
Having your applications and settings taken care of via a golden image means you only need to worry about backing up your data via file-level backups. This allows you to create small efficient backups that can run more often and take up less storage space than backing up the entire server. Since file-level backups can run more frequently, you limit the amount of data that could be potentially lost during an incident. Smaller backups also equate to quicker restoring of that data. Combined with a golden image, this can mean mere minutes to recover from a device failure instead of potential hours configuring a new device from scratch.
Use a Load Balancer as the front-end to the environment
Load Balancers are assigned a static IP just like a Cloud Server but are high availability. This means that the Load Balancer will remain available even through a physical device failure to the underlying hardware. For this reason, we recommend using a Load Balancer as your front end and pointing your DNS to the load balancer. This has two key advantages. The first is your IP will remain as that IP until you delete the Load Balancer, meaning once you set it, you do not have to change DNS and thus wait for propagation changes again. The second advantage this gives you is flexibility on what is serving the site behind the Load Balancer. For example, say you have a problem with a web server. Simply spin up a new one from your golden image, copy your latest backup onto it to update the data, then swap it with the problem node behind the Load Balancer. The new node can continue to serve the site with little to no difference over the old one, which can then be further examined or deleted. No more waiting on DNS to propagate before your site is back up.
Build redundancy into your environment
Problems can occur at any time, and it is helpful to have some reinforcements to take up the slack while you troubleshoot the problem. If you have two web-servers behind a Load Balancer instead of one and a problem develops on one, you can simply take that one out of rotation and work with it or replace it. You should take this mentality with as many parts of your environment as possible, and eliminate single points of failure. For example, a simple master and slave DB setup allows for more functionality during a failure than a single server database setup. Even better is a master-master setup, as this would mean no change in functionality during a failure.
Test your backups and images
“A backup is only good if it’s been tested” is a phrase you should make a mantra. While testing every single backup you create is probably an unattainable goal, you should aim to test as often as time allows. This allows you to ensure your backups are viable and there were no issues in the backup process itself. It also allows you to test and practice your recovery methods, allowing for a faster recovery should the unthinkable happen. Finally, it also serves as a way to double-check that you have configured the backup to capture everything you need and trim out what you don’t need.
Have offsite or local backups
With many of the automation tools out there in regards to version control and continuous integration, offsite backups may be easier for some developing an application vs running a website in one of the CMS platforms, but it’s important to have redundancy at your backups just as you might do for important photos you don’t want to lose. Most of the common CMS platforms have plugins as well that can help enable automated local or offsite backups to take that work off your plate.
What to do if your environment is down
This section assumes you have at the very least implemented golden images, regular backups, and have a Load Balancer front end.
If an issue knocks out one of your servers, all that is necessary is to spin up a new server from the golden image. Once the build completes, perform a backup recovery to the new server with the latest backup you have available. Once the restore completes, give the server a quick once over to make sure all services are started and it responds as expected to requests. Then simply add the new server as a node to the Load Balancer, and remove the problem node. With practice, this process can be as short as just a few minutes from start to end, meaning very little downtime even in the worst cases.
Once the immediate concern of keeping your site or application running is addressed, you can then opt to investigate the problem node. We recommend doing this so you can determine if any additional preventative measures should be added to your DR plan or changes be made to the server and thus the golden image as well. In either case, once the server has been taken out of rotation, and any possible investigation has been performed, We recommend either bringing the repaired node back into rotation behind the Load Balancer or deleting it to reduce costs.
After the situation is resolved and the results from a possible investigation are in, make any changes necessary to your DR plan or environment based on the issues you ran into. It can be painful to extend a stressful situation, but implementing any extra precautions while the issue is fresh in your mind reduces the chances of neglect and running into the same issue again in the future.