Welcome to the home page of Charles N Wyble. Charles is a 24 year old systems guy, hacker and entrepreneur currently living in El Monte CA, with his wife of 3 years.

He is currently employed as a system engineer for Ripple TV with responsibility for a nation wide advertising network.

In his spare time he serves as Chief Technology Officer for the SoCalWiFI.net project, runs a hacker space in the San Gabriel Valley and tries to save the local economy.

Friday, December 12, 2008

Known Element Enterprises disaster recovery plan.

One of the items on my TODO list is the creation of a disaster recovery plan.

Here is what I have come up with.

Feel free to use this as a model for your disaster recovery plan. We have found it works well
for our firm, and we hope it works well for other firms.

Disaster prevention

Any time an organization experiences a disaster and is forced to activate a recovery plan, much time and money is spent. It is far cheaper to prevent a disaster then it is to recovery from one.

In light of this we are taking the following steps to avoid disasters:

1) Multiple DNS providers (Zoneedit as our primary and Domainsite.com (our registrar) as secondary).

2) Nightly backups of the corporate VZ slice (which hosts our project wiki and various internal corporate applications such as CRM/ERP) to our El Monte and San Fernando Valley office servers, as well as S3.

3) Development/QA/load testing work done on EC2 instances and the El Monte development lab.

4) Standard security procedures such as IDS/log monitoring/strong passwords/VPN.

5) Usage of Zoneedit.com fail over service, with our disaster recovery page hosted on Amazon Cloudfront.

We have found the above methods provide a substantial protection against risk to our firm at a very low cost.

However it is necessary to go above and beyond those methods and have a plan in place
when those methods are not sufficient.

What scenarios does this plan cover and how do we handle them?

1) Inaccessibility of OpenVZ instance.
We handle this via restoring our S3 backups to an EC2 instance.
We would then select an alternative hosting provider and provision an instance. We have extensive documentation and backups of our existing server so restoration is straightforward and we have gone through multiple migrations to test this.

2) Inaccessibility of the El Monte or San Fernando Valley office servers.
Not a great concern as these servers host internal development resources and a downtime of 24 to 72 hours (average downtime necessary to repair a failed server) doesn't affect company operations. Any immediate needs are already handled via EC2.

3) Death or separation of executive leadership.
4) Death or separation of senior and middle management.
5) Death or separation of employees.

As everything the company does (both public and private operations) is on our corporate server, which is backed up in multiple geographically distributed locations, any remaining element or elements of the organization will be able to establish continuity of operations.

No comments: