Fox and Geese Logo
2023-04-11 NOTICE: The following policy or plan is currently under internal review and may not be up-to-date or fully aligned with our organization's current practices or procedures. Please check back shortly, or contact us for more information.

Disaster Recovery Policy and Plan

Last updated: 2023-03-30

The Fox and Geese Disaster Recovery Plan establishes procedures to recover Fox and Geese services following a disruption caused by a disaster. This Disaster Recovery Policy and Plan is maintained by the Fox and Geese DevOps Team.

The following objectives have been established for this plan:

  1. Maximize the effectiveness of contingency operations through an established plan that consists of the following phases:
    • Notification/Activation phase to detect and assess damage and to activate the plan;
    • Recovery phase to restore temporary IT operations and recover damage done to the original system;
    • Reconstitution phase to restore IT system processing capabilities to normal operations.
  2. Identify the activities, resources, and procedures needed to carry out Fox and Geese processing requirements during prolonged interruptions to normal operations.
  3. Identify and define the impact of interruptions to Fox and Geese systems.
  4. Assign responsibilities to designated personnel and provide guidance for recovering Fox and Geese systems during prolonged periods of interruption to normal operations.
  5. Ensure coordination with other Fox and Geese staff who will participate in the contingency planning strategies.
  6. Ensure coordination with external points of contact and vendors who will participate in the contingency planning strategies.
  7. Establish a clear communication plan to keep employees, partners, customers, and stakeholders informed during a disaster.
  8. Define Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) for each critical system.
  9. Prioritize systems and applications to ensure the most important systems are restored first.
  10. Implement a training and awareness program for employees on disaster recovery procedures and their roles during a disaster.
  11. Consider alternate communication methods in case of widespread infrastructure disruptions.
  12. Establish a process for documenting lessons learned after tests or actual disaster recovery events.
  13. Ensure compliance with regulatory and legal requirements during a disaster.

This Disaster Recovery Plan has been developed as required under the Federal Information Security Modernization Act (FISMA) of 2014, the Health Insurance Portability and Accountability Act (HIPAA) Final Security Rule, Section §164.308(a)(7), the Cybersecurity and Infrastructure Security Agency (CISA) Cyber Essentials, the Cybersecurity Maturity Model Certification (CMMC), and the General Data Protection Regulation (GDPR). These requirements establish the necessity for the creation and implementation of procedures for responding to events that damage systems containing electronic protected health information and ensuring the security and resilience of critical infrastructure.

The Disaster Recovery Plan is created in accordance with the guidelines established by the National Institute of Standards and Technology (NIST) Special Publication (SP) 800-34 Rev. 1, titled "Contingency Planning Guide for Federal Information Systems" dated May 2010, and the NIST Cybersecurity Framework.

The Disaster Recovery Plan also complies with the following federal and departmental policies:

  • The Federal Information Security Modernization Act (FISMA) of 2014;
  • Health Insurance Portability and Accountability Act (HIPAA) Final Security Rule, Section §164.308(a)(7);
  • Cybersecurity and Infrastructure Security Agency (CISA) Cyber Essentials;
  • Cybersecurity Maturity Model Certification (CMMC);
  • General Data Protection Regulation (GDPR);
  • The Computer Security Act of 1987;
  • OMB Circular A-130, Management of Federal Information Resources, Appendix III, November 2000;
  • Presidential Policy Directive (PPD) 21, Critical Infrastructure Security and Resilience, February 2013;
  • Executive Order 13800, Strengthening the Cybersecurity of Federal Networks and Critical Infrastructure, May 2017.

Example of the types of disasters that would initiate this plan are natural disasters, political disturbances, man-made disasters, external human threats, and internal malicious activities.

Fox and Geese defines two categories of systems from a disaster recovery perspective:

  1. Critical Systems. These systems host application servers and database servers containing customer data, sensitive information, and other essential services. An interruption in the availability of these systems directly impacts our ability to provide services to our customers. Specific RTO and RPO should be established for each critical system. As a general guideline, critical systems have an RTO of 2 hours and an RPO of 1 hour.

  2. Non-critical Systems. These are all systems not considered critical by definition above. While these systems may affect the performance and overall security of critical systems, they do not directly impact our ability to provide services to our customers. These systems are restored at a lower priority than critical systems. Specific RTO and RPO should also be established for each non-critical system. As a general guideline, non-critical systems have an RTO of 2 hours and an RPO of 6 hours.

Business Continuity Plan Integration

This Disaster Recovery Policy and Plan is integrated with the Fox and Geese Business Continuity Plan to ensure a comprehensive approach to disaster recovery and business continuity, addressing the recovery of datacenter and contracted support services.

Plan Update Frequency

This Disaster Recovery Policy and Plan is reviewed and updated at least once every 12 months to ensure its continued effectiveness and alignment with the organization's evolving needs.

Communication Plan

A clear communication plan is essential during a disaster to keep employees, partners, customers, and stakeholders informed about the situation and the steps being taken to restore services. The communication plan should include:

  1. Designated spokespersons for different stakeholder groups.
  2. Communication channels to be used, such as email, phone, social media, and the company website.
  3. Frequency of updates and the type of information to be shared.
  4. A process for addressing inquiries and concerns from stakeholders.

The Disaster Recovery Coordinator (DRC) is responsible for overseeing the communication plan and ensuring that all stakeholders are kept informed during a disaster.

Training and Awareness

A training and awareness program should be established to ensure that employees are aware of the disaster recovery procedures and their roles during a disaster. The program should include:

  1. Regular training sessions on the disaster recovery plan and procedures.
  2. Clear documentation of roles and responsibilities for each employee during a disaster.
  3. Periodic reviews and updates to the training program to ensure it remains relevant and effective.

The DRC is responsible for overseeing the training and awareness program and ensuring that all employees are prepared to respond effectively during a disaster.

Lessons Learned

After each test or actual disaster recovery event, a process should be in place for documenting lessons learned and making necessary updates to the plan based on those lessons. This process includes:

  1. A thorough review of the event and the effectiveness of the disaster recovery procedures.
  2. Identification of areas for improvement or gaps in the plan.
  3. Updates to the plan and procedures based on the lessons learned.

The DRC is responsible for overseeing the lessons learned process and ensuring that the plan is updated as needed.

Regulatory and Legal Compliance

The disaster recovery plan should include procedures to ensure compliance with relevant regulations and legal requirements during a disaster. This includes:

  1. Ensuring that data protection and privacy requirements are met, even during a disaster.
  2. Coordinating with legal and compliance teams to address any potential issues that may arise during a disaster.
  3. Documenting compliance efforts and providing updates to relevant authorities as needed.

The DRC is responsible for overseeing regulatory and legal compliance during a disaster.

Applicable Standards

Applicable Standards from the HITRUST Common Security Framework

  • 12.c - Developing and Implementing Continuity Plans Including Information Security

Applicable Standards from the HIPAA Security Rule

  • 164.308(a)(7)(i) - Contingency Plan

Line of Succession

The following order of succession ensures that decision-making authority for this Disaster Recovery Plan is uninterrupted. The designated Disaster Recovery Coordinator (DRC) is responsible for ensuring the safety of personnel and the execution of procedures documented within this Disaster Recovery Plan. The Chief Technology Officer (CTO) typically serves as the DRC; however, if the CTO is unable to function as the overall authority or chooses to delegate this responsibility to a successor, the CEO or COO shall function as the DRC. To provide contact initiation should the contingency plan need to be initiated, please use the prioritized contact list below.

Responsibilities

The following teams have been developed and trained to respond to a contingency event affecting the IT system.

  1. The Ops Team is responsible for recovery of the Fox and Geese hosted environment, network devices, and all servers. Members of the team include personnel who are also responsible for the daily operations and maintenance of Fox and Geese. The team leader is the DRC and directs the DevOps Team.
  2. The Web Services Team is responsible for ensuring all application servers, web services, and platform add-ons are working. It is also responsible for testing redeployments and assessing damage to the environment. The team leader is the DRC and directs the Web Services Team.

Members of the Ops and Web Services teams must maintain local copies of the contact information from the documented line of succession. Additionally, the DRC must maintain a local copy of this policy in the event Internet access is not available during a disaster scenario.

Testing and Maintenance

The DRC shall establish criteria for validation/testing of a Disaster Recovery Plan, an annual test schedule, and ensure implementation of the test. This process will also serve as training for personnel involved in the plan's execution. At a minimum, the Disaster Recovery Plan shall be tested annually (within 365 days). The types of validation/testing exercises include tabletop and technical testing. Disaster Recovery Plans for all application systems must be tested at a minimum using the tabletop testing process. However, if the application system Disaster Recovery Plan is included in the technical testing of their respective support systems, that technical test will satisfy the annual requirement.

Tabletop Testing

Tabletop Testing is conducted in accordance with the CMS Risk Management Handbook, Version 1.2 January 28, 2019. The primary objective of the tabletop test is to ensure designated personnel are knowledgeable and capable of performing the notification/activation requirements and procedures as outlined in the Disaster Recovery Plan, in a timely manner. The exercises include, but are not limited to:

  • Testing to validate the ability to respond to a crisis in a coordinated, timely, and effective manner, by simulating the occurrence of a specific crisis.

Technical Testing

The primary objective of the technical test is to ensure the communication processes and data storage and recovery processes can function at an alternate site to perform the functions and capabilities of the system within the designated requirements. Technical testing shall include, but is not limited to:

  • Process from backup system at the alternate site;
  • Restore system using backups; and
  • Switch compute and storage resources to alternate processing site.

Annual Testing Frequency and Requirements

The Disaster Recovery Plan is tested at least once every 12 months, addressing datacenter recovery and contracted support services. This annual testing requirement is satisfied through the Tabletop and Technical Testing exercises mentioned above.

Disaster Recovery Procedures

This section provides more detailed procedures for recovering the application at an alternate site, addressing datacenter recovery, and contracted support services. These procedures are designed to ensure the quick and efficient restoration of Fox and Geese services after a disaster.

Please provide specific technical details on the architecture, systems, and requirements for Fox and Geese services to ensure the accuracy of the procedures outlined below.

Notification and Activation Phase

The Notification and Activation phase remains the same as previously described.

Recovery Phase

This section provides more detailed procedures for recovering the Fox and Geese infrastructure at the alternate site. Each procedure should be executed in the sequence it is presented to maintain efficient operations.

  1. Contact affected Partners and Customers - Web Services Team

    • Inform them of the disaster, its impact on services, the estimated downtime, and the recovery plan.
  2. Assess damage to the environment - Web Services Team

    • Determine the extent of damage to hardware, software, data, and network components.
    • Identify affected systems, applications, and services.
  3. Establish a Recovery Team - DRC

    • Assign roles and responsibilities for each team member during the recovery process.
    • Coordinate with external vendors, partners, and support services as needed.
  4. Set up the alternate site - DevOps Team

    • Prepare the alternate site with necessary hardware, software, and network components.
    • Configure the systems and network according to the pre-defined disaster recovery architecture.
  5. Begin rebuilding the environment using AWS CloudFormation templates and other AWS-specific tools - DevOps Team

    • Deploy AWS CloudFormation templates to recreate the infrastructure in the alternate site.
    • Configure and customize systems, applications, and services as required.
  6. Restore databases and data - DevOps Team

    • Restore databases using DynamoDB point-in-time backups and AWS Backups.
    • Implement S3 version control to recover lost or corrupted data.
    • Verify the integrity of restored data.
  7. Implement security measures - DevOps Team

    • Ensure systems are appropriately patched and up to date.
    • Test logging, security, and alerting functionality.
    • Implement additional security measures as needed.
  8. Test the new environment - Web Services Team

    • Perform functional tests to ensure all systems, applications, and services are working as expected.
    • Identify and resolve any issues that arise during testing.
  9. Deploy the environment to production - Web Services Team

    • Transition the services to the newly established environment in the alternate site.
    • Monitor the performance of systems and applications to ensure stability.
  10. Update DNS to the new environment - DevOps Team

    • Configure DNS records to point to the new environment.
  11. Communicate the recovery status - DRC

    • Update employees, partners, customers, and stakeholders on the recovery progress and the resumption of services.

Reconstitution Phase

This section discusses activities necessary for restoring Fox and Geese operations at the original or new site. The goal is to restore full operations within 24 hours of a disaster or outage. When the hosted data center at the original or new site has been restored, Fox and Geese operations at the alternate site may be transitioned back. The goal is to provide a seamless transition of operations from the alternate site to the computer center.

  1. Original or New Site Restoration - DevOps and Web Services Teams

    • Follow the same procedures as in the Recovery Phase for rebuilding the environment, restoring databases and data, implementing security measures, testing the environment, and deploying to production.
  2. Switching back to the original or new site - DevOps Team

    • Update DNS records to point back to the original or new site.
    • Monitor the performance of systems and applications to ensure stability.
  3. Plan Deactivation - DRC

    • Coordinate with teams to decommission the alternate site, following the Fox and Geese Media Disposal Policy.
    • Document lessons learned from the disaster recovery process and update the Disaster Recovery Plan accordingly.