Fox and Geese Software Development

Incident Response Process

Bugs happens. Servers crash. Databases rupture.

The following briefly highlights our incident response process for when bad things happen to good software.

This plan covers our owned products, including Versionista and Fluxguard. If we've done work for you, or if you have a custom engagement with us, please refer to our contract which may extend, limit, or otherwise modify this plan.

General approach

Good software practices underlie everything we do. When we thoughtfully architect systems, we limit incidents that can arise from those systems. This means:

More, our goal is to automate "everything." As such, we rely on self-healing systems and a "NoOps" approach.

In practice, this means reducing servers (and their maintenance) as much as possible via "serverless" solutions, such as AWS Lambda. This limits our need to manually rotate logs, update system software, patch bugs, and the like.

This also sees us favor task-specific cloud solutions, such as AWS API Gateway, SNS/SQS for queueing, and AWS RDS/Aurora for databases. When we favor task-specific cloud services with minimal configuration, we can leverage Amazon's patches, upgrades, and maintenance.

But good software development and automation can only get you so far. Bugs still happen. Servers still crash. When this happens,

For reported issues and bugs, we can guarantee that within 12 hours we will investigate, assess, and classify reported bugs into one of the following categories: Feature request. For example, support monitoring of password-protected content; send screenshots of captured pages in email reports. These reports will go to the Product team for assessment and prioritization. As we have discussed, we guarantee a dedicated number of development hours for DK-requested features per month. While we will try to align with customers on feature prioritization, ultimately our product team will determine the development focus and cadence for requested features. Operator error. For example, emails are not received because of a user's spam filter; login does not work due to customer's admin removing access; customer pauses monitoring yet says they are not getting reports. We will issue a brief memo highlighting the issue; and we will CC our Product team to consider UI/UX enhancements to minimize similar operator error situations in future. Acts of God. For example, Amazon AWS has a major outage; a monitored site is offline. We will issue a brief memo highlighting the issue and lack of remediation options. Core product bug. For example, user cannot login and receives a 503 error; email alerts are not being sent; broken image in console. These reports will go to DevOps for further assessment and remediation. Classified bugs will also be provisionally categorized on two more axes: one for severity (sev level) and one for resolution difficult (DevOps level). Sev levels are as follows: Sev 1: Complete outage Sev 2: Major functionality broken and revenue affected Sev 3: Minor problem, bug Sev 4: Redundant component failure Sev 5: False alarm or alert for something you can’t fix DevOps levels are as follows: DevOps 1: Typically config level. e.g., a log file has grown too large. Can be investigated and resolved automatically or by a junior technician. DevOps 2: Typically config level. e.g., wrong Lambda alias in use. Can be investigated and resolved by a junior technician. DevOps 3: Typically CI/CD or code level. e.g., automated deployment process broken. Can be investigated and resolved by 1 - 2 developers. DevOps 4: Code level. e.g., introduced bug in crawling system with limited reproducibility. Can be investigated and resolved by 1 - 2+ senior developers. There are many combinations of these severity and difficulty bugs. Generally, we make the following guarantees IN ADDITION TO the initial 12 hour investigation period: Sev 1 - 2, DevOps 1 - 2. Resolution within 12 hours. Sev 1 - 2, DevOps 3. Resolution within 48 hours. Sev 1 - 2, DevOps 4. Resolution within 1 week. Sev 3 - 4, DevOps 1 - 2. Resolution within 2 weeks. Sev 3 - 4, DevOps 3 - 4. Resolution within 3 months. It's worth noting that the vast majority of issues fall in the Sev 1 / Dev 1 - 2 and Sev 3, Dev 1 - 3 categories. DevOps 4 is rare; significant outages can occur, but they are usually trivially fixed, and the result of a basic error. Our practices around automation and TDD limit these as much as possible.