Incident management and resolution process | Fluid Attacks Help

Incident management and resolution process


Definition

Confirmed issues include malfunctions, performance degradation, service disruptions, or errors within Fluid Attacks products or components. These issues must have already affected or have the potential to impact a minimum of two users, regardless of whether they have been reported or not. This criterion applies universally, including internal issues related to hackers or team members.

Incident Manager

The Incident Manager is responsible for creating and managing incidents from the time of detection until the incident Postmortem is published.

Assigned developer

The assigned developer is responsible for resolving the Issue. Their duty involves promptly addressing the case with the utmost priority, keeping the Incident Desk updated on all progress and obstacles, or any eventuality that arises during the process. Furthermore, the assigned developer provides all necessary information to the Incident Manager to identify and adequately document the root cause and the development of the Postmortem.

Detection

This occurs when an external or internal user identifies an incident and subsequently reports it for publication and resolution.

Incidents are usually reported through help@fluidattacks.com, but can also be notified through an Issue, an internal chat, a direct email, or other means of contact.

Issue

When an incident is detected, the help area generates a detailed Issue describing the problem. It is the responsibility of the Incident Manager to assign a developer to resolve it. While most issues originate from the help area, their can also be created by other Fluid Attacks employees or external users.

Publication

After assigning a developer, the Incident Manager must promptly access the incident panel to create the publication, considering the following fields:

  1. Incident name: Reflect the nature of the problem clearly and concisely.

  2. Incident status: Upon publishing, the status should always be set to Identified.

  3. Message: Description of what is being affected and how.

    1. Workaround: Description of how to resolve the issue through alternative methods, accompanied by a link to these instructions within the documentation.

  4. Affected components: Use checkboxes to select the affected components, which include:

    1. Platform (Integrates): Responsible for the Fluid Attacks application and its API.
    2. Web (Airs): Contains our homepage and all information about Fluid Attacks and its products.
    3. Docs (Docs): Contains our documentation.
    4. Agent (Forces): Responsible for the client-side part of the DevSecOps agent.
    5. Cloning (Melts): Allows downloading the End Users' code repositories and a few other utilities that Fluid Attacks hackers require.
    6. Scanning (Machine): Security vulnerability detection tool that scans source code, infrastructure, and applications and reports its security problems.
    7. Extensions (Retrieves): Visual Studio Code extension to visualize reported vulnerabilities in the Platform by pointing to its specific file and line of code.
    8. Mailing: This component is responsible for sending notifications to users of Fluid Attacks products.

  5. Severity: Options include either major outage, partial outage, degraded performance, or operational.

  6. Notifications: Ensure the checkbox remains selected to keep subscribers informed.

  7. Long-running incident reminders: keep this checkbox selected and set reminders at one-hour intervals. This ensures admins are consistently updated on the continuous persistence of the incident.
Once the above fields are completed, click on the Create button. The incident will be immediately published on status.fluidattacks.com, and subscribers will be notified.

Incident desk

The Incident Desk is an internal chat group consisting of at least four members with extensive knowledge of the business. Its purpose is to provide to the developer tasked with incident remediation.

Whenever a developer is assigned to resolve an incident, the Incident Manager adds them to the group and posts a description of the problem, who the accountable developer is, the issue's URL, and the incident's URL. This initiates a discussion thread around the incident.

The assigned developer is expected to dedicate 100% of their working time to resolving the incident. Additionally, they must promptly inform about any mitigations, solutions, obstacles, or other relevant updates regarding the case.

Upon successful resolution of the incident, the Incident Manager removes the developer from the Incident Desk group.

Follow-up

It is the Incident Manager responsibility to stay in contact with the developer responsible for the incident, tracking its progress, and keeping the Incident Desk updated on the incident's status or providing support if any obstacles arise.

For long-running incident, the Incident Manager must constantly report the incident status, either through the crisis desk, responses to status page interval notifications, or directly through incident updates. Additionally, a brief update is provided every hour through the status page, ensuring clear and relevant information for understanding the incident's current status.

Closure

Once the incident is resolved, the Incident Manager concludes the process by updating the incident status to Resolved with a concise two-line message indicating the resolution and that the affected component is operating normally again. Clicking the Update button publishes the new incident state on the status page, sends notifications to subscribers, and unlocks the option to write a Postmortem.

Postmortem

A public report detailing the incident is provided through four specific sections: Impact, Cause, Solution, and Conclusion. This report must be generated when the incident is closed (Resolved) and can be published within the incident using the Write Postmortem option in the incident management panel.

The Incident Manager is responsible for writing and publishing the postmortem. This involves conducting an investigative process around the incident, seeking support from the developer who resolved it and other stakeholders who can provide necessary information.

The postmortem should be crafted for general understanding, avoiding technicalities, and striving for clarity and conciseness. Each of its sections is described below:

Impact

This section should be written with the following elements:

  1. How and how many users were affected.
  2. Timestamp of the incident's timeline (on UTC-5 [1] <yy-mm-dd hh:mm>) where:

    1. [1] The date of the merge request that caused the incident.
  1. How the incident was discovered, whether proactively (by someone on the Fluid Attacks team) or reactively (by a customer).
  2. Elapsed time of the incident's detection ([1] Time to detect was <elapsed_time>) where:

    1. [1] Defined as Time To Detect (TTD), it is the elapsed time from the date the failure reaches production to the date it is reported through any of our support channels. The unit of measurement for <elapsed_time> must be defined as follows:

      1. if elapsed_time < 1 hour, then X minutes.
      2. if 1 day > elapsed_time > 1 hour, then X hours.
      3. if 1 month > elapsed_time > 1 day, then X days.

                     Always with a precision of up to one decimal place.
  1. Elapsed time of the incident's detection ([1] Time to fix was <elapsed_time>) where:

    1. [1] Defined as Time To Fix (TTF), it is the elapsed time from the date the failure is reported through any of our support channels to the date the incident is resolved. The unit of measurement for <elapsed_time> must be defined as follows:

      1. if elapsed_time < 1 hour, then X minutes.
      2. if 1 day > elapsed_time > 1 hour, then X hours.
      3. if 1 month > elapsed_time > 1 day, then X days.

                     Always with a precision of up to one decimal place.
  1. Describe who reported the incident and through which medium it was communicated. 
  2. Elapsed time of the incident's total impact ([1] Time to recover was <elapsed_time>) where:
  1. [1] Defined as Time To Recover (TTR), it is the elapsed time from the date the failure reaches production to the date the incident is resolved. The unit of measurement for <elapsed_time> must be defined as follows:

    1. if elapsed_time < 1 hour, then X minutes.
    2. if 1 day > elapsed_time > 1 hour, then X hours.
    3. if 1 month > elapsed_time > 1 day, then X days.

                     Always with a precision of up to one decimal place.
  1. It is essential to include the reference to the Issue created to resolve it.

Cause

This section requires a thorough investigation to identify the root cause or causes behind the incident. At this point, it is essential to collaborate with the developer responsible for remediation and any other stakeholder who can provide the necessary information to identify and write up the cause successfully. By the end of this section, it should be clear what precisely the root cause of the incident was. For this section, it may be helpful to rely on the 5 Whys Technique.

If the cause was introduced in a Merge Request, this URL must linked as a reference (always the Merge Request and not the commit).

Solution

This section requires collaboration with the developer who remediated the incident and aims to clearly explain the remediation process.

If code intervention was involved in resolving the incident, include the URL of the Merge Request that fixed the problem as a reference.

Conclusion

In this section, the Incident Manager should document the lessons learned during the incident process. This includes understanding what architectural aspects allowed the problem to reach the user (Production environment), detailing the actions taken by the responsible developer to remediate the root cause, and finally, adding a Taxonomy Term or Taxonomy Tree.

Taxonomy term

These terms serve as tags to categorize various incidents that may occur. Their purpose is to succinctly and systematically summarize and classify the nature of a problem or error. This provides a quick and structured understanding of why a particular issue reached the production environment.


Term
Definition
COMMUNICATION_FAILURE
Occurs when it was an expected behavior, but it was not specified in the documentation, or there were insufficient instructions to perform the task correctly.
DATA_QUALITY
Refers to a situation in which the data used within a process is deemed inadequate or insufficient in accuracy, completeness, or relevancy, potentially leading to erroneous results or compromised outcomes.
FAILED_LINTER
Occurs when the linter and its configured rules do not identify a syntax error.
FAILED_MIGRATION
Denotes an error during a migration process, negatively impacting processes or data integrity.
IMPOSSIBLE_TO_TEST
Occurs when the flow or nature of the operation does not allow for testing.
INCOMPLETE_PERSPECTIVE
Refers to a situation where certain aspects were not considered during the planning or development of the functionality/process, resulting in failures or unexpected behavior.
INFRASTRUCTURE_ERROR
Refers to a situation when misconfigurations within the infrastructure as code setup result in component failures, reduced performance, or even complete service interruptions.
LACK_OF_TRACEABILITY
Occurs when insufficient loggers or tools are available to identify the root cause of the error.
MISSING_ALERT
Occurs when technologies in the development environment and the continuous integration do not trigger an alert on the error, allowing it to go unnoticed in production.
MISSING_TEST
Refers to the absence of specific tests, which can lead to undetected errors and result in product failures.
NO_SPECIFIED
Refers to an error that, regarding its nature, has no reasonable or clear explanation.
ROTATION_FAILURE
Occurs when failures in the credential rotation process lead to specific components losing access to third-party services, compromising the operability, performance, or availability of our products.
THIRD_PARTY_CHANGE
Refers to a situation wherein a third-party service or technology provider implements alterations in their infrastructure, potentially resulting in operational disruptions or unanticipated behaviors for their end-users.
THIRD_PARTY_ERROR
Refers to a disruption caused by the failure or downtime of a third-party service provider, leading to unavailability or reduced functionality.
UNHANDLED_EXCEPTION
Occurs when an exception happens without an associated error-handling mechanism.

Taxonomy tree

It is a hierarchical structure that describes a complex problem's root cause. It is employed when the incident arises from a series of interconnected errors or failures rather than having a clear cause. The hierarchical structure helps break down the problem into its parts and connections, making it easier to identify the main issue. The structure can be understood as follows:

Principal Failure < Primary Cause < Secondary Cause < N Cause

The structure can be interpreted as:

Failure Occurred Due to < Which Was Caused By < Which Was In Turn Caused By

Here is an example: DATA_QUALITY < FAILED_MIGRATION

To ensure transparency and keep stakeholders informed, an Incidents Page is available.