Incident management and resolution process | Fluid Attacks Help

Incident management and resolution process

Definition

Confirmed issues include malfunctions, performance degradation, service disruptions, or errors within Fluid Attacks products or components. These issues must have already affected or have the potential to impact a minimum of two users, regardless of whether they have been reported or not. This criterion applies universally, including internal issues affecting hackers or other team members.

Incident Manager

The Incident Manager is responsible for creating and managing incidents from the time of detection until the incident postmortem is published.

Assigned developer

The assigned developer is responsible for resolving the issue. Their duty involves promptly addressing the case with the utmost priority, keeping the Incident Desk updated on all progress and obstacles or any eventuality that arises during the process. Furthermore, the assigned developer provides all necessary information to the Incident Manager to identify and adequately document the root cause and the development of the postmortem.

Detection

This refers to when an external or internal user identifies an incident and subsequently reports it for publication and resolution.

Incidents are usually reported through help@fluidattacks.com, but can also be notified through an issue, internal chat, a direct email, or other means of contact.

Issue

When an incident is detected, the help area generates a detailed issue describing the problem. It is the responsibility of the Incident Manager to assign a developer to resolve it. While most issues originate from the help area, they can also be created by other Fluid Attacks employees or external users.

Publication

After assigning a developer, the Incident Manager must promptly access the incident panel to create the publication, considering the following fields:

  1. Incident name: Reflects the nature of the problem clearly and concisely

  2. Incident status: Upon publishing, the status should always be set to 'Identified'

  3. Message: Description of what is being affected and how

    1. Workaround: Description of how to avoid the issue through alternative methods, accompanied by a link to these instructions within the documentation

  4. Affected components: Use checkboxes to select the affected components, which include:

    1. Platform (Integrates): Responsible for the Fluid Attacks application and its API
    2. Web (Airs): Contains the Fluid Attacks homepage and all information about Fluid Attacks and its products
    3. Docs (Docs): Contains Fluid Attacks documentation
    4. Agent (Forces): Responsible for the client-side part of the DevSecOps agent
    5. Cloning (Melts): Allows downloading the end users' code repositories and a few other utilities that Fluid Attacks hackers require
    6. Scanning (Machine): Security vulnerability detection tool that scans source code, infrastructure, and applications and reports its security problems
    7. Extensions (Retrieves): Visual Studio Code extension to visualize reported vulnerabilities in the platform by pointing to its specific file and line of code
    8. Mailing: Responsible for sending notifications to users of Fluid Attacks products

  5. Severity: Options include major outage, partial outage, degraded performance, and operational

  6. Notifications: Keep subscribers informed (ensure this checkbox remains selected)

  7. Long-running incident reminders: Ensures admins are consistently updated on the continuous persistence of the incident (keep this checkbox selected and set reminders at one-hour intervals)
Once the above fields are completed, click on the Create button. The incident will be immediately published on status.fluidattacks.com, and subscribers will be notified.

Incident Desk

The Incident Desk is an internal chat group consisting of at least four members with extensive knowledge of the business. Its purpose is to provide the developer tasked with incident remediation.

Whenever a developer is assigned to resolve an incident, the Incident Manager adds them to the group and posts a description of the problem, who the accountable developer is, the issue's URL, and the incident's URL. This initiates a discussion thread around the incident.

The assigned developer is expected to dedicate 100% of their working time to resolving the incident. Additionally, they must promptly inform about any mitigations, solutions, obstacles, or other relevant updates regarding the case.

Upon successful resolution of the incident, the Incident Manager removes the developer from the Incident Desk group.

Follow-up

It is the Incident Manager's responsibility to stay in contact with the developer responsible for the incident, tracking its progress, and keeping the Incident Desk updated on the incident's status or providing support if any obstacles arise.

For long-running incident, the Incident Manager must constantly report the incident status, either through the crisis desk, responses to status page interval notifications, or directly through incident updates. Additionally, a brief update is provided every hour through the status page, ensuring clear and relevant information for understanding the incident's current status.

Closure

Once the incident is resolved, the Incident Manager concludes the process by updating the incident status to 'Resolved' with a concise two-line message indicating the resolution and that the affected component is operating normally again. Clicking the 'Update' button publishes the new incident state on the status page, sends notifications to subscribers, and unlocks the option to write a postmortem.

Postmortem

A public report detailing the incident is provided through four specific sections: Impact, Cause, Solution, and Conclusion. This report must be generated when the incident is closed ('Resolved') and can be published within the incident using the Write Postmortem option in the incident management panel.

The Incident Manager is responsible for writing and publishing the postmortem. This involves conducting an investigative process around the incident, seeking support from the developer who resolved it and other stakeholders who can provide necessary information.

The postmortem should be crafted for general understanding, avoiding technicalities, and striving for clarity and conciseness. Each of its sections is described below.

Impact

This section should be written with the following elements:

  1. How many users were affected and how
  2. Timestamp of the incident's timeline (on UTC-5 [1] <yy-mm-dd hh:mm>) where:

    1. [1] is the date of the merge request that caused the incident.
  1. How the incident was discovered, whether proactively (by someone on the Fluid Attacks team) or reactively (by a customer)
  2. Elapsed time of the incident's detection ([1] Time to detect was <elapsed_time>) where:

    1. [1] is defined as Time To Detect (TTD), that is, the elapsed time from the date the failure reaches production to the date it is reported through any of Fluid Attacks' support channels. The unit of measurement for <elapsed_time> must be defined as follows:

      1. if elapsed_time < 1 hour, then X minutes.
      2. if 1 day > elapsed_time > 1 hour, then X hours
      3. if 1 month > elapsed_time > 1 day, then X days

                     TTD is always expressed with a precision of up to one decimal place.
  1. Elapsed time of the incident's detection ([1] Time to fix was <elapsed_time>) where:

    1. [1] is defined as Time To Fix (TTF), that is, the elapsed time from the date the failure is reported through any of our support channels to the date the incident is resolved. The unit of measurement for <elapsed_time> must be defined as follows:

      1. if elapsed_time < 1 hour, then X minutes
      2. if 1 day > elapsed_time > 1 hour, then X hours
      3. if 1 month > elapsed_time > 1 day, then X days

                     TTF is always expressed with a precision of up to one decimal place.
    1. Who reported the incident and through which medium it was communicated
    2. Elapsed time of the incident's total impact ([1] Time to recover was <elapsed_time>) where:
    1. [1] is defined as Time To Recover (TTR), that is, the elapsed time from the date the failure reaches production to the date the incident is resolved. The unit of measurement for <elapsed_time> must be defined as follows:

      1. if elapsed_time < 1 hour, then X minutes
      2. if 1 day > elapsed_time > 1 hour, then X hours
      3. if 1 month > elapsed_time > 1 day, then X days

                 TTR is always expressed with a precision of up to one decimal place.
    1. It is essential to include the reference to the issue created to resolve it.

    Cause

    This section requires a thorough investigation to identify the root cause or causes behind the incident. At this point, it is essential to collaborate with the developer responsible for remediation and any other stakeholder who can provide the necessary information to identify and write up the cause successfully. By the end of this section, it should be clear what precisely the root cause of the incident was. For this section, it may be helpful to rely on the 5 Whys Technique.

    If the cause was introduced in a merge request, this URL must be linked as a reference (it is always the merge request URL and not the commit URL).

    Solution

    This section requires collaboration with the developer who remediated the incident and aims to clearly explain the remediation process.

    If code intervention was involved in resolving the incident, include the URL of the merge request (not the commit) that fixed the problem as a reference.

    Conclusion

    In this section, the Incident Manager should document the lessons learned during the incident process. This includes understanding what architectural aspects allowed the problem to reach the user (Production environment), detailing the actions taken by the responsible developer to remediate the root cause, and finally, adding a Taxonomy term or Taxonomy tree.

    Taxonomy term

    These terms serve as tags to categorize various incidents that may occur. Their purpose is to succinctly and systematically summarize and classify the nature of a problem or error. This provides a quick and structured understanding of why a particular issue reached the production environment.


    Term
    Definition
    COMMUNICATION_FAILURE
    Occurs when it was an expected behavior, but it was not specified in the documentation, or there were insufficient instructions to perform the task correctly.
    DATA_QUALITY
    Refers to a situation in which the data used within a process is deemed inadequate or insufficient in accuracy, completeness, or relevance, potentially leading to erroneous results or compromised outcomes
    FAILED_LINTER
    Occurs when the linter and its configured rules do not identify a syntax error
    FAILED_MIGRATION
    Denotes an error during a migration process, negatively impacting processes or data integrity
    IMPOSSIBLE_TO_TEST
    Occurs when the flow or nature of the operation does not allow for testing
    INCOMPLETE_PERSPECTIVE
    Refers to a situation where certain aspects were not considered during the planning or development of the functionality/process, resulting in failures or unexpected behavior
    INFRASTRUCTURE_ERROR
    Refers to a situation when misconfigurations within the infrastructure as code setup result in component failures, reduced performance, or even complete service interruptions
    LACK_OF_TRACEABILITY
    Occurs when insufficient loggers or tools are available to identify the root cause of the error
    MISSING_ALERT
    Occurs when technologies in the development environment and the continuous integration do not trigger an alert on the error, allowing it to go unnoticed in production
    MISSING_TEST
    Refers to the absence of specific tests, which can lead to undetected errors and result in product failures
    NO_SPECIFIED
    Refers to an error that, regarding its nature, has no reasonable or clear explanation
    PERFORMANCE_DEGRADATION
    Refers to a decrease in system performance that impacts the user experience or overall operational efficiency
    ROTATION_FAILURE
    Occurs when failures in the credential rotation process lead to specific components losing access to third-party services, compromising the operability, performance, or availability of Fluid Attacks' products
    THIRD_PARTY_CHANGE
    Refers to a situation wherein a third-party service or technology provider implements alterations in their infrastructure, potentially resulting in operational disruptions or unanticipated behaviors for their end-users
    THIRD_PARTY_ERROR
    Refers to a disruption caused by the failure or downtime of a third-party service provider, leading to unavailability or reduced functionality
    UNHANDLED_EXCEPTION
    Occurs when an exception happens without an associated error-handling mechanism


    Taxonomy tree

    A taxonomy tree is a hierarchical structure that describes a complex problem's root cause. It is employed when the incident arises from a series of interconnected errors or failures rather than having a clear cause. The hierarchical structure helps break down the problem into its parts and connections, making it easier to identify the main issue. The structure can be understood as follows:

    Principal Failure < Primary Cause < Secondary Cause < N Cause

    The structure can be interpreted as:

    Failure Occurred Due to < Which Was Caused By < Which Was In Turn Caused By

    Here is an example: DATA_QUALITY < FAILED_MIGRATION

    To ensure transparency and keep stakeholders informed, an incidents page is available.