Confirmed issues include malfunctions, performance degradation, service disruptions, or errors within Fluid Attacks products or components. These issues must have already affected or have the potential to impact a minimum of two users, regardless of whether they have been reported or not. This criterion applies universally, including internal issues related to hackers or team members.
The
Incident Manager is responsible for creating and managing incidents from the time of detection until the incident
Postmortem is published.
Assigned developer
The assigned developer is responsible for resolving the
Issue. Their duty involves promptly addressing the case with the utmost priority, keeping the
Incident Desk updated on all progress and obstacles, or any eventuality that arises during the process. Furthermore, the assigned developer provides all necessary information to the
Incident Manager to identify and adequately document the root cause and the development of the
Postmortem.
Detection
This occurs when an external or internal user identifies an incident and subsequently reports it for publication and resolution.
Incidents are usually reported through
help@fluidattacks.com, but can also be notified through an
Issue, an internal chat, a direct email, or other means of contact.
Issue
When an incident is detected, the help area generates a detailed
Issue describing the problem. It is the responsibility of the
Incident Manager to assign a developer to resolve it. While most issues originate from the help area, their can also be created by other Fluid Attacks employees or external users.
Publication
After assigning a developer, the Incident Manager must promptly access the incident panel to create the publication, considering the following fields:
Whenever a developer is assigned to resolve an incident, the
Incident Manager adds them to the group and posts a description of the problem, who the accountable developer is, the issue's URL, and the incident's URL. This initiates a discussion thread around the incident.
The assigned developer is expected to dedicate 100% of their working time to resolving the incident. Additionally, they must promptly inform about any mitigations, solutions, obstacles, or other relevant updates regarding the case.
Upon successful resolution of the incident, the Incident Manager removes the developer from the Incident Desk group.
Follow-up
It is the
Incident Manager responsibility to stay in contact with the developer responsible for the incident, tracking its progress, and keeping the
Incident Desk updated on the incident's status or providing support if any obstacles arise.
For long-running incident, the Incident Manager must constantly report the incident status, either through the crisis desk, responses to status page interval notifications, or directly through incident updates. Additionally, a brief update is provided every hour through the status page, ensuring clear and relevant information for understanding the incident's current status.
Closure
Once the incident is resolved, the Incident Manager concludes the process by updating the incident status to
Resolved with a concise two-line message indicating the resolution and that the affected component is operating normally again. Clicking the
Update button publishes the new incident state on the status page, sends notifications to subscribers, and unlocks the option to write a
Postmortem.
Postmortem
A public report detailing the incident is provided through four specific sections: Impact, Cause, Solution, and Conclusion. This report must be generated when the incident is closed (Resolved) and can be published within the incident using the Write Postmortem option in the incident management panel.
The Incident Manager is responsible for writing and publishing the postmortem. This involves conducting an investigative process around the incident, seeking support from the developer who resolved it and other stakeholders who can provide necessary information.
The postmortem should be crafted for general understanding, avoiding technicalities, and striving for clarity and conciseness. Each of its sections is described below:
This section should be written with the following elements:
- How and how many users were affected.
- Timestamp of the incident's timeline (at UTC-5 [1] <yy-mm-dd hh:mm> to [2] <yy-mm-dd hh:mm> | [3] Time to recover was <elapsed_time>) where:
- [1] The date of the merge request that caused the incident.
- [2] The date of the merge request that fixed the incident.
- [3] Defined as Time To Recover (TTR), it is the elapsed time from the date the failure is reported through any of our support channels until it is resolved.
The unit of measurement for <elapsed_time> must be expressed as follows:
- if elapsed_time < 1 hour, then <X minutes>.
- if 1 day > elapsed_time >= 1 hour, then <X hours>.
- if 1 month > elapsed_time >= 1 day, then <X days>.
Always with a precision of up to one decimal place.
- How the incident was discovered, whether proactively (by someone on the Fluid Attacks team) or reactively (by a customer).
- Timestamp of the incident's detection (at UTC-5 [1] <yy-mm-dd hh:mm> |[2] Time to detect was <elapsed_time>) where:
- [1] The date when the failure was reported through any of our support channels.
- [2] Defined as Time To Detect (TTD), it is the elapsed time from the date the failure reaches production to the date it is reported through any of our support channels. The unit of measurement for <elapsed_time> must be defined as follows:
- if elapsed_time < 1 hour, then X minutes.
- if 1 day > elapsed_time > 1 hour, then X hours.
- if 1 month > elapsed_time > 1 day, then X days.
Always with a precision of up to one decimal place.
- Describe how the incident was discovered and reported. It is essential to include the reference to the Issue created to resolve it.
Cause
This section requires a thorough investigation to identify the root cause or causes behind the incident. At this point, it is essential to collaborate with the developer responsible for remediation and any other stakeholder who can provide the necessary information to identify and write up the cause successfully. By the end of this section, it should be clear what precisely the root cause of the incident was. For this section, it may be helpful to rely on the
5 Whys Technique.
If the cause was introduced in a Merge Request, this URL must linked as a reference (always the Merge Request and not the commit).
Solution
This section requires collaboration with the developer who remediated the incident and aims to clearly explain the remediation process.
If code intervention was involved in resolving the incident, include the URL of the Merge Request that fixed the problem as a reference.
Conclusion
In this section, the Incident Manager should document the lessons learned during the incident process. This includes understanding what architectural aspects allowed the problem to reach the user (Production environment), detailing the actions taken by the responsible developer to remediate the root cause, and finally, adding a
Taxonomy Term or
Taxonomy Tree.
Taxonomy term
These terms serve as tags to categorize various incidents that may occur. Their purpose is to succinctly and systematically summarize and classify the nature of a problem or error. This provides a quick and structured understanding of why a particular issue reached the production environment.
Term | Definition |
COMMUNICATION_FAILURE | Occurs when it was an expected behavior, but it was not specified in the documentation, or there were insufficient instructions to perform the task correctly. |
DATA_QUALITY | Refers to a situation in which the data used within a process is deemed inadequate or insufficient in accuracy, completeness, or relevancy, potentially leading to erroneous results or compromised outcomes. |
FAILED_LINTER | Occurs when the linter and its configured rules do not identify a syntax error. |
FAILED_MIGRATION | Denotes an error during a migration process, negatively impacting processes or data integrity. |
IMPOSSIBLE_TO_TEST | Occurs when the flow or nature of the operation does not allow for testing. |
INCOMPLETE_PERSPECTIVE | Refers to a situation where certain aspects were not considered during the planning or development of the functionality/process, resulting in failures or unexpected behavior. |
INFRASTRUCTURE_ERROR | Refers to a situation when misconfigurations within the infrastructure as code setup result in component failures, reduced performance, or even complete service interruptions. |
LACK_OF_TRACEABILITY | Occurs when insufficient loggers or tools are available to identify the root cause of the error. |
MISSING_ALERT | Occurs when technologies in the development environment and the continuous integration do not trigger an alert on the error, allowing it to go unnoticed in production. |
MISSING_TEST | Refers to the absence of specific tests, which can lead to undetected errors and result in product failures. |
NO_SPECIFIED | Refers to an error that, regarding its nature, has no reasonable or clear explanation. |
ROTATION_FAILURE | Occurs when failures in the credential rotation process lead to specific components losing access to third-party services, compromising the operability, performance, or availability of our products. |
THIRD_PARTY_CHANGE | Refers to a situation wherein a third-party service or technology provider implements alterations in their infrastructure, potentially resulting in operational disruptions or unanticipated behaviors for their end-users. |
THIRD_PARTY_ERROR | Refers to a disruption caused by the failure or downtime of a third-party service provider, leading to unavailability or reduced functionality. | |
|
UNHANDLED_EXCEPTION | Occurs when an exception happens without an associated error-handling mechanism. |
Taxonomy tree
It is a hierarchical structure that describes a complex problem's root cause. It is employed when the incident arises from a series of interconnected errors or failures rather than having a clear cause. The hierarchical structure helps break down the problem into its parts and connections, making it easier to identify the main issue. The structure can be understood as follows:
Principal Failure < Primary Cause < Secondary Cause < N Cause
The structure can be interpreted as:
Failure Occurred Due to < Which Was Caused By < Which Was In Turn Caused By
Here is an example: DATA_QUALITY < FAILED_MIGRATION
To ensure transparency and keep stakeholders informed, an
Incidents Page is available.