Introduction

Introduction

Amazon CloudWatch serves as the central observability service we use at Fluid Attacks to monitor our services, collect and visualize real-time metrics, and respond to operational and security changes. This guide documents our specific implementation and best practices.

Why CloudWatch is Critical for Debugging

Debugging in cloud environments differs significantly from traditional debugging. It's no longer possible to simply launch a debugger or examine processes locally. Instead, we need tools that provide visibility into distributed and often ephemeral systems. CloudWatch offers:
  1. Centralized observability: A single point to investigate issues across multiple AWS services.
  2. Real-time and historical data: For analyzing both active incidents and recurring problems.
  3. Cross-service correlation: Ability to connect related events across different components.
  4. Powerful analysis tools: To identify patterns and trends in large volumes of data.
Essentially, CloudWatch provides the necessary tools to answer critical questions during debugging:
  1. What exactly is failing?
  2. When did the problem start?
  3. Is it affecting other components?
  4. Has it happened before?
  5. What changed before the problem appeared?

Use Cases

At Fluid Attacks, we use AWS CloudWatch primarily for:
  1. Production Debugging: Tracking execution flows and diagnosing errors in real-time.
  2. Security Auditing: Recording sensitive actions performed by users for compliance and forensic purposes.
  3. Business Observability: Monitoring critical events in our service lifecycle.
  4. Incident Correlation: Linking logs, metrics, and traces for root cause analysis.

Logging Architecture

Core Components

  1. Log Groups: We primarily use the integrates group as our main container.
  2. Log Streams: We separate by component or service to facilitate filtering.
  3. Metrics: We extract key metrics from logs for dashboard visualization.
  4. Alarms: We configure alerts based on specific patterns or metric thresholds.

Backend Implementation

Core Logging Utility

We've developed the cloudwatch_log function as a centralized point to send structured logs to CloudWatch. This utility ensures consistent formatting, proper context inclusion, and asynchronous operation to minimize performance impact:

# integrates/back/integrates/custom_utils/logs.py
TRANSACTIONS_LOGGER: logging.Logger = logging.getLogger("transactional")

def cloudwatch_log(
    request: Request | WebSocket | Any,    msg: str,    extra: dict,    user_email: str = "",) -> None:
    if user_email:        TRANSACTIONS_LOGGER.info(            
            msg,
            extra={
                "extra": {
                    "environment": FI_ENVIRONMENT, "user_email": user_email, **extra}
            },        
        )
    else:        
        aio.to_background(cloudwatch_log_async(request, msg, extra))

When user_email is provided, the function logs synchronously with the specified user context. Otherwise, it delegates to an asynchronous function that extracts the user information from the request context without blocking the main execution thread.

The complementary asynchronous function automatically extracts the user context when not explicitly provided:

# integrates/back/integrates/custom_utils/logs.py
async def cloudwatch_log_async(request: Request | WebSocket | Any, msg: str, extra: dict) -> None:
    try:        
        user_data = await sessions_domain.get_jwt_content(request)    except (ExpiredToken, InvalidAuthorization):        
        user_data = {"user_email": "unauthenticated"}

    TRANSACTIONS_LOGGER.info(
        msg,
        extra={
            "extra": {
                "environment": FI_ENVIRONMENT,
                "user_email": user_data["user_email"],
                **extra,
            }
        },
    )

The asynchronous approach is particularly important for our high-throughput APIs, as it prevents logging operations from blocking request processing.

Handler Configuration

We use Watchtower as a handler for Python logging, configured centrally through our logging configuration system. This handler efficiently transmits logs to CloudWatch while providing important features like queuing, batching, and throttling:

# integrates/back/integrates/settings/logging/handlers.py
def get_watchtower_handler(*, stream_name: str) -> dict:
    return {        "boto3_client": BOTO3_SESSION.client("logs"),        "class": "watchtower.CloudWatchLogHandler",        "create_log_group": False,        "create_log_stream": False,        "filters": ["production_only"],        "formatter": "json",        "level": "INFO",        "log_group_name": "integrates",        "log_stream_name": stream_name,        "use_queues": True,    }

This configuration ensures:
  1. Structured JSON format: All logs are stored in a consistent JSON structure for easy querying and parsing.
  2. Automatic environment filtering: The production_only filter ensures development logs don't flood our production CloudWatch account.
  3. Asynchronous sending: The use_queues option enables background processing of log events, preventing application blocking.
  4. Separate streams by component: Each functional component uses a distinct stream for better organization.
  5. Consistent authentication: We use a pre-configured boto3 session with appropriate IAM credentials.
  6. Resource optimization: We don't automatically create log groups or streams, which are instead managed through infrastructure as code.
We have extended this basic configuration with additional features:
  1. Custom JSON formatters that properly handle complex Python objects.
  2. Throttling to prevent excessive API calls during log bursts.
  3. Retry logic for handling temporary AWS API failures.
  4. Graceful degradation when CloudWatch is unavailable.

Usage Patterns

Security Audit Logging

We log access attempts to sensitive resources, especially for security operations. This allows us to maintain a comprehensive audit trail for regulatory compliance and security investigations:

# integrates/back/integrates/api/mutations/add_secret.py
logs_utils.cloudwatch_log(
    info.context,
    "Blocked attept to create ROOT type secret for GitRoot",    
    extra={
        "group_name": kwargs["group_name"],        
        "resource_type": resource_type,        
        "resource_id": kwargs["resource_id"],
        "root_type": "GitRoot",        
        "user_email": email,        
        "log_type": "Security",    
    },
)

In this example, we're documenting a security policy enforcement event where:
  1. A user attempted to create a sensitive resource (ROOT type secret).
  2. The system blocked this action based on security policies.
  3. We capture complete context, including the group, resource details, and user identity.
This type of logging is critical for security operations, as it allows us to:
  1. Detect potential security policy violations or misconfigurations.
  2. Provide evidence during security audits and compliance reviews.
  3. Create alerts for suspicious activity patterns.
  4. Support forensic investigations when needed.
Our security logs maintain a consistent structure with detailed contextual information, enabling the creation of comprehensive dashboards and reports for effective security oversight.

State Change Logging

We document critical system changes for traceability, enabling us to understand what changed, when it changed, and who made the change:

# integrates/back/integrates/api/mutations/add_secret.py
logs_utils.cloudwatch_log(
    info.context,
    "Added secret",    extra={
        "group_name": kwargs["group_name"],
        "resource_type": resource_type,
        "resource_id": kwargs["resource_id"],
        "log_type": "Security",
    },
)

State change logs serve multiple critical purposes:
  1. Creating a chronological record of system modifications.
  2. Supporting rollback operations when issues arise.
  3. Providing context for troubleshooting application behavior.
  4. Enabling reconstruction of the system state at any point in time.
  5. Facilitating compliance with data governance requirements.
For particularly sensitive operations, we include additional context such as the previous state, specific fields that changed, and justification for the change.

GraphQL Operation Logging

We maintain a record of all executed mutations to provide a complete picture of API activity and data modifications:

# integrates/back/integrates/api/__init__.py
logs_utils.cloudwatch_log(
    request,
    "GraphQL mutation executed",    extra={
        "operation": operation_name,        "log_type": "Audit",    },
)

GraphQL operation logging is particularly important because:
  1. It provides a high-level view of API usage patterns.
  2. It helps identify performance bottlenecks in specific operations.
  3. It creates an audit trail of data modifications across the system.
  4. It supports debugging complex GraphQL operations that touch multiple resolvers.
Our logging captures the operation name and type, along with authentication context, while carefully avoiding logging of sensitive parameter values that might contain PII or credentials.

Field Taxonomy

To facilitate searches and analysis, we standardize the following fields in the extra structure. This consistent taxonomy makes it possible to build powerful queries, filters, and dashboards across our entire logging ecosystem:
Field Description Examples
environment Execution environment identifier production, staging, development
user_email Email of the user performing the action  user@fluidattacks.com, unauthenticated
log_type Main event category for classification Security, Audit, Business, Error, Performance
group_name Affected group/project identifier  group_123, acme_corp
resource_type Type of resource involved in the operation  Secret, Root, Finding, Evidence, Organization
resource_id Unique identifier for the affected resource  uuid, path/to/resource
operation Name of the specific operation being performed  addSecret, updateFinding, deleteUser
status Operation outcome status  success, failure, blocked, partial
duration_ms Time taken to complete the operation in milliseconds  127, 3542
error_code Standardized error code, if applicable  AUTH_FAILED, RESOURCE_NOT_FOUND, VALIDATION_ERROR
service Service or microservice handling the request  api, auth, notifications
version Service version or deployment identifier  v2.3.1, build-5436
This taxonomy has evolved based on our operational needs and is consistently enforced through our centralized logging utilities. The standardized structure enables powerful analytical capabilities:
  1. We can track actions by specific users across multiple services.
  2. We can correlate errors with specific resource types or operations.
  3. We can measure performance characteristics by service or operation type.
  4. We can filter security events by group, resource type, or action outcome.

Monitored Services

CloudWatch is used in the following Fluid Attacks components:
  1. Integrates Backend: All GraphQL operations and critical domains, including user management, vulnerability findings, evidence handling, and security scanning operations. We maintain dedicated log streams for each major API component.
  2. Observes: ETL components and data analysis pipelines that process security findings and metrics. These components generate logs for data processing activities, extraction operations, and transformation steps.
  3. Infrastructure: Monitoring of ECS clusters, Lambda functions, and other AWS services supporting our platform. We collect both application-generated logs and AWS-provided infrastructure metrics to gain a complete view of system health.
  4. Security Services: Our security scanning engines, vulnerability verification systems, and remediation tracking tools all contribute logs to CloudWatch to provide complete visibility into the security assessment lifecycle.
  5. Client Applications: Frontend errors and performance metrics are captured and sent to CloudWatch, giving us insights into client-side operations and user experience.
Each service implements consistent logging patterns while adding domain-specific contextual information that enhances the value of the logs for that particular component.

Queries and Analysis with CloudWatch Logs Insights

Common Queries

# Security events by user in the last 24h
fields @timestamp, extra.user_email, message, extra.resource_type
| filter extra.log_type = "Security"
| sort @timestamp desc
| limit 100

This query helps our security team review all security-relevant events, organized by user, to identify potential suspicious patterns or unauthorized access attempts.

#Errors in critical operations 
fields @timestamp, message, extra.operation
| filter level = "ERROR"| sort @timestamp desc
| limit 50

Best Practices at Fluid Attacks

Through years of CloudWatch implementation experience, we've established the following best practices:
  1. Consistent Structure: Always use cloudwatch_log instead of direct logger calls to ensure standardized formatting and context inclusion. This consistent approach simplifies queries and analysis while ensuring that all necessary context is captured with every log event.
  2. Avoid Sensitive Information: Don't include passwords, tokens, or PII in logs to maintain compliance with data protection regulations. We've implemented automatic sanitization for common patterns like credit card numbers, social security numbers, and authentication tokens to prevent accidental exposure.
  3. Enriched Context: Always add relevant information in the extra field to make logs more valuable for analysis. Context-rich logs significantly reduce the time required for troubleshooting and provide better insights for operational and security analysis.
  4. Descriptive Messages: Use clear and concise messages that identify the action being performed or the event being recorded. Good message formatting makes logs more readable for humans while still supporting machine parsing.
  5. Optimal Log Levels: Use appropriate log levels (DEBUG, INFO, WARNING, ERROR) to differentiate between routine operations and exceptional conditions. This practice helps control log volume while ensuring that important events are properly highlighted.
  6. Correlation IDs: Include request IDs or correlation tokens in all logs related to a single transaction to enable end-to-end request tracking across services. These IDs allow us to reconstruct the complete flow of a request through our distributed system.
  7. Performance Awareness: Be mindful of the performance impact of logging, especially in high-throughput components. Our asynchronous logging approach and careful control of log verbosity help minimize this impact.
  8. Regular Log Analysis: Schedule regular reviews of log data to identify patterns, anomalies, or potential improvements. Proactive log analysis often reveals issues before they become critical incidents.