API Gateway

API Gateway

API Gateway and Lambda APIs are critical components in our Universe ecosystem, serving as the interface for client communications and backend services:
Service Usage
Integrates Main API for GraphQL operations and WebSocket
Tracks API for events and auditing
Fixes WebSocket API for real-time correction suggestions
Notes
Note
Unlike DynamoDB, API Gateway, and Lambda DO generates detailed logs in CloudWatch. Debugging relies on:
  1. API Gateway Access Logs: HTTP/WebSocket requests with performance metrics.
  2. Lambda Execution Logs: Application logs, errors, and execution metrics.
  3. CloudWatch Metrics: Latency, errors, throttling, and throughput.

API Gateway Architecture 

Key API Services and Configurations

  1. Integrates API (GraphQL + WebSocket)
    1. Type: HTTP API v2 with WebSocket support
    2. Authentication: JWT tokens
    3. Logging: Detailed access logs in JSON format
    4. Metrics: Detailed metrics enabled
  2. Tracks API (REST)
    1. Type: HTTP API v2
    2. Authentication: AWS IAM
    3. Logging: Access logs with request/response information
    4. Throttling: Rate limit 10,000 req/min, burst 5,000
  3. Fixes API (WebSocket)
    1. Type: WebSocket API
    2. Logging: Connection and message logs
    3. Data tracing: Enabled for debugging

Implementation

API Gateway configurations are defined in Terraform:

resource "aws_apigatewayv2_stage" "tracks" { api_id = aws_apigatewayv2_api.tracks.id auto_deploy = true name = "prod"
access_log_settings { destination_arn = aws_cloudwatch_log_group.tracks_access_logs.arn format = jsonencode({ requestId = "$context.requestId"
httpMethod = "$context.httpMethod"
routeKey = "$context.routeKey"
status = "$context.status"
responseLength = "$context.responseLength"
requestTime = "$context.requestTime"
}) } default_route_settings { detailed_metrics_enabled = true throttling_burst_limit = 5000
throttling_rate_limit = 10000 } }



Log groups for Lambda functions:

resource "aws_cloudwatch_log_group" "tracks_lambda_api" { name = "/aws/lambda/tracks_lambda_api"
retention_in_days = 90 tags = { "fluidattacks:comp" = "tracks" "fluidattacks:line" = "cost" "Name" = "tracks_lambda_api_function_log_group" } }


Monitoring with CloudWatch

Understanding API Monitoring Sources

API monitoring comes from three main sources:
  1. API Gateway Access Logs:
    1. HTTP/WebSocket requests with full details
    2. Access via: CloudWatch → Logs Insights → /aws/api_gateway/[api-name]
    3. Structured JSON format for analysis
  2. Lambda Execution Logs:
    1. Application logs and runtime errors
    2. Access via: CloudWatch → Logs Insights → /aws/lambda/[function-name]
    3. Includes custom logs and stack traces
  3. CloudWatch Metrics:
    1. API Gateway: Latency, errors, throttling
    2. Lambda: Duration, errors, concurrency
    3. Access via: CloudWatch → Metrics → AWS/ApiGateway or AWS/Lambda

Key Metrics to Monitor

  1. API Gateway Metrics (Namespace AWS/ApiGateway):
    1. Latency - Overall latency of your API
    2. IntegrationLatency - Time between API Gateway and backend
    3. 4XX / 5XX - Client and server errors
    4. Count - Number of API calls
  2. Lambda Metrics (Namespace AWS/Lambda):
    1. Duration - Execution time
    2. Errors - Number of errors
    3. Throttles - Number of throttled invocations
    4. ConcurrentExecutions - Number of concurrent executions

Debugging with CloudWatch Logs

Finding API Performance Issues

  1. Access logs:

  2. fields @timestamp, requestId, routeKey, status | filter status = 200 | sort @timestamp desc

  3. API Errors:

  4. fields @timestamp, routeKey, status, errorMessage | filter status >= 400 | sort @timestamp desc | limit 20

  5. Connection events:

  6. fields @timestamp, connectionId, eventType, errorMessage
    | filter eventType = "CONNECT" or eventType = "DISCONNECT" | sort @timestamp desc
  1. Message failures:

  2. fields @timestamp, connectionId, eventType, errorMessage | filter eventType = "MESSAGE" and status >= 400
    | sort @timestamp desc

CloudWatch Alarms for API Gateway

Universe has centralized alarms for APIs defined in runs/vpc/infra/alarms.tf:

resource "aws_cloudwatch_metric_alarm" "lambda_error_alarm" { for_each = toset(var.lambda_names) alarm_name = "${each.value}_Error_Alarm"
metric_name = "Errors"
namespace = "AWS/Lambda" period = 60 statistic = "Sum" threshold = 1 alarm_actions = [aws_sns_topic.central_alarms.arn] }




  1. API Gateway Alarms
    1. High Latency:
      1. Metric: Latency
      2. Threshold: > 5000ms
      3. Statistic: Average or p95
      4. Period: 1 minute
      5. Action: SNS notification
    2. High Error Rate:
      1. Metric: 5XXError
      2. Threshold: > 10
      3. Statistic: Sum
      4. Period: 1 minute
      5. Action: SNS notification
    3. Throttling Events:
      1. Metric: 4XXError (filtering for 429)
      2. Threshold: > 0
      3. Statistic: Sum
      4. Period: 1 minute
      5. Action: SNS notification
  2. Lambda API Alarms
    1. Function Errors:
      1. Metric: Errors
      2. Threshold: > 0
      3. Statistic: Sum
      4. Period: 1 minute
      5. Action: SNS notification
    2. High Duration:
      1. Metric: Duration
      2. Threshold: > 30000ms
      3. Statistic: p95
      4. Period: 1 minute
      5. Action: SNS notification
    3. Throttles:
      1. Metric: Throttles
      2. Threshold: > 0
      3. Statistic: Sum
      4. Period: 1 minute
      5. Action: SNS notification