Active service disruptions, MTTR metrics, and post-mortem resolution workflows.
1 Active8 Resolved
Critical Outages
2
High Severity
3
Medium Severity
4
Mean Time to Resolve
45 min
INC-0891
critical
resolved
production
High CPU Usage on Production Cluster
2026-06-16 05:52:07
3h ago
CPU utilization exceeded 85% threshold on 3 of 5 nodes. Auto-scaling triggered but scale-out took 4 minutes due to AMI warm-up delay.
Owner / Assignee
john.doe@company.com
Affected Service
API Gateway
Outage Duration
52 minutes
Root Cause Diagnostic
Unoptimized DB query in /api/v2/reports endpoint executing full table scan. Query time spiked from 8ms to 4.2s under load.
Resolution Plan Notes
Added composite index on reports(created_at, user_id). Query time returned to 8ms. Added query timeout of 500ms as circuit breaker.
INC-0890
high
resolved
production
Container OOMKilled Restart Loop
2026-06-16 00:52:07
8h ago
prometheus container entered OOMKilled restart loop — 3 restarts within 10 minutes. Scrape targets were unreachable causing alerting gap.
Owner / Assignee
jane.smith@company.com
Affected Service
Prometheus
Outage Duration
1h 04m
Root Cause Diagnostic
Memory limit set to 256MB was insufficient for 47 scrape targets. Prometheus TSDB head block reached 280MB during retention compaction.
Resolution Plan Notes
Increased container memory limit to 1GB. Optimized scrape_interval from 15s to 30s for non-critical targets. Added --storage.tsdb.retention.size=800MB flag.
INC-0889
high
investigating
production
Database Replication Lag Spike
2026-06-16 08:22:07
30m ago
PostgreSQL replication lag on read replica reached 45 seconds during afternoon peak load. Reads routed to primary causing increased latency.
Owner / Assignee
ops-team
Affected Service
PostgreSQL Primary
Outage Duration
ongoing
Root Cause Diagnostic
Heavy analytical queries from BI dashboard executing on primary instance. WAL sender backlog accumulating faster than replica can apply.
Resolution Plan Notes
Under active mitigation. Moved BI queries to dedicated read replica. Increasing max_wal_senders and wal_keep_size on primary.
INC-0888
medium
resolved
production
CDN Cache Invalidation Failure
2026-06-15 20:52:07
12h ago
Stale assets served to ~12% of users in ap-southeast-1 region. Cache purge API call returned 504 timeout from CDN control plane.
Owner / Assignee
jane.smith@company.com
Affected Service
CDN Service
Outage Duration
18 minutes
Root Cause Diagnostic
CDN provider control plane API experienced elevated latency during a maintenance window that was not communicated via status page.
Resolution Plan Notes
Implemented retry logic with exponential backoff for cache invalidation calls. Added monitoring alert for CDN control plane API response time.
INC-0887
critical
resolved
production
Auth Service Token Validation Errors
2026-06-15 08:52:07
1d ago
JWT validation started failing for tokens issued before 08:30 UTC. 503 errors affecting approximately 800 concurrent users.
Owner / Assignee
john.doe@company.com
Affected Service
Auth Service
Outage Duration
42 minutes
Root Cause Diagnostic
Secret key rotation procedure did not complete gracefully. Old tokens signed with previous key were invalidated before TTL expiry.
Resolution Plan Notes
Rolled back secret key. Implemented dual-key verification window during rotation. Updated runbook for secret rotation procedure.
INC-0886
medium
resolved
production
Background Worker Queue Backup
2026-06-14 20:52:07
1d 12h ago
Email notification queue backed up to 24,000 pending jobs. Workers stopped processing due to Redis connection timeout configuration.
Owner / Assignee
ops-team
Affected Service
Background Worker
Outage Duration
28 minutes
Root Cause Diagnostic
Redis client connection_timeout set to 100ms was too aggressive. Network latency spike to 180ms in eu-west-1 caused all workers to drop connections.
Resolution Plan Notes
Increased connection_timeout to 2000ms. Added connection pool with health checks. Queue drained within 15 minutes after fix.
INC-0885
high
resolved
production
Elasticsearch Index Circuit Breaker Exception
2026-06-14 08:52:07
2d ago
Queries targeting logs index returned 429 Too Many Requests due to parent circuit breaker triggering (JVM memory pressure at 95.2%).
Owner / Assignee
ops-team
Affected Service
Background Worker
Outage Duration
1h 15m
Root Cause Diagnostic
Large wildcard aggregation query executed by automated reporting tool. JVM heap capacity exhausted trying to keep bucket states in memory.
Resolution Plan Notes
Configured indices.breaker.fielddata.limit to 40%. Added ES node JVM heap telemetry alert thresholds.
INC-0884
medium
resolved
staging
SSL Certificate Renewal Handshake Failure
2026-06-13 08:52:07
3d ago
Automated Let's Encrypt renewal agent failed authorization due to CAA record configuration mismatch on DNS provider.
Owner / Assignee
jane.smith@company.com
Affected Service
API Gateway
Outage Duration
45m
Root Cause Diagnostic
CAA record was restricting issuance to DigiCert only, blocking Let's Encrypt challenge validator.
Resolution Plan Notes
Added letsencrypt.org to CAA record set. Triggered renewal script manually.
INC-0883
medium
resolved
development
Docker Registry Connection Timeout
2026-06-12 08:52:07
4d ago
Internal runners failed to pull base alpine image. Registry authentication service experienced database deadlocks.
Owner / Assignee
john.doe@company.com
Affected Service
CDN Service
Outage Duration
32m
Root Cause Diagnostic
Database connection leakage in docker registry authentication handler under high concurrency.
Resolution Plan Notes
Patched docker registry token server dependencies. Restarted registry auth pods.