Incident Center

Active service disruptions, MTTR metrics, and post-mortem resolution workflows.

1 Active 8 Resolved
Critical Outages
2
High Severity
3
Medium Severity
4
Mean Time to Resolve
45 min
INC-0891 critical resolved production

High CPU Usage on Production Cluster

2026-06-16 05:52:07
3h ago

CPU utilization exceeded 85% threshold on 3 of 5 nodes. Auto-scaling triggered but scale-out took 4 minutes due to AMI warm-up delay.

Owner / Assignee
john.doe@company.com
Affected Service
API Gateway
Outage Duration
52 minutes
Root Cause Diagnostic
Unoptimized DB query in /api/v2/reports endpoint executing full table scan. Query time spiked from 8ms to 4.2s under load.
Resolution Plan Notes
Added composite index on reports(created_at, user_id). Query time returned to 8ms. Added query timeout of 500ms as circuit breaker.
INC-0890 high resolved production

Container OOMKilled Restart Loop

2026-06-16 00:52:07
8h ago

prometheus container entered OOMKilled restart loop — 3 restarts within 10 minutes. Scrape targets were unreachable causing alerting gap.

Owner / Assignee
jane.smith@company.com
Affected Service
Prometheus
Outage Duration
1h 04m
Root Cause Diagnostic
Memory limit set to 256MB was insufficient for 47 scrape targets. Prometheus TSDB head block reached 280MB during retention compaction.
Resolution Plan Notes
Increased container memory limit to 1GB. Optimized scrape_interval from 15s to 30s for non-critical targets. Added --storage.tsdb.retention.size=800MB flag.
INC-0889 high investigating production

Database Replication Lag Spike

2026-06-16 08:22:07
30m ago

PostgreSQL replication lag on read replica reached 45 seconds during afternoon peak load. Reads routed to primary causing increased latency.

Owner / Assignee
ops-team
Affected Service
PostgreSQL Primary
Outage Duration
ongoing
Root Cause Diagnostic
Heavy analytical queries from BI dashboard executing on primary instance. WAL sender backlog accumulating faster than replica can apply.
Resolution Plan Notes
Under active mitigation. Moved BI queries to dedicated read replica. Increasing max_wal_senders and wal_keep_size on primary.
INC-0888 medium resolved production

CDN Cache Invalidation Failure

2026-06-15 20:52:07
12h ago

Stale assets served to ~12% of users in ap-southeast-1 region. Cache purge API call returned 504 timeout from CDN control plane.

Owner / Assignee
jane.smith@company.com
Affected Service
CDN Service
Outage Duration
18 minutes
Root Cause Diagnostic
CDN provider control plane API experienced elevated latency during a maintenance window that was not communicated via status page.
Resolution Plan Notes
Implemented retry logic with exponential backoff for cache invalidation calls. Added monitoring alert for CDN control plane API response time.
INC-0887 critical resolved production

Auth Service Token Validation Errors

2026-06-15 08:52:07
1d ago

JWT validation started failing for tokens issued before 08:30 UTC. 503 errors affecting approximately 800 concurrent users.

Owner / Assignee
john.doe@company.com
Affected Service
Auth Service
Outage Duration
42 minutes
Root Cause Diagnostic
Secret key rotation procedure did not complete gracefully. Old tokens signed with previous key were invalidated before TTL expiry.
Resolution Plan Notes
Rolled back secret key. Implemented dual-key verification window during rotation. Updated runbook for secret rotation procedure.
INC-0886 medium resolved production

Background Worker Queue Backup

2026-06-14 20:52:07
1d 12h ago

Email notification queue backed up to 24,000 pending jobs. Workers stopped processing due to Redis connection timeout configuration.

Owner / Assignee
ops-team
Affected Service
Background Worker
Outage Duration
28 minutes
Root Cause Diagnostic
Redis client connection_timeout set to 100ms was too aggressive. Network latency spike to 180ms in eu-west-1 caused all workers to drop connections.
Resolution Plan Notes
Increased connection_timeout to 2000ms. Added connection pool with health checks. Queue drained within 15 minutes after fix.
INC-0885 high resolved production

Elasticsearch Index Circuit Breaker Exception

2026-06-14 08:52:07
2d ago

Queries targeting logs index returned 429 Too Many Requests due to parent circuit breaker triggering (JVM memory pressure at 95.2%).

Owner / Assignee
ops-team
Affected Service
Background Worker
Outage Duration
1h 15m
Root Cause Diagnostic
Large wildcard aggregation query executed by automated reporting tool. JVM heap capacity exhausted trying to keep bucket states in memory.
Resolution Plan Notes
Configured indices.breaker.fielddata.limit to 40%. Added ES node JVM heap telemetry alert thresholds.
INC-0884 medium resolved staging

SSL Certificate Renewal Handshake Failure

2026-06-13 08:52:07
3d ago

Automated Let's Encrypt renewal agent failed authorization due to CAA record configuration mismatch on DNS provider.

Owner / Assignee
jane.smith@company.com
Affected Service
API Gateway
Outage Duration
45m
Root Cause Diagnostic
CAA record was restricting issuance to DigiCert only, blocking Let's Encrypt challenge validator.
Resolution Plan Notes
Added letsencrypt.org to CAA record set. Triggered renewal script manually.
INC-0883 medium resolved development

Docker Registry Connection Timeout

2026-06-12 08:52:07
4d ago

Internal runners failed to pull base alpine image. Registry authentication service experienced database deadlocks.

Owner / Assignee
john.doe@company.com
Affected Service
CDN Service
Outage Duration
32m
Root Cause Diagnostic
Database connection leakage in docker registry authentication handler under high concurrency.
Resolution Plan Notes
Patched docker registry token server dependencies. Restarted registry auth pods.