Incident Center | DevOps Command Center

INC-0891 critical resolved production

High CPU Usage on Production Cluster

2026-06-16 05:52:07

3h ago

CPU utilization exceeded 85% threshold on 3 of 5 nodes. Auto-scaling triggered but scale-out took 4 minutes due to AMI warm-up delay.

Owner / Assignee

john.doe@company.com

Affected Service

API Gateway

Outage Duration

52 minutes

Root Cause Diagnostic

Unoptimized DB query in /api/v2/reports endpoint executing full table scan. Query time spiked from 8ms to 4.2s under load.

Resolution Plan Notes

Added composite index on reports(created_at, user_id). Query time returned to 8ms. Added query timeout of 500ms as circuit breaker.

INC-0890 high resolved production

Container OOMKilled Restart Loop

2026-06-16 00:52:07

8h ago

prometheus container entered OOMKilled restart loop — 3 restarts within 10 minutes. Scrape targets were unreachable causing alerting gap.

Owner / Assignee

jane.smith@company.com

Affected Service

Prometheus

Outage Duration

1h 04m

Root Cause Diagnostic

Memory limit set to 256MB was insufficient for 47 scrape targets. Prometheus TSDB head block reached 280MB during retention compaction.

Resolution Plan Notes

Increased container memory limit to 1GB. Optimized scrape_interval from 15s to 30s for non-critical targets. Added --storage.tsdb.retention.size=800MB flag.

INC-0889 high investigating production

Database Replication Lag Spike

2026-06-16 08:22:07

30m ago

PostgreSQL replication lag on read replica reached 45 seconds during afternoon peak load. Reads routed to primary causing increased latency.

Owner / Assignee

ops-team

Affected Service

PostgreSQL Primary

Outage Duration

ongoing

Root Cause Diagnostic

Heavy analytical queries from BI dashboard executing on primary instance. WAL sender backlog accumulating faster than replica can apply.

Resolution Plan Notes

Under active mitigation. Moved BI queries to dedicated read replica. Increasing max_wal_senders and wal_keep_size on primary.

INC-0888 medium resolved production

CDN Cache Invalidation Failure

2026-06-15 20:52:07

12h ago

Stale assets served to ~12% of users in ap-southeast-1 region. Cache purge API call returned 504 timeout from CDN control plane.

Owner / Assignee

jane.smith@company.com

Affected Service

CDN Service

Outage Duration

18 minutes

Root Cause Diagnostic

CDN provider control plane API experienced elevated latency during a maintenance window that was not communicated via status page.

Resolution Plan Notes

Implemented retry logic with exponential backoff for cache invalidation calls. Added monitoring alert for CDN control plane API response time.

INC-0887 critical resolved production

Auth Service Token Validation Errors

2026-06-15 08:52:07

1d ago

JWT validation started failing for tokens issued before 08:30 UTC. 503 errors affecting approximately 800 concurrent users.

Owner / Assignee

john.doe@company.com

Affected Service

Auth Service

Outage Duration

42 minutes

Root Cause Diagnostic

Secret key rotation procedure did not complete gracefully. Old tokens signed with previous key were invalidated before TTL expiry.

Resolution Plan Notes

Rolled back secret key. Implemented dual-key verification window during rotation. Updated runbook for secret rotation procedure.

INC-0886 medium resolved production

Background Worker Queue Backup

2026-06-14 20:52:07

1d 12h ago

Email notification queue backed up to 24,000 pending jobs. Workers stopped processing due to Redis connection timeout configuration.

Owner / Assignee

ops-team

Affected Service

Background Worker

Outage Duration

28 minutes

Root Cause Diagnostic

Redis client connection_timeout set to 100ms was too aggressive. Network latency spike to 180ms in eu-west-1 caused all workers to drop connections.

Resolution Plan Notes

Increased connection_timeout to 2000ms. Added connection pool with health checks. Queue drained within 15 minutes after fix.

INC-0885 high resolved production

Elasticsearch Index Circuit Breaker Exception

2026-06-14 08:52:07

2d ago

Queries targeting logs index returned 429 Too Many Requests due to parent circuit breaker triggering (JVM memory pressure at 95.2%).

Owner / Assignee

ops-team

Affected Service

Background Worker

Outage Duration

1h 15m

Root Cause Diagnostic

Large wildcard aggregation query executed by automated reporting tool. JVM heap capacity exhausted trying to keep bucket states in memory.

Resolution Plan Notes

Configured indices.breaker.fielddata.limit to 40%. Added ES node JVM heap telemetry alert thresholds.

INC-0884 medium resolved staging

SSL Certificate Renewal Handshake Failure

2026-06-13 08:52:07

3d ago

Automated Let's Encrypt renewal agent failed authorization due to CAA record configuration mismatch on DNS provider.

Owner / Assignee

jane.smith@company.com

Affected Service

API Gateway

Outage Duration

45m

Root Cause Diagnostic

CAA record was restricting issuance to DigiCert only, blocking Let's Encrypt challenge validator.

Resolution Plan Notes

Added letsencrypt.org to CAA record set. Triggered renewal script manually.

INC-0883 medium resolved development

Docker Registry Connection Timeout

2026-06-12 08:52:07

4d ago

Internal runners failed to pull base alpine image. Registry authentication service experienced database deadlocks.

Owner / Assignee

john.doe@company.com

Affected Service

CDN Service

Outage Duration

32m

Root Cause Diagnostic

Database connection leakage in docker registry authentication handler under high concurrency.

Resolution Plan Notes

Patched docker registry token server dependencies. Restarted registry auth pods.