A config rollout lowered the database connection-pool limit on the
sync-worker tier from 64 to 8, exhausting connections under
normal afternoon load. The sync API returned 502s for roughly 21% of
requests over 47 minutes. We mitigated by reverting the config and
cycling the worker fleet; no data was lost.
cfg-9a12 promoted to
production via the standard rollout pipeline.
sync_5xx_rate > 2% for 60s.
On-call (Devon) acknowledges.
cfg-9a12
reverted; worker fleet cycled. 5xx rate drops below 0.2% within
three minutes.
PR #4888 made connection-pool limits
configurable per worker tier. The default for the new
sync-worker key was meant to inherit the
global value (64) but was hard-coded to 8 during a local test and
committed. The config linter only validates type, not magnitude, so the
change passed CI.
Because config rollouts and code deploys go through separate pipelines, the on-call's first instinct — rolling back the most recent application deploys — had no effect and cost roughly 13 minutes of diagnosis time.
| Requests failed (502) | ~41,200 |
|---|---|
| Peak error rate | 21.4% |
| Users affected | ~2,300 workspaces |
| Data loss | None — clients retried |
| SLA breach | No (within monthly budget) |
max_connections (warn < 32)
Apr 18