Slow DB Query Runbook
Section titled “Slow DB Query Runbook”Owner: Platform team Last updated: 2026-04-21 Related tickets: TKT-005
Overview
Section titled “Overview”The Cloudflare services that currently emit DB query spans into GlitchTip are dashboard, worker-consumer-api, worker-ingestion-orchestrator, and worker-ingestion-process, all via the @sentry/cloudflare SDK. Each span carries:
| Attribute | Description |
|---|---|
op | Always db.query |
db.system | Always sqlite (D1) |
db.operation | SQL verb (SELECT, INSERT, UPDATE, DELETE, …) |
db.service | Service that ran the query (my-legaciti-dashboard, worker-consumer-api, worker-ingestion-orchestrator, worker-ingestion-process) |
db.query_fingerprint | Redacted hash of the SQL shape — safe to store and aggregate |
db.duration_ms | Actual query wall-clock time in milliseconds |
db.slow_query | true when db.duration_ms > db.slow_query_threshold_ms |
db.slow_query_threshold_ms | Threshold used (default: 200ms) |
db.shard | D1 shard index (0–4), when known |
db.logic | Human-readable label for the query’s purpose |
db.row_count | Number of rows returned, when available |
Operational Thresholds
Section titled “Operational Thresholds”| Severity | Condition | Action |
|---|---|---|
| Warning | p95 db.duration_ms > 200ms sustained 10 min | Investigate — see First Checks |
| Critical | Any db.duration_ms > 1000ms | Immediate triage |
| Budget alert | Event volume spike (> 2× baseline in 5 min) | Consider reducing sample rate |
GlitchTip Saved Views
Section titled “GlitchTip Saved Views”Configure these views in GlitchTip → Performance:
1. Global: All Slow DB Queries
Section titled “1. Global: All Slow DB Queries”- Filter:
op = db.query AND db.slow_query = true - Group by:
db.query_fingerprint - Sort by: p95 duration descending
2. Per-service: Dashboard
Section titled “2. Per-service: Dashboard”- Filter:
op = db.query AND db.service = my-legaciti-dashboard - Group by:
db.logic
3. Per-service: Public API
Section titled “3. Per-service: Public API”- Filter:
op = db.query AND db.service = worker-consumer-api - Group by:
db.logic
4. Per-service: Ingestion Orchestrator
Section titled “4. Per-service: Ingestion Orchestrator”- Filter:
op = db.query AND db.service = worker-ingestion-orchestrator - Group by:
db.logic
5. Per-service: Ingestion Process
Section titled “5. Per-service: Ingestion Process”- Filter:
op = db.query AND db.service = worker-ingestion-process - Group by:
db.logic
Alerts
Section titled “Alerts”Alert 1 — Sustained p95 Degradation
Section titled “Alert 1 — Sustained p95 Degradation”- Condition: p95 of
db.duration_mswhereop = db.queryexceeds 200ms for 10 consecutive minutes - Channel: Email + Discord
#platform-alerts - Severity: Warning
Alert 2 — Critical Single Query
Section titled “Alert 2 — Critical Single Query”- Condition: Any
db.duration_ms > 1000ms - Channel: Discord
#platform-alerts(immediate) - Severity: Critical
Triage Steps {#first-checks}
Section titled “Triage Steps {#first-checks}”When a slow-query alert fires, run through these checks in order:
1. Was there a recent deploy?
Section titled “1. Was there a recent deploy?”- Check Cloudflare Workers deployments for the flagged
db.servicein the last 30 minutes. - If yes: compare p95 before/after; roll back if query patterns changed unexpectedly.
2. Did the sample rate change?
Section titled “2. Did the sample rate change?”- Check
SENTRY_TRACES_SAMPLE_RATEin the affected worker’s Cloudflare environment variables. - A rate increase surfaces more slow queries without them being new — verify against baseline.
3. Is the issue shard-specific?
Section titled “3. Is the issue shard-specific?”- Filter GlitchTip by
db.shardattribute to isolate a single D1 shard. - If one shard is hot, check for fan-out queries, lock contention, or large batch operations hitting it disproportionately.
4. Which endpoint is responsible?
Section titled “4. Which endpoint is responsible?”- Look at
db.logicand, for worker-consumer-api/dashboard,http.routeattributes. - Cross-reference Cloudflare Workers analytics for the same route to confirm traffic volume.
5. Is the fingerprint new or recurring?
Section titled “5. Is the fingerprint new or recurring?”- Check the
db.query_fingerprintin the global slow-query view. - A new fingerprint after a deploy → likely a new query shape introduced.
- A recurring fingerprint that recently spiked → data volume growth or missing index.
Mitigation Steps
Section titled “Mitigation Steps”| Cause | Mitigation |
|---|---|
| Missing index | Add index in packages/db/src/schema.ts; run pnpm db:drizzle:generate; deploy |
| Hot shard | Review sharding key distribution; consider re-sharding strategy |
| Fan-out N+1 | Consolidate multi-shard fan-out into a single per-shard batch; merge in memory |
| Large result set | Add pagination limit or tighter WHERE clause |
| Slow aggregate | Cache result in KV (PUB_CACHE) with short TTL |
| Temporary spike | Reduce SENTRY_TRACES_SAMPLE_RATE to 0.01 while investigating |
Escalation
Section titled “Escalation”| Condition | Owner |
|---|---|
| Dashboard query regressions | Dashboard team |
| Ingestion pipeline slowdowns | Ingestion / data team |
| Public API SLA breach | Platform team (primary), on-call |
| Persistent after mitigation | Platform lead |
Post-Incident Checklist
Section titled “Post-Incident Checklist”- Root cause documented in incident log.
- Index or query fix merged and deployed.
- GlitchTip alert threshold adjusted if it was a false positive.
- Follow-up ticket created if a more significant refactor is needed.
- Slow fingerprint added to weekly review tracking list.
Weekly Review
Section titled “Weekly Review”Once per week, inspect the top-10 slow fingerprints in the Global slow-query view. For each:
- Confirm
db.logiclabel and owning service. - Verify p95 trend is flat or improving.
- Create an optimization ticket if p95 > 500ms for two consecutive weeks.
Disabling Traces
Section titled “Disabling Traces”To stop emitting DB spans without a code deploy:
# Per-service via Cloudflare dashboard or wrangler secretswrangler secret put SENTRY_TRACES_SAMPLE_RATE --env production# Enter: 0This takes effect on the next worker request. No redeployment needed.