Slow DB Query Runbook

Migrated from root technical docs.

Slow DB Query Runbook

Owner: Platform team Last updated: 2026-04-21 Related tickets: TKT-005

Overview

The Cloudflare services that currently emit DB query spans into GlitchTip are dashboard, worker-consumer-api, worker-ingestion-orchestrator, and worker-ingestion-process, all via the @sentry/cloudflare SDK. Each span carries:

Attribute	Description
`op`	Always `db.query`
`db.system`	Always `sqlite` (D1)
`db.operation`	SQL verb (`SELECT`, `INSERT`, `UPDATE`, `DELETE`, …)
`db.service`	Service that ran the query (`my-legaciti-dashboard`, `worker-consumer-api`, `worker-ingestion-orchestrator`, `worker-ingestion-process`)
`db.query_fingerprint`	Redacted hash of the SQL shape — safe to store and aggregate
`db.duration_ms`	Actual query wall-clock time in milliseconds
`db.slow_query`	`true` when `db.duration_ms > db.slow_query_threshold_ms`
`db.slow_query_threshold_ms`	Threshold used (default: 200ms)
`db.shard`	D1 shard index (0–4), when known
`db.logic`	Human-readable label for the query’s purpose
`db.row_count`	Number of rows returned, when available

Operational Thresholds

Severity	Condition	Action
Warning	p95 `db.duration_ms` > 200ms sustained 10 min	Investigate — see First Checks
Critical	Any `db.duration_ms` > 1000ms	Immediate triage
Budget alert	Event volume spike (> 2× baseline in 5 min)	Consider reducing sample rate

GlitchTip Saved Views

Configure these views in GlitchTip → Performance:

1. Global: All Slow DB Queries

Filter: op = db.query AND db.slow_query = true
Group by: db.query_fingerprint
Sort by: p95 duration descending

2. Per-service: Dashboard

Filter: op = db.query AND db.service = my-legaciti-dashboard
Group by: db.logic

3. Per-service: Public API

Filter: op = db.query AND db.service = worker-consumer-api
Group by: db.logic

4. Per-service: Ingestion Orchestrator

Filter: op = db.query AND db.service = worker-ingestion-orchestrator
Group by: db.logic

5. Per-service: Ingestion Process

Filter: op = db.query AND db.service = worker-ingestion-process
Group by: db.logic

Alerts

Alert 1 — Sustained p95 Degradation

Condition: p95 of db.duration_ms where op = db.query exceeds 200ms for 10 consecutive minutes
Channel: Email + Discord #platform-alerts
Severity: Warning

Alert 2 — Critical Single Query

Condition: Any db.duration_ms > 1000ms
Channel: Discord #platform-alerts (immediate)
Severity: Critical

Triage Steps {#first-checks}

When a slow-query alert fires, run through these checks in order:

1. Was there a recent deploy?

Check Cloudflare Workers deployments for the flagged db.service in the last 30 minutes.
If yes: compare p95 before/after; roll back if query patterns changed unexpectedly.

2. Did the sample rate change?

Check SENTRY_TRACES_SAMPLE_RATE in the affected worker’s Cloudflare environment variables.
A rate increase surfaces more slow queries without them being new — verify against baseline.

3. Is the issue shard-specific?

Filter GlitchTip by db.shard attribute to isolate a single D1 shard.
If one shard is hot, check for fan-out queries, lock contention, or large batch operations hitting it disproportionately.

4. Which endpoint is responsible?

Look at db.logic and, for worker-consumer-api/dashboard, http.route attributes.
Cross-reference Cloudflare Workers analytics for the same route to confirm traffic volume.

5. Is the fingerprint new or recurring?

Check the db.query_fingerprint in the global slow-query view.
A new fingerprint after a deploy → likely a new query shape introduced.
A recurring fingerprint that recently spiked → data volume growth or missing index.

Mitigation Steps

Cause	Mitigation
Missing index	Add index in `packages/db/src/schema.ts`; run `pnpm db:drizzle:generate`; deploy
Hot shard	Review sharding key distribution; consider re-sharding strategy
Fan-out N+1	Consolidate multi-shard fan-out into a single per-shard batch; merge in memory
Large result set	Add pagination limit or tighter `WHERE` clause
Slow aggregate	Cache result in KV (`PUB_CACHE`) with short TTL
Temporary spike	Reduce `SENTRY_TRACES_SAMPLE_RATE` to `0.01` while investigating

Escalation

Condition	Owner
Dashboard query regressions	Dashboard team
Ingestion pipeline slowdowns	Ingestion / data team
Public API SLA breach	Platform team (primary), on-call
Persistent after mitigation	Platform lead

Post-Incident Checklist

Root cause documented in incident log.
Index or query fix merged and deployed.
GlitchTip alert threshold adjusted if it was a false positive.
Follow-up ticket created if a more significant refactor is needed.
Slow fingerprint added to weekly review tracking list.

Weekly Review

Once per week, inspect the top-10 slow fingerprints in the Global slow-query view. For each:

Confirm db.logic label and owning service.
Verify p95 trend is flat or improving.
Create an optimization ticket if p95 > 500ms for two consecutive weeks.

Disabling Traces

To stop emitting DB spans without a code deploy:

# Per-service via Cloudflare dashboard or wrangler secrets
wrangler secret put SENTRY_TRACES_SAMPLE_RATE --env production
# Enter: 0

This takes effect on the next worker request. No redeployment needed.

templates

sentrydbqueries

templates

eventbus

Slow DB Query Runbook

Slow DB Query Runbook

Overview

Operational Thresholds

GlitchTip Saved Views

1. Global: All Slow DB Queries

2. Per-service: Dashboard

3. Per-service: Public API

4. Per-service: Ingestion Orchestrator

5. Per-service: Ingestion Process

Alerts

Alert 1 — Sustained p95 Degradation

Alert 2 — Critical Single Query

Triage Steps {#first-checks}

1. Was there a recent deploy?

2. Did the sample rate change?

3. Is the issue shard-specific?

4. Which endpoint is responsible?

5. Is the fingerprint new or recurring?

Mitigation Steps

Escalation

Post-Incident Checklist

Weekly Review

Disabling Traces

Start Here

Architecture

Runbooks

Operational References

Migrated Internal Docs

Root Docs Archive