Skip to content

Slow DB Query Runbook

Migrated from root technical docs.

Owner: Platform team Last updated: 2026-04-21 Related tickets: TKT-005


The Cloudflare services that currently emit DB query spans into GlitchTip are dashboard, worker-consumer-api, worker-ingestion-orchestrator, and worker-ingestion-process, all via the @sentry/cloudflare SDK. Each span carries:

AttributeDescription
opAlways db.query
db.systemAlways sqlite (D1)
db.operationSQL verb (SELECT, INSERT, UPDATE, DELETE, …)
db.serviceService that ran the query (my-legaciti-dashboard, worker-consumer-api, worker-ingestion-orchestrator, worker-ingestion-process)
db.query_fingerprintRedacted hash of the SQL shape — safe to store and aggregate
db.duration_msActual query wall-clock time in milliseconds
db.slow_querytrue when db.duration_ms > db.slow_query_threshold_ms
db.slow_query_threshold_msThreshold used (default: 200ms)
db.shardD1 shard index (0–4), when known
db.logicHuman-readable label for the query’s purpose
db.row_countNumber of rows returned, when available

SeverityConditionAction
Warningp95 db.duration_ms > 200ms sustained 10 minInvestigate — see First Checks
CriticalAny db.duration_ms > 1000msImmediate triage
Budget alertEvent volume spike (> 2× baseline in 5 min)Consider reducing sample rate

Configure these views in GlitchTip → Performance:

  • Filter: op = db.query AND db.slow_query = true
  • Group by: db.query_fingerprint
  • Sort by: p95 duration descending
  • Filter: op = db.query AND db.service = my-legaciti-dashboard
  • Group by: db.logic
  • Filter: op = db.query AND db.service = worker-consumer-api
  • Group by: db.logic
  • Filter: op = db.query AND db.service = worker-ingestion-orchestrator
  • Group by: db.logic
  • Filter: op = db.query AND db.service = worker-ingestion-process
  • Group by: db.logic

  • Condition: p95 of db.duration_ms where op = db.query exceeds 200ms for 10 consecutive minutes
  • Channel: Email + Discord #platform-alerts
  • Severity: Warning
  • Condition: Any db.duration_ms > 1000ms
  • Channel: Discord #platform-alerts (immediate)
  • Severity: Critical

When a slow-query alert fires, run through these checks in order:

  • Check Cloudflare Workers deployments for the flagged db.service in the last 30 minutes.
  • If yes: compare p95 before/after; roll back if query patterns changed unexpectedly.
  • Check SENTRY_TRACES_SAMPLE_RATE in the affected worker’s Cloudflare environment variables.
  • A rate increase surfaces more slow queries without them being new — verify against baseline.
  • Filter GlitchTip by db.shard attribute to isolate a single D1 shard.
  • If one shard is hot, check for fan-out queries, lock contention, or large batch operations hitting it disproportionately.
  • Look at db.logic and, for worker-consumer-api/dashboard, http.route attributes.
  • Cross-reference Cloudflare Workers analytics for the same route to confirm traffic volume.
  • Check the db.query_fingerprint in the global slow-query view.
  • A new fingerprint after a deploy → likely a new query shape introduced.
  • A recurring fingerprint that recently spiked → data volume growth or missing index.

CauseMitigation
Missing indexAdd index in packages/db/src/schema.ts; run pnpm db:drizzle:generate; deploy
Hot shardReview sharding key distribution; consider re-sharding strategy
Fan-out N+1Consolidate multi-shard fan-out into a single per-shard batch; merge in memory
Large result setAdd pagination limit or tighter WHERE clause
Slow aggregateCache result in KV (PUB_CACHE) with short TTL
Temporary spikeReduce SENTRY_TRACES_SAMPLE_RATE to 0.01 while investigating

ConditionOwner
Dashboard query regressionsDashboard team
Ingestion pipeline slowdownsIngestion / data team
Public API SLA breachPlatform team (primary), on-call
Persistent after mitigationPlatform lead

  • Root cause documented in incident log.
  • Index or query fix merged and deployed.
  • GlitchTip alert threshold adjusted if it was a false positive.
  • Follow-up ticket created if a more significant refactor is needed.
  • Slow fingerprint added to weekly review tracking list.

Once per week, inspect the top-10 slow fingerprints in the Global slow-query view. For each:

  1. Confirm db.logic label and owning service.
  2. Verify p95 trend is flat or improving.
  3. Create an optimization ticket if p95 > 500ms for two consecutive weeks.

To stop emitting DB spans without a code deploy:

Terminal window
# Per-service via Cloudflare dashboard or wrangler secrets
wrangler secret put SENTRY_TRACES_SAMPLE_RATE --env production
# Enter: 0

This takes effect on the next worker request. No redeployment needed.