Skip to content

TKT-005: GlitchTip Dashboards, Alerts, and Runbook

Migrated from root technical docs.

TKT-005: GlitchTip Dashboards, Alerts, and Runbook

Section titled “TKT-005: GlitchTip Dashboards, Alerts, and Runbook”

Status: Todo Priority: P1 Estimated effort: 1 day Depends on: TKT-003, TKT-004

Create actionable GlitchTip observability for slow DB queries and document response workflows.

  • Saved views/filters for slow DB spans.
  • Alerts for sustained degradation.
  • On-call runbook for triage.
  • Slow query (default): db.duration_ms > 200
  • Critical query: db.duration_ms > 1000
  • Sustained issue: p95 over threshold for >=10 minutes
  • Create a saved performance view filtered by op=db.query and db.slow_query=true.
  • Create endpoint-level view grouped by route + fingerprint.
  • Create service-level views for dashboard, ingestion, and consumer-api.
  • Configure alert: p95 db span duration above threshold for 10 minutes.
  • Configure alert: critical spans count above threshold per 5 minutes.
  • Route alerts to configured channel(s) (email/Discord/etc.).
  • Add runbook doc section: “How to triage slow DB query alerts” — see docs/runbooks/slow-db-query-runbook.md.
  • Define first checks (new deploys, sample-rate changes, shard imbalance, hot endpoint).
  • Define mitigation steps (throttle, cache, index, temporary feature flags).
  • Define escalation path and ownership per service.
  • Add post-incident checklist for query regression prevention.
  • At least 3 saved views exist (global + per-service) — configured in GlitchTip (manual).
  • At least 2 alerts are active and tested — configured in GlitchTip (manual).
  • Runbook is published and linked from team docs — docs/runbooks/slow-db-query-runbook.md.
  • Disable noisy alerts while retaining baseline view.
  • Adjust thresholds/sample rates to reduce false positives.