TKT-005: GlitchTip Dashboards, Alerts, and Runbook
Section titled “TKT-005: GlitchTip Dashboards, Alerts, and Runbook”Status: Todo Priority: P1 Estimated effort: 1 day Depends on: TKT-003, TKT-004
Objective
Section titled “Objective”Create actionable GlitchTip observability for slow DB queries and document response workflows.
- Saved views/filters for slow DB spans.
- Alerts for sustained degradation.
- On-call runbook for triage.
Operational Definitions
Section titled “Operational Definitions”- Slow query (default):
db.duration_ms > 200 - Critical query:
db.duration_ms > 1000 - Sustained issue: p95 over threshold for >=10 minutes
Setup Tasks
Section titled “Setup Tasks”- Create a saved performance view filtered by
op=db.queryanddb.slow_query=true. - Create endpoint-level view grouped by route + fingerprint.
- Create service-level views for dashboard, ingestion, and consumer-api.
- Configure alert: p95 db span duration above threshold for 10 minutes.
- Configure alert: critical spans count above threshold per 5 minutes.
- Route alerts to configured channel(s) (email/Discord/etc.).
Runbook Tasks
Section titled “Runbook Tasks”- Add runbook doc section: “How to triage slow DB query alerts” — see docs/runbooks/slow-db-query-runbook.md.
- Define first checks (new deploys, sample-rate changes, shard imbalance, hot endpoint).
- Define mitigation steps (throttle, cache, index, temporary feature flags).
- Define escalation path and ownership per service.
- Add post-incident checklist for query regression prevention.
Acceptance Criteria
Section titled “Acceptance Criteria”- At least 3 saved views exist (global + per-service) — configured in GlitchTip (manual).
- At least 2 alerts are active and tested — configured in GlitchTip (manual).
- Runbook is published and linked from team docs — docs/runbooks/slow-db-query-runbook.md.
Rollback Plan
Section titled “Rollback Plan”- Disable noisy alerts while retaining baseline view.
- Adjust thresholds/sample rates to reduce false positives.