TKT-006: Test Coverage, Rollout Strategy, and Ratchet
Section titled “TKT-006: Test Coverage, Rollout Strategy, and Ratchet”Status: Todo Priority: P1 Estimated effort: 1-2 days Depends on: TKT-001 through TKT-005
Objective
Section titled “Objective”Finalize production readiness with test hardening, staged rollout controls, and a measurable ratchet plan for long-term query performance improvements.
- Test completion criteria.
- Progressive rollout matrix.
- KPI and ratchet checkpoints.
Test Tasks
Section titled “Test Tasks”- Add/expand unit tests for query span helper and threshold logic (slow/fast query tests in monitoring-cloudflare.test.ts).
- Add regression test to ensure sensitive query data is redacted (fingerprint high-cardinality test in monitoring-cloudflare.test.ts).
- Add service-level tests where helper integration is mocked/verified (error-path propagation test in monitoring-cloudflare.test.ts; traceServerDbQuery test in monitoring.test.ts).
- Ensure CI test commands pass for modified packages/apps — all 4 affected packages pass (
packages/utils34 tests,apps/dashboard19 tests,workers/consumer-api15 tests,workers/ingestion4 monitoring tests); turbo type-check clean on dashboard + utils; no new TS errors in worker-consumer-api or worker-ingestion. - Document known test limitations and follow-up tasks — see below.
Known Test Limitations
Section titled “Known Test Limitations”- Author-matching cross-shard tracing (
author-matching.ts): tests verify thequeryOrcidsAcrossShardshelper compiles and types check correctly, but no dedicated test exists for the shard-index-to-span-attribute flow across all 5 lookup strategies. Follow-up: add anauthor-matching.test.tswithtraceDbQuerymock verifyingshardattribute per strategy. - Workspace-modules and site-tools DB tracing: instrumented but not unit-tested in isolation (only type-checked). Follow-up: add
workspace-modules.test.tsandsite-tools/queries.test.tswith atraceServerDbQuerymock. Date.nowmock ordering: the slow-query test relies on a call-count pattern. If the implementation changes to a singleDate.nowcall, the test will need updating.- Worker deploy validation: span emission under real Cloudflare runtime (not Vitest) can only be confirmed post-deploy with
SENTRY_TRACES_SAMPLE_RATE=1.0in preview.
Rollout Tasks
Section titled “Rollout Tasks”- Define rollout matrix:
- Stage 1: preview at 100% traces for 24h.
- Stage 2: production at 2% traces for 48h.
- Stage 3: production at 5% traces if volume/cost acceptable.
- Add explicit stop conditions (event budget, latency overhead, error spikes).
- Define rollback owner and SLA for disabling traces.
KPI Tasks
Section titled “KPI Tasks”- Baseline p50/p95/p99 DB span duration by service and endpoint.
- Track top 10 slow query fingerprints weekly.
- Track alert volume and false-positive rate.
- Create monthly review checkpoint to tune thresholds and sampling.
Definition of Done
Section titled “Definition of Done”- Tracing and DB instrumentation deployed to all target services.
- Alerts and dashboards are stable and useful for operations.
- Tests cover critical helper behaviors and integration paths.
- Team has a ratchet cadence (weekly/monthly) for performance improvement.
Exit Checklist
Section titled “Exit Checklist”- All tickets marked complete in README.
- Follow-up optimization tickets created for top slow fingerprints (requires post-deploy data).
- Runbook linked in relevant service docs — linked from
apps/dashboard/AGENTS.mdandpackages/platform-ingestion/docs/OBSERVABILITY.md.