TKT-006: Test Coverage, Rollout Strategy, and Ratchet

Migrated from root technical docs.

TKT-006: Test Coverage, Rollout Strategy, and Ratchet

Status: Todo Priority: P1 Estimated effort: 1-2 days Depends on: TKT-001 through TKT-005

Objective

Finalize production readiness with test hardening, staged rollout controls, and a measurable ratchet plan for long-term query performance improvements.

Scope

Test completion criteria.
Progressive rollout matrix.
KPI and ratchet checkpoints.

Test Tasks

Add/expand unit tests for query span helper and threshold logic (slow/fast query tests in monitoring-cloudflare.test.ts).
Add regression test to ensure sensitive query data is redacted (fingerprint high-cardinality test in monitoring-cloudflare.test.ts).
Add service-level tests where helper integration is mocked/verified (error-path propagation test in monitoring-cloudflare.test.ts; traceServerDbQuery test in monitoring.test.ts).
Ensure CI test commands pass for modified packages/apps — all 4 affected packages pass (packages/utils 34 tests, apps/dashboard 19 tests, workers/consumer-api 15 tests, workers/ingestion 4 monitoring tests); turbo type-check clean on dashboard + utils; no new TS errors in worker-consumer-api or worker-ingestion.
Document known test limitations and follow-up tasks — see below.

Known Test Limitations

Author-matching cross-shard tracing (author-matching.ts): tests verify the queryOrcidsAcrossShards helper compiles and types check correctly, but no dedicated test exists for the shard-index-to-span-attribute flow across all 5 lookup strategies. Follow-up: add an author-matching.test.ts with traceDbQuery mock verifying shard attribute per strategy.
Workspace-modules and site-tools DB tracing: instrumented but not unit-tested in isolation (only type-checked). Follow-up: add workspace-modules.test.ts and site-tools/queries.test.ts with a traceServerDbQuery mock.
Date.now mock ordering: the slow-query test relies on a call-count pattern. If the implementation changes to a single Date.now call, the test will need updating.
Worker deploy validation: span emission under real Cloudflare runtime (not Vitest) can only be confirmed post-deploy with SENTRY_TRACES_SAMPLE_RATE=1.0 in preview.

Rollout Tasks

Define rollout matrix:
Stage 1: preview at 100% traces for 24h.
Stage 2: production at 2% traces for 48h.
Stage 3: production at 5% traces if volume/cost acceptable.
Add explicit stop conditions (event budget, latency overhead, error spikes).
Define rollback owner and SLA for disabling traces.

KPI Tasks

Baseline p50/p95/p99 DB span duration by service and endpoint.
Track top 10 slow query fingerprints weekly.
Track alert volume and false-positive rate.
Create monthly review checkpoint to tune thresholds and sampling.

Definition of Done

Tracing and DB instrumentation deployed to all target services.
Alerts and dashboards are stable and useful for operations.
Tests cover critical helper behaviors and integration paths.
Team has a ratchet cadence (weekly/monthly) for performance improvement.

Exit Checklist

All tickets marked complete in README.
Follow-up optimization tickets created for top slow fingerprints (requires post-deploy data).
Runbook linked in relevant service docs — linked from apps/dashboard/AGENTS.md and packages/platform-ingestion/docs/OBSERVABILITY.md.

templates

eventbus

adr

architecture

templates

consumer-api

deploy

onboarding

plans

sentrydbqueries

templates

reference

resync

runbooks

eventbus

TKT-006: Test Coverage, Rollout Strategy, and Ratchet

TKT-006: Test Coverage, Rollout Strategy, and Ratchet

Objective

Scope

Test Tasks

Known Test Limitations

Rollout Tasks

KPI Tasks

Definition of Done

Exit Checklist

Start Here

Architecture

Runbooks

Operational References

Migrated Internal Docs

Root Docs Archive