Architecture Decision Record for the localized search strategy in the sharded D1 architecture — projection-table vs. FTS5.
| Field | Value |
|---|---|
| Status | Accepted |
| Deciders | API + data owners |
| Epic | E — Multi-Language Search And Fallback (Phase 4) |
| Depends on | A3 (load scenarios with latency baseline) |
| Supersedes | — |
Context
Section titled “Context”The platform stores localizable text fields in JSON columns in Cloudflare D1 (SQLite):
publications.title TEXT—{ "en": "…", "pt": "…" }people.name TEXT—{ "en": "…", "pt": "…" }people.biography TEXT— sameperson_categories.name / description TEXT— same
The data lives across 5 D1 shards (DB_0–DB_4) partitioned by a djb2 hash of the entity identifier. Any list or search operation fans out to all five shards in parallel and merges results in the Worker.
Current search implementation is English-only and hardcoded:
-- publications/handlers.tsAND LOWER(json_extract(p.title, '$.en')) LIKE ?
-- people/handlers.tsjson_extract(r.name, '$.en') LIKE ?This breaks when the UI is in Portuguese (or any other locale): searching for “Investigador” returns nothing because the search only matches the English value.
The goal of E2–E3 is to make search locale-aware: a query in locale pt should match the pt value of a field (with fallback to en if the pt value is absent).
Decision Drivers
Section titled “Decision Drivers”- Correctness — search must return records that match in the requested locale.
- Performance — no regression vs. the current single-
LIKEpath; target P95 < 200 ms on fan-out of 5 shards. - Maintainability — new locales should require only data backfill, not schema changes.
- D1 constraints — D1 is SQLite-on-Workers: no custom tokenizers, no external search engine, Worker CPU budget ≤ 10 ms before streaming is forced.
- Operational simplicity — the backfill must be interruptible and resumable (large datasets).
- Future readiness — eventually support 100+ locales including CJK languages that require different tokenization.
Considered Options
Section titled “Considered Options”Option A — Projection Table
Section titled “Option A — Projection Table”Introduce a search_projections table (one row per entity × locale × field):
CREATE TABLE search_projections ( entity_type TEXT NOT NULL, -- 'publication' | 'person' entity_id TEXT NOT NULL, -- DOI or ORCID, same shard as source locale TEXT NOT NULL, -- 'en', 'pt', … field TEXT NOT NULL, -- 'title', 'name', 'biography', … content TEXT NOT NULL, -- plain-text value for this locale/field updated_at INTEGER NOT NULL, PRIMARY KEY (entity_type, entity_id, locale, field));
CREATE INDEX idx_sp_locale_field_content ON search_projections(entity_type, locale, field, content);Search query (publications, locale=pt, fallback to en):
SELECT p.*FROM publications pJOIN search_projections sp ON sp.entity_type = 'publication' AND sp.entity_id = p.id AND sp.locale = ? -- 'pt' AND sp.field = 'title' AND LOWER(sp.content) LIKE ?WHERE p.deleted_at IS NULL AND p.visible = 1ORDER BY p.publication_year DESCLIMIT ? OFFSET ?;Fallback (if no pt row) is handled by a second pass against en, or by always inserting a fallback row at write time (preferred — avoids the second query).
Write path: on every insert/update of a localizable field, upsert the projection rows for all locales present in the JSON value. Implemented in a shared upsertProjections(db, entityType, entityId, fields) helper called from the ingestion worker and the PATCH handler.
Option B — FTS5 Virtual Tables
Section titled “Option B — FTS5 Virtual Tables”Create an FTS5 virtual table per shard that indexes localized text with language tags embedded in the content:
CREATE VIRTUAL TABLE publications_fts USING fts5( id UNINDEXED, locale UNINDEXED, title, content=publications, -- EXTERNAL_CONTENT mode tokenize='unicode61');Search:
SELECT p.*FROM publications pJOIN publications_fts fts ON fts.id = p.id AND fts.locale = ?WHERE publications_fts MATCH ?ORDER BY rank;FTS5 with the unicode61 tokenizer handles Latin-script languages (en, pt, es, fr) via Unicode case-folding and basic stemming. For CJK languages it would require the trigram tokenizer or pre-tokenized input.
Comparison
Section titled “Comparison”| Criterion | A — Projection Table | B — FTS5 |
|---|---|---|
| Implementation effort | Low — standard SQL DML | Medium — FTS5 virtual table wiring, EXTERNAL_CONTENT setup |
| Query complexity | Low — indexed JOIN with LIKE | Low — single MATCH clause |
| D1 support | Full — plain table and index | Supported, but no custom tokenizers; trigram not available in D1 |
| New-locale cost | Data-only backfill, no DDL | Requires INSERT INTO fts(fts) VALUES ('rebuild') on each shard |
| CJK / 100-language path | Pre-tokenize at write time, store tokens | FTS5 trigram unavailable; would need pre-tokenization anyway |
| Write amplification | ×N rows per entity (N = locale count) | Similar — FTS5 shadow tables have same write cost |
| Storage overhead | ×N plain text rows | FTS5 index is typically 1.5–2× content size per locale |
| LIKE semantics | Exact prefix/infix match | BM25 ranking + stemming (better relevance, harder to control) |
| Backfill | Resumable with cursor on (entity_id, locale) | Resumable with batch INSERT INTO fts SELECT … |
| Ops on schema change | ALTER TABLE or new column | DROP + recreate virtual table; must re-run full rebuild |
| Observability | Row count per locale is directly queryable | Shadow table internals are opaque |
| Risk | Low — no new SQLite features | Medium — FTS5 behavior on D1 needs validation; no Community FTS examples at scale |
Performance expectations
Section titled “Performance expectations”Both options add one index-assisted operation per shard per query. Given current dataset size (< 10 k publications, < 500 people per shard):
| Metric | Current (English LIKE) | Option A (Projection JOIN) | Option B (FTS5 MATCH) |
|---|---|---|---|
| P50 single-shard (est.) | 2–5 ms | 3–8 ms | 3–6 ms |
| P95 single-shard (est.) | 8–15 ms | 10–25 ms | 8–20 ms |
| Fan-out overhead (5 shards) | ~15 ms | ~20 ms | ~18 ms |
| At 100 k rows / shard | Unknown | 15–40 ms (with index) | 10–30 ms (BM25 index) |
FTS5 has a slight edge on relevance ranking and scan speed at large scale. Projection tables have a slight edge on operational predictability and locale controllability.
Decision
Section titled “Decision”Use Option A (Projection Table) as the primary localized search mechanism for E2.
Rationale
Section titled “Rationale”-
Lower risk on D1. SQLite FTS5 is supported on D1 but there are no documented production examples of FTS5 at scale on D1 with multi-language content. Option A uses only indexed
JOIN + LIKE— a well-understood D1 workload. -
Simpler ops. Adding a new locale requires only a backfill script. No DDL changes, no FTS rebuild, no shadow-table management across 5 shards × 5 databases.
-
CJK parity. For Phase 5 (100-language support), both options require pre-tokenization for CJK. Option A stores pre-tokenized tokens in the
contentcolumn — the same tokenization work feeds any future FTS5 upgrade. There is no unique FTS5 advantage here. -
Deterministic fallback. The projection table makes locale fallback explicit: the write path decides which locale values to store; the read path is a single JOIN. FTS5 MATCH semantics are harder to tune for “search in pt, fall back to en” without query manipulation.
-
FTS5 as a future optimization layer. If relevance ranking becomes a product requirement (beyond prefix/infix match), an FTS5 EXTERNAL_CONTENT table can be layered on top of the projection table with no changes to the projection write path. This preserves optionality without betting on FTS5 from the start.
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Locale-aware search works for all registered locales from day one of E2.
- Adding a new locale is a data operation (backfill script) with no code change required in the query layer.
- Backfill is fully resumable: the script pages through
entity_idwith a cursor and can be restarted safely. - Projection rows are directly auditable (
SELECT COUNT(*) FROM search_projections WHERE locale='pt').
Negative / Trade-offs
Section titled “Negative / Trade-offs”- Write amplification: every publication or person save triggers ×N upserts into
search_projections(N = number of locales). At 2 locales this is negligible; at 100 locales it adds ~100 ms to write latency unless batched. - Storage cost: currently < 50 MB total across all shards. At 2 locales and 10 k publications with 200-character average title this adds ~4 MB per shard — acceptable.
- LIKE search does not rank results by relevance. This is acceptable for the current use case (researchers filtering their own institution’s publications). If relevance ranking is required, add an FTS5 layer in Phase 5.
Neutral
Section titled “Neutral”- Query performance is equivalent to or marginally slower than the current English-only LIKE. The additional JOIN is index-assisted and the performance delta is within the A3 baseline margins.
Phased Rollout Plan
Section titled “Phased Rollout Plan”Phase 1 — Schema (E2, Sprint 1)
Section titled “Phase 1 — Schema (E2, Sprint 1)”New migration: 022_search_projections.sql
CREATE TABLE IF NOT EXISTS search_projections ( entity_type TEXT NOT NULL CHECK (entity_type IN ('publication', 'person')), entity_id TEXT NOT NULL, locale TEXT NOT NULL, field TEXT NOT NULL, content TEXT NOT NULL DEFAULT '', updated_at INTEGER NOT NULL, PRIMARY KEY (entity_type, entity_id, locale, field));
CREATE INDEX IF NOT EXISTS idx_sp_lookup ON search_projections (entity_type, locale, field, LOWER(content));Applied to all 5 D1 shards. Table starts empty — search falls back to the current English LIKE path until backfill completes.
Dual-read guard in handlers: while search_projections is empty (or below a configurable coverage threshold), fall back to json_extract(…, '$.en') LIKE ?. This ensures zero search regression during rollout.
Phase 2 — Write Path (E2, Sprint 1–2)
Section titled “Phase 2 — Write Path (E2, Sprint 1–2)”Implement upsertSearchProjections(db, entityType, entityId, fields) in packages/utils:
export async function upsertSearchProjections( db: D1Database, entityType: "publication" | "person", entityId: string, fields: Record<string, LocalizedString>, // field name → JSON value): Promise<void> { const now = Date.now(); const rows: { locale: string; field: string; content: string }[] = [];
for (const [field, value] of Object.entries(fields)) { if (typeof value === "string") { rows.push({ locale: DEFAULT_LOCALE, field, content: value }); } else if (value && typeof value === "object") { for (const [locale, text] of Object.entries(value)) { if (text) rows.push({ locale, field, content: String(text) }); } } }
if (rows.length === 0) return;
// Batch upsert — D1 supports parameterized batch statements const stmts = rows.map(({ locale, field, content }) => db .prepare( `INSERT INTO search_projections (entity_type, entity_id, locale, field, content, updated_at) VALUES (?, ?, ?, ?, ?, ?) ON CONFLICT (entity_type, entity_id, locale, field) DO UPDATE SET content = excluded.content, updated_at = excluded.updated_at`, ) .bind(entityType, entityId, locale, field, content, now), );
await db.batch(stmts);}Wire into:
@platform/ingestion: after each publication upsert and each person upsertdashboard-api PATCH /publications/:doi: after applying the edit patchdashboard-api PATCH /people/:orcid: after saving name/biography edits
Phase 3 — Read Path (E2, Sprint 2)
Section titled “Phase 3 — Read Path (E2, Sprint 2)”Replace the hardcoded json_extract(…, '$.en') LIKE ? in both handlers with a locale-aware query:
SELECT p.*, sp.content AS matched_contentFROM publications pJOIN search_projections sp ON sp.entity_type = 'publication' AND sp.entity_id = p.id AND sp.locale IN (?, ?) -- [requested_locale, DEFAULT_LOCALE] AND sp.field = 'title' AND LOWER(sp.content) LIKE ?WHERE p.deleted_at IS NULL AND p.visible = 1GROUP BY p.id -- deduplicate when both locales matchORDER BY …The IN (?, ?) with GROUP BY p.id deduplication gives correct fallback behavior: a record matches if either its pt or en title contains the query string.
Add locale query parameter (default: en) to /api/publications and /api/people — handled in E3 schema work.
Phase 4 — Backfill (E2, Sprint 2–3)
Section titled “Phase 4 — Backfill (E2, Sprint 2–3)”Backfill script: scripts/db/backfill-search-projections.sh
- Pages through all publications and people in each shard using a cursor on
(entity_type, entity_id). - Writes projection rows via the same
upsertSearchProjectionshelper (consistency). - Tracks progress in a
backfill_stateKV key per shard (start cursor, last processed id, count). - Safe to re-run:
ON CONFLICT DO UPDATEis idempotent. - Runs against production via
wrangler d1 executeor as a scheduled Durable Object task. - Estimated runtime: < 5 minutes per shard at current dataset size.
Coverage check:
SELECT (SELECT COUNT(DISTINCT entity_id) FROM search_projections WHERE entity_type='publication') AS covered, (SELECT COUNT(*) FROM publications WHERE deleted_at IS NULL) AS total;Phase 5 — FTS5 Upgrade Layer (Future, Phase 5+)
Section titled “Phase 5 — FTS5 Upgrade Layer (Future, Phase 5+)”If product requirements include relevance ranking or stemming:
- Add
publications_ftsEXTERNAL_CONTENT FTS5 virtual table pointing atsearch_projections. - No changes to the projection write path — the FTS5 table uses
content=search_projections. - Add
MATCHquery path gated behind a feature flag. - A/B test relevance vs. LIKE results before full rollout.
This upgrade is entirely non-breaking for E2 consumers.
Open Questions
Section titled “Open Questions”| Question | Owner | Resolution target |
|---|---|---|
Should biography be projected for people search, or only name? | Product | E2 kickoff |
| What is the max acceptable write latency for the publication PATCH endpoint? | API owners | E2 Sprint 1 |
Should the backfill run as a Cloudflare Cron or a one-shot wrangler d1 execute? | Ops | E2 Sprint 2 |
| For Phase 5 (100 languages), what is the plan for CJK pre-tokenization? | Data owners | Phase 5 planning |
References
Section titled “References”- Performance And Localization Roadmap
- Performance And Localization Implementation Backlog
- Cloudflare D1 FTS documentation
- SQLite FTS5 documentation
migrations/schema.sql— canonical schema file that includes JSON locale columnsworkers/my-api/src/features/publications/handlers.ts:161— current English-only LIKE searchworkers/my-api/src/features/people/handlers.ts:221— current English-only LIKE search