Skip to content

ADR E1: Localized Search Strategy

Architecture Decision Record for the localized search strategy in the sharded D1 architecture — projection-table vs. FTS5.

FieldValue
StatusAccepted
DecidersAPI + data owners
EpicE — Multi-Language Search And Fallback (Phase 4)
Depends onA3 (load scenarios with latency baseline)
Supersedes

The platform stores localizable text fields in JSON columns in Cloudflare D1 (SQLite):

  • publications.title TEXT{ "en": "…", "pt": "…" }
  • people.name TEXT{ "en": "…", "pt": "…" }
  • people.biography TEXT — same
  • person_categories.name / description TEXT — same

The data lives across 5 D1 shards (DB_0–DB_4) partitioned by a djb2 hash of the entity identifier. Any list or search operation fans out to all five shards in parallel and merges results in the Worker.

Current search implementation is English-only and hardcoded:

-- publications/handlers.ts
AND LOWER(json_extract(p.title, '$.en')) LIKE ?
-- people/handlers.ts
json_extract(r.name, '$.en') LIKE ?

This breaks when the UI is in Portuguese (or any other locale): searching for “Investigador” returns nothing because the search only matches the English value.

The goal of E2–E3 is to make search locale-aware: a query in locale pt should match the pt value of a field (with fallback to en if the pt value is absent).


  1. Correctness — search must return records that match in the requested locale.
  2. Performance — no regression vs. the current single-LIKE path; target P95 < 200 ms on fan-out of 5 shards.
  3. Maintainability — new locales should require only data backfill, not schema changes.
  4. D1 constraints — D1 is SQLite-on-Workers: no custom tokenizers, no external search engine, Worker CPU budget ≤ 10 ms before streaming is forced.
  5. Operational simplicity — the backfill must be interruptible and resumable (large datasets).
  6. Future readiness — eventually support 100+ locales including CJK languages that require different tokenization.

Introduce a search_projections table (one row per entity × locale × field):

CREATE TABLE search_projections (
entity_type TEXT NOT NULL, -- 'publication' | 'person'
entity_id TEXT NOT NULL, -- DOI or ORCID, same shard as source
locale TEXT NOT NULL, -- 'en', 'pt', …
field TEXT NOT NULL, -- 'title', 'name', 'biography', …
content TEXT NOT NULL, -- plain-text value for this locale/field
updated_at INTEGER NOT NULL,
PRIMARY KEY (entity_type, entity_id, locale, field)
);
CREATE INDEX idx_sp_locale_field_content
ON search_projections(entity_type, locale, field, content);

Search query (publications, locale=pt, fallback to en):

SELECT p.*
FROM publications p
JOIN search_projections sp
ON sp.entity_type = 'publication'
AND sp.entity_id = p.id
AND sp.locale = ? -- 'pt'
AND sp.field = 'title'
AND LOWER(sp.content) LIKE ?
WHERE p.deleted_at IS NULL
AND p.visible = 1
ORDER BY p.publication_year DESC
LIMIT ? OFFSET ?;

Fallback (if no pt row) is handled by a second pass against en, or by always inserting a fallback row at write time (preferred — avoids the second query).

Write path: on every insert/update of a localizable field, upsert the projection rows for all locales present in the JSON value. Implemented in a shared upsertProjections(db, entityType, entityId, fields) helper called from the ingestion worker and the PATCH handler.

Create an FTS5 virtual table per shard that indexes localized text with language tags embedded in the content:

CREATE VIRTUAL TABLE publications_fts USING fts5(
id UNINDEXED,
locale UNINDEXED,
title,
content=publications, -- EXTERNAL_CONTENT mode
tokenize='unicode61'
);

Search:

SELECT p.*
FROM publications p
JOIN publications_fts fts
ON fts.id = p.id
AND fts.locale = ?
WHERE publications_fts MATCH ?
ORDER BY rank;

FTS5 with the unicode61 tokenizer handles Latin-script languages (en, pt, es, fr) via Unicode case-folding and basic stemming. For CJK languages it would require the trigram tokenizer or pre-tokenized input.


CriterionA — Projection TableB — FTS5
Implementation effortLow — standard SQL DMLMedium — FTS5 virtual table wiring, EXTERNAL_CONTENT setup
Query complexityLow — indexed JOIN with LIKELow — single MATCH clause
D1 supportFull — plain table and indexSupported, but no custom tokenizers; trigram not available in D1
New-locale costData-only backfill, no DDLRequires INSERT INTO fts(fts) VALUES ('rebuild') on each shard
CJK / 100-language pathPre-tokenize at write time, store tokensFTS5 trigram unavailable; would need pre-tokenization anyway
Write amplification×N rows per entity (N = locale count)Similar — FTS5 shadow tables have same write cost
Storage overhead×N plain text rowsFTS5 index is typically 1.5–2× content size per locale
LIKE semanticsExact prefix/infix matchBM25 ranking + stemming (better relevance, harder to control)
BackfillResumable with cursor on (entity_id, locale)Resumable with batch INSERT INTO fts SELECT …
Ops on schema changeALTER TABLE or new columnDROP + recreate virtual table; must re-run full rebuild
ObservabilityRow count per locale is directly queryableShadow table internals are opaque
RiskLow — no new SQLite featuresMedium — FTS5 behavior on D1 needs validation; no Community FTS examples at scale

Both options add one index-assisted operation per shard per query. Given current dataset size (< 10 k publications, < 500 people per shard):

MetricCurrent (English LIKE)Option A (Projection JOIN)Option B (FTS5 MATCH)
P50 single-shard (est.)2–5 ms3–8 ms3–6 ms
P95 single-shard (est.)8–15 ms10–25 ms8–20 ms
Fan-out overhead (5 shards)~15 ms~20 ms~18 ms
At 100 k rows / shardUnknown15–40 ms (with index)10–30 ms (BM25 index)

FTS5 has a slight edge on relevance ranking and scan speed at large scale. Projection tables have a slight edge on operational predictability and locale controllability.


Use Option A (Projection Table) as the primary localized search mechanism for E2.

  1. Lower risk on D1. SQLite FTS5 is supported on D1 but there are no documented production examples of FTS5 at scale on D1 with multi-language content. Option A uses only indexed JOIN + LIKE — a well-understood D1 workload.

  2. Simpler ops. Adding a new locale requires only a backfill script. No DDL changes, no FTS rebuild, no shadow-table management across 5 shards × 5 databases.

  3. CJK parity. For Phase 5 (100-language support), both options require pre-tokenization for CJK. Option A stores pre-tokenized tokens in the content column — the same tokenization work feeds any future FTS5 upgrade. There is no unique FTS5 advantage here.

  4. Deterministic fallback. The projection table makes locale fallback explicit: the write path decides which locale values to store; the read path is a single JOIN. FTS5 MATCH semantics are harder to tune for “search in pt, fall back to en” without query manipulation.

  5. FTS5 as a future optimization layer. If relevance ranking becomes a product requirement (beyond prefix/infix match), an FTS5 EXTERNAL_CONTENT table can be layered on top of the projection table with no changes to the projection write path. This preserves optionality without betting on FTS5 from the start.


  • Locale-aware search works for all registered locales from day one of E2.
  • Adding a new locale is a data operation (backfill script) with no code change required in the query layer.
  • Backfill is fully resumable: the script pages through entity_id with a cursor and can be restarted safely.
  • Projection rows are directly auditable (SELECT COUNT(*) FROM search_projections WHERE locale='pt').
  • Write amplification: every publication or person save triggers ×N upserts into search_projections (N = number of locales). At 2 locales this is negligible; at 100 locales it adds ~100 ms to write latency unless batched.
  • Storage cost: currently < 50 MB total across all shards. At 2 locales and 10 k publications with 200-character average title this adds ~4 MB per shard — acceptable.
  • LIKE search does not rank results by relevance. This is acceptable for the current use case (researchers filtering their own institution’s publications). If relevance ranking is required, add an FTS5 layer in Phase 5.
  • Query performance is equivalent to or marginally slower than the current English-only LIKE. The additional JOIN is index-assisted and the performance delta is within the A3 baseline margins.

New migration: 022_search_projections.sql

CREATE TABLE IF NOT EXISTS search_projections (
entity_type TEXT NOT NULL CHECK (entity_type IN ('publication', 'person')),
entity_id TEXT NOT NULL,
locale TEXT NOT NULL,
field TEXT NOT NULL,
content TEXT NOT NULL DEFAULT '',
updated_at INTEGER NOT NULL,
PRIMARY KEY (entity_type, entity_id, locale, field)
);
CREATE INDEX IF NOT EXISTS idx_sp_lookup
ON search_projections (entity_type, locale, field, LOWER(content));

Applied to all 5 D1 shards. Table starts empty — search falls back to the current English LIKE path until backfill completes.

Dual-read guard in handlers: while search_projections is empty (or below a configurable coverage threshold), fall back to json_extract(…, '$.en') LIKE ?. This ensures zero search regression during rollout.

Implement upsertSearchProjections(db, entityType, entityId, fields) in packages/utils:

packages/utils/src/searchProjections.ts
export async function upsertSearchProjections(
db: D1Database,
entityType: "publication" | "person",
entityId: string,
fields: Record<string, LocalizedString>, // field name → JSON value
): Promise<void> {
const now = Date.now();
const rows: { locale: string; field: string; content: string }[] = [];
for (const [field, value] of Object.entries(fields)) {
if (typeof value === "string") {
rows.push({ locale: DEFAULT_LOCALE, field, content: value });
} else if (value && typeof value === "object") {
for (const [locale, text] of Object.entries(value)) {
if (text) rows.push({ locale, field, content: String(text) });
}
}
}
if (rows.length === 0) return;
// Batch upsert — D1 supports parameterized batch statements
const stmts = rows.map(({ locale, field, content }) =>
db
.prepare(
`INSERT INTO search_projections (entity_type, entity_id, locale, field, content, updated_at)
VALUES (?, ?, ?, ?, ?, ?)
ON CONFLICT (entity_type, entity_id, locale, field)
DO UPDATE SET content = excluded.content, updated_at = excluded.updated_at`,
)
.bind(entityType, entityId, locale, field, content, now),
);
await db.batch(stmts);
}

Wire into:

  • @platform/ingestion: after each publication upsert and each person upsert
  • dashboard-api PATCH /publications/:doi: after applying the edit patch
  • dashboard-api PATCH /people/:orcid: after saving name/biography edits

Replace the hardcoded json_extract(…, '$.en') LIKE ? in both handlers with a locale-aware query:

SELECT p.*, sp.content AS matched_content
FROM publications p
JOIN search_projections sp
ON sp.entity_type = 'publication'
AND sp.entity_id = p.id
AND sp.locale IN (?, ?) -- [requested_locale, DEFAULT_LOCALE]
AND sp.field = 'title'
AND LOWER(sp.content) LIKE ?
WHERE p.deleted_at IS NULL
AND p.visible = 1
GROUP BY p.id -- deduplicate when both locales match
ORDER BY

The IN (?, ?) with GROUP BY p.id deduplication gives correct fallback behavior: a record matches if either its pt or en title contains the query string.

Add locale query parameter (default: en) to /api/publications and /api/people — handled in E3 schema work.

Backfill script: scripts/db/backfill-search-projections.sh

  • Pages through all publications and people in each shard using a cursor on (entity_type, entity_id).
  • Writes projection rows via the same upsertSearchProjections helper (consistency).
  • Tracks progress in a backfill_state KV key per shard (start cursor, last processed id, count).
  • Safe to re-run: ON CONFLICT DO UPDATE is idempotent.
  • Runs against production via wrangler d1 execute or as a scheduled Durable Object task.
  • Estimated runtime: < 5 minutes per shard at current dataset size.

Coverage check:

SELECT
(SELECT COUNT(DISTINCT entity_id) FROM search_projections WHERE entity_type='publication') AS covered,
(SELECT COUNT(*) FROM publications WHERE deleted_at IS NULL) AS total;

Phase 5 — FTS5 Upgrade Layer (Future, Phase 5+)

Section titled “Phase 5 — FTS5 Upgrade Layer (Future, Phase 5+)”

If product requirements include relevance ranking or stemming:

  1. Add publications_fts EXTERNAL_CONTENT FTS5 virtual table pointing at search_projections.
  2. No changes to the projection write path — the FTS5 table uses content=search_projections.
  3. Add MATCH query path gated behind a feature flag.
  4. A/B test relevance vs. LIKE results before full rollout.

This upgrade is entirely non-breaking for E2 consumers.


QuestionOwnerResolution target
Should biography be projected for people search, or only name?ProductE2 kickoff
What is the max acceptable write latency for the publication PATCH endpoint?API ownersE2 Sprint 1
Should the backfill run as a Cloudflare Cron or a one-shot wrangler d1 execute?OpsE2 Sprint 2
For Phase 5 (100 languages), what is the plan for CJK pre-tokenization?Data ownersPhase 5 planning