ADR E1: Localized Search Strategy

Architecture Decision Record for the localized search strategy in the sharded D1 architecture — projection-table vs. FTS5.

Field	Value
Status	Accepted
Deciders	API + data owners
Epic	E — Multi-Language Search And Fallback (Phase 4)
Depends on	A3 (load scenarios with latency baseline)
Supersedes	—

Context

The platform stores localizable text fields in JSON columns in Cloudflare D1 (SQLite):

publications.title TEXT — { "en": "…", "pt": "…" }
people.name TEXT — { "en": "…", "pt": "…" }
people.biography TEXT — same
person_categories.name / description TEXT — same

The data lives across 5 D1 shards (DB_0–DB_4) partitioned by a djb2 hash of the entity identifier. Any list or search operation fans out to all five shards in parallel and merges results in the Worker.

Current search implementation is English-only and hardcoded:

-- publications/handlers.ts
AND LOWER(json_extract(p.title, '$.en')) LIKE ?

-- people/handlers.ts
json_extract(r.name, '$.en') LIKE ?

This breaks when the UI is in Portuguese (or any other locale): searching for “Investigador” returns nothing because the search only matches the English value.

The goal of E2–E3 is to make search locale-aware: a query in locale pt should match the pt value of a field (with fallback to en if the pt value is absent).

Decision Drivers

Correctness — search must return records that match in the requested locale.
Performance — no regression vs. the current single-LIKE path; target P95 < 200 ms on fan-out of 5 shards.
Maintainability — new locales should require only data backfill, not schema changes.
D1 constraints — D1 is SQLite-on-Workers: no custom tokenizers, no external search engine, Worker CPU budget ≤ 10 ms before streaming is forced.
Operational simplicity — the backfill must be interruptible and resumable (large datasets).
Future readiness — eventually support 100+ locales including CJK languages that require different tokenization.

Considered Options

Option A — Projection Table

Introduce a search_projections table (one row per entity × locale × field):

CREATE TABLE search_projections (
  entity_type  TEXT    NOT NULL,   -- 'publication' | 'person'
  entity_id    TEXT    NOT NULL,   -- DOI or ORCID, same shard as source
  locale       TEXT    NOT NULL,   -- 'en', 'pt', …
  field        TEXT    NOT NULL,   -- 'title', 'name', 'biography', …
  content      TEXT    NOT NULL,   -- plain-text value for this locale/field
  updated_at   INTEGER NOT NULL,
  PRIMARY KEY (entity_type, entity_id, locale, field)
);

CREATE INDEX idx_sp_locale_field_content
  ON search_projections(entity_type, locale, field, content);

Search query (publications, locale=pt, fallback to en):

SELECT p.*
FROM publications p
JOIN search_projections sp
  ON sp.entity_type = 'publication'
 AND sp.entity_id   = p.id
 AND sp.locale      = ?            -- 'pt'
 AND sp.field       = 'title'
 AND LOWER(sp.content) LIKE ?
WHERE p.deleted_at IS NULL
  AND p.visible = 1
ORDER BY p.publication_year DESC
LIMIT ? OFFSET ?;

Fallback (if no pt row) is handled by a second pass against en, or by always inserting a fallback row at write time (preferred — avoids the second query).

Write path: on every insert/update of a localizable field, upsert the projection rows for all locales present in the JSON value. Implemented in a shared upsertProjections(db, entityType, entityId, fields) helper called from the ingestion worker and the PATCH handler.

Option B — FTS5 Virtual Tables

Create an FTS5 virtual table per shard that indexes localized text with language tags embedded in the content:

CREATE VIRTUAL TABLE publications_fts USING fts5(
  id UNINDEXED,
  locale UNINDEXED,
  title,
  content=publications,        -- EXTERNAL_CONTENT mode
  tokenize='unicode61'
);

Search:

SELECT p.*
FROM publications p
JOIN publications_fts fts
  ON fts.id     = p.id
 AND fts.locale = ?
WHERE publications_fts MATCH ?
ORDER BY rank;

FTS5 with the unicode61 tokenizer handles Latin-script languages (en, pt, es, fr) via Unicode case-folding and basic stemming. For CJK languages it would require the trigram tokenizer or pre-tokenized input.

Comparison

Criterion	A — Projection Table	B — FTS5
Implementation effort	Low — standard SQL DML	Medium — FTS5 virtual table wiring, EXTERNAL_CONTENT setup
Query complexity	Low — indexed JOIN with LIKE	Low — single MATCH clause
D1 support	Full — plain table and index	Supported, but no custom tokenizers; `trigram` not available in D1
New-locale cost	Data-only backfill, no DDL	Requires `INSERT INTO fts(fts) VALUES ('rebuild')` on each shard
CJK / 100-language path	Pre-tokenize at write time, store tokens	FTS5 `trigram` unavailable; would need pre-tokenization anyway
Write amplification	×N rows per entity (N = locale count)	Similar — FTS5 shadow tables have same write cost
Storage overhead	×N plain text rows	FTS5 index is typically 1.5–2× content size per locale
LIKE semantics	Exact prefix/infix match	BM25 ranking + stemming (better relevance, harder to control)
Backfill	Resumable with cursor on `(entity_id, locale)`	Resumable with batch `INSERT INTO fts SELECT …`
Ops on schema change	`ALTER TABLE` or new column	`DROP` + recreate virtual table; must re-run full rebuild
Observability	Row count per locale is directly queryable	Shadow table internals are opaque
Risk	Low — no new SQLite features	Medium — FTS5 behavior on D1 needs validation; no Community FTS examples at scale

Performance expectations

Both options add one index-assisted operation per shard per query. Given current dataset size (< 10 k publications, < 500 people per shard):

Metric	Current (English LIKE)	Option A (Projection JOIN)	Option B (FTS5 MATCH)
P50 single-shard (est.)	2–5 ms	3–8 ms	3–6 ms
P95 single-shard (est.)	8–15 ms	10–25 ms	8–20 ms
Fan-out overhead (5 shards)	~15 ms	~20 ms	~18 ms
At 100 k rows / shard	Unknown	15–40 ms (with index)	10–30 ms (BM25 index)

FTS5 has a slight edge on relevance ranking and scan speed at large scale. Projection tables have a slight edge on operational predictability and locale controllability.

Decision

Use Option A (Projection Table) as the primary localized search mechanism for E2.

Rationale

Lower risk on D1. SQLite FTS5 is supported on D1 but there are no documented production examples of FTS5 at scale on D1 with multi-language content. Option A uses only indexed JOIN + LIKE — a well-understood D1 workload.
Simpler ops. Adding a new locale requires only a backfill script. No DDL changes, no FTS rebuild, no shadow-table management across 5 shards × 5 databases.
CJK parity. For Phase 5 (100-language support), both options require pre-tokenization for CJK. Option A stores pre-tokenized tokens in the content column — the same tokenization work feeds any future FTS5 upgrade. There is no unique FTS5 advantage here.
Deterministic fallback. The projection table makes locale fallback explicit: the write path decides which locale values to store; the read path is a single JOIN. FTS5 MATCH semantics are harder to tune for “search in pt, fall back to en” without query manipulation.
FTS5 as a future optimization layer. If relevance ranking becomes a product requirement (beyond prefix/infix match), an FTS5 EXTERNAL_CONTENT table can be layered on top of the projection table with no changes to the projection write path. This preserves optionality without betting on FTS5 from the start.

Consequences

Positive

Locale-aware search works for all registered locales from day one of E2.
Adding a new locale is a data operation (backfill script) with no code change required in the query layer.
Backfill is fully resumable: the script pages through entity_id with a cursor and can be restarted safely.
Projection rows are directly auditable (SELECT COUNT(*) FROM search_projections WHERE locale='pt').

Negative / Trade-offs

Write amplification: every publication or person save triggers ×N upserts into search_projections (N = number of locales). At 2 locales this is negligible; at 100 locales it adds ~100 ms to write latency unless batched.
Storage cost: currently < 50 MB total across all shards. At 2 locales and 10 k publications with 200-character average title this adds ~4 MB per shard — acceptable.
LIKE search does not rank results by relevance. This is acceptable for the current use case (researchers filtering their own institution’s publications). If relevance ranking is required, add an FTS5 layer in Phase 5.

Neutral

Query performance is equivalent to or marginally slower than the current English-only LIKE. The additional JOIN is index-assisted and the performance delta is within the A3 baseline margins.

Phased Rollout Plan

Phase 1 — Schema (E2, Sprint 1)

New migration: 022_search_projections.sql

CREATE TABLE IF NOT EXISTS search_projections (
  entity_type  TEXT    NOT NULL CHECK (entity_type IN ('publication', 'person')),
  entity_id    TEXT    NOT NULL,
  locale       TEXT    NOT NULL,
  field        TEXT    NOT NULL,
  content      TEXT    NOT NULL DEFAULT '',
  updated_at   INTEGER NOT NULL,
  PRIMARY KEY (entity_type, entity_id, locale, field)
);

CREATE INDEX IF NOT EXISTS idx_sp_lookup
  ON search_projections (entity_type, locale, field, LOWER(content));

Applied to all 5 D1 shards. Table starts empty — search falls back to the current English LIKE path until backfill completes.

Dual-read guard in handlers: while search_projections is empty (or below a configurable coverage threshold), fall back to json_extract(…, '$.en') LIKE ?. This ensures zero search regression during rollout.

Phase 2 — Write Path (E2, Sprint 1–2)

Implement upsertSearchProjections(db, entityType, entityId, fields) in packages/utils:

export async function upsertSearchProjections(
  db: D1Database,
  entityType: "publication" | "person",
  entityId: string,
  fields: Record<string, LocalizedString>, // field name → JSON value
): Promise<void> {
  const now = Date.now();
  const rows: { locale: string; field: string; content: string }[] = [];

  for (const [field, value] of Object.entries(fields)) {
    if (typeof value === "string") {
      rows.push({ locale: DEFAULT_LOCALE, field, content: value });
    } else if (value && typeof value === "object") {
      for (const [locale, text] of Object.entries(value)) {
        if (text) rows.push({ locale, field, content: String(text) });
      }
    }
  }

  if (rows.length === 0) return;

  // Batch upsert — D1 supports parameterized batch statements
  const stmts = rows.map(({ locale, field, content }) =>
    db
      .prepare(
        `INSERT INTO search_projections (entity_type, entity_id, locale, field, content, updated_at)
       VALUES (?, ?, ?, ?, ?, ?)
       ON CONFLICT (entity_type, entity_id, locale, field)
       DO UPDATE SET content = excluded.content, updated_at = excluded.updated_at`,
      )
      .bind(entityType, entityId, locale, field, content, now),
  );

  await db.batch(stmts);
}

Wire into:

@platform/ingestion: after each publication upsert and each person upsert
dashboard-api PATCH /publications/:doi: after applying the edit patch
dashboard-api PATCH /people/:orcid: after saving name/biography edits

Phase 3 — Read Path (E2, Sprint 2)

Replace the hardcoded json_extract(…, '$.en') LIKE ? in both handlers with a locale-aware query:

SELECT p.*, sp.content AS matched_content
FROM publications p
JOIN search_projections sp
  ON sp.entity_type = 'publication'
 AND sp.entity_id   = p.id
 AND sp.locale      IN (?, ?)      -- [requested_locale, DEFAULT_LOCALE]
 AND sp.field       = 'title'
 AND LOWER(sp.content) LIKE ?
WHERE p.deleted_at IS NULL
  AND p.visible = 1
GROUP BY p.id                      -- deduplicate when both locales match
ORDER BY …

The IN (?, ?) with GROUP BY p.id deduplication gives correct fallback behavior: a record matches if either its pt or en title contains the query string.

Add locale query parameter (default: en) to /api/publications and /api/people — handled in E3 schema work.

Phase 4 — Backfill (E2, Sprint 2–3)

Backfill script: scripts/db/backfill-search-projections.sh

Pages through all publications and people in each shard using a cursor on (entity_type, entity_id).
Writes projection rows via the same upsertSearchProjections helper (consistency).
Tracks progress in a backfill_state KV key per shard (start cursor, last processed id, count).
Safe to re-run: ON CONFLICT DO UPDATE is idempotent.
Runs against production via wrangler d1 execute or as a scheduled Durable Object task.
Estimated runtime: < 5 minutes per shard at current dataset size.

Coverage check:

SELECT
  (SELECT COUNT(DISTINCT entity_id) FROM search_projections WHERE entity_type='publication') AS covered,
  (SELECT COUNT(*) FROM publications WHERE deleted_at IS NULL)                               AS total;

Phase 5 — FTS5 Upgrade Layer (Future, Phase 5+)

If product requirements include relevance ranking or stemming:

Add publications_fts EXTERNAL_CONTENT FTS5 virtual table pointing at search_projections.
No changes to the projection write path — the FTS5 table uses content=search_projections.
Add MATCH query path gated behind a feature flag.
A/B test relevance vs. LIKE results before full rollout.

This upgrade is entirely non-breaking for E2 consumers.

Open Questions

Question	Owner	Resolution target
Should `biography` be projected for people search, or only `name`?	Product	E2 kickoff
What is the max acceptable write latency for the publication PATCH endpoint?	API owners	E2 Sprint 1
Should the backfill run as a Cloudflare Cron or a one-shot `wrangler d1 execute`?	Ops	E2 Sprint 2
For Phase 5 (100 languages), what is the plan for CJK pre-tokenization?	Data owners	Phase 5 planning

References

Performance And Localization Roadmap
Performance And Localization Implementation Backlog
Cloudflare D1 FTS documentation
SQLite FTS5 documentation
migrations/schema.sql — canonical schema file that includes JSON locale columns
workers/my-api/src/features/publications/handlers.ts:161 — current English-only LIKE search
workers/my-api/src/features/people/handlers.ts:221 — current English-only LIKE search

templates

sentrydbqueries