ADR-0011 — Catalog port semantics across backends

Status: Accepted
Date: 2026-05-12
Related specs: 01-architecture.md, 03-pipelines.md § Crash resume, 06-deployment.md, ADR-0006, ADR-0007

Context

The catalog (trait Catalog) is implemented against at least three backends across topologies:

Local SQLite — Topology 1 (CLI) and Topology 2 (single-VM daemon).
Durable Object SQLite — Topology 3 (Cloudflare), per-project DO (ADR-0006).
RDS Postgres — Topology 4 (AWS) and Topology 2.5 (multi-node self-hosted cluster) (ADR-0007).

DynamoDB is a planned M2.5 adapter; the same semantics apply.

The catalog carries load-bearing state: pipeline cursors, per-chunk leases (backfill), schema-evolution history, snapshot manifests, OTel-audit lineage edges. Semantic divergence across backends produces backend-specific bugs. Concrete examples:

SELECT FOR UPDATE SKIP LOCKED (the backfill chunk-claim primitive) does not exist in SQLite. The naïve port (BEGIN IMMEDIATE) has different contention semantics.
DO SQLite enforces single-writer per DO (i.e., per project) by construction. Local SQLite under hakiri serve allows multiple in-process writers. RDS Postgres allows N-way write concurrency with row-level locks.
Postgres supports point-in-time recovery; SQLite point-in-time is only as good as the last on-disk snapshot.

The M2 success criterion (“the same hakiri.toml produces byte-identical Parquet on both clouds after a 24h soak”) requires the catalog to behave identically across backends — not just expose the same Rust API.

Decision

We pin the trait Catalog contract as a small set of semantic invariants every backend must provide. Each backend implements the contract using its native primitive; the contract is verified by a single conformance test suite run against all backends in CI.

The contract

For every backend:

Linearizability per pipeline_id. All writes to rows keyed by a single pipeline_id are linearizable. Two writes against the same pipeline_id appear in some total order to any reader.
Monotonic schema-history reads. Once a schema_history(pipeline_id, version, schema_json, applied_at) row is written, no reader subsequently sees an older version of that row, and the version sequence is dense and monotonic.
At-most-once chunk dispatch. A pipeline_chunks(chunk_id, attempt) → holder_node row, once claimed, cannot be claimed concurrently by two workers. Two workers attempting to claim the same chunk produce exactly one success.
Append-only lineage edges. lineage(run_id, record_id, source_run_id, ...) rows are append-only. The catalog refuses UPDATE or DELETE on this table — only INSERT and SELECT.
Atomic snapshot commit. A snapshot row (snapshots(table, snapshot_ts, includes_runs, indexes, ...)) becomes visible to readers only after all referenced sidecar manifests are durable. The contract requires a commit_snapshot() API that performs this atomically.
Capability-token revocation epoch reads. revocation_epochs(project, tenant, principal_class) → epoch is read on every token verification. Reads must reflect writes within a bounded staleness window (default 60s; configurable). Stale reads are safe in that they allow extra access for the staleness window; the catalog never returns fresher revocation than truth, and never returns rows that were never written.

Per-backend primitive mapping

Invariant	Local SQLite	DO SQLite	RDS Postgres	DynamoDB (M2.5)
(1) Linearizability per pipeline_id	`BEGIN IMMEDIATE` + per-pipeline row lock	DO actor scope (single writer per project DO)	`SELECT FOR UPDATE` on the pipeline row	Conditional update with `IF version = X`
(2) Monotonic schema-history	`INSERT` with WAL fsync; version is `PRIMARY KEY AUTOINCREMENT`	DO transactional write	Postgres `INSERT` with `version` UNIQUE constraint	Single-item conditional `INSERT` keyed by `(pipeline_id, version)`
(3) At-most-once chunk dispatch	`BEGIN IMMEDIATE` + `UPDATE WHERE status='pending'` + retry	DO actor scope	`SELECT FOR UPDATE SKIP LOCKED LIMIT 1`	Conditional update with `status = 'pending'`
(4) Append-only lineage	View-level constraint (`CREATE TRIGGER`)	View-level constraint	View-level constraint + revoked `UPDATE`/`DELETE` grants	Streams + immutable item attribute
(5) Atomic snapshot commit	Single SQLite transaction touching `snapshots` + each `indexes` row	DO transactional write across keys	Single Postgres transaction	`TransactWriteItems` across catalog tables
(6) Revocation epoch reads	Direct read (no replication, no staleness)	Direct read	Read with optional read-replica routing; staleness ≤ replica lag	Eventually-consistent read with bounded staleness

Conformance test suite

A single test suite under crates/hakiri-context/tests/catalog_conformance/ exercises:

Concurrent chunk-claim under load (100 workers competing for 1000 chunks, expect exactly-once dispatch).
Schema-history insert + read under contention (no torn reads, no version skips).
Lease acquisition with crash + recovery (simulated holder death, verify takeover after TTL).
Snapshot commit + read (commit not visible until all sidecars referenced; visible immediately after).
Revocation epoch propagation (bump epoch, verify rejection within staleness window across N readers).

The suite runs against every backend in CI. A backend whose conformance test fails cannot be released. The local-SQLite backend is the reference implementation; backend-specific divergence (e.g., DynamoDB’s lack of strict linearizability across items in a transaction) requires either (a) the backend implementing a compensating pattern or (b) an explicit divergence in the contract scoped to that backend.

Consequences

Positive

One contract, many backends — operators can swap topologies without code changes (only config).
The M2 byte-identical-Parquet soak test has a fighting chance because the catalog cannot silently produce different results on different backends.
New backends (DynamoDB, eventually FoundationDB or others) ship by passing the conformance suite; no surprise behavior.

Negative

DynamoDB’s consistency model requires more catalog code than the Postgres adapter (single-item conditional writes plus TransactWriteItems). The conformance suite is the discipline that exposes the gaps.
The conformance suite itself is non-trivial — easily 2k lines of Rust integration tests. Worth it.
Some Postgres-native conveniences (e.g., LISTEN/NOTIFY for lease expiration push) are not in the contract because SQLite/DO can’t provide them. Backends may use them internally as performance optimizations but not as semantic primitives.

Neutral

The contract is intentionally small. Features that don’t fit (full-text search, complex joins) live in DuckDB over Parquet, not in the catalog.

Alternatives considered

One backend (Postgres everywhere). Cleanest semantically, but requires running Postgres for the local CLI and for the Cloudflare topology — directly conflicts with Pillar 1 and the no-orchestrator promise.

Loose contract (“the implementations are similar enough”). What we started with. Rejected because the M2 soak test will surface a backend-specific bug a week before launch otherwise. The conformance suite is the only honest path.

Embed a single Rust embedded DB (redb, sled) and ignore Postgres. Doesn’t address DO SQLite or the AWS multi-node case where shared catalog state across Fargate tasks is required.

References

01-architecture.md § trait Catalog
03-pipelines.md § Crash resume
ADR-0006, ADR-0007
Jepsen analyses — the testing-shape reference for conformance suites of this kind