Skip to content

ADR-0011 — Catalog port semantics across backends

The catalog (trait Catalog) is implemented against at least three backends across topologies:

  • Local SQLite — Topology 1 (CLI) and Topology 2 (single-VM daemon).
  • Durable Object SQLite — Topology 3 (Cloudflare), per-project DO (ADR-0006).
  • RDS Postgres — Topology 4 (AWS) and Topology 2.5 (multi-node self-hosted cluster) (ADR-0007).

DynamoDB is a planned M2.5 adapter; the same semantics apply.

The catalog carries load-bearing state: pipeline cursors, per-chunk leases (backfill), schema-evolution history, snapshot manifests, OTel-audit lineage edges. Semantic divergence across backends produces backend-specific bugs. Concrete examples:

  1. SELECT FOR UPDATE SKIP LOCKED (the backfill chunk-claim primitive) does not exist in SQLite. The naïve port (BEGIN IMMEDIATE) has different contention semantics.
  2. DO SQLite enforces single-writer per DO (i.e., per project) by construction. Local SQLite under hakiri serve allows multiple in-process writers. RDS Postgres allows N-way write concurrency with row-level locks.
  3. Postgres supports point-in-time recovery; SQLite point-in-time is only as good as the last on-disk snapshot.

The M2 success criterion (“the same hakiri.toml produces byte-identical Parquet on both clouds after a 24h soak”) requires the catalog to behave identically across backends — not just expose the same Rust API.

We pin the trait Catalog contract as a small set of semantic invariants every backend must provide. Each backend implements the contract using its native primitive; the contract is verified by a single conformance test suite run against all backends in CI.

For every backend:

  1. Linearizability per pipeline_id. All writes to rows keyed by a single pipeline_id are linearizable. Two writes against the same pipeline_id appear in some total order to any reader.
  2. Monotonic schema-history reads. Once a schema_history(pipeline_id, version, schema_json, applied_at) row is written, no reader subsequently sees an older version of that row, and the version sequence is dense and monotonic.
  3. At-most-once chunk dispatch. A pipeline_chunks(chunk_id, attempt) → holder_node row, once claimed, cannot be claimed concurrently by two workers. Two workers attempting to claim the same chunk produce exactly one success.
  4. Append-only lineage edges. lineage(run_id, record_id, source_run_id, ...) rows are append-only. The catalog refuses UPDATE or DELETE on this table — only INSERT and SELECT.
  5. Atomic snapshot commit. A snapshot row (snapshots(table, snapshot_ts, includes_runs, indexes, ...)) becomes visible to readers only after all referenced sidecar manifests are durable. The contract requires a commit_snapshot() API that performs this atomically.
  6. Capability-token revocation epoch reads. revocation_epochs(project, tenant, principal_class) → epoch is read on every token verification. Reads must reflect writes within a bounded staleness window (default 60s; configurable). Stale reads are safe in that they allow extra access for the staleness window; the catalog never returns fresher revocation than truth, and never returns rows that were never written.
InvariantLocal SQLiteDO SQLiteRDS PostgresDynamoDB (M2.5)
(1) Linearizability per pipeline_idBEGIN IMMEDIATE + per-pipeline row lockDO actor scope (single writer per project DO)SELECT FOR UPDATE on the pipeline rowConditional update with IF version = X
(2) Monotonic schema-historyINSERT with WAL fsync; version is PRIMARY KEY AUTOINCREMENTDO transactional writePostgres INSERT with version UNIQUE constraintSingle-item conditional INSERT keyed by (pipeline_id, version)
(3) At-most-once chunk dispatchBEGIN IMMEDIATE + UPDATE WHERE status='pending' + retryDO actor scopeSELECT FOR UPDATE SKIP LOCKED LIMIT 1Conditional update with status = 'pending'
(4) Append-only lineageView-level constraint (CREATE TRIGGER)View-level constraintView-level constraint + revoked UPDATE/DELETE grantsStreams + immutable item attribute
(5) Atomic snapshot commitSingle SQLite transaction touching snapshots + each indexes rowDO transactional write across keysSingle Postgres transactionTransactWriteItems across catalog tables
(6) Revocation epoch readsDirect read (no replication, no staleness)Direct readRead with optional read-replica routing; staleness ≤ replica lagEventually-consistent read with bounded staleness

A single test suite under crates/hakiri-context/tests/catalog_conformance/ exercises:

  • Concurrent chunk-claim under load (100 workers competing for 1000 chunks, expect exactly-once dispatch).
  • Schema-history insert + read under contention (no torn reads, no version skips).
  • Lease acquisition with crash + recovery (simulated holder death, verify takeover after TTL).
  • Snapshot commit + read (commit not visible until all sidecars referenced; visible immediately after).
  • Revocation epoch propagation (bump epoch, verify rejection within staleness window across N readers).

The suite runs against every backend in CI. A backend whose conformance test fails cannot be released. The local-SQLite backend is the reference implementation; backend-specific divergence (e.g., DynamoDB’s lack of strict linearizability across items in a transaction) requires either (a) the backend implementing a compensating pattern or (b) an explicit divergence in the contract scoped to that backend.

Positive

  • One contract, many backends — operators can swap topologies without code changes (only config).
  • The M2 byte-identical-Parquet soak test has a fighting chance because the catalog cannot silently produce different results on different backends.
  • New backends (DynamoDB, eventually FoundationDB or others) ship by passing the conformance suite; no surprise behavior.

Negative

  • DynamoDB’s consistency model requires more catalog code than the Postgres adapter (single-item conditional writes plus TransactWriteItems). The conformance suite is the discipline that exposes the gaps.
  • The conformance suite itself is non-trivial — easily 2k lines of Rust integration tests. Worth it.
  • Some Postgres-native conveniences (e.g., LISTEN/NOTIFY for lease expiration push) are not in the contract because SQLite/DO can’t provide them. Backends may use them internally as performance optimizations but not as semantic primitives.

Neutral

  • The contract is intentionally small. Features that don’t fit (full-text search, complex joins) live in DuckDB over Parquet, not in the catalog.

One backend (Postgres everywhere). Cleanest semantically, but requires running Postgres for the local CLI and for the Cloudflare topology — directly conflicts with Pillar 1 and the no-orchestrator promise.

Loose contract (“the implementations are similar enough”). What we started with. Rejected because the M2 soak test will surface a backend-specific bug a week before launch otherwise. The conformance suite is the only honest path.

Embed a single Rust embedded DB (redb, sled) and ignore Postgres. Doesn’t address DO SQLite or the AWS multi-node case where shared catalog state across Fargate tasks is required.