Context Store

The context store is Hakiri’s destination-of-default. It’s not a database — it’s a documented directory layout you can read with duckdb or pyarrow even if Hakiri is uninstalled. The store is local-first: it exists fully on disk; sync to an S3-compatible bucket is opt-in.

Goals

Be cat-able: every file is plain SQLite, Parquet, JSON, or TOML.
Be duckdb-queryable with zero Hakiri runtime.
Be syncable to S3-compatible storage as the unit of “team context.”
Be safe under crash: any partially-written run is recoverable or droppable, never half-applied.

On-disk layout

.hakiri/context/<project>/
  meta.sqlite                       # catalog: schemas, runs, cursors, lineage
  config.toml                       # store config (version, partitioning)
  tables/
    github_issues/
      schema.json                   # latest Schema (Arrow JSON form)
      schema-history.jsonl          # append-only schema evolutions
      data/
        snapshot=2026-05-11T08-00-00Z/
          part-00000.parquet
          part-00001.parquet
          _manifest.json
        runs/
          run-01HXYZ.../
            <node-id>/                          # writer node id; one dir per replica
              part-00000.parquet
              _manifest.json
    shopify_orders/
      ...
  views/
    github_issues.sql               # DuckDB view def (PK-merged across runs)
    ...

Two file kinds matter:

Run files under tables/<t>/data/runs/<run-id>/ are the raw output of one pipeline run. Append-only, never edited.
Snapshot files under tables/<t>/data/snapshot=<ts>/ are compaction outputs — runs collapsed into a queryable, PK-deduped Parquet set. Snapshots are immutable; compaction creates new ones, never mutates old.

DuckDB views (views/<t>.sql) hide the run-vs-snapshot detail from queries:

-- views/github_issues.sql — generated by `hakiri schema regen-view github_issues`
-- Catalog schema fingerprint: sha256:abc123… (rev 12)
-- Explicit CASTs unify type-widened Parquet across runs. DuckDB does NOT auto-cast
-- (e.g. int32→int64) across files in a glob; without these CASTs queries silently
-- truncate or error when widened columns coexist with their original-width snapshots.
CREATE OR REPLACE VIEW github_issues AS
WITH all_rows AS (
  SELECT
    CAST(id            AS BIGINT)    AS id,
    CAST(title         AS VARCHAR)   AS title,
    CAST(author        AS VARCHAR)   AS author,
    CAST(created_at    AS TIMESTAMP) AS created_at,
    CAST(comment_count AS BIGINT)    AS comment_count,
    -- … one CAST per column; columns absent in older files become NULL via union_by_name
    _ingested_at,
    _run_id,
    '_run' AS source
  FROM read_parquet('tables/github_issues/data/runs/*/*/*.parquet', union_by_name=true)
  WHERE NOT EXISTS (
    SELECT 1 FROM snapshot_manifest sm WHERE sm.includes_run = _run_id
  )
  UNION ALL BY NAME
  SELECT
    CAST(id            AS BIGINT)    AS id,
    CAST(title         AS VARCHAR)   AS title,
    -- … same projection for snapshots …
    _ingested_at,
    NULL AS _run_id,
    '_snapshot' AS source
  FROM read_parquet('tables/github_issues/data/snapshot=*/*.parquet', union_by_name=true)
)
SELECT * EXCLUDE (source, _run_id) FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY _ingested_at DESC) AS rn
  FROM all_rows
) WHERE rn = 1;

The view is regenerated by hakiri schema regen-view <table> whenever the catalog schema changes. The fingerprint comment in the SQL makes drift detectable. Note the run glob is runs/*/*/*.parquet — the extra * is the writer-node-id segment introduced by the conflict-resolution model.

Catalog schema (`meta.sqlite`)

CREATE TABLE schema_version (id INTEGER PRIMARY KEY, applied_at TEXT NOT NULL);

CREATE TABLE table_schema (
  table_name TEXT PRIMARY KEY,
  arrow_schema_json TEXT NOT NULL,
  primary_key TEXT,                  -- JSON array
  created_at TEXT NOT NULL,
  updated_at TEXT NOT NULL
);

CREATE TABLE schema_history (
  id INTEGER PRIMARY KEY AUTOINCREMENT,
  table_name TEXT NOT NULL,
  change_kind TEXT NOT NULL,         -- add_column | widen_type | rename | …
  before_json TEXT,
  after_json  TEXT NOT NULL,
  decided_by  TEXT NOT NULL,         -- 'auto' | 'operator:<id>' | 'agent:<id>'
  at TEXT NOT NULL
);

CREATE TABLE run (
  run_id TEXT PRIMARY KEY,
  pipeline_id TEXT NOT NULL,
  started_at TEXT NOT NULL,
  ended_at TEXT,
  status TEXT NOT NULL,
  row_count INTEGER,
  byte_count INTEGER,
  trace_id TEXT
);

CREATE TABLE snapshot (
  snapshot_id TEXT PRIMARY KEY,
  table_name TEXT NOT NULL,
  created_at TEXT NOT NULL,
  includes_runs TEXT NOT NULL,       -- JSON array of run_ids folded in
  row_count INTEGER NOT NULL
);

CREATE TABLE lineage (
  child_run_id TEXT NOT NULL,
  parent_run_id TEXT NOT NULL,
  PRIMARY KEY (child_run_id, parent_run_id)
);

The catalog is the only stateful component; everything else (Parquet, JSON manifests) is derivable. hakiri context rebuild-catalog can reconstruct meta.sqlite from the file tree.

Querying

# Open an interactive shell
hakiri query

# Or one-shot
hakiri query "select author, count(*) from github_issues group by 1"

# Or just use duckdb directly
duckdb -c "ATTACH '.hakiri/context/oh/duck.db'; SELECT * FROM github_issues LIMIT 10;"

DuckDB is the query engine. We ship a thin wrapper that:

Auto-ATTACHes the context store
Registers each view from views/*.sql
Exposes table-level row counts as a hakiri.tables system view

There is no separate “Hakiri SQL dialect”. You’re writing DuckDB SQL against Parquet files; it just works.

Compaction

Compaction is how the context store stays fast to query as runs accumulate. Without it, a table with 15-minute pulls over a year has ~35,000 run-file directories, each holding a fragment of the table; DuckDB’s planner does fine but file-open overhead and small-file scans dominate at that scale. Compaction collapses runs into immutable, well-organized snapshots that agents and duckdb queries hit first.

What compaction does — atomic snapshot + sidecar commit

A background task (in hakiri serve) or a manual hakiri context compact performs the following, with all artifacts staged under snapshot=<ts>.staging/ and renamed atomically so that the snapshot becomes visible only when the Parquet and every declared sidecar index are durable:

Select inputs. All run files for table T newer than the latest snapshot (or all runs, for a full rebuild).
PK-dedup. Last-write-wins by _ingested_at for tables with a declared primary key; append-only union otherwise.
Schema reconcile. A column added in run N is backfilled as NULL (or its declared default) for rows from runs 1..N-1. Type-widened columns (e.g. int32 → int64) are cast up. The compaction never silently narrows or drops a column — divergent schema histories require explicit hakiri schema reconcile before compaction proceeds.
Sort + cluster. Rows are written sorted by the table’s declared cluster_by columns (default: PK if present, else _ingested_at). This produces effective Parquet row-group zone maps for the columns agents actually filter on.
Partition. Output is split into one Parquet file per partition spec (default: a single partition; opt-in date / tenant / access_pattern partitioning, see below).
Write Parquet to staging. Files land under tables/<t>/data/snapshot=<ts>.staging/ with a partial _manifest.json.
Build sidecars in staging. Indexes (vector, FTS, PK lookup, bloom; see § Indexes for agent consumption) rebuild against the staged Parquet, writing to tables/<t>/indexes/<index-id>/snapshot=<ts>.staging/. Each sidecar finalizes its own _manifest.json only after its data is fully written and fsynced.
Finalize the snapshot manifest. The Parquet _manifest.json is rewritten to enumerate every required sidecar (indexes: ["pk", "vec-body-bge-large-en-v1.5", "fts-body"]). The manifest is fsynced.
Atomic rename. snapshot=<ts>.staging/ → snapshot=<ts>/, and each indexes/<index-id>/snapshot=<ts>.staging/ → indexes/<index-id>/snapshot=<ts>/. On local filesystems this is a rename(2); on object stores it is a content-hash-keyed copy + delete since object stores lack atomic rename — the catalog records the commit point and readers ignore any directory without a corresponding catalog row.
Update catalog. snapshot.includes_runs records which run-ids were folded; the view in views/<t>.sql regenerates to point at the new snapshot; the snapshot’s indexes array is recorded so readers know what sidecars to expect.
GC folded runs and staging directories after a retention window (default 7 days) preserving point-in-time recovery for that window.

Why staging + atomic rename: the invariant is “each index snapshot is tied to a data snapshot.” Building sidecars after a snapshot is already “current” leaves a window where queries find Parquet but no matching index — at best slow, at worst inconsistent. Writing Parquet + sidecars to staging, then committing them together as the readable snapshot, makes the invariant load-bearing. A crash mid-build leaves a staging directory the catalog will GC; no half-indexed snapshot is ever readable.

Compaction is non-blocking for reads. The view query reads the catalog’s current snapshot pointer; queries running mid-compaction continue against the prior snapshot until they finish, then naturally pick up the new one. There is no “snapshot promoted but indexes still building” state.

Triggers

Three trigger kinds, configurable per table:

Trigger	Default	Rationale
Run-count	Every 50 runs	The most-correlated proxy for “too many small files”
Time	Every 6 hours	Bounds staleness of zone maps and sidecar indexes
Manual	`hakiri context compact <table>`	Operator-initiated for migrations and reshapes

[[pipeline.tables]]
name = "github_issues"
  [pipeline.tables.compaction]
  trigger_run_count = 50
  trigger_interval  = "6h"
  cluster_by        = ["repo", "updated_at"]
  partition_by      = "month(updated_at)"     # opt-in; default = no partitioning
  retain_runs       = "7d"                    # how long folded runs survive

Partitioning strategy

Partitioning is opt-in because most agent-context tables are small enough that a single Parquet snapshot is faster than partitioned scans. Three patterns we support and when to use each:

Pattern	When	Spec
None (default)	Tables under ~10 GB; query selectivity comes from row-group zone maps	—
Date	Time-series tables where most queries filter on a date column	`partition_by = "month(updated_at)"`
Tenant	Multi-tenant context where every agent’s read scopes to one tenant	`partition_by = "tenant_id"`
Access-pattern	Declared per PRD Pillar 5 (collocation) — `recent_90d`, `by_repo`, etc. — so replicas can pull just the partition they need	`access_pattern = "recent_90d"` (translated to a derived partition spec)

Over-partitioning is the classic mistake (too many tiny files); the runtime warns at hakiri plan if a partition spec is projected to produce > 1000 partitions or median partition size < 16 MB.

Clustering vs sort order

cluster_by controls row order within a Parquet file. Choosing it well lets DuckDB skip row groups via zone maps without explicit indexes:

cluster_by = ["repo", "updated_at"] — efficient for WHERE repo = 'x' AND updated_at > '2026-01-01'.
cluster_by = ["account_id"] — efficient for per-tenant queries; pairs well with tenant partitioning.

Z-ordering (Hilbert curve, like Delta Lake’s OPTIMIZE ZORDER BY) is a v2 candidate; v0 ships lexicographic sort, which covers the agent-query distribution we expect.

Compaction in clouds

Topology	Where compaction runs	Notes
Local CLI	In-process at `hakiri context compact`	One-shot
Daemon	Background task in `hakiri serve`	Soft-rate-limited so it doesn’t starve live runs
Cloudflare	A scheduled Worker dispatches a Container task per due table	Container does the work; Worker just decides
AWS	EventBridge Schedule → Fargate task	Same shape as the runtime’s normal pipelines

Compaction in the workflow-shaped clouds reuses the same chunk-claim queue as backfill (see 03-pipelines.md § Backfill orchestration) — a “compact this table” job is just another chunk for a worker to claim.

Cost model

Compaction reads N run files, writes one snapshot file, then deletes the run files after retention. Network cost (for cloud topologies) is 2× the table size per compaction: once to read, once to write back. Operators sensitive to egress can:

Compact in the same region as the bucket (the default for hakiri deploy).
Raise trigger_run_count to compact less often.
Skip partitioning if the table is small.

The runtime emits hakiri.compaction.* OTel spans (input bytes, output bytes, rows read, rows after dedup, sidecar rebuild time) so the operator can tune from data, not guesses.

Indexes for agent consumption

The query engine choice (ADR-0004) commits us to DuckDB-over-Parquet. To make that pleasant for agent retrieval — which mixes structured filters with vector and full-text search — Hakiri ships indexes as sidecars next to Parquet, rebuildable from the canonical store and discoverable via MCP. The indexes are part of the on-disk layout, not a separate database.

This section covers what indexes exist, where they live, how they are built, how an agent finds them, and how they interact with the access-control layer.

Index kinds shipped in v0

Index	Engine	Built for	File layout
Zone map (min/max per row group)	Parquet native	Range filters, equality on clustered columns	Inline in Parquet footers — free with `cluster_by`
PK lookup	Sorted Parquet + sparse map	Point lookups by primary key	`tables/<t>/indexes/pk/<snapshot-ts>/`
Vector (HNSW)	DuckDB `vss` extension or LanceDB sidecar	Semantic similarity search	`tables/<t>/indexes/vec-<col>-<model>/<snapshot-ts>/`
Full-text (BM25)	Tantivy	Keyword search over text columns	`tables/<t>/indexes/fts-<col>/<snapshot-ts>/`
Bloom filter	Parquet native	Selective high-cardinality equality	Inline in Parquet — opt-in per column

The index format choice for vectors and FTS is itself an open question (see PRD § Open product questions); the storage location (sidecar next to the snapshot) is stable across format choices.

On-disk layout for sidecars

tables/github_issues/
  data/
    snapshot=2026-05-12T08-00-00Z/
      part-00000.parquet
      _manifest.json
  indexes/
    pk/
      snapshot=2026-05-12T08-00-00Z/
        index.dat
        _manifest.json
    vec-body-bge-large-en-v1.5/                 # column × embedding model
      snapshot=2026-05-12T08-00-00Z/
        hnsw.bin
        id_map.parquet                          # row_id ↔ parquet_offset
        _manifest.json
        _meta.json                              # model, dim, distance metric
    fts-body/
      snapshot=2026-05-12T08-00-00Z/
        tantivy/...
        _manifest.json

Two invariants:

Index version is identified by (column, builder, builder-version). A vector index on body built with bge-large-en-v1.5 is a different directory from one built with text-embedding-3-large. Both can coexist; the agent picks at query time.
Each index snapshot is tied to a data snapshot. When a snapshot is GC’d, its indexes are GC’d with it. There is never an index pointing at a deleted Parquet snapshot.

Manifest declaration

Indexes are declared per table in the manifest:

[[pipeline.tables]]
name = "github_issues"

  [[pipeline.tables.indexes]]
  kind   = "pk"
  on     = ["id"]

  [[pipeline.tables.indexes]]
  kind   = "vector"
  column = "body"
  model  = "bge-large-en-v1.5"      # operator's embedding choice — PRD Pillar 7
  dim    = 1024
  metric = "cosine"
  m      = 32                       # HNSW connectivity
  ef_construction = 200

  [[pipeline.tables.indexes]]
  kind     = "fts"
  columns  = ["title", "body"]
  analyzer = "english"

  [[pipeline.tables.indexes]]
  kind   = "bloom"
  on     = "author_id"              # high-cardinality equality lookups

The embedding model identifier travels with the index. Swapping models is a one-command rebuild that does not touch the canonical Parquet:

hakiri index rebuild github_issues vec-body --model bge-large-en-v1.5

This commits to the Pillar 7 (provider-agnostic) story: the catalog records the new model identifier alongside the old; both indexes can coexist during cutover; agents that haven’t switched yet keep using the old.

Build strategy

Indexes build in the same pass as compaction (preferred) or independently:

During compaction. The compactor writes the new Parquet snapshot, then immediately builds the declared indexes against it. Index files land in the same snapshot=<ts> directory namespace. This is the default — one pass, one I/O cost.
Incremental. For vector indexes specifically, embedding API costs make full rebuilds expensive. The runtime supports incremental = true on a vector index declaration: only rows new since the previous snapshot are embedded; the HNSW graph is extended in place. Tolerates drift (the graph becomes slightly suboptimal); a full rebuild is recommended every ~10 incremental rebuilds.
Standalone. hakiri index build <table> <index-id> builds a single index out-of-band — useful for adding a new index to an existing table without forcing a full compaction.

Index builds emit hakiri.index.* OTel spans recording rows indexed, builder duration, and (for vectors) the embedding-provider request count and cost-estimate hint. This is how operators answer “what did this index cost me.”

Agent discovery

The MCP server exposes context.describe(table):

{
  "table": "github_issues",
  "rows":  482_910,
  "schema_fingerprint": "sha256:abc123…",
  "agent_description": "GitHub issues across tracked repos. Filter by `repo`, `state`. Use vector index `vec-body-*` for semantic search over issue bodies; FTS index `fts-body` for keyword matches. Recent issues live in the `recent_90d` partition.",
  "indexes": [
    { "id": "pk-id",                       "kind": "pk",     "on": ["id"] },
    { "id": "vec-body-bge-large-en-v1.5",  "kind": "vector", "column": "body", "model": "bge-large-en-v1.5", "dim": 1024, "rows": 482910 },
    { "id": "fts-body",                    "kind": "fts",    "columns": ["title","body"] }
  ],
  "partitions": { "scheme": "month(updated_at)", "count": 84 },
  "example_queries": [
    "SELECT * FROM github_issues WHERE state = 'open' AND repo = 'torvalds/linux' ORDER BY updated_at DESC LIMIT 20;",
    "context.query(table='github_issues', filter={state:'open'}, semantic={text:'kernel panic in driver', limit:20})"
  ]
}

The agent_description field — natural-language column hints plus example queries — is what makes a table self-orienting for a new agent. Authored by humans, regenerable from a connector’s schema spec (an agent task an LLM does well), and surfaced at MCP tools/list time so the agent knows what’s queryable before it queries.

Unified MCP query surface

Agents do not query “the SQL store” and “the vector store” as separate surfaces. One MCP tool — context.query — accepts a structured filter and a semantic component in the same call:

{
  "table":   "github_issues",
  "filter":  { "state": "open", "repo": ["torvalds/linux", "openhackersclub/gctrl"] },
  "semantic": {
    "text":  "kernel panic in network driver",
    "column": "body",
    "limit": 50
  },
  "project": ["id", "title", "url", "score"],
  "limit":   20
}

The runtime plans the execution:

Filter-then-ANN (default when the structured filter is highly selective): apply WHERE first, then run ANN over the filtered row id set using the vector index’s id_map.parquet.
ANN-then-filter (when the semantic query is the dominant predicate): get top-K from HNSW, then filter the K candidates.
Hybrid (when both are mid-selectivity): pull top-K-broad from HNSW, intersect with filter result, re-rank.

The planning heuristic in M2 uses declared statistics (rows, partition counts, declared selectivities); a cost-based optimizer is the v2 candidate flagged in Challenge 3.

Returned rows always carry a provenance edge:

{ "id": "issue-481", "title": "...", "score": 0.872,
  "_provenance": {
    "table":  "github_issues",
    "run":    "run_01HXYZ...",
    "connector": "github@0.4.2",
    "ingested_at": "2026-05-12T08:14:00Z",
    "authored_by": "agent://claude-connector-author"   // if the connector was agent-authored
  }
}

The agent can cite, not hallucinate. The provenance fields are also what an auditor needs to answer “where did this passage come from.”

Policy enforcement

Capability tokens (09-access-control.md) filter retrieval results before they reach the agent:

Row-level security predicates apply to the result set after retrieval.
Column masking applies to projected columns; if the agent retrieved a passage that contains customer.email and the token forbids reading customer.email, the passage is dropped or the email column is masked in the returned record (token-policy-dependent).
Vector matches against rows the token cannot read are filtered before the top-K is returned — an attacker with vector access but not row access cannot infer row presence by similarity.

This makes the retrieval surface policy-aware by construction, the property Challenge 1 and Challenge 3 jointly require.

Replicas (Pillar 5)

Pull-side replicas (PRD Pillar 5) materialize the table plus its declared indexes onto the agent’s host. A laptop running an agent over github_issues syncs the snapshot Parquet, the HNSW sidecar, and the FTS sidecar — agent queries hit local files, p99 reads in single-digit ms.

Replica refresh is per-snapshot, not per-row: when a new snapshot is committed centrally, the replica fetches the new snapshot directory and its sidecars, then atomically swaps the current symlink. Old snapshots stay around for the retention window.

Replicas can declare which indexes they want — a laptop with limited disk can pull the FTS index but skip the 4 GB HNSW; a Worker can pull only the partition shards it serves.

Sync protocol

Sync targets any S3-compatible bucket — with the honest caveat that S3-compatible is not S3-equivalent. The wire format mirrors the on-disk layout 1:1 with one addition: a top-level manifest.json at the bucket root that lists every project, table, and snapshot, plus their content hashes.

S3-compatible capability matrix

The local-first-no-server promise depends on atomic conditional puts for single-writer leases stored in the bucket itself (the Topology 2 sync model + Challenge 5). Conditional-put support varies:

Backend	`If-None-Match: *` (create-if-absent)	`If-Match: <etag>` (CAS update)	Multi-writer sync (lease-in-bucket) supported
Cloudflare R2	✓	✓	✓
AWS S3 (since Nov 2024)	✓	✓	✓
MinIO	partial (recent versions)	partial	⚠️ probe required
Garage	recent, behavior quirky	recent	⚠️ probe required
Backblaze B2 (S3 API)	✓	✗ no `If-Match` for PUT	✗
SeaweedFS	spotty across versions	spotty	✗

hakiri sync diagnose runs a capability probe against the configured bucket — writes test objects exercising If-None-Match/If-Match, observes the responses, and writes a report:

$ hakiri sync diagnose
bucket: r2://oh-context
  ✓ HEAD / GET / PUT
  ✓ conditional PUT (If-None-Match: *)
  ✓ conditional PUT (If-Match: <etag>)
  → multi-writer sync mode SUPPORTED

If the probe fails the conditional-put checks, Hakiri refuses to enable multi-writer mode and either:

Falls back to single-writer mode (one designated writer node; other nodes pull-only), or
Falls back to a catalog-backed lease (Postgres / SQLite holds the lease instead of the bucket), if a catalog is configured.

The fallback is explicit — the operator sees [sync] mode = "single-writer" in the runtime banner, not a silent degrade.

Wire format

r2://<bucket>/
  manifest.json                     # top-level index, cached locally for fast diff
  <project>/
    config.toml
    meta.sqlite                     # uploaded as a single object on push
    tables/<t>/...                  # mirror of local layout

Push (`hakiri sync push`)

Walk local .hakiri/context/<project>/
For each file, compute SHA-256; compare to remote manifest.json
Upload new/changed objects (multipart for >5MB)
Upload updated manifest.json last (commit point — readers see consistent state)

Pull (`hakiri sync pull`)

Fetch remote manifest.json
Diff against local file hashes
Download new/changed objects in parallel
Atomically replace local meta.sqlite only after all data files land

Cursor semantics

Cursors are opaque to the runtime, but their conflict-resolution behavior is not uniform — that depends on what the cursor represents. Each connector declares its cursor-kind in the WIT export:

`cursor-kind`	Semantics	Example sources	Multi-writer safety
`monotonic`	High-watermark (timestamp, autoincrement id). Reading past the watermark on another replica is harmless; missed records get picked up on the next run.	Postgres `updated_at`, GitHub `since`, generic timestamp APIs	LWW safe — the latest cursor wins; older replicas re-read briefly.
`opaque-token`	Vendor-supplied page/iterator token. May expire, may be stateful on the vendor side, may be impossible to interleave between replicas.	Stripe `starting_after`, Shopify `page_info`, GitHub Link headers, Google `nextPageToken`	Not LWW safe — two replicas advancing the same token can skip records or hit expired tokens. Requires a single-writer lease.
`snapshot-id`	Point-in-time identifier the source emits. Replicas must agree on which snapshot they’re consuming.	Postgres LSN, S3 version-id, Datomic `t`	Single-writer per pipeline. Leasing mandatory.

Single-writer leases

For opaque-token and snapshot-id cursors the catalog holds a lease record:

CREATE TABLE pipeline_lease (
  pipeline_id  TEXT PRIMARY KEY,
  holder_node  TEXT NOT NULL,
  expires_at   TEXT NOT NULL
);

A replica wanting to run such a pipeline first calls acquire_lease(pipeline_id, ttl=10min) — an atomic CAS against the catalog row. If another replica holds an unexpired lease, the run is skipped (the next reconciliation tick retries). Leases auto-expire so a crashed replica doesn’t lock out others forever.

For monotonic cursors no lease is required — concurrent runs are allowed; LWW on the cursor write is safe.

The host enforces leasing automatically based on the declared kind. Sources that fail to declare default to opaque-token — safe-by-default, slightly slower.

Conflict resolution

Two replicas push concurrently → one wins, the other detects the manifest changed mid-push and retries with rebase semantics:

Run files: union. Run IDs are globally unique (UUIDv7), and the path layout includes the writer’s node id (tables/<t>/data/runs/<run-id>/<node-id>/part-NNNNN.parquet) so two replicas writing under the same logical run-id after a partial-sync replay produce disjoint files instead of colliding on part-00000.parquet.
Snapshot files: union (snapshot_id includes a timestamp + node id).
Cursors: behavior depends on cursor-kind (see above). Monotonic = LWW; non-monotonic = lease-gated.
config.toml: requires three-way merge; conflict markers like git, surfaced in hakiri sync status.
Schema: identical schemas merge silently; divergent schemas (e.g. one replica added a column, another renamed) require hakiri schema reconcile.

No CRDTs in v0. Append-only data + cursor-kind-aware metadata covers the common workflows. Simultaneous schema renames are documented as “use one writer at a time.” Rationale and alternatives: see ADR-0005.

Encryption

At rest: optional client-side encryption per project, with a key derived from a password or fetched from an external KMS. Parquet files encrypted with Parquet’s native modular encryption.
In transit: TLS to the bucket endpoint (R2/S3 always; MinIO depends on operator).

Encryption is off by default in v0 because most teams’ first use case is “sync our public-facing data”. Enabled via [encryption] key_source = "env:HAKIRI_KEY" in hakiri.toml.

Sidecar encryption — invariant: indexes leak no more than Parquet

When at-rest encryption is enabled, sidecar indexes (HNSW, Tantivy, Bloom, PK lookup) are encrypted with the same KMS key as the Parquet they index. This closes the gap where stolen bucket credentials would expose redacted-column content through index structure even though the Parquet itself is encrypted:

Tantivy FTS: the on-disk segment files are wrapped in age/AEAD blocks; the runtime decrypts in-memory on open. Builds in the staging directory under encryption from the first byte.
HNSW: the graph file and id_map.parquet (the row-id ↔ parquet-offset mapping) are written through the same Parquet modular encryption path as the data. The HNSW graph topology alone is enough to enable approximate-NN-search of the embedding space, so leaving it cleartext is not acceptable.
Bloom + zone maps: bloom filter sidecars and Parquet zone maps (inline in Parquet) inherit Parquet’s encryption.

For columns that are redacted at write time (09-access-control.md § Layer 1), the index over those columns does not exist on disk in any form. The manifest validator refuses to declare an index over a column marked redact = true at the project level. For replica-specific redaction (the column exists in the canonical store but is redacted at the sync edge for some replicas), the replica pulls only the encrypted Parquet without the column’s sidecar.

This makes the PRD Pillar 6 compliance-property claim honest: a stolen R2 credential reveals the encrypted Parquet (opaque ciphertext to the attacker) and encrypted sidecars (also opaque). It does not reveal redacted-column content via index structure.

Key rotation

Signing keys (token verification, audit log root signing) rotate via dual-key acceptance: the manifest declares signing_keys: [{kid: "k2", pub: "..."}, {kid: "k1", pub: "..."}]. Tokens carry a kid; the verifier checks against the listed key. To rotate: add k2, wait the longest in-flight TTL (24h default), remove k1.
KMS-held encryption keys rotate per-snapshot: new snapshots are written under the new key version, old snapshots remain readable under the old key version until GC’d. The catalog records which key version each snapshot uses.
Project pepper (CLS hash-with-bucket, hashed joins) rotates per-clean-room-pair via the escrow channel; static project-wide peppers are not supported for low-cardinality columns.

Operator runs hakiri keys status to see what’s due for rotation. CI-suggested rotation cadences live in specs/11-compliance.md.

Query engine

DuckDB is the primary query face: it reads Parquet directly, embeds into every Hakiri runtime target, and speaks standard SQL. SurrealDB was evaluated and rejected as the core engine; it may earn an optional M3+ slot as a reproducible-from-Parquet sidecar for lineage-graph queries and LIVE SELECT subscriptions. Rationale, full comparison, and alternatives: see ADR-0004.

Open questions

Iceberg / Delta Lake compatibility? v0 ships vanilla Parquet + manifest (rationale: ADR-0002). M2+ may add Iceberg as an alternative layout once iceberg-rust stabilizes.
DuckDB extensions for spatial / vector data. A natural fit — the context store can hold embeddings as Parquet, queried with the vss extension. Worth a worked example.
Bucket-level multi-tenancy. A bucket with N projects works as written. Per-project IAM is the operator’s problem.