Skip to content

ADR-0002 — Parquet + JSON manifest layout, not Iceberg / Delta

The context store needs a documented on-disk layout that:

  • a developer can read with duckdb or pyarrow even if Hakiri is not installed,
  • syncs cleanly to any S3-compatible bucket as a unit of “team context,”
  • handles crash-safety (partial runs are recoverable or droppable, never half-applied),
  • supports schema evolution driven by API drift, and
  • replicates to laptops, Workers, Lambdas, and on-prem boxes (Pillar 5).

The candidates were vanilla Parquet files with a hand-rolled JSON manifest, Apache Iceberg, and Delta Lake.

The v0 layout is plain Parquet files plus a JSON manifest per snapshot/run, with a SQLite catalog (meta.sqlite) for cursors, lineage, and schema history. Tables are split into append-only run directories and compacted into immutable snapshot directories; DuckDB views unify them.

Iceberg / Delta compatibility is deferred. The “vanilla Parquet + manifest” shape gives us the invariants we need (immutable snapshots, atomic commit via manifest write, time travel via snapshot history) without taking a dependency on a richer table format.

Positive

  • cat-able everything: every file is plain Parquet, JSON, or SQLite. No proprietary metadata service is needed to read the store.
  • No runtime dependency on Iceberg or Delta libraries. The Rust ecosystem for both is still maturing; we’d be early adopters of a fast-moving API.
  • Sync protocol is trivial — copy the files. No metastore to keep in sync.
  • Pillar 5 (collocation) works because replicas can be materialized by s5cmd sync; the bucket layout is portable.

Negative

  • We re-implement a slice of what Iceberg/Delta already do: snapshot manifests, schema evolution log, hidden partitioning. The slice is small (Hakiri’s writes are agent-shaped, not warehouse-shaped) but it is duplication.
  • No engine besides DuckDB knows how to query the layout natively. Trino/Athena cannot read Hakiri tables as tables unless we point them at the raw Parquet — losing snapshot semantics.
  • Time travel is bounded by however much snapshot history we retain; Iceberg/Delta give it for free.

Neutral

  • The catalog schema is ours to evolve. We’re not waiting on upstream features, but we also don’t get them.

Apache Iceberg. Richer table format with snapshot isolation, hidden partitioning, time travel, and broad engine support (Spark, Trino, Athena, Snowflake). Rejected for v0 because (a) the Rust client ecosystem (iceberg-rust) is pre-1.0 and the catalog API surface keeps changing, (b) Iceberg presumes a metastore (REST catalog, Glue, Nessie) that adds operational weight contrary to Pillar 1, and (c) Hakiri’s read workload is agent queries over modest table counts, not analyst joins across a lakehouse. Worth supporting as an alternate layout in M2+ when the Rust client stabilizes.

Delta Lake. Similar feature set to Iceberg, with delta-rs providing a Rust client. Rejected for the same reasons plus a smaller engine-portability story than Iceberg.

Postgres or a custom database as the destination. Would simplify schema management but lose the “destination is files on object storage” property that makes sync trivial and replicas cheap. Off-table for the context-layer persona.