Skip to content

ADR-0004 — DuckDB as the primary query face; SurrealDB rejected as core

Hakiri needs a query engine that:

  • reads the canonical Parquet + JSON manifest layout (ADR-0002) directly,
  • embeds into the Hakiri binary, a Cloudflare Worker, a Lambda, a Fargate task, and a WASI host,
  • handles OLAP-shaped reads (scan + filter + aggregate) at low latency,
  • supports vector and full-text search for agent retrieval,
  • adds no operational dependency for the local-first and clustered topologies,
  • replicates cheaply via Pillar 5 (collocation via object storage).

DuckDB and SurrealDB were the two serious candidates. SurrealDB is attractive on the surface because it is Rust-native, multi-model (document/graph/relational/KV), has built-in record-level permissions, supports LIVE SELECT subscriptions, and offers distributed mode via TiKV.

DuckDB is the primary query face. SurrealDB is rejected as the core engine.

SurrealDB may earn a slot in M3+ as an optional secondary index for two narrow use cases — lineage-graph queries and live agent subscriptions — under the hard rule that anything stored in SurrealDB must be reproducible from Parquet + the catalog. A SurrealDB corruption is “rebuild the cache,” never “lost data.”

Positive

  • DuckDB reads Parquet natively. Zero ingest step between writer and reader; the canonical artifact is the query input. This is what makes ADR-0002 and Pillar 5 (collocation) tractable.
  • ~15–25 MB linked footprint vs SurrealDB’s ~80–100 MB binary. Matters for Lambda cold start and the wasm32-wasip2 target.
  • Standard SQL — high LLM familiarity. Agents author and read SQL fluently; SurrealQL would require re-training every connector author and tool caller.
  • Best-in-class columnar performance on Parquet for the read shape we care about.
  • Vector (vss) and FTS extensions exist; sidecar indexes for Pillar 5 replicas use them.
  • MIT license. Friendly for embed-and-redistribute.
  • The query engine is stateless over Parquet — replication, collocation, and offline use all fall out of “copy the Parquet files.”

Negative

  • No native live subscriptions. Agents that want push-notification on context change must poll or wait for the M3+ SurrealDB sidecar.
  • No built-in record-level permissions. Capability tokens (PRD Pillar 3) enforce at the sync edge, which is the right boundary anyway but is more work than borrowing SurrealDB’s permission system.
  • Graph queries are recursive CTEs, which are verbose. Acceptable for the ~80% of lineage queries Hakiri needs; the SurrealDB sidecar covers the rest.
  • Concurrent writers against a single DuckDB file are limited. Not an issue for Hakiri (the pipeline runtime writes Parquet, never DuckDB itself), but it would block a different use case.

Neutral

  • We don’t get the multi-model surface SurrealDB offers. We don’t want it — the store is Parquet tables, full stop.

SurrealDB as the core engine. Owns its on-disk format (RocksDB / SurrealKV / TiKV), which breaks the “every file is cat-able” contract that makes Pillar 5 tractable. Distributed mode requires running a TiKV cluster — a second piece of infrastructure that directly contradicts ADR-0008 (no external coordinator required). SurrealQL is bespoke and adds friction for agents. Replication assumes always-connected nodes; Hakiri’s sync is “any S3 bucket between any two parties, possibly offline for hours.”

Polars / DataFusion. Both are excellent columnar engines in Rust. Rejected because DuckDB is more mature for the embedded-SQL use case, has a richer extension ecosystem (vector + FTS shipped), and is already the de facto query face for analyst-on-Parquet workloads. Worth revisiting if DuckDB’s WASM build or embed size becomes a problem.

ClickHouse. Excellent OLAP performance and shapes the deployment story (ADR-0008). Rejected as the query face because it’s server-shaped; embedding ClickHouse into the Hakiri binary or a Worker is impractical. We borrow ClickHouse’s deployment model, not its engine.

SQLite (as the query face, not the catalog). Lightweight, but row-oriented and uncompetitive for OLAP scans on Parquet-sized tables.