Comparisons

How Hakiri positions against the closest neighbors at the engineering level: feature matrix and head-to-head comparisons. Three categories get detailed treatment here:

OSS ELT — dlt, Airbyte, Meltano/Singer, Fivetran, Cloudflare Pipelines
Agent-memory products — Mem0, SurrealDB Spectron
Data architecture patterns — DuckLake, bauplan

Audience framing (“who Hakiri is for vs. who it isn’t”), broader competitive landscape (memory frameworks beyond the two above, vendor-locked memory, per-tool MCP servers, vector storage, embedded data stacks), and the “what makes Hakiri non-substitutable” argument live in PRD.md — see PRD § Target audience and PRD § Competitive landscape. This doc is the technical companion: feature-by-feature comparisons against the two categories above.

ELT feature matrix

	Hakiri	dlt	Airbyte (OSS)	Meltano (Singer)	Fivetran	Cloudflare Pipelines
Language	Rust binary	Python lib	Python + Java	Python	SaaS	SaaS (closed)
Distribution	single binary	pip package	docker-compose / k8s	pip package	hosted	CF account
Connector contract	WIT/WASM	Python `@dlt.source`	Java/Python container	Singer JSON spec	proprietary	proprietary
Agent authoring	first-class MCP	none	none	none	none	none
Sandboxing	WASM capabilities	trust the lib	container	trust the tap	n/a	n/a
Local-first	yes	yes	partially	yes	no	no
Destination of default	local context store	warehouse of choice	warehouse	warehouse	warehouse	R2
Edge / Workers deploy	yes (WASM + Containers)	no	no	no	no	native
Schema evolution	declarative + agent-resolvable	inferred	per-connector	per-tap	inferred	n/a
OSS license	Apache-2.0 (proposed)	Apache-2.0	Elastic-2.0	MIT	proprietary	proprietary
Sync to S3-compatible	first-class	via destination	via destination	via target	via destination	native

vs. dlt

dlt is closest in spirit — also small, library-first, schema-inferring, destination-flexible. Differences:

Language: dlt is Python; Hakiri is Rust + WASM. dlt’s Python-native ergonomics are unmatched for ad-hoc analytics scripts. Hakiri pays a startup-tooling cost (you compile or download a binary) to get sandboxing and edge deployability.
Connectors are libraries vs. components: a dlt source is a Python function that imports the world. A Hakiri connector is a WASM component with declared capabilities. Hakiri’s contract is stricter; dlt’s is more flexible.
Local context store: dlt doesn’t ship one. You pick a destination (DuckDB, Postgres, BigQuery, …) and dlt writes to it. Hakiri’s default is the on-disk context store, syncable to R2 — a different mental model.
Agent authoring: dlt has no MCP surface today. The community is exploring it; Hakiri builds it in.

Pick dlt when: your team is Python-shop, you want quick scripts that land in your existing warehouse, you trust your connector code.

Pick Hakiri when: you want a single binary, you want agents to author connectors, you want WASM sandboxing, you want a local-first context store as the default.

vs. Airbyte

Airbyte is the breadth incumbent — hundreds of connectors, big UI, k8s-heavy deployment. Differences:

Operational weight: Airbyte OSS wants a Kubernetes cluster, a database, a UI. Hakiri wants a binary.
Connector authoring: Airbyte’s CDK is Python or Java; each connector is a container with its own image. Hakiri connectors are WASM components, no image build.
Marketplace: Airbyte’s marketplace is its moat. Hakiri starts with zero and earns connectors via the agent-authoring loop. If you need 200 connectors today, Airbyte wins.
License: Airbyte moved from MIT to Elastic-2.0 (ELv2). Hakiri targets Apache-2.0.

Pick Airbyte when: you have a long tail of obscure SaaS sources and value the existing connector inventory more than runtime ergonomics.

Pick Hakiri when: you want a small footprint, you’re OK with fewer-but-easier-to-write connectors, you want the agent loop.

vs. Meltano / Singer

Singer’s tap/target subprocess protocol is the OG declarative ELT model — JSON-lines over stdio between independent processes. Meltano is the orchestrator on top.

Inspiration: Singer’s “small interface, many implementations” philosophy is exactly Hakiri’s. The difference is we use WIT/WASM instead of JSON-over-stdio, which gives us typed schemas, sandboxing, and a faster wire format.
Compatibility: a hakiri singer <tap> shim is on the roadmap (see 03-pipelines) — adapt a Singer tap as a Hakiri source by running it as a subprocess and translating the JSONL stream. That gets us the existing tap ecosystem for free.
State management: Meltano’s state is YAML+JSON files in the project dir. Hakiri’s catalog is SQLite + Parquet. Functionally equivalent; Hakiri’s is more queryable.

Pick Meltano when: you’re already invested in Singer taps and the Python ecosystem.

Pick Hakiri when: you want a similar declarative model but in a single binary with WASM-sandboxed connectors and an MCP server.

vs. Fivetran / Stitch (SaaS)

These are the buy-don’t-build options. Different category — they sell you “we operate the connectors”. Hakiri is the build-it-yourself option that doesn’t require building everything from scratch.

Pick Fivetran when: you have budget, you’d rather pay than operate, your sources are common SaaS apps.

Pick Hakiri when: you have non-standard sources, you have data residency constraints, you want to own the data plane, or you have an agent that should be authoring connectors.

vs. Cloudflare Pipelines

Cloudflare Pipelines (the newer product) is closed-source SaaS optimized for “stream into R2”. It overlaps with Hakiri’s edge deployment story but is opinionated about R2 and the Cloudflare runtime.

Hakiri runs on Cloudflare (Containers + Workflows) but isn’t only for Cloudflare.
Cloudflare Pipelines doesn’t have an MCP surface and doesn’t expose connector authoring to agents.
OSS vs. proprietary is the headline difference.

Pick CF Pipelines when: you’re all-in on Cloudflare, you want managed, your shape is “events → R2”.

Pick Hakiri when: you want OSS, you want to deploy outside Cloudflare too, you want agent authoring.

Agent-memory feature matrix

	Hakiri	Mem0	SurrealDB Spectron
Scope	Ingestion + memory + access control + MCP in one binary	Memory layer for agents (vector + graph)	Memory layer integrated into SurrealDB
Storage engine	Parquet files + SQLite catalog	Postgres + vector DB (e.g. Qdrant)	SurrealDB (RocksDB / SurrealKV / TiKV)
Storage format on disk	`cat`-able Parquet + JSON manifests	Postgres tables + vendor vector format	SurrealDB-owned, opaque without engine
Query surface	DuckDB SQL + MCP	Mem0 REST API + MCP-adjacent	SurrealQL (bespoke) + MCP
Ingestion	Pull from N source connectors; backfill + drift; agent-authored WASM	Caller pushes events / messages	Caller writes via MCP / SurrealDB drivers
Sandboxing	WASM Component with capability-declared host access	n/a — caller’s code	n/a — caller’s code
Access control	Capability tokens (subject tuple + DPoP) + RLS + CLS + write-time redaction	Token + project scoping	SurrealDB record-level permissions + RBAC
Distribution	Single binary, three profiles	Mem0 service + Postgres + vector DB	SurrealDB Cloud (managed) or self-hosted SurrealDB
Local-first sync	Any S3-compatible bucket; offline-capable; no relay	Hosted-first; self-hosted = three services running	Replication assumes connected SurrealDB cluster
Edge / WASM	Cloudflare Containers + Workers (WASM `core` profile)	Server-shaped	Server-shaped
License	Apache-2.0 (proposed)	Apache-2.0 + hosted	BSL 1.1 → Apache-2.0 (SurrealDB); Spectron license unannounced
Lock-in failure mode	Project stops → data is Parquet + JSON, readable by every analytics tool	Mem0 stops → Postgres dump is recoverable but schema is Mem0’s	SurrealDB stops or pivots → format is SurrealDB’s

vs. Mem0

Mem0 is the closest pure-memory competitor and the most likely “we already use Mem0, why do we need Hakiri?” objection. Both ship OSS + hosted; both target agent memory; both touch MCP-adjacent surfaces.

Ingestion: Mem0 is push-only — the caller writes events and messages into the memory API. Hakiri pulls from source connectors on a schedule with backfill and drift detection.
Storage: Mem0 owns its schema across Postgres + a vector DB. Hakiri’s storage is Parquet + JSON manifests + SQLite — readable by duckdb, parquet-tools, or cat without Hakiri running.
Deploy: Mem0’s hosted is one service; self-hosted requires Mem0 + Postgres + a vector DB — three processes minimum. Hakiri is one binary, no companion services.
Sandboxing: Mem0 trusts the caller’s code; Hakiri runs connectors as WASM components with capability-declared host access.
Access control: Mem0 has token + project scoping. Hakiri has capability tokens (subject tuple + DPoP), RLS, CLS, and write-time redaction enforced at the sync edge.

Pick Mem0 when: you only need a memory store for an existing agent, you’re happy to push events at it, and you don’t want to run ingestion yourself. Hosted Mem0 is the fastest path to “my agent remembers things.”

Pick Hakiri when: the agent needs to pull from multiple SaaS silos and remember things across them; you cannot or will not run three services; you need a cat-able audit story or capability-token ACLs for an internal/regulated environment.

Honest convergence read: Mem0 may add ingestion connectors over time, and Hakiri may someday be a reasonable Mem0 substitute. Today they sit on different sides of “memory for agents” vs “data layer for agents.” Wrong tools for each other’s primary use case.

vs. SurrealDB Spectron

Spectron is the closest architectural competitor in the agent-memory category: OSS-aligned vendor, MCP surface, knowledge-graph + vector + temporal memory, and a self-hosting story. The thesis Spectron leads with — “memory in the database, not above it” — is a real critique of the Mem0/middleware shape, and one Hakiri partly agrees with: a single binary owns the storage, the catalog, and the MCP surface, so there is no middleware tax.

The disagreement is about which database, and whether the answer should be a database at all.

Storage shape: Spectron stores entities, relationships, embeddings, and temporal metadata inside SurrealDB’s on-disk format (RocksDB / SurrealKV / TiKV). Hakiri stores them as Parquet files on a local FS or S3-compatible bucket, with a SQLite catalog as the index. Spectron’s storage is opaque without SurrealDB running; Hakiri’s is portable.
Query surface: Spectron speaks SurrealQL (bespoke, multi-model). Hakiri speaks DuckDB SQL (standard) with Polars for transforms. Agents write SQL fluently out of the box; SurrealQL is a re-training cost.
Memory model: Spectron makes knowledge graph, bi-temporal facts, entity disambiguation, and autonomous enrichment first-class storage primitives. Hakiri treats these as queries over Parquet tables backed by vector (HNSW) and full-text (Tantivy) sidecar indices — same capabilities, different abstraction level.
Ingestion: Spectron is caller-writes (via MCP or SurrealDB drivers). Hakiri pulls from source connectors, with sandboxed WASM connectors and agent authoring.
Access control: Spectron uses SurrealDB record-level permissions + RBAC inside the database. Hakiri uses capability tokens with composable subjects + RLS + CLS + write-time redaction enforced at the sync edge.
Deploy: Spectron is in waitlist preview; production path is SurrealDB Cloud (managed) or self-hosted SurrealDB. Hakiri is a single binary in three feature profiles — no DB to run alongside.
Local-first sync: SurrealDB’s replication assumes always-connected nodes (a cluster). Hakiri’s sync is “any S3 bucket between any two parties, possibly offline for hours” — file-shaped, no relay.
Edge / WASM: SurrealDB embeds, but Spectron itself is server-shaped. Hakiri runs on Cloudflare Containers + Workers (WASM core profile) by design.
Coupling: Spectron requires SurrealDB; SurrealDB’s distributed mode requires TiKV. Hakiri’s destinations are pluggable (local FS, R2, S3); no foreign cluster required.

Pick Spectron when: you have already committed to SurrealDB as your primary data store, you want memory writes in the same ACID transaction as the rest of your domain data, you value the knowledge-graph + bi-temporal model as a first-class storage primitive, and you are comfortable with the storage format being SurrealDB-owned.

Pick Hakiri when: you need to pull from multiple SaaS silos and remember things across them; you want Parquet storage (every analytics tool reads it, audit is cat-able, ADR-0002 explains why); you want to deploy on the edge as WASM; you want capability-token ACLs at the sync edge rather than RBAC inside a database; you want Pillar 3’s MCP-only / provider-agnostic guarantee to extend to the storage layer.

Where Spectron’s critique of “middleware over fragmented DBs” lands on Hakiri: it doesn’t, but for a different reason than Spectron’s own answer. Hakiri is not middleware — it owns its storage end to end. The difference is Hakiri’s storage is files on object storage, not a database. That gets us the same atomic-write story (a Parquet file is atomic to publish via manifest swap), no cross-DB round trips (one process writes the file and updates the catalog), and additionally: portability, cat-ability, replication-without-a-cluster, and no second piece of infrastructure to run. The underlying decision is in ADR-0004, which evaluated SurrealDB-as-engine and rejected it on the same axes that separate Hakiri from Spectron here.

Honest convergence read: if Spectron ships and proves out, it is the strongest evidence yet that “agent memory should own its storage” is the correct architectural intuition. The disagreement is narrower than it looks — Hakiri agrees with the conclusion, disagrees with the implementation (database vs. files), and disagrees with the implied lock-in (SurrealDB-or-nothing vs. portable-files-anywhere).

Data architecture patterns

Two recent entries — DuckLake (MotherDuck, 2025) and bauplan (founded 2023) — share substrate or philosophy with Hakiri but sit at different levels of the stack. Worth comparing because the substrate choices (Parquet, DuckDB-shaped compute, lakehouse-style storage, serverless execution, branching) overlap with Hakiri’s design space — and because the differences in level clarify what Hakiri is and isn’t.

Feature matrix

	Hakiri	DuckLake	bauplan
Category	Data movement runtime + context layer	Lakehouse table-format spec	Serverless data platform
Level in stack	Runtime + ingestion + agent surface	Storage format + catalog	Compute platform + storage + branching UX
Data files	Parquet + JSON manifests + sidecar indexes	Parquet	Apache Iceberg (Parquet underneath)
Catalog backend	SQLite / DO SQLite / RDS / Postgres	Any SQL DB (Postgres, MySQL, SQLite, DuckDB)	Bauplan-managed metastore
Compute / query	Rust + WASM runtime; DuckDB + Polars at query time	Any engine (DuckDB reference impl); SQL	Python serverless functions (“FaaS for data”)
Ingestion	Pull from N WASM connectors; backfill; drift detection	n/a — table format only	Python “loader” functions
Branching	CRDT history on config (not data); LWW + node-id paths on data	First-class snapshots + branches at the table level	First-class git-for-data on tables (“branch / merge / commit”)
Sandboxing	WASM Component with capability declarations	n/a	Container per function
Agent surface	MCP-native (in-tree)	none	none
Local-first	yes (single binary, offline-capable)	yes (DuckDB engine + SQLite catalog works locally)	hosted-first (some local dev tooling)
License	Apache-2.0 (proposed)	MIT (DuckLake spec); MIT (DuckDB reference impl)	Apache-2.0 OSS components; closed-source hosted SaaS
Lock-in failure mode	Project stops → Parquet + JSON readable by any tool	Spec stops → Parquet files still readable; catalog rows are plain SQL	bauplan stops → Iceberg tables remain readable; orchestration/branching is bauplan-owned

vs. DuckLake

DuckLake is a lakehouse table-format specification published by MotherDuck in 2025. The core idea: replace the file-based catalog of Iceberg / Delta (manifest files, transaction logs, JSON metadata trees) with rows in a regular SQL database. Data stays as Parquet; the catalog is CREATE TABLE ducklake_snapshots (...) in Postgres / MySQL / SQLite / DuckDB itself. ACID transactions ride the SQL database’s existing transaction semantics; snapshots, branches, time-travel, and schema evolution are SQL rows. The reference implementation is a DuckDB extension.

Hakiri and DuckLake share substrate but sit at different levels of the stack — they’re complementary, not substitutes:

Level: DuckLake is just the storage layer. It says nothing about how data gets in, how schedules fire, how agents query, how access is controlled. Hakiri is the runtime above the storage. The honest framing: Hakiri could one day emit DuckLake-formatted tables and consume them through the same DuckDB query face it already uses.
Catalog shape: Both keep catalog state in a SQL store. Hakiri uses SQLite (or Postgres / RDS / DO SQLite) for pipeline state — cursors, run history, schema evolution decisions, capability tokens. DuckLake uses a SQL DB for table state — snapshot IDs, file lists, schema versions, branch heads. Different scopes, same instinct: “the catalog should be a database, not files.”
Manifest format: Hakiri’s ADR-0002 rejected Iceberg’s complexity for v0 in favor of a small Parquet + JSON manifest design. DuckLake lands in roughly the same place philosophically — keep the catalog readable, avoid file-based metadata trees — but with a more formal table-format spec and a community-adopted shape.
Branching: DuckLake ships snapshots + branches as first-class table operations (cheap to create, fork, merge). Hakiri does not have data-level branching today. Hakiri’s CRDT-on-config story (14-collab-config.md) gives multiplayer editing of pipeline definitions, but not “branch this dataset, experiment, merge back.”
Engine assumption: DuckLake’s reference implementation is DuckDB. Other engines (Spark, Trino, Polars) would need DuckLake readers/writers — work that’s in progress but not uniformly available. Hakiri’s runtime is Rust + WASM with DuckDB as the query face, not the data manipulation engine.

Pick DuckLake when: you want a lakehouse table format simpler than Iceberg / Delta; your compute is DuckDB-shaped (or you can wait for other engines to add support); you don’t need ingestion — the data is already produced by something upstream — and you want the catalog to be a regular database you can SELECT * FROM ducklake_snapshots.

Pick Hakiri when: you need the layer above DuckLake — connectors to pull data in, schedules, a context store with vector + FTS indexes, an MCP server, agent authoring, capability tokens at the sync edge.

Honest convergence read: if DuckLake stabilizes as a community spec, Hakiri’s on-disk format could become “Parquet + DuckLake-spec catalog” — replacing the bespoke manifest with a standard one. The bet against that today: DuckLake is new (months old at the time of writing), and Hakiri’s manifest is small enough to swap later. ADR-0002 is the right place to track this — when DuckLake reaches v1.0 and has non-DuckDB readers, that ADR gets re-opened. The two projects are aligned more than they are competing: a world where Hakiri pipelines write DuckLake tables that any DuckDB / Polars / Snowflake client can read is strictly better than today’s bespoke layout.

What’s not a comparison axis here: agent authoring, ingestion, scheduling, MCP. DuckLake doesn’t claim any of these; it’s the wrong question to ask.

vs. bauplan

bauplan is a serverless data platform aimed at data + ML workloads, founded by Jacopo Tagliabue (formerly Coveo). Compute is Python serverless functions (“FaaS for data”) — each step in a pipeline is a function bauplan invokes against tables. Storage is Apache Iceberg with a bauplan-managed metastore. The headline feature is git-for-data: tables are branchable, mergeable, commit-shaped — you can bauplan branch create staging, run experiments, and merge back, with snapshots that are cheap and atomic.

bauplan and Hakiri target overlapping pain (declarative data work, agent / ML-friendly compute, local-first dev loop) but make different bets on the shape of the runtime and the storage:

Compute model: bauplan compute is Python serverless functions. Hakiri compute is a Rust binary running WASM Component connectors and Polars transforms. bauplan optimizes for the data-science / ML-engineer ergonomic (@bauplan.python_step decorators, familiar pandas/pyarrow); Hakiri optimizes for sandboxing, deploy-anywhere, agent-authorable connectors. Different runtime philosophies.
Storage format: bauplan uses Iceberg. Hakiri uses Parquet + a small JSON manifest, per ADR-0002. The trade-off: Iceberg gives standardized snapshots, schema evolution, and a growing ecosystem; the manifest cost is real (metadata tree, vacuum, table-maintenance overhead). Hakiri bet on simpler + portable for v0; bauplan bet on standard + branchable. Both bets are defensible.
Branching: bauplan’s git-for-data is first-class on the data plane — you branch a table, mutate, merge. Hakiri’s branching story is only on the config plane (14-collab-config.md) — you branch a manifest, edit collaboratively, apply. Hakiri has no native answer to “branch this dataset for an experiment” today. This is a real gap.
Target user: bauplan’s marketing speaks to data scientists / ML engineers building feature pipelines and model workflows. Hakiri’s is agents + small teams of operators building a context layer over their SaaS silos. Different personas; overlapping tools.
Agent surface: bauplan has no MCP server and no notion of agent-authored functions (you write Python; bauplan runs it). Hakiri’s pillar 2 is exactly this surface.
Distribution: bauplan is a hosted SaaS with some OSS components and a local-dev path. Hakiri is a single OSS binary with three deploy profiles — local CLI, daemon, CF / AWS / cluster. The deploy-anywhere story is meaningfully different.
Sandboxing: bauplan functions run in bauplan-managed containers. Hakiri connectors run as WASM Components with declared capabilities — finer-grained than container isolation.
License: bauplan has Apache-2.0 OSS pieces and a closed-source hosted product. Hakiri targets Apache-2.0 end to end.

Pick bauplan when: you’re building ML / data-science pipelines in Python; you want git-for-data branching on real tables (not just config); you’re happy on a hosted platform; Iceberg is a familiar or required substrate; your team thinks in pandas / pyarrow / Jupyter rather than agents and MCP.

Pick Hakiri when: the workload is agent-facing context (not ML feature engineering); you want a single OSS binary you can deploy on the edge; you want WASM-sandboxed connectors authored by agents; you need capability tokens and write-time redaction for regulated data; you want the data plane to never leave the customer environment.

What we’d happily steal from bauplan: the git-for-data branching model is genuinely useful, and the gap is real — Hakiri can edit config collaboratively but cannot branch a dataset for an experiment without copying it. A future feature: data branches as cheap Parquet copies + a branch pointer in the catalog, applying the same materialization-on-apply discipline as the CRDT config layer. Tracked as an open question below; not on the M1 roadmap.

Honest convergence read: bauplan and Hakiri end up in different worlds. bauplan ships a managed platform with strong Python ergonomics and Iceberg branching for data scientists. Hakiri ships an agent-shaped, edge-deployable, Parquet-portable context layer. The interesting overlap is the substrate question — “should the data plane be Iceberg or simpler Parquet?” — which both projects answer, differently, in good faith. If Iceberg’s catalog story converges with DuckLake’s (a real possibility given the catalog-in-SQL direction), the Parquet-vs-Iceberg debate at Hakiri’s level collapses, and bauplan’s strongest substrate differentiator becomes branching specifically — which is more reproducible by other tools than the format itself.

What Hakiri is not trying to be

A streaming SQL engine. That’s Materialize, Arroyo, RisingWave. Hakiri is batch-oriented (record batches, not row streams).
A workflow orchestrator. That’s Temporal, Restate, Step Functions, Cloudflare Workflows. Hakiri is a step inside a workflow.
A warehouse. That’s Snowflake, BigQuery, Clickhouse. The context store is a queryable cache, not a warehouse.
A vector database. Embeddings in Parquet work fine for analytics, but for online vector search you want Turbopuffer, Pinecone, or Postgres+pgvector.

Open questions

OpenLineage compatibility. Worth emitting OpenLineage events from runs? Free integration with downstream observability tools. Likely yes in M2.
dbt adapter. The context store + DuckDB is a natural dbt target. Easy to write an adapter; defer until requested.
DuckLake interop. If DuckLake reaches v1.0 with non-DuckDB readers, does Hakiri swap its bespoke Parquet + JSON manifest for the DuckLake catalog shape? Re-opens ADR-0002. Watch through 2026.
Data branching (bauplan-style). Hakiri has CRDT branching on config but no native answer for “branch this table, experiment, merge back.” Worth a design spike post-M1 — likely Parquet-copy + catalog branch-pointer, materialized-on-apply like the config layer.