Skip to content

Comparisons

How Hakiri positions against the closest neighbors at the engineering level: feature matrix and head-to-head comparisons. Three categories get detailed treatment here:

  1. OSS ELT — dlt, Airbyte, Meltano/Singer, Fivetran, Cloudflare Pipelines
  2. Agent-memory products — Mem0, SurrealDB Spectron
  3. Data architecture patterns — DuckLake, bauplan

Audience framing (“who Hakiri is for vs. who it isn’t”), broader competitive landscape (memory frameworks beyond the two above, vendor-locked memory, per-tool MCP servers, vector storage, embedded data stacks), and the “what makes Hakiri non-substitutable” argument live in PRD.md — see PRD § Target audience and PRD § Competitive landscape. This doc is the technical companion: feature-by-feature comparisons against the two categories above.

HakiridltAirbyte (OSS)Meltano (Singer)FivetranCloudflare Pipelines
LanguageRust binaryPython libPython + JavaPythonSaaSSaaS (closed)
Distributionsingle binarypip packagedocker-compose / k8spip packagehostedCF account
Connector contractWIT/WASMPython @dlt.sourceJava/Python containerSinger JSON specproprietaryproprietary
Agent authoringfirst-class MCPnonenonenonenonenone
SandboxingWASM capabilitiestrust the libcontainertrust the tapn/an/a
Local-firstyesyespartiallyyesnono
Destination of defaultlocal context storewarehouse of choicewarehousewarehousewarehouseR2
Edge / Workers deployyes (WASM + Containers)nononononative
Schema evolutiondeclarative + agent-resolvableinferredper-connectorper-tapinferredn/a
OSS licenseApache-2.0 (proposed)Apache-2.0Elastic-2.0MITproprietaryproprietary
Sync to S3-compatiblefirst-classvia destinationvia destinationvia targetvia destinationnative

dlt is closest in spirit — also small, library-first, schema-inferring, destination-flexible. Differences:

  • Language: dlt is Python; Hakiri is Rust + WASM. dlt’s Python-native ergonomics are unmatched for ad-hoc analytics scripts. Hakiri pays a startup-tooling cost (you compile or download a binary) to get sandboxing and edge deployability.
  • Connectors are libraries vs. components: a dlt source is a Python function that imports the world. A Hakiri connector is a WASM component with declared capabilities. Hakiri’s contract is stricter; dlt’s is more flexible.
  • Local context store: dlt doesn’t ship one. You pick a destination (DuckDB, Postgres, BigQuery, …) and dlt writes to it. Hakiri’s default is the on-disk context store, syncable to R2 — a different mental model.
  • Agent authoring: dlt has no MCP surface today. The community is exploring it; Hakiri builds it in.

Pick dlt when: your team is Python-shop, you want quick scripts that land in your existing warehouse, you trust your connector code.

Pick Hakiri when: you want a single binary, you want agents to author connectors, you want WASM sandboxing, you want a local-first context store as the default.

Airbyte is the breadth incumbent — hundreds of connectors, big UI, k8s-heavy deployment. Differences:

  • Operational weight: Airbyte OSS wants a Kubernetes cluster, a database, a UI. Hakiri wants a binary.
  • Connector authoring: Airbyte’s CDK is Python or Java; each connector is a container with its own image. Hakiri connectors are WASM components, no image build.
  • Marketplace: Airbyte’s marketplace is its moat. Hakiri starts with zero and earns connectors via the agent-authoring loop. If you need 200 connectors today, Airbyte wins.
  • License: Airbyte moved from MIT to Elastic-2.0 (ELv2). Hakiri targets Apache-2.0.

Pick Airbyte when: you have a long tail of obscure SaaS sources and value the existing connector inventory more than runtime ergonomics.

Pick Hakiri when: you want a small footprint, you’re OK with fewer-but-easier-to-write connectors, you want the agent loop.

Singer’s tap/target subprocess protocol is the OG declarative ELT model — JSON-lines over stdio between independent processes. Meltano is the orchestrator on top.

  • Inspiration: Singer’s “small interface, many implementations” philosophy is exactly Hakiri’s. The difference is we use WIT/WASM instead of JSON-over-stdio, which gives us typed schemas, sandboxing, and a faster wire format.
  • Compatibility: a hakiri singer <tap> shim is on the roadmap (see 03-pipelines) — adapt a Singer tap as a Hakiri source by running it as a subprocess and translating the JSONL stream. That gets us the existing tap ecosystem for free.
  • State management: Meltano’s state is YAML+JSON files in the project dir. Hakiri’s catalog is SQLite + Parquet. Functionally equivalent; Hakiri’s is more queryable.

Pick Meltano when: you’re already invested in Singer taps and the Python ecosystem.

Pick Hakiri when: you want a similar declarative model but in a single binary with WASM-sandboxed connectors and an MCP server.

These are the buy-don’t-build options. Different category — they sell you “we operate the connectors”. Hakiri is the build-it-yourself option that doesn’t require building everything from scratch.

Pick Fivetran when: you have budget, you’d rather pay than operate, your sources are common SaaS apps.

Pick Hakiri when: you have non-standard sources, you have data residency constraints, you want to own the data plane, or you have an agent that should be authoring connectors.

Cloudflare Pipelines (the newer product) is closed-source SaaS optimized for “stream into R2”. It overlaps with Hakiri’s edge deployment story but is opinionated about R2 and the Cloudflare runtime.

  • Hakiri runs on Cloudflare (Containers + Workflows) but isn’t only for Cloudflare.
  • Cloudflare Pipelines doesn’t have an MCP surface and doesn’t expose connector authoring to agents.
  • OSS vs. proprietary is the headline difference.

Pick CF Pipelines when: you’re all-in on Cloudflare, you want managed, your shape is “events → R2”.

Pick Hakiri when: you want OSS, you want to deploy outside Cloudflare too, you want agent authoring.

HakiriMem0SurrealDB Spectron
ScopeIngestion + memory + access control + MCP in one binaryMemory layer for agents (vector + graph)Memory layer integrated into SurrealDB
Storage engineParquet files + SQLite catalogPostgres + vector DB (e.g. Qdrant)SurrealDB (RocksDB / SurrealKV / TiKV)
Storage format on diskcat-able Parquet + JSON manifestsPostgres tables + vendor vector formatSurrealDB-owned, opaque without engine
Query surfaceDuckDB SQL + MCPMem0 REST API + MCP-adjacentSurrealQL (bespoke) + MCP
IngestionPull from N source connectors; backfill + drift; agent-authored WASMCaller pushes events / messagesCaller writes via MCP / SurrealDB drivers
SandboxingWASM Component with capability-declared host accessn/a — caller’s coden/a — caller’s code
Access controlCapability tokens (subject tuple + DPoP) + RLS + CLS + write-time redactionToken + project scopingSurrealDB record-level permissions + RBAC
DistributionSingle binary, three profilesMem0 service + Postgres + vector DBSurrealDB Cloud (managed) or self-hosted SurrealDB
Local-first syncAny S3-compatible bucket; offline-capable; no relayHosted-first; self-hosted = three services runningReplication assumes connected SurrealDB cluster
Edge / WASMCloudflare Containers + Workers (WASM core profile)Server-shapedServer-shaped
LicenseApache-2.0 (proposed)Apache-2.0 + hostedBSL 1.1 → Apache-2.0 (SurrealDB); Spectron license unannounced
Lock-in failure modeProject stops → data is Parquet + JSON, readable by every analytics toolMem0 stops → Postgres dump is recoverable but schema is Mem0’sSurrealDB stops or pivots → format is SurrealDB’s

Mem0 is the closest pure-memory competitor and the most likely “we already use Mem0, why do we need Hakiri?” objection. Both ship OSS + hosted; both target agent memory; both touch MCP-adjacent surfaces.

  • Ingestion: Mem0 is push-only — the caller writes events and messages into the memory API. Hakiri pulls from source connectors on a schedule with backfill and drift detection.
  • Storage: Mem0 owns its schema across Postgres + a vector DB. Hakiri’s storage is Parquet + JSON manifests + SQLite — readable by duckdb, parquet-tools, or cat without Hakiri running.
  • Deploy: Mem0’s hosted is one service; self-hosted requires Mem0 + Postgres + a vector DB — three processes minimum. Hakiri is one binary, no companion services.
  • Sandboxing: Mem0 trusts the caller’s code; Hakiri runs connectors as WASM components with capability-declared host access.
  • Access control: Mem0 has token + project scoping. Hakiri has capability tokens (subject tuple + DPoP), RLS, CLS, and write-time redaction enforced at the sync edge.

Pick Mem0 when: you only need a memory store for an existing agent, you’re happy to push events at it, and you don’t want to run ingestion yourself. Hosted Mem0 is the fastest path to “my agent remembers things.”

Pick Hakiri when: the agent needs to pull from multiple SaaS silos and remember things across them; you cannot or will not run three services; you need a cat-able audit story or capability-token ACLs for an internal/regulated environment.

Honest convergence read: Mem0 may add ingestion connectors over time, and Hakiri may someday be a reasonable Mem0 substitute. Today they sit on different sides of “memory for agents” vs “data layer for agents.” Wrong tools for each other’s primary use case.

Spectron is the closest architectural competitor in the agent-memory category: OSS-aligned vendor, MCP surface, knowledge-graph + vector + temporal memory, and a self-hosting story. The thesis Spectron leads with — “memory in the database, not above it” — is a real critique of the Mem0/middleware shape, and one Hakiri partly agrees with: a single binary owns the storage, the catalog, and the MCP surface, so there is no middleware tax.

The disagreement is about which database, and whether the answer should be a database at all.

  • Storage shape: Spectron stores entities, relationships, embeddings, and temporal metadata inside SurrealDB’s on-disk format (RocksDB / SurrealKV / TiKV). Hakiri stores them as Parquet files on a local FS or S3-compatible bucket, with a SQLite catalog as the index. Spectron’s storage is opaque without SurrealDB running; Hakiri’s is portable.
  • Query surface: Spectron speaks SurrealQL (bespoke, multi-model). Hakiri speaks DuckDB SQL (standard) with Polars for transforms. Agents write SQL fluently out of the box; SurrealQL is a re-training cost.
  • Memory model: Spectron makes knowledge graph, bi-temporal facts, entity disambiguation, and autonomous enrichment first-class storage primitives. Hakiri treats these as queries over Parquet tables backed by vector (HNSW) and full-text (Tantivy) sidecar indices — same capabilities, different abstraction level.
  • Ingestion: Spectron is caller-writes (via MCP or SurrealDB drivers). Hakiri pulls from source connectors, with sandboxed WASM connectors and agent authoring.
  • Access control: Spectron uses SurrealDB record-level permissions + RBAC inside the database. Hakiri uses capability tokens with composable subjects + RLS + CLS + write-time redaction enforced at the sync edge.
  • Deploy: Spectron is in waitlist preview; production path is SurrealDB Cloud (managed) or self-hosted SurrealDB. Hakiri is a single binary in three feature profiles — no DB to run alongside.
  • Local-first sync: SurrealDB’s replication assumes always-connected nodes (a cluster). Hakiri’s sync is “any S3 bucket between any two parties, possibly offline for hours” — file-shaped, no relay.
  • Edge / WASM: SurrealDB embeds, but Spectron itself is server-shaped. Hakiri runs on Cloudflare Containers + Workers (WASM core profile) by design.
  • Coupling: Spectron requires SurrealDB; SurrealDB’s distributed mode requires TiKV. Hakiri’s destinations are pluggable (local FS, R2, S3); no foreign cluster required.

Pick Spectron when: you have already committed to SurrealDB as your primary data store, you want memory writes in the same ACID transaction as the rest of your domain data, you value the knowledge-graph + bi-temporal model as a first-class storage primitive, and you are comfortable with the storage format being SurrealDB-owned.

Pick Hakiri when: you need to pull from multiple SaaS silos and remember things across them; you want Parquet storage (every analytics tool reads it, audit is cat-able, ADR-0002 explains why); you want to deploy on the edge as WASM; you want capability-token ACLs at the sync edge rather than RBAC inside a database; you want Pillar 3’s MCP-only / provider-agnostic guarantee to extend to the storage layer.

Where Spectron’s critique of “middleware over fragmented DBs” lands on Hakiri: it doesn’t, but for a different reason than Spectron’s own answer. Hakiri is not middleware — it owns its storage end to end. The difference is Hakiri’s storage is files on object storage, not a database. That gets us the same atomic-write story (a Parquet file is atomic to publish via manifest swap), no cross-DB round trips (one process writes the file and updates the catalog), and additionally: portability, cat-ability, replication-without-a-cluster, and no second piece of infrastructure to run. The underlying decision is in ADR-0004, which evaluated SurrealDB-as-engine and rejected it on the same axes that separate Hakiri from Spectron here.

Honest convergence read: if Spectron ships and proves out, it is the strongest evidence yet that “agent memory should own its storage” is the correct architectural intuition. The disagreement is narrower than it looks — Hakiri agrees with the conclusion, disagrees with the implementation (database vs. files), and disagrees with the implied lock-in (SurrealDB-or-nothing vs. portable-files-anywhere).

Two recent entries — DuckLake (MotherDuck, 2025) and bauplan (founded 2023) — share substrate or philosophy with Hakiri but sit at different levels of the stack. Worth comparing because the substrate choices (Parquet, DuckDB-shaped compute, lakehouse-style storage, serverless execution, branching) overlap with Hakiri’s design space — and because the differences in level clarify what Hakiri is and isn’t.

HakiriDuckLakebauplan
CategoryData movement runtime + context layerLakehouse table-format specServerless data platform
Level in stackRuntime + ingestion + agent surfaceStorage format + catalogCompute platform + storage + branching UX
Data filesParquet + JSON manifests + sidecar indexesParquetApache Iceberg (Parquet underneath)
Catalog backendSQLite / DO SQLite / RDS / PostgresAny SQL DB (Postgres, MySQL, SQLite, DuckDB)Bauplan-managed metastore
Compute / queryRust + WASM runtime; DuckDB + Polars at query timeAny engine (DuckDB reference impl); SQLPython serverless functions (“FaaS for data”)
IngestionPull from N WASM connectors; backfill; drift detectionn/a — table format onlyPython “loader” functions
BranchingCRDT history on config (not data); LWW + node-id paths on dataFirst-class snapshots + branches at the table levelFirst-class git-for-data on tables (“branch / merge / commit”)
SandboxingWASM Component with capability declarationsn/aContainer per function
Agent surfaceMCP-native (in-tree)nonenone
Local-firstyes (single binary, offline-capable)yes (DuckDB engine + SQLite catalog works locally)hosted-first (some local dev tooling)
LicenseApache-2.0 (proposed)MIT (DuckLake spec); MIT (DuckDB reference impl)Apache-2.0 OSS components; closed-source hosted SaaS
Lock-in failure modeProject stops → Parquet + JSON readable by any toolSpec stops → Parquet files still readable; catalog rows are plain SQLbauplan stops → Iceberg tables remain readable; orchestration/branching is bauplan-owned

DuckLake is a lakehouse table-format specification published by MotherDuck in 2025. The core idea: replace the file-based catalog of Iceberg / Delta (manifest files, transaction logs, JSON metadata trees) with rows in a regular SQL database. Data stays as Parquet; the catalog is CREATE TABLE ducklake_snapshots (...) in Postgres / MySQL / SQLite / DuckDB itself. ACID transactions ride the SQL database’s existing transaction semantics; snapshots, branches, time-travel, and schema evolution are SQL rows. The reference implementation is a DuckDB extension.

Hakiri and DuckLake share substrate but sit at different levels of the stack — they’re complementary, not substitutes:

  • Level: DuckLake is just the storage layer. It says nothing about how data gets in, how schedules fire, how agents query, how access is controlled. Hakiri is the runtime above the storage. The honest framing: Hakiri could one day emit DuckLake-formatted tables and consume them through the same DuckDB query face it already uses.
  • Catalog shape: Both keep catalog state in a SQL store. Hakiri uses SQLite (or Postgres / RDS / DO SQLite) for pipeline state — cursors, run history, schema evolution decisions, capability tokens. DuckLake uses a SQL DB for table state — snapshot IDs, file lists, schema versions, branch heads. Different scopes, same instinct: “the catalog should be a database, not files.”
  • Manifest format: Hakiri’s ADR-0002 rejected Iceberg’s complexity for v0 in favor of a small Parquet + JSON manifest design. DuckLake lands in roughly the same place philosophically — keep the catalog readable, avoid file-based metadata trees — but with a more formal table-format spec and a community-adopted shape.
  • Branching: DuckLake ships snapshots + branches as first-class table operations (cheap to create, fork, merge). Hakiri does not have data-level branching today. Hakiri’s CRDT-on-config story (14-collab-config.md) gives multiplayer editing of pipeline definitions, but not “branch this dataset, experiment, merge back.”
  • Engine assumption: DuckLake’s reference implementation is DuckDB. Other engines (Spark, Trino, Polars) would need DuckLake readers/writers — work that’s in progress but not uniformly available. Hakiri’s runtime is Rust + WASM with DuckDB as the query face, not the data manipulation engine.

Pick DuckLake when: you want a lakehouse table format simpler than Iceberg / Delta; your compute is DuckDB-shaped (or you can wait for other engines to add support); you don’t need ingestion — the data is already produced by something upstream — and you want the catalog to be a regular database you can SELECT * FROM ducklake_snapshots.

Pick Hakiri when: you need the layer above DuckLake — connectors to pull data in, schedules, a context store with vector + FTS indexes, an MCP server, agent authoring, capability tokens at the sync edge.

Honest convergence read: if DuckLake stabilizes as a community spec, Hakiri’s on-disk format could become “Parquet + DuckLake-spec catalog” — replacing the bespoke manifest with a standard one. The bet against that today: DuckLake is new (months old at the time of writing), and Hakiri’s manifest is small enough to swap later. ADR-0002 is the right place to track this — when DuckLake reaches v1.0 and has non-DuckDB readers, that ADR gets re-opened. The two projects are aligned more than they are competing: a world where Hakiri pipelines write DuckLake tables that any DuckDB / Polars / Snowflake client can read is strictly better than today’s bespoke layout.

What’s not a comparison axis here: agent authoring, ingestion, scheduling, MCP. DuckLake doesn’t claim any of these; it’s the wrong question to ask.

bauplan is a serverless data platform aimed at data + ML workloads, founded by Jacopo Tagliabue (formerly Coveo). Compute is Python serverless functions (“FaaS for data”) — each step in a pipeline is a function bauplan invokes against tables. Storage is Apache Iceberg with a bauplan-managed metastore. The headline feature is git-for-data: tables are branchable, mergeable, commit-shaped — you can bauplan branch create staging, run experiments, and merge back, with snapshots that are cheap and atomic.

bauplan and Hakiri target overlapping pain (declarative data work, agent / ML-friendly compute, local-first dev loop) but make different bets on the shape of the runtime and the storage:

  • Compute model: bauplan compute is Python serverless functions. Hakiri compute is a Rust binary running WASM Component connectors and Polars transforms. bauplan optimizes for the data-science / ML-engineer ergonomic (@bauplan.python_step decorators, familiar pandas/pyarrow); Hakiri optimizes for sandboxing, deploy-anywhere, agent-authorable connectors. Different runtime philosophies.
  • Storage format: bauplan uses Iceberg. Hakiri uses Parquet + a small JSON manifest, per ADR-0002. The trade-off: Iceberg gives standardized snapshots, schema evolution, and a growing ecosystem; the manifest cost is real (metadata tree, vacuum, table-maintenance overhead). Hakiri bet on simpler + portable for v0; bauplan bet on standard + branchable. Both bets are defensible.
  • Branching: bauplan’s git-for-data is first-class on the data plane — you branch a table, mutate, merge. Hakiri’s branching story is only on the config plane (14-collab-config.md) — you branch a manifest, edit collaboratively, apply. Hakiri has no native answer to “branch this dataset for an experiment” today. This is a real gap.
  • Target user: bauplan’s marketing speaks to data scientists / ML engineers building feature pipelines and model workflows. Hakiri’s is agents + small teams of operators building a context layer over their SaaS silos. Different personas; overlapping tools.
  • Agent surface: bauplan has no MCP server and no notion of agent-authored functions (you write Python; bauplan runs it). Hakiri’s pillar 2 is exactly this surface.
  • Distribution: bauplan is a hosted SaaS with some OSS components and a local-dev path. Hakiri is a single OSS binary with three deploy profiles — local CLI, daemon, CF / AWS / cluster. The deploy-anywhere story is meaningfully different.
  • Sandboxing: bauplan functions run in bauplan-managed containers. Hakiri connectors run as WASM Components with declared capabilities — finer-grained than container isolation.
  • License: bauplan has Apache-2.0 OSS pieces and a closed-source hosted product. Hakiri targets Apache-2.0 end to end.

Pick bauplan when: you’re building ML / data-science pipelines in Python; you want git-for-data branching on real tables (not just config); you’re happy on a hosted platform; Iceberg is a familiar or required substrate; your team thinks in pandas / pyarrow / Jupyter rather than agents and MCP.

Pick Hakiri when: the workload is agent-facing context (not ML feature engineering); you want a single OSS binary you can deploy on the edge; you want WASM-sandboxed connectors authored by agents; you need capability tokens and write-time redaction for regulated data; you want the data plane to never leave the customer environment.

What we’d happily steal from bauplan: the git-for-data branching model is genuinely useful, and the gap is real — Hakiri can edit config collaboratively but cannot branch a dataset for an experiment without copying it. A future feature: data branches as cheap Parquet copies + a branch pointer in the catalog, applying the same materialization-on-apply discipline as the CRDT config layer. Tracked as an open question below; not on the M1 roadmap.

Honest convergence read: bauplan and Hakiri end up in different worlds. bauplan ships a managed platform with strong Python ergonomics and Iceberg branching for data scientists. Hakiri ships an agent-shaped, edge-deployable, Parquet-portable context layer. The interesting overlap is the substrate question — “should the data plane be Iceberg or simpler Parquet?” — which both projects answer, differently, in good faith. If Iceberg’s catalog story converges with DuckLake’s (a real possibility given the catalog-in-SQL direction), the Parquet-vs-Iceberg debate at Hakiri’s level collapses, and bauplan’s strongest substrate differentiator becomes branching specifically — which is more reproducible by other tools than the format itself.

  • A streaming SQL engine. That’s Materialize, Arroyo, RisingWave. Hakiri is batch-oriented (record batches, not row streams).
  • A workflow orchestrator. That’s Temporal, Restate, Step Functions, Cloudflare Workflows. Hakiri is a step inside a workflow.
  • A warehouse. That’s Snowflake, BigQuery, Clickhouse. The context store is a queryable cache, not a warehouse.
  • A vector database. Embeddings in Parquet work fine for analytics, but for online vector search you want Turbopuffer, Pinecone, or Postgres+pgvector.
  • OpenLineage compatibility. Worth emitting OpenLineage events from runs? Free integration with downstream observability tools. Likely yes in M2.
  • dbt adapter. The context store + DuckDB is a natural dbt target. Easy to write an adapter; defer until requested.
  • DuckLake interop. If DuckLake reaches v1.0 with non-DuckDB readers, does Hakiri swap its bespoke Parquet + JSON manifest for the DuckLake catalog shape? Re-opens ADR-0002. Watch through 2026.
  • Data branching (bauplan-style). Hakiri has CRDT branching on config but no native answer for “branch this table, experiment, merge back.” Worth a design spike post-M1 — likely Parquet-copy + catalog branch-pointer, materialized-on-apply like the config layer.