Skip to content

Architecture

flowchart LR
  subgraph Author["Author (human or agent)"]
    Agent[Agent / Claude Code]
    Dev[Developer CLI]
  end

  subgraph Hakiri["hakiri (single binary)"]
    MCP[MCP server]
    Runtime[Pipeline runtime]
    WasmHost["Wasmtime Component host"]
    Catalog[(Catalog: SQLite)]
    Context[(Context store: Parquet + DuckDB)]
    Sync[Sync engine]
  end

  subgraph Connectors["Connectors"]
    Built["Built-in (native Rust)"]
    Wasm["User/agent-authored (WASM component)"]
  end

  subgraph Outside["Outside world"]
    Sources["Sources (Postgres, HTTP, GitHub, …)"]
    Bucket["S3-compatible bucket (R2)"]
  end

  Agent --> MCP
  Dev --> Runtime
  MCP --> Runtime
  Runtime --> WasmHost
  WasmHost --> Wasm
  Runtime --> Built
  Built --> Sources
  Wasm --> Sources
  Runtime --> Catalog
  Runtime --> Context
  Sync <--> Bucket
  Sync --> Catalog
  Sync --> Context
CrateResponsibilityKey typesDepends on
hakiri-coreDomain types and ports. No I/O.Schema, Record, Batch, Cursor, PipelineSpec, ConnectorRef, RunId
hakiri-runtimePipeline executor: schedules runs, orchestrates source→transform→destination, owns the WASM host.Runner, WasmHost, RunContextcore
hakiri-contextLocal-first storage: SQLite for metadata/state, Parquet for table data, DuckDB as the query face.ContextStore, TableRef, Snapshotcore
hakiri-syncPush/pull to S3-compatible bucket. Defines the wire format (manifest + parquet objects).SyncClient, Manifest, ConflictResolutioncore, context
hakiri-agentBuilt-in MCP server + connector scaffolder.McpServer, Scaffoldercore, runtime, context
hakiri-connectorsBuilt-in native connectors (postgres, http, github, openapi, file, s3, …). Each is a small sub-crate.PostgresSource, HttpSource, …core
hakiri-cliThe hakiri binary; wires everything via dependency injection.Cli, Commandall of the above

The domain (hakiri-core) is pure: no tokio, no wasmtime, no reqwest. Adapters live in the runtime/context/sync/connector crates. This is the hexagonal split from arch-taste.md applied to Rust:

hakiri-core/
src/
schema.rs # Arrow-compatible Schema, Field, DataType
record.rs # Batch (Arrow RecordBatch wrapper), Record
cursor.rs # opaque incremental-load token
pipeline.rs # PipelineSpec, SourceSpec, DestinationSpec
error.rs # thiserror enums
ports/
source.rs # trait Source
destination.rs# trait Destination
transform.rs # trait Transform
catalog.rs # trait Catalog (read/write pipeline metadata)
clock.rs # trait Clock (testable time)
sequenceDiagram
  participant CLI as hakiri-cli
  participant R as Runner (hakiri-runtime)
  participant Cat as Catalog (SQLite)
  participant Src as Source connector
  participant Ctx as ContextStore
  participant T as Tracer (OTel)

  CLI->>R: run(pipeline_id)
  R->>Cat: load PipelineSpec + last Cursor
  R->>T: start span run.<id>
  R->>Src: open(cursor)
  loop until exhausted
    Src-->>R: RecordBatch
    R->>Ctx: write(table, batch)
    Note over R,Ctx: each batch = one Parquet file under runs/<run_id>/
  end
  R->>Ctx: commit snapshot
  R->>Cat: persist new Cursor + RunStatus
  R->>T: end span

Properties:

  • One run = one atomic snapshot in the context store. Crash mid-run leaves partial Parquet under runs/<run_id>/ which the next run can either resume or discard.
  • Backpressure is in the Source. Sources return Stream<RecordBatch>; the runner pulls. No unbounded buffers.
  • Schema can evolve mid-run. New columns get appended (additive); type changes promote (e.g. int→long). Breaking changes are rejected and surfaced as SchemaIncompatible errors the agent can react to.
  • tokio multi-threaded runtime
  • One pipeline run = one async task; many pipelines can run concurrently
  • Connectors that yield Stream<RecordBatch> get N-way parallelism only if they explicitly partition (e.g. Postgres logical decoding slots, HTTP paginated fetches with shards)
  • WASM components run on wasmtime’s async support; each component gets a fresh Store per run for isolation
  • Errors are typed (thiserror): Transient, Permanent, SchemaIncompatible, AuthExpired, RateLimited
  • The runner has a single retry policy: exponential backoff with jitter, max 5 attempts, only for Transient and RateLimited
  • Permanent errors fail the run; the agent can read the trace and propose a fix
  • Partial writes are visible in the catalog as RunStatus::PartialFailure { written_batches: N }
  • Distributed execution. v0 is single-process. Multi-node sharding ships when there’s a concrete use case demanding it.
  • Reverse ETL connectors. A destination is a connector, so this works in principle, but the M0 destinations are limited to the context store + a few common sinks.
  • Real-time CDC. Postgres logical replication is on the roadmap; Kafka is not (use a Kafka→Hakiri shim).
  • Arrow as the canonical in-memory format? Arrow is the right call for interchange with DuckDB/Parquet. But Arrow’s Rust crates (arrow-rs) carry weight. Investigate arrow2 vs arrow-rs tradeoff for binary size.
  • Where does transformation live? Three options: (a) inside the source connector, (b) a separate Transform port like dlt’s @dlt.transformer, (c) push down into DuckDB after landing. Initial bias: (c) — land raw, transform in SQL. Revisit if perf demands.
  • Pipeline graph or linear? Singer treats pipelines as tap | target. dlt allows DAGs. Initial bias: linear source→[transform]*→destination; DAG support is a v2 concern.