Architecture

High-level shape

flowchart LR
  subgraph Author["Author (human or agent)"]
    Agent[Agent / Claude Code]
    Dev[Developer CLI]
  end

  subgraph Hakiri["hakiri (single binary)"]
    MCP[MCP server]
    Runtime[Pipeline runtime]
    WasmHost["Wasmtime Component host"]
    Catalog[(Catalog: SQLite)]
    Context[(Context store: Parquet + DuckDB)]
    Sync[Sync engine]
  end

  subgraph Connectors["Connectors"]
    Built["Built-in (native Rust)"]
    Wasm["User/agent-authored (WASM component)"]
  end

  subgraph Outside["Outside world"]
    Sources["Sources (Postgres, HTTP, GitHub, …)"]
    Bucket["S3-compatible bucket (R2)"]
  end

  Agent --> MCP
  Dev --> Runtime
  MCP --> Runtime
  Runtime --> WasmHost
  WasmHost --> Wasm
  Runtime --> Built
  Built --> Sources
  Wasm --> Sources
  Runtime --> Catalog
  Runtime --> Context
  Sync <--> Bucket
  Sync --> Catalog
  Sync --> Context

Components (bounded contexts → crates)

Crate	Responsibility	Key types	Depends on
`hakiri-core`	Domain types and ports. No I/O.	`Schema`, `Record`, `Batch`, `Cursor`, `PipelineSpec`, `ConnectorRef`, `RunId`	—
`hakiri-runtime`	Pipeline executor: schedules runs, orchestrates source→transform→destination, owns the WASM host.	`Runner`, `WasmHost`, `RunContext`	`core`
`hakiri-context`	Local-first storage: SQLite for metadata/state, Parquet for table data, DuckDB as the query face.	`ContextStore`, `TableRef`, `Snapshot`	`core`
`hakiri-sync`	Push/pull to S3-compatible bucket. Defines the wire format (manifest + parquet objects).	`SyncClient`, `Manifest`, `ConflictResolution`	`core`, `context`
`hakiri-agent`	Built-in MCP server + connector scaffolder.	`McpServer`, `Scaffolder`	`core`, `runtime`, `context`
`hakiri-connectors`	Built-in native connectors (postgres, http, github, openapi, file, s3, …). Each is a small sub-crate.	`PostgresSource`, `HttpSource`, …	`core`
`hakiri-cli`	The `hakiri` binary; wires everything via dependency injection.	`Cli`, `Command`	all of the above

The domain (hakiri-core) is pure: no tokio, no wasmtime, no reqwest. Adapters live in the runtime/context/sync/connector crates. This is the hexagonal split from arch-taste.md applied to Rust:

hakiri-core/
  src/
    schema.rs       # Arrow-compatible Schema, Field, DataType
    record.rs       # Batch (Arrow RecordBatch wrapper), Record
    cursor.rs       # opaque incremental-load token
    pipeline.rs     # PipelineSpec, SourceSpec, DestinationSpec
    error.rs        # thiserror enums
    ports/
      source.rs     # trait Source
      destination.rs# trait Destination
      transform.rs  # trait Transform
      catalog.rs    # trait Catalog (read/write pipeline metadata)
      clock.rs      # trait Clock (testable time)

Data flow — one pipeline run

sequenceDiagram
  participant CLI as hakiri-cli
  participant R as Runner (hakiri-runtime)
  participant Cat as Catalog (SQLite)
  participant Src as Source connector
  participant Ctx as ContextStore
  participant T as Tracer (OTel)

  CLI->>R: run(pipeline_id)
  R->>Cat: load PipelineSpec + last Cursor
  R->>T: start span run.<id>
  R->>Src: open(cursor)
  loop until exhausted
    Src-->>R: RecordBatch
    R->>Ctx: write(table, batch)
    Note over R,Ctx: each batch = one Parquet file under runs/<run_id>/
  end
  R->>Ctx: commit snapshot
  R->>Cat: persist new Cursor + RunStatus
  R->>T: end span

Properties:

One run = one atomic snapshot in the context store. Crash mid-run leaves partial Parquet under runs/<run_id>/ which the next run can either resume or discard.
Backpressure is in the Source. Sources return Stream<RecordBatch>; the runner pulls. No unbounded buffers.
Schema can evolve mid-run. New columns get appended (additive); type changes promote (e.g. int→long). Breaking changes are rejected and surfaced as SchemaIncompatible errors the agent can react to.

Concurrency model

tokio multi-threaded runtime
One pipeline run = one async task; many pipelines can run concurrently
Connectors that yield Stream<RecordBatch> get N-way parallelism only if they explicitly partition (e.g. Postgres logical decoding slots, HTTP paginated fetches with shards)
WASM components run on wasmtime’s async support; each component gets a fresh Store per run for isolation

Failure & retry

Errors are typed (thiserror): Transient, Permanent, SchemaIncompatible, AuthExpired, RateLimited
The runner has a single retry policy: exponential backoff with jitter, max 5 attempts, only for Transient and RateLimited
Permanent errors fail the run; the agent can read the trace and propose a fix
Partial writes are visible in the catalog as RunStatus::PartialFailure { written_batches: N }

What’s not in this picture (yet)

Distributed execution. v0 is single-process. Multi-node sharding ships when there’s a concrete use case demanding it.
Reverse ETL connectors. A destination is a connector, so this works in principle, but the M0 destinations are limited to the context store + a few common sinks.
Real-time CDC. Postgres logical replication is on the roadmap; Kafka is not (use a Kafka→Hakiri shim).

Open questions

Arrow as the canonical in-memory format? Arrow is the right call for interchange with DuckDB/Parquet. But Arrow’s Rust crates (arrow-rs) carry weight. Investigate arrow2 vs arrow-rs tradeoff for binary size.
Where does transformation live? Three options: (a) inside the source connector, (b) a separate Transform port like dlt’s @dlt.transformer, (c) push down into DuckDB after landing. Initial bias: (c) — land raw, transform in SQL. Revisit if perf demands.
Pipeline graph or linear? Singer treats pipelines as tap | target. dlt allows DAGs. Initial bias: linear source→[transform]*→destination; DAG support is a v2 concern.