Architecture
High-level shape
Section titled “High-level shape”flowchart LR
subgraph Author["Author (human or agent)"]
Agent[Agent / Claude Code]
Dev[Developer CLI]
end
subgraph Hakiri["hakiri (single binary)"]
MCP[MCP server]
Runtime[Pipeline runtime]
WasmHost["Wasmtime Component host"]
Catalog[(Catalog: SQLite)]
Context[(Context store: Parquet + DuckDB)]
Sync[Sync engine]
end
subgraph Connectors["Connectors"]
Built["Built-in (native Rust)"]
Wasm["User/agent-authored (WASM component)"]
end
subgraph Outside["Outside world"]
Sources["Sources (Postgres, HTTP, GitHub, …)"]
Bucket["S3-compatible bucket (R2)"]
end
Agent --> MCP
Dev --> Runtime
MCP --> Runtime
Runtime --> WasmHost
WasmHost --> Wasm
Runtime --> Built
Built --> Sources
Wasm --> Sources
Runtime --> Catalog
Runtime --> Context
Sync <--> Bucket
Sync --> Catalog
Sync --> Context
Components (bounded contexts → crates)
Section titled “Components (bounded contexts → crates)”| Crate | Responsibility | Key types | Depends on |
|---|---|---|---|
hakiri-core | Domain types and ports. No I/O. | Schema, Record, Batch, Cursor, PipelineSpec, ConnectorRef, RunId | — |
hakiri-runtime | Pipeline executor: schedules runs, orchestrates source→transform→destination, owns the WASM host. | Runner, WasmHost, RunContext | core |
hakiri-context | Local-first storage: SQLite for metadata/state, Parquet for table data, DuckDB as the query face. | ContextStore, TableRef, Snapshot | core |
hakiri-sync | Push/pull to S3-compatible bucket. Defines the wire format (manifest + parquet objects). | SyncClient, Manifest, ConflictResolution | core, context |
hakiri-agent | Built-in MCP server + connector scaffolder. | McpServer, Scaffolder | core, runtime, context |
hakiri-connectors | Built-in native connectors (postgres, http, github, openapi, file, s3, …). Each is a small sub-crate. | PostgresSource, HttpSource, … | core |
hakiri-cli | The hakiri binary; wires everything via dependency injection. | Cli, Command | all of the above |
The domain (hakiri-core) is pure: no tokio, no wasmtime, no reqwest. Adapters live in the runtime/context/sync/connector crates. This is the hexagonal split from arch-taste.md applied to Rust:
hakiri-core/ src/ schema.rs # Arrow-compatible Schema, Field, DataType record.rs # Batch (Arrow RecordBatch wrapper), Record cursor.rs # opaque incremental-load token pipeline.rs # PipelineSpec, SourceSpec, DestinationSpec error.rs # thiserror enums ports/ source.rs # trait Source destination.rs# trait Destination transform.rs # trait Transform catalog.rs # trait Catalog (read/write pipeline metadata) clock.rs # trait Clock (testable time)Data flow — one pipeline run
Section titled “Data flow — one pipeline run”sequenceDiagram
participant CLI as hakiri-cli
participant R as Runner (hakiri-runtime)
participant Cat as Catalog (SQLite)
participant Src as Source connector
participant Ctx as ContextStore
participant T as Tracer (OTel)
CLI->>R: run(pipeline_id)
R->>Cat: load PipelineSpec + last Cursor
R->>T: start span run.<id>
R->>Src: open(cursor)
loop until exhausted
Src-->>R: RecordBatch
R->>Ctx: write(table, batch)
Note over R,Ctx: each batch = one Parquet file under runs/<run_id>/
end
R->>Ctx: commit snapshot
R->>Cat: persist new Cursor + RunStatus
R->>T: end span
Properties:
- One run = one atomic snapshot in the context store. Crash mid-run leaves partial Parquet under
runs/<run_id>/which the next run can either resume or discard. - Backpressure is in the Source. Sources return
Stream<RecordBatch>; the runner pulls. No unbounded buffers. - Schema can evolve mid-run. New columns get appended (additive); type changes promote (e.g. int→long). Breaking changes are rejected and surfaced as
SchemaIncompatibleerrors the agent can react to.
Concurrency model
Section titled “Concurrency model”tokiomulti-threaded runtime- One pipeline run = one async task; many pipelines can run concurrently
- Connectors that yield
Stream<RecordBatch>get N-way parallelism only if they explicitly partition (e.g. Postgres logical decoding slots, HTTP paginated fetches with shards) - WASM components run on
wasmtime’s async support; each component gets a freshStoreper run for isolation
Failure & retry
Section titled “Failure & retry”- Errors are typed (
thiserror):Transient,Permanent,SchemaIncompatible,AuthExpired,RateLimited - The runner has a single retry policy: exponential backoff with jitter, max 5 attempts, only for
TransientandRateLimited Permanenterrors fail the run; the agent can read the trace and propose a fix- Partial writes are visible in the catalog as
RunStatus::PartialFailure { written_batches: N }
What’s not in this picture (yet)
Section titled “What’s not in this picture (yet)”- Distributed execution. v0 is single-process. Multi-node sharding ships when there’s a concrete use case demanding it.
- Reverse ETL connectors. A destination is a connector, so this works in principle, but the M0 destinations are limited to the context store + a few common sinks.
- Real-time CDC. Postgres logical replication is on the roadmap; Kafka is not (use a Kafka→Hakiri shim).
Open questions
Section titled “Open questions”- Arrow as the canonical in-memory format? Arrow is the right call for interchange with DuckDB/Parquet. But Arrow’s Rust crates (
arrow-rs) carry weight. Investigatearrow2vsarrow-rstradeoff for binary size. - Where does transformation live? Three options: (a) inside the source connector, (b) a separate
Transformport like dlt’s@dlt.transformer, (c) push down into DuckDB after landing. Initial bias: (c) — land raw, transform in SQL. Revisit if perf demands. - Pipeline graph or linear? Singer treats pipelines as
tap | target. dlt allows DAGs. Initial bias: linear source→[transform]*→destination; DAG support is a v2 concern.