Skip to content

Connector Model

The connector model is the load-bearing decision of the project. It determines:

  • whether agents can reliably author connectors,
  • whether user-authored code is safe to run,
  • whether connectors can ship out-of-band from the binary,
  • whether the same connector code can run on a laptop and on Cloudflare Workers.

All connectors — whether shipped in-tree or authored later — conform to a single WIT (WebAssembly Interface Type) world. Built-in connectors are still compiled as native Rust for speed, but they implement the same Rust trait that the WIT bindings generate, so the runtime treats them uniformly.

Rationale, alternatives considered, and trade-offs: see ADR-0001.

wit/connector.wit
package hakiri:connector@0.1.0;
interface types {
record schema-field {
name: string,
data-type: data-type,
nullable: bool,
}
variant data-type {
boolean,
int32,
int64,
float64,
string,
bytes,
timestamp-millis,
list(data-type),
struct(list<schema-field>),
%json, // opaque JSON, deferred typing
}
record schema {
name: string,
fields: list<schema-field>,
primary-key: list<string>,
}
// Arrow IPC-encoded RecordBatch (host and guest agree on Arrow schema separately)
type record-batch = list<u8>;
type cursor = list<u8>; // opaque; connector owns the format
variant error {
transient(string),
permanent(string),
auth-expired(string),
rate-limited(u32), // retry-after seconds
schema-incompatible(string),
}
}
interface source {
use types.{schema, record-batch, cursor, error};
/// Return the schemas this source can produce.
discover: func() -> result<list<schema>, error>;
/// Open a read stream against one table, optionally resuming.
open: func(table: string, cursor: option<cursor>) -> result<read-handle, error>;
resource read-handle {
/// Pull the next batch. Returns none when exhausted.
next: func() -> result<option<record-batch>, error>;
/// Current cursor (callable any time; advances as next() yields batches).
cursor: func() -> cursor;
}
}
interface destination {
use types.{schema, record-batch, error};
prepare: func(table: string, schema: schema) -> result<_, error>;
write: func(table: string, batch: record-batch) -> result<_, error>;
commit: func() -> result<_, error>;
abort: func() -> result<_, error>;
}
world connector {
// Host capabilities granted to the component (each is opt-in per connector)
import wasi:http/outgoing-handler@0.2.0;
import wasi:logging/logging@0.1.0;
import wasi:clocks/wall-clock@0.2.0;
// No wasi:filesystem, no wasi:sockets — connectors talk to the world via host-mediated HTTP.
export source;
export destination;
}

The connector world is split in practice into two narrower worlds — source and destination — so a connector that’s only a source doesn’t have to stub out destination methods. Both worlds share the types interface.

A connector’s manifest declares what host capabilities it needs; the host refuses to load it without explicit approval.

.hakiri/connectors/github/connector.toml
[connector]
name = "github"
version = "0.3.1"
kind = "source"
wasm = "./github.wasm"
[capabilities]
http = ["api.github.com"] # outbound HTTP allowlist
env = ["GITHUB_TOKEN"] # env vars the connector can read
clock = true

Wildcards (*.github.com) are supported. There is no escape hatch — a connector that needs unlisted access fails to load with a clear diagnostic.

The same TOML/JSON duality from pipeline manifests applies (see 03-pipelines.md). connector.toml is the canonical hand-edited form; connector.json (validated against hakiri schema export connector) is the agent-authoring path. Both deserialize into the same ConnectorManifest struct.

Capability grants are themselves declarative: the manifest describes what access the connector needs; the host decides whether to grant it. There is no runtime request_permission() call, no imperative escalation path. This keeps capability review tractable for both humans (read the TOML) and agents (validate against the JSON Schema before install).

Capability allowlists prevent connectors from reaching the wrong things. Resource limits prevent them from consuming unbounded host resources — infinite loops, megabyte-per-second log floods, regex catastrophic backtracking, runaway allocations. wasmtime exposes the knobs (Store::limiter, fuel metering, async deadlines); the host configures defaults per call.

LimitDefaultOverride
Linear memory256 MiBper-connector via manifest, max 2 GiB
Fuel (instructions)10⁹ units per next() / discover() callper-call via runtime config
Wall-clock deadline30s per next(), 60s per discover()per-pipeline via manifest
Log volume1 MiB/sec per connector via wasi:logginghard cap; no override
Outbound HTTP100 in-flight requests per connectorper-connector via manifest

A connector hitting a limit returns a transient error to the runtime, which may retry with backoff. Persistent limit hits surface as Permanent::ResourceExhausted and fail the run fast. Limit decisions are recorded in OTel spans so agents and operators see why a run failed without grepping logs.

The dep allowlist for in-tree connector authoring bans known footguns: fancy-regex (catastrophic backtracking) in favor of regex (linear-time); raw serde_json::Value for arbitrary input (depth-DoS) in favor of serde_json::de::from_slice with MaxDepth.

Three personas, same WIT contract:

Terminal window
hakiri agent scaffold-connector \
--spec ./openapi.json \
--kind source \
--name shopify

This walks an OpenAPI/AsyncAPI/SQL spec and emits:

  • A WIT-compliant Rust crate under connectors/shopify/
  • A connector.toml with inferred capabilities
  • A round-trip test that exercises discover() against a recorded fixture

The MCP server exposes this as a tool; an agent calls it, reviews the diff, runs hakiri connector build && hakiri connector test, and commits.

The host owns the WIT; the agent fills only the Rust impl. Scaffold emits WIT bindings from a fixed template (parameterized on spec + connector name); the agent’s surface area is the Rust trait methods, not the contract. This eliminates the most common failure mode in current frontier models — producing syntactically-plausible but semantically-wrong WIT (invented types, wrong package versions, mixed pre/post-0.2 import paths). WIT is immutable from the agent’s perspective; the host upgrades it across releases.

Direct use of hakiri-connector-sdk:

use hakiri_connector_sdk::{source, Schema, RecordBatch, Cursor, Error};
#[source]
struct GithubIssues { token: String }
impl Source for GithubIssues {
fn discover(&self) -> Result<Vec<Schema>, Error> { ... }
fn open(&self, table: &str, cursor: Option<Cursor>) -> Result<Stream, Error> { ... }
}

The macro emits wit-bindgen glue and a cdylib target.

Use componentize-py, componentize-js, or TinyGo against the same WIT file. We ship example skeletons but won’t maintain non-Rust SDKs first-class.

Compiled into the binary (no WASM round-trip) because they’re hot paths:

  • Sources: postgres (snapshot + logical decoding), http (REST, OpenAPI-driven), file (CSV/JSON/Parquet/NDJSON on local fs or S3), github (Issues/PRs/repos/comments), s3 (object listing + content)
  • Destinations: context (the local context store; this is the default), parquet (write Parquet to a path or S3), duckdb (write into an external DuckDB file), webhook (POST to a URL)
  • Transforms: kept minimal in M1 — select, rename, cast, filter. Anything more interesting belongs in DuckDB SQL post-landing.

Built-in connectors implement the same Rust trait as the WIT bindings emit; the runtime is agnostic to which one it’s calling.

Target connector inventory (agent-authored, M2+)

Section titled “Target connector inventory (agent-authored, M2+)”

Per Challenge 2 in PRD.md: the connector-count race against Airbyte (300+) and Fivetran (500+) is unwinnable on person-hours and largely misleading — most catalog entries are mediocre or unmaintained. Hakiri’s win condition is the 25–35 sources a team actually queries all being authorable, drift-detectable, and maintainable via the M2 agent loop, not a 300-tile marketplace.

This is the working target list. Each row is a candidate for the M2 agent-authoring eval — given the public OpenAPI or API documentation URL, can an agent produce a working WASM connector with passing dry-runs in one MCP conversation? Selection criteria: high adoption among the agent-builder + small-team audience (PRD § Target audience); public API docs; OAuth or token auth; no SOAP, no proprietary binary protocols requiring vendor SDKs.

CategorySourceAuthOpenAPIPaginationTier
Engineering / observabilityGitHubPAT / OAuth / Appcursor + link-headerbuilt-in (M1)
GitLabPAT / OAuthpage + cursorM2
JiraPAT / OAuth (Atlassian)✓ (REST v3)offset + cursorM2
LinearPAT / OAuthpartial (GraphQL SDL)cursorM2
PagerDutyAPI keyoffsetM2
DatadogAPI key + app keycursor + time-windowM2
SentryPAT / DSNlink-headerM2
Collaboration / docsSlackOAuthpartial (Web API method docs)cursorM2
NotionOAuthcursorM2
Google DriveOAuth✓ (via Discovery)page-tokenM2
ConfluencePAT / OAuth (Atlassian)offset + cursorM2
Customer ops / supportZendeskOAuth / API tokenoffset + cursorM2
IntercomOAuthcursorM2
Sales / CRMSalesforceOAuth (REST)partial (REST only; SOQL custom)offset + query locatorM2
HubSpotOAuth / Private appcursorM2
PipedriveAPI tokenoffsetM2
Product analyticsPostHogAPI keyoffset + cursorM2
Mixpanelservice account✓ partialtime-window exportM2
AmplitudeAPI key + secret✓ partialtime-window exportM2
SegmentAPI token✓ partialcursorM2
Payments / billingStripeAPI keycursorM2
ChargebeeAPI keyoffset + cursorM2
QuickBooksOAuth (Intuit)partialoffsetM2.5 (per-realm OAuth complexity)
MarketingMailchimpOAuth / API keyoffsetM2
MarketoOAuthoffset + paging tokenM2.5
Recruiting / HRGreenhouseAPI keyoffset + link-headerM2
BambooHRAPI keypartialoffsetM2
Files / dataAirtablePAT / OAuthcursor tokenM2
ShopifyOAuthcursor + link-headerM2
Google SheetsOAuth✓ (via Discovery)range-readM2
Databases (as source)Postgrespassword / certn/a (wire protocol)snapshot + logical decodingbuilt-in (M1)
MySQLpassword / certn/asnapshot + binlogM2
MongoDBconnection stringn/asnapshot + change streamM2
SnowflakeOAuth / key-pairn/a (SQL API + JDBC)snapshot + streamM2.5
BigQueryservice accountn/a (REST Storage API)partition + streamM2.5
Object storage (as source)S3AWS credsn/alistingbuilt-in (M1)
GCSservice accountn/a (XML/JSON)listingM2
Azure Blobshared key / SASn/alistingM2

Five built-in (M1), ~25 agent-authored REST + OpenAPI (M2), ~5 partial-OpenAPI or auth-complex (M2.5), ~5 binary-protocol databases / object stores (M2 or M2.5 depending on driver maturity).

The inventory falls into four scaffolding shapes; the M2 agent scaffolder branches on shape, not on source:

  • REST + OpenAPI (Stripe, GitHub, Notion, Datadog, PostHog, Jira, Zendesk, …) — primary path. hakiri agent scaffold-connector --spec <openapi.json> produces ~90% of the connector. The agent fills the auth flow, pagination quirks, and the cursor-kind declaration.
  • REST without machine-readable OpenAPI (Slack Web API, parts of Mailchimp, BambooHR) — agent scaffolds from API documentation HTML / Markdown; lower automation, more iteration. M2.5 target with a lower per-call success rate in the eval harness.
  • GraphQL (Linear primarily, Shopify alt-path) — scaffolded from a SDL schema dump. Same WIT contract; different fetch shape, different pagination idioms (relay-style cursors).
  • Database / binary protocols (Postgres, MySQL, MongoDB) — hand-written Rust against established drivers (tokio-postgres, mysql_async, mongodb). Not agent-scaffolded; ship as built-ins or as community Rust crates.

The M2 eval target — 60% reach discover() working, 30% reach full contract conformance (pm/roadmap.md M2) — is calibrated against the REST + OpenAPI subset. The other three shapes are tracked separately so a single weak shape doesn’t poison the headline metric.

ShapeCountAuthoring pathMaintenance
Built-in Rust (hot paths, native trait)5–7Hakiri teamCode review on every change
Agent-authored REST + OpenAPI~25Agent scaffolds; human reviews diffhakiri connector check-drift on schedule
Agent-authored without OpenAPI~5Agent scaffolds from docs / samplesSame; lower automation
Community-contributed Rust / WASMopen-endedThird-party authors against hakiri-connector-sdkMaintainer’s responsibility; provenance recorded in catalog

Hakiri does not aim to be a connector marketplace (Challenge 2 § The trap to avoid). The inventory above is the operational target — what should exist by the end of M2 for the agent-builder + small-team audience to be productive. Anything beyond is community-contributed under the same WIT contract, with provenance and capability declarations recorded in the catalog like any other connector.

  • In-tree — built into the binary; resolved by name.
  • Local filewasm = "./path/to/connector.wasm" in the project.
  • HTTPS URLwasm = "https://hakiri.dev/connectors/shopify-0.3.1.wasm", with a SHA-256 pin.
  • OCI registrywasm = "oci://ghcr.io/owner/shopify:0.3.1". Components are valid OCI artifacts.

Pin-by-hash is enforced; an unpinned URL fails CI.

  • Contract tests ship with the SDK: feed every connector a synthetic WIT-conformance test (discover returns valid schemas, open returns a finite stream, cursor round-trips).
  • Fixture replay: record real HTTP interactions to .hakiri/cassettes/ (VCR-style), replay them in CI.
  • Property tests via proptest for cursor monotonicity and schema invariants.
  • WASI 0.2 vs 0.3. As of writing, wasmtime ships solid 0.2; 0.3 (with native async) is landing. Default to 0.2 with a migration plan.
  • Component pooling. Cold-start is ~ms; for fast-firing pipelines we may want to pool component instances. Defer until measurement says it matters.
  • Connector signing. Sigstore-style transparency log for community-published connectors. Worth considering for the marketplace narrative.