Connector Model

The connector model is the load-bearing decision of the project. It determines:

whether agents can reliably author connectors,
whether user-authored code is safe to run,
whether connectors can ship out-of-band from the binary,
whether the same connector code can run on a laptop and on Cloudflare Workers.

The contract: WIT + WASM Component Model

All connectors — whether shipped in-tree or authored later — conform to a single WIT (WebAssembly Interface Type) world. Built-in connectors are still compiled as native Rust for speed, but they implement the same Rust trait that the WIT bindings generate, so the runtime treats them uniformly.

Rationale, alternatives considered, and trade-offs: see ADR-0001.

The WIT world (sketch)

package hakiri:connector@0.1.0;

interface types {
  record schema-field {
    name: string,
    data-type: data-type,
    nullable: bool,
  }

  variant data-type {
    boolean,
    int32,
    int64,
    float64,
    string,
    bytes,
    timestamp-millis,
    list(data-type),
    struct(list<schema-field>),
    %json,           // opaque JSON, deferred typing
  }

  record schema {
    name: string,
    fields: list<schema-field>,
    primary-key: list<string>,
  }

  // Arrow IPC-encoded RecordBatch (host and guest agree on Arrow schema separately)
  type record-batch = list<u8>;

  type cursor = list<u8>;       // opaque; connector owns the format

  variant error {
    transient(string),
    permanent(string),
    auth-expired(string),
    rate-limited(u32),           // retry-after seconds
    schema-incompatible(string),
  }
}

interface source {
  use types.{schema, record-batch, cursor, error};

  /// Return the schemas this source can produce.
  discover: func() -> result<list<schema>, error>;

  /// Open a read stream against one table, optionally resuming.
  open: func(table: string, cursor: option<cursor>) -> result<read-handle, error>;

  resource read-handle {
    /// Pull the next batch. Returns none when exhausted.
    next: func() -> result<option<record-batch>, error>;
    /// Current cursor (callable any time; advances as next() yields batches).
    cursor: func() -> cursor;
  }
}

interface destination {
  use types.{schema, record-batch, error};

  prepare: func(table: string, schema: schema) -> result<_, error>;
  write:   func(table: string, batch: record-batch) -> result<_, error>;
  commit:  func() -> result<_, error>;
  abort:   func() -> result<_, error>;
}

world connector {
  // Host capabilities granted to the component (each is opt-in per connector)
  import wasi:http/outgoing-handler@0.2.0;
  import wasi:logging/logging@0.1.0;
  import wasi:clocks/wall-clock@0.2.0;
  // No wasi:filesystem, no wasi:sockets — connectors talk to the world via host-mediated HTTP.

  export source;
  export destination;
}

The connector world is split in practice into two narrower worlds — source and destination — so a connector that’s only a source doesn’t have to stub out destination methods. Both worlds share the types interface.

Capability grants

A connector’s manifest declares what host capabilities it needs; the host refuses to load it without explicit approval.

[connector]
name = "github"
version = "0.3.1"
kind = "source"
wasm = "./github.wasm"

[capabilities]
http = ["api.github.com"]      # outbound HTTP allowlist
env = ["GITHUB_TOKEN"]          # env vars the connector can read
clock = true

Wildcards (*.github.com) are supported. There is no escape hatch — a connector that needs unlisted access fails to load with a clear diagnostic.

The same TOML/JSON duality from pipeline manifests applies (see 03-pipelines.md). connector.toml is the canonical hand-edited form; connector.json (validated against hakiri schema export connector) is the agent-authoring path. Both deserialize into the same ConnectorManifest struct.

Capability grants are themselves declarative: the manifest describes what access the connector needs; the host decides whether to grant it. There is no runtime request_permission() call, no imperative escalation path. This keeps capability review tractable for both humans (read the TOML) and agents (validate against the JSON Schema before install).

Resource limits

Capability allowlists prevent connectors from reaching the wrong things. Resource limits prevent them from consuming unbounded host resources — infinite loops, megabyte-per-second log floods, regex catastrophic backtracking, runaway allocations. wasmtime exposes the knobs (Store::limiter, fuel metering, async deadlines); the host configures defaults per call.

Limit	Default	Override
Linear memory	256 MiB	per-connector via manifest, max 2 GiB
Fuel (instructions)	10⁹ units per `next()` / `discover()` call	per-call via runtime config
Wall-clock deadline	30s per `next()`, 60s per `discover()`	per-pipeline via manifest
Log volume	1 MiB/sec per connector via `wasi:logging`	hard cap; no override
Outbound HTTP	100 in-flight requests per connector	per-connector via manifest

A connector hitting a limit returns a transient error to the runtime, which may retry with backoff. Persistent limit hits surface as Permanent::ResourceExhausted and fail the run fast. Limit decisions are recorded in OTel spans so agents and operators see why a run failed without grepping logs.

The dep allowlist for in-tree connector authoring bans known footguns: fancy-regex (catastrophic backtracking) in favor of regex (linear-time); raw serde_json::Value for arbitrary input (depth-DoS) in favor of serde_json::de::from_slice with MaxDepth.

Authoring a connector

Three personas, same WIT contract:

As an agent (primary path)

hakiri agent scaffold-connector \
  --spec ./openapi.json \
  --kind source \
  --name shopify

This walks an OpenAPI/AsyncAPI/SQL spec and emits:

A WIT-compliant Rust crate under connectors/shopify/
A connector.toml with inferred capabilities
A round-trip test that exercises discover() against a recorded fixture

The MCP server exposes this as a tool; an agent calls it, reviews the diff, runs hakiri connector build && hakiri connector test, and commits.

The host owns the WIT; the agent fills only the Rust impl. Scaffold emits WIT bindings from a fixed template (parameterized on spec + connector name); the agent’s surface area is the Rust trait methods, not the contract. This eliminates the most common failure mode in current frontier models — producing syntactically-plausible but semantically-wrong WIT (invented types, wrong package versions, mixed pre/post-0.2 import paths). WIT is immutable from the agent’s perspective; the host upgrades it across releases.

As a Rust author (power path)

Direct use of hakiri-connector-sdk:

use hakiri_connector_sdk::{source, Schema, RecordBatch, Cursor, Error};

#[source]
struct GithubIssues { token: String }

impl Source for GithubIssues {
    fn discover(&self) -> Result<Vec<Schema>, Error> { ... }
    fn open(&self, table: &str, cursor: Option<Cursor>) -> Result<Stream, Error> { ... }
}

The macro emits wit-bindgen glue and a cdylib target.

As a non-Rust author (escape hatch)

Use componentize-py, componentize-js, or TinyGo against the same WIT file. We ship example skeletons but won’t maintain non-Rust SDKs first-class.

Built-in connectors (M0/M1)

Compiled into the binary (no WASM round-trip) because they’re hot paths:

Sources: postgres (snapshot + logical decoding), http (REST, OpenAPI-driven), file (CSV/JSON/Parquet/NDJSON on local fs or S3), github (Issues/PRs/repos/comments), s3 (object listing + content)
Destinations: context (the local context store; this is the default), parquet (write Parquet to a path or S3), duckdb (write into an external DuckDB file), webhook (POST to a URL)
Transforms: kept minimal in M1 — select, rename, cast, filter. Anything more interesting belongs in DuckDB SQL post-landing.

Built-in connectors implement the same Rust trait as the WIT bindings emit; the runtime is agnostic to which one it’s calling.

Target connector inventory (agent-authored, M2+)

Per Challenge 2 in PRD.md: the connector-count race against Airbyte (300+) and Fivetran (500+) is unwinnable on person-hours and largely misleading — most catalog entries are mediocre or unmaintained. Hakiri’s win condition is the 25–35 sources a team actually queries all being authorable, drift-detectable, and maintainable via the M2 agent loop, not a 300-tile marketplace.

This is the working target list. Each row is a candidate for the M2 agent-authoring eval — given the public OpenAPI or API documentation URL, can an agent produce a working WASM connector with passing dry-runs in one MCP conversation? Selection criteria: high adoption among the agent-builder + small-team audience (PRD § Target audience); public API docs; OAuth or token auth; no SOAP, no proprietary binary protocols requiring vendor SDKs.

Inventory

Category	Source	Auth	OpenAPI	Pagination	Tier
Engineering / observability	GitHub	PAT / OAuth / App	✓	cursor + link-header	built-in (M1)
	GitLab	PAT / OAuth	✓	page + cursor	M2
	Jira	PAT / OAuth (Atlassian)	✓ (REST v3)	offset + cursor	M2
	Linear	PAT / OAuth	partial (GraphQL SDL)	cursor	M2
	PagerDuty	API key	✓	offset	M2
	Datadog	API key + app key	✓	cursor + time-window	M2
	Sentry	PAT / DSN	✓	link-header	M2
Collaboration / docs	Slack	OAuth	partial (Web API method docs)	cursor	M2
	Notion	OAuth	✓	cursor	M2
	Google Drive	OAuth	✓ (via Discovery)	page-token	M2
	Confluence	PAT / OAuth (Atlassian)	✓	offset + cursor	M2
Customer ops / support	Zendesk	OAuth / API token	✓	offset + cursor	M2
	Intercom	OAuth	✓	cursor	M2
Sales / CRM	Salesforce	OAuth (REST)	partial (REST only; SOQL custom)	offset + query locator	M2
	HubSpot	OAuth / Private app	✓	cursor	M2
	Pipedrive	API token	✓	offset	M2
Product analytics	PostHog	API key	✓	offset + cursor	M2
	Mixpanel	service account	✓ partial	time-window export	M2
	Amplitude	API key + secret	✓ partial	time-window export	M2
	Segment	API token	✓ partial	cursor	M2
Payments / billing	Stripe	API key	✓	cursor	M2
	Chargebee	API key	✓	offset + cursor	M2
	QuickBooks	OAuth (Intuit)	partial	offset	M2.5 (per-realm OAuth complexity)
Marketing	Mailchimp	OAuth / API key	✓	offset	M2
	Marketo	OAuth	✓	offset + paging token	M2.5
Recruiting / HR	Greenhouse	API key	✓	offset + link-header	M2
	BambooHR	API key	partial	offset	M2
Files / data	Airtable	PAT / OAuth	✓	cursor token	M2
	Shopify	OAuth	✓	cursor + link-header	M2
	Google Sheets	OAuth	✓ (via Discovery)	range-read	M2
Databases (as source)	Postgres	password / cert	n/a (wire protocol)	snapshot + logical decoding	built-in (M1)
	MySQL	password / cert	n/a	snapshot + binlog	M2
	MongoDB	connection string	n/a	snapshot + change stream	M2
	Snowflake	OAuth / key-pair	n/a (SQL API + JDBC)	snapshot + stream	M2.5
	BigQuery	service account	n/a (REST Storage API)	partition + stream	M2.5
Object storage (as source)	S3	AWS creds	n/a	listing	built-in (M1)
	GCS	service account	n/a (XML/JSON)	listing	M2
	Azure Blob	shared key / SAS	n/a	listing	M2

Five built-in (M1), ~25 agent-authored REST + OpenAPI (M2), ~5 partial-OpenAPI or auth-complex (M2.5), ~5 binary-protocol databases / object stores (M2 or M2.5 depending on driver maturity).

Scaffolding patterns

The inventory falls into four scaffolding shapes; the M2 agent scaffolder branches on shape, not on source:

REST + OpenAPI (Stripe, GitHub, Notion, Datadog, PostHog, Jira, Zendesk, …) — primary path. hakiri agent scaffold-connector --spec <openapi.json> produces ~90% of the connector. The agent fills the auth flow, pagination quirks, and the cursor-kind declaration.
REST without machine-readable OpenAPI (Slack Web API, parts of Mailchimp, BambooHR) — agent scaffolds from API documentation HTML / Markdown; lower automation, more iteration. M2.5 target with a lower per-call success rate in the eval harness.
GraphQL (Linear primarily, Shopify alt-path) — scaffolded from a SDL schema dump. Same WIT contract; different fetch shape, different pagination idioms (relay-style cursors).
Database / binary protocols (Postgres, MySQL, MongoDB) — hand-written Rust against established drivers (tokio-postgres, mysql_async, mongodb). Not agent-scaffolded; ship as built-ins or as community Rust crates.

The M2 eval target — 60% reach discover() working, 30% reach full contract conformance (pm/roadmap.md M2) — is calibrated against the REST + OpenAPI subset. The other three shapes are tracked separately so a single weak shape doesn’t poison the headline metric.

Distribution of authoring effort

Shape	Count	Authoring path	Maintenance
Built-in Rust (hot paths, native trait)	5–7	Hakiri team	Code review on every change
Agent-authored REST + OpenAPI	~25	Agent scaffolds; human reviews diff	`hakiri connector check-drift` on schedule
Agent-authored without OpenAPI	~5	Agent scaffolds from docs / samples	Same; lower automation
Community-contributed Rust / WASM	open-ended	Third-party authors against `hakiri-connector-sdk`	Maintainer’s responsibility; provenance recorded in catalog

Hakiri does not aim to be a connector marketplace (Challenge 2 § The trap to avoid). The inventory above is the operational target — what should exist by the end of M2 for the agent-builder + small-team audience to be productive. Anything beyond is community-contributed under the same WIT contract, with provenance and capability declarations recorded in the catalog like any other connector.

Distribution

In-tree — built into the binary; resolved by name.
Local file — wasm = "./path/to/connector.wasm" in the project.
HTTPS URL — wasm = "https://hakiri.dev/connectors/shopify-0.3.1.wasm", with a SHA-256 pin.
OCI registry — wasm = "oci://ghcr.io/owner/shopify:0.3.1". Components are valid OCI artifacts.

Pin-by-hash is enforced; an unpinned URL fails CI.

Testing connectors

Contract tests ship with the SDK: feed every connector a synthetic WIT-conformance test (discover returns valid schemas, open returns a finite stream, cursor round-trips).
Fixture replay: record real HTTP interactions to .hakiri/cassettes/ (VCR-style), replay them in CI.
Property tests via proptest for cursor monotonicity and schema invariants.

Open questions

WASI 0.2 vs 0.3. As of writing, wasmtime ships solid 0.2; 0.3 (with native async) is landing. Default to 0.2 with a migration plan.
Component pooling. Cold-start is ~ms; for fast-firing pipelines we may want to pool component instances. Defer until measurement says it matters.
Connector signing. Sigstore-style transparency log for community-published connectors. Worth considering for the marketplace narrative.