ADR-0014 — Cloudflare Workers + Workflows + Containers as the default team-execution surface

Status: Accepted (with self-hosted Rust hakiri-control as a concurrent parity path)
Date: 2026-05-12
Related specs: 06-deployment.md, 13-team-surfaces.md, 14-collab-config.md

Context

Hakiri’s day-1 team product (M1) needs reliable scheduled execution without requiring every team to provision and operate always-on infrastructure. The constraints:

Schedules must fire even when no teammate’s laptop is online. Laptop-only placement is acceptable per pipeline (“ingest from this folder on Alice’s Mac”) but not as the team’s reliability floor.
The substrate must scale from “team of 3 with 5 pipelines firing hourly” to “team of 50 with 200 pipelines firing every minute” without a re-architecture.
Idle cost must approach zero. A team that runs two pipelines a day should not pay for a 24/7 task.
The same product must work for air-gapped / on-prem teams who cannot use Cloudflare.

Three families of approach exist:

Always-on team worker (Fargate task, Mac mini in the office, Hetzner VM). Reliable, simple mental model, but kills constraint 3 (idle cost) and creates a “you must operate a VM” onboarding cliff.
Cron-driven serverless (Cloudflare Workers + Workflows + Containers, or AWS Lambda + Step Functions + Fargate). Zero idle cost, durable orchestration baked in, but ties team-mode reliability to a specific cloud’s availability.
Laptop-only with retries (whoever’s online runs it; if nobody is, retry later). Cheapest but unreliable. Fine for solo / hobby use; wrong for a team product.

ADR-0009 already commits Cloudflare and AWS as first-class deploy targets for the data plane. This ADR extends that commitment to the team-mode control plane.

Decision

Cloudflare Workers + Workflows + Containers + Durable Objects + R2 is the default team-mode execution surface, brought to M1 alongside the local CLI. Every team-mode primitive maps onto a CF resource:

Concern	CF resource
Cron tick (schedule firing)	Cron Trigger
Control-plane HTTP and WebSocket API	Worker + Durable Object
Team metadata, canonical Loro doc, awareness state	Durable Object SQLite
Long-running pipeline execution (>30s wall)	CF Workflow with `step.do` checkpoints
WASM connector execution	Container, invoked from Workflow steps
Manifest TOML snapshots, Parquet data	R2
Real-time sync to clients	DO hibernating WebSocket

For teams that cannot use Cloudflare (air-gapped, regulated, on-prem, multi-cloud-by-policy), the same hakiri-control binary ships as a Rust daemon with identical HTTP / WebSocket / sync semantics. The wire protocol is identical; only the substrate differs:

Concern	CF resource	Self-hosted Rust equivalent
Cron tick	Cron Trigger	In-process `tokio` scheduler
HTTP / WebSocket	Worker + DO	`axum` + `tokio-tungstenite`
Canonical Loro doc	DO SQLite	Embedded SQLite
Workflow durability	CF Workflow	Embedded queue with checkpointed retry — leverages the same crash-resume pattern as 04-context-store.md
Snapshots / Parquet	R2	Any S3-compatible bucket, or local filesystem

A CI acceptance suite verifies both implementations pass the same conformance tests on every PR.

Consequences

Positive

Zero idle cost. A team running pipelines hourly pays ~$5/month (DO storage + Cron + WebSocket frames). No always-on VM, no Fargate task, no Mac mini in the office.
Durable runs out of the box. CF Workflows checkpoint each step.do — mid-run crashes resume from the last completed step, no custom retry logic in our code.
Native long-running execution. Containers spin up when a workflow needs WASM connector execution; sleep when done. Cold-start (2–8s for non-trivial images) is amortized over the workflow run, not paid per request.
Real-time sync with hibernating WebSockets. CF DO supports hibernating WebSocket connections that persist across DO eviction without holding CPU. Per 14-collab-config.md, this is how the Loro sync channel scales to dozens of concurrent editors per team at near-zero cost.
Preserves “no orchestrator” pillar. Operators provision a Worker + DO with one hakiri team init --cloudflare command. No Kubernetes, no Terraform, no managed service mesh.
Same primitives as Topology 3. The CF data-plane primitives were already first-class; this decision reuses them for the control plane.

Negative

Team-mode reliability tied to Cloudflare for non-self-hosted teams. CF outages affect team-mode users; solo CLI users are unaffected (their daemon runs locally). Mitigation: the Rust hakiri-control parity path is a one-binary fallback for teams whose risk tolerance demands on-premise.
Workflow caps remain load-bearing. Topology 3 in 06-deployment.md documents the constraints — 1 MiB step result, 1024 steps per workflow, ~6h practical retry window. The reconciler honors them (step results are R2 keys, not payloads; >1024-page sources batch pages into chunks). Same constraints apply to team-mode pipelines.
DO single-thread execution. Per-team DO serializes mutations. A team with 10000 simultaneous Loro ops per second would queue; in practice teams operate at 1–10 ops/sec per team. Mitigation: shard by pipeline group across multiple DOs if a team grows past comfortable thresholds.
CF Worker CPU / wall-time limits. Reconciler logic must stay under the Worker budget (30s wall, 50ms CPU on free tier; higher on paid). Heavy work moves to Workflows + Containers.
Self-hosted Rust binary is real work. hakiri-control is not a thin shim; it implements WebSocket sync, embedded queue with checkpoints, Loro doc storage, OAuth callback handling. Estimated at 4–6 weeks of dedicated work to reach CF parity for M1.

Neutral

The control-plane API is identical between CF and self-hosted backends. Electron / web clients are unaware of the substrate.
This decision implies no analogous “AWS-first team-mode” target. AWS team-mode would require re-implementing the control plane on Lambda + Step Functions + DynamoDB, which we judge low-value (teams who prefer AWS for compliance tend to self-host). Revisit if AWS-native demand from team-mode buyers materializes.

Alternatives considered

Always-on Fargate task as default. Simple mental model (“there’s one worker, it runs your pipelines”). Rejected for:

$30–100/month idle baseline, regardless of usage.
Onboarding cliff: every team has to provision AWS or contemplate a Mac mini in the office.
Doesn’t change durability guarantees vs CF Workflows (which checkpoint natively).

AWS Lambda + Step Functions + Fargate as default for team mode. Parallels Topology 4. Rejected because:

Hosted control plane on AWS would require multi-tenant DB management (RDS); CF gives us per-tenant DO state with zero ops.
Step Functions Standard pricing scales aggressively (50 steps × 4 transitions × hourly × 100 pipelines ≈ $1.4k/mo). DO + Workflow is roughly 1/10 the cost at the same throughput.
AWS data-plane deploys remain first-class per ADR-0009. This decision is about the control plane only.

Laptop-only with retries. Rejected — see Context #1.

Hybrid: laptop-when-online + cron Worker as fallback. Considered. Net result is the same as making CF the default, with placement choices (placement = "any-mac" vs placement = "cf:auto") handling per-pipeline preference. Folded into the placement model in 03-pipelines.md.

Implementation guardrails

CF dependency must be optional. The OSS data plane and CLI work without any CF account. CF is the default for team mode, not the mandatory substrate.
Self-hosted parity is a release gate. M1 ships both the CF and the Rust hakiri-control paths. A team-mode feature available on CF must work on Rust within the same release; CI gates this via a shared acceptance suite.
Same hakiri.toml. The manifest is identical across CF, AWS-data-plane, and self-hosted control planes. Migration from “fractalbox-hosted CF” → “self-hosted Rust” is a one-command export plus redeploy.
No CF-specific code in hakiri-core. The CF integration lives in a hakiri-control-cf adapter crate; the core control-plane logic is substrate-agnostic.

References

Cloudflare Workflows
Cloudflare Durable Objects
Hibernating WebSockets
Topology 3 — Cloudflare in 06-deployment.md
ADR-0006 — DO SQLite as catalog backend (extends to control-plane state)
ADR-0008 — coordinator-bundling rationale (applies to self-hosted control plane)
ADR-0009 — first-class cloud commitment