ADR-0014 — Cloudflare Workers + Workflows + Containers as the default team-execution surface
- Status: Accepted (with self-hosted Rust
hakiri-controlas a concurrent parity path) - Date: 2026-05-12
- Related specs: 06-deployment.md, 13-team-surfaces.md, 14-collab-config.md
Context
Section titled “Context”Hakiri’s day-1 team product (M1) needs reliable scheduled execution without requiring every team to provision and operate always-on infrastructure. The constraints:
- Schedules must fire even when no teammate’s laptop is online. Laptop-only placement is acceptable per pipeline (“ingest from this folder on Alice’s Mac”) but not as the team’s reliability floor.
- The substrate must scale from “team of 3 with 5 pipelines firing hourly” to “team of 50 with 200 pipelines firing every minute” without a re-architecture.
- Idle cost must approach zero. A team that runs two pipelines a day should not pay for a 24/7 task.
- The same product must work for air-gapped / on-prem teams who cannot use Cloudflare.
Three families of approach exist:
- Always-on team worker (Fargate task, Mac mini in the office, Hetzner VM). Reliable, simple mental model, but kills constraint 3 (idle cost) and creates a “you must operate a VM” onboarding cliff.
- Cron-driven serverless (Cloudflare Workers + Workflows + Containers, or AWS Lambda + Step Functions + Fargate). Zero idle cost, durable orchestration baked in, but ties team-mode reliability to a specific cloud’s availability.
- Laptop-only with retries (whoever’s online runs it; if nobody is, retry later). Cheapest but unreliable. Fine for solo / hobby use; wrong for a team product.
ADR-0009 already commits Cloudflare and AWS as first-class deploy targets for the data plane. This ADR extends that commitment to the team-mode control plane.
Decision
Section titled “Decision”Cloudflare Workers + Workflows + Containers + Durable Objects + R2 is the default team-mode execution surface, brought to M1 alongside the local CLI. Every team-mode primitive maps onto a CF resource:
| Concern | CF resource |
|---|---|
| Cron tick (schedule firing) | Cron Trigger |
| Control-plane HTTP and WebSocket API | Worker + Durable Object |
| Team metadata, canonical Loro doc, awareness state | Durable Object SQLite |
| Long-running pipeline execution (>30s wall) | CF Workflow with step.do checkpoints |
| WASM connector execution | Container, invoked from Workflow steps |
| Manifest TOML snapshots, Parquet data | R2 |
| Real-time sync to clients | DO hibernating WebSocket |
For teams that cannot use Cloudflare (air-gapped, regulated, on-prem, multi-cloud-by-policy), the same hakiri-control binary ships as a Rust daemon with identical HTTP / WebSocket / sync semantics. The wire protocol is identical; only the substrate differs:
| Concern | CF resource | Self-hosted Rust equivalent |
|---|---|---|
| Cron tick | Cron Trigger | In-process tokio scheduler |
| HTTP / WebSocket | Worker + DO | axum + tokio-tungstenite |
| Canonical Loro doc | DO SQLite | Embedded SQLite |
| Workflow durability | CF Workflow | Embedded queue with checkpointed retry — leverages the same crash-resume pattern as 04-context-store.md |
| Snapshots / Parquet | R2 | Any S3-compatible bucket, or local filesystem |
A CI acceptance suite verifies both implementations pass the same conformance tests on every PR.
Consequences
Section titled “Consequences”Positive
- Zero idle cost. A team running pipelines hourly pays ~$5/month (DO storage + Cron + WebSocket frames). No always-on VM, no Fargate task, no Mac mini in the office.
- Durable runs out of the box. CF Workflows checkpoint each
step.do— mid-run crashes resume from the last completed step, no custom retry logic in our code. - Native long-running execution. Containers spin up when a workflow needs WASM connector execution; sleep when done. Cold-start (2–8s for non-trivial images) is amortized over the workflow run, not paid per request.
- Real-time sync with hibernating WebSockets. CF DO supports hibernating WebSocket connections that persist across DO eviction without holding CPU. Per 14-collab-config.md, this is how the Loro sync channel scales to dozens of concurrent editors per team at near-zero cost.
- Preserves “no orchestrator” pillar. Operators provision a Worker + DO with one
hakiri team init --cloudflarecommand. No Kubernetes, no Terraform, no managed service mesh. - Same primitives as Topology 3. The CF data-plane primitives were already first-class; this decision reuses them for the control plane.
Negative
- Team-mode reliability tied to Cloudflare for non-self-hosted teams. CF outages affect team-mode users; solo CLI users are unaffected (their daemon runs locally). Mitigation: the Rust
hakiri-controlparity path is a one-binary fallback for teams whose risk tolerance demands on-premise. - Workflow caps remain load-bearing. Topology 3 in 06-deployment.md documents the constraints — 1 MiB step result, 1024 steps per workflow, ~6h practical retry window. The reconciler honors them (step results are R2 keys, not payloads; >1024-page sources batch pages into chunks). Same constraints apply to team-mode pipelines.
- DO single-thread execution. Per-team DO serializes mutations. A team with 10000 simultaneous Loro ops per second would queue; in practice teams operate at 1–10 ops/sec per team. Mitigation: shard by pipeline group across multiple DOs if a team grows past comfortable thresholds.
- CF Worker CPU / wall-time limits. Reconciler logic must stay under the Worker budget (30s wall, 50ms CPU on free tier; higher on paid). Heavy work moves to Workflows + Containers.
- Self-hosted Rust binary is real work.
hakiri-controlis not a thin shim; it implements WebSocket sync, embedded queue with checkpoints, Loro doc storage, OAuth callback handling. Estimated at 4–6 weeks of dedicated work to reach CF parity for M1.
Neutral
- The control-plane API is identical between CF and self-hosted backends. Electron / web clients are unaware of the substrate.
- This decision implies no analogous “AWS-first team-mode” target. AWS team-mode would require re-implementing the control plane on Lambda + Step Functions + DynamoDB, which we judge low-value (teams who prefer AWS for compliance tend to self-host). Revisit if AWS-native demand from team-mode buyers materializes.
Alternatives considered
Section titled “Alternatives considered”Always-on Fargate task as default. Simple mental model (“there’s one worker, it runs your pipelines”). Rejected for:
- $30–100/month idle baseline, regardless of usage.
- Onboarding cliff: every team has to provision AWS or contemplate a Mac mini in the office.
- Doesn’t change durability guarantees vs CF Workflows (which checkpoint natively).
AWS Lambda + Step Functions + Fargate as default for team mode. Parallels Topology 4. Rejected because:
- Hosted control plane on AWS would require multi-tenant DB management (RDS); CF gives us per-tenant DO state with zero ops.
- Step Functions Standard pricing scales aggressively (50 steps × 4 transitions × hourly × 100 pipelines ≈ $1.4k/mo). DO + Workflow is roughly 1/10 the cost at the same throughput.
- AWS data-plane deploys remain first-class per ADR-0009. This decision is about the control plane only.
Laptop-only with retries. Rejected — see Context #1.
Hybrid: laptop-when-online + cron Worker as fallback. Considered. Net result is the same as making CF the default, with placement choices (placement = "any-mac" vs placement = "cf:auto") handling per-pipeline preference. Folded into the placement model in 03-pipelines.md.
Implementation guardrails
Section titled “Implementation guardrails”- CF dependency must be optional. The OSS data plane and CLI work without any CF account. CF is the default for team mode, not the mandatory substrate.
- Self-hosted parity is a release gate. M1 ships both the CF and the Rust
hakiri-controlpaths. A team-mode feature available on CF must work on Rust within the same release; CI gates this via a shared acceptance suite. - Same
hakiri.toml. The manifest is identical across CF, AWS-data-plane, and self-hosted control planes. Migration from “fractalbox-hosted CF” → “self-hosted Rust” is a one-command export plus redeploy. - No CF-specific code in
hakiri-core. The CF integration lives in ahakiri-control-cfadapter crate; the core control-plane logic is substrate-agnostic.
References
Section titled “References”- Cloudflare Workflows
- Cloudflare Durable Objects
- Hibernating WebSockets
- Topology 3 — Cloudflare in 06-deployment.md
- ADR-0006 — DO SQLite as catalog backend (extends to control-plane state)
- ADR-0008 — coordinator-bundling rationale (applies to self-hosted control plane)
- ADR-0009 — first-class cloud commitment