Compliance posture
The data-sovereignty / compliance dimension from the PRD’s “Properties that follow”. This spec is the substrate — the architectural floor that makes attestation possible. It is not itself a regulatory attestation. Attestation is the work of the customer plus the commercial layer at M3+ (PRD § Commercial layer).
Related:
- Access control mechanics:
09-access-control.md. - Encryption at rest and sidecar encryption:
04-context-store.md§ Encryption. - Subject attestation:
09-access-control.md§ Attestation. - Audit log:
09-access-control.md§ Audit trail.
Compliance reality-check: what v0 actually provides vs. what it doesn’t
Section titled “Compliance reality-check: what v0 actually provides vs. what it doesn’t”| Regime | Hakiri provides (substrate) | Hakiri does not provide (operator’s work or commercial-layer work) |
|---|---|---|
| GDPR data-residency | Customer-controlled cloud region for first-class deploys (CF + AWS EU); air-gapped on-prem via Topology 2.5; no Hakiri-operated data path | Documented region pinning per customer; legal data-processing agreement (DPA) |
| GDPR right-to-erasure | Per-row _subject_id lineage tagging; hakiri context forget --subject <id> rewrites affected snapshots without the subject | Operator must declare which column is the subject identifier; M2 ships the forget tool — until then erasure is not supported |
| HIPAA — technical safeguards | Encryption at rest (Parquet modular encryption), encryption in transit (TLS), capability-token access control, OTel audit + hash-chained local audit, secrets-via-sandbox-only | Business Associate Agreement (BAA) — requires a commercial entity at M3.5+; risk assessment; access-review processes; incident-response runbook |
| HIPAA — PHI handling | Write-time redaction of declared pii_type = "phi" columns; “PHI must never leave the kernel unmasked” invariant via column tagging | Operator declares which columns are PHI; clinical workflow validation |
| EU AI Act | Provenance edges from row → connector → connector-author (agent or human); capability-token audit; documented agent reads | High-risk-system classification; human-oversight processes; risk-management documentation |
| SOC 2 | Manifest-as-code with PR review; immutable lineage table; OTel + chained audit; declared change-management surface in pm/roadmap.md | SOC 2 Type II audit on the commercial hosted control plane (M3.5+); change-management process; incident-response process |
| PCI DSS | TLS, encryption at rest, capability-token scoping, audit trail | Cardholder-data scoping (operator’s call); compensating-controls documentation; quarterly scan |
| FedRAMP | Air-gapped deployment, no telemetry-by-default, sovereign cloud regions | FedRAMP authorization on a specific deploy — not on Hakiri itself |
Honest read: v0 ships the substrate a regulated buyer’s compliance team needs to attest against. It does not ship the attestation. M3.5+ commercial tier is where attestations land (SOC 2 Type II on the hosted control plane; BAA-eligible variant for HIPAA customers; documented data-residency whitepaper).
Data-residency posture
Section titled “Data-residency posture”- Data plane runs entirely in the customer’s environment. Always. There is no Hakiri-operated data path.
- First-class clouds (Cloudflare, AWS) both offer EU regions; the
hakiri deploy <cloud>command takes--regionand pins all resources to that region. The runtime fails to start if a configured resource is in a different region than declared. - On-prem / air-gapped via Topology 2.5 (self-hosted cluster with bundled
hakiri coord) — no public-internet path required. - Optional M3 hosted control plane stores only manifests and schedules — never data. Customer can run an in-region instance of the control plane or rely on the OHC-affiliated hosted instance per their residency obligations.
Encryption posture
Section titled “Encryption posture”See 04-context-store.md § Encryption for the mechanics. Summary for this spec:
- Parquet modular encryption with operator-supplied KMS keys (AWS KMS, GCP KMS, HashiCorp Vault, CF Workers Secrets, OS keychain).
- Sidecar indexes (HNSW, Tantivy, Bloom) encrypted under the same key. Indexes over redacted columns do not exist on disk.
- TLS in transit, non-negotiable for cloud sync backends.
- Key rotation: dual-key acceptance for signing keys (24h overlap); per-snapshot key versioning for encryption keys.
Recommended rotation cadence
Section titled “Recommended rotation cadence”| Key | Recommended rotation | Mandatory for |
|---|---|---|
| Project signing key (token verification) | 90 days | HIPAA, SOC 2 |
| Parquet encryption key (KMS-held) | 365 days (KMS-managed); per-incident-suspicion immediate | HIPAA, PCI DSS |
| Clean-room pair pepper | Per clean-room session | All |
| Sync bucket credentials | 90 days | SOC 2 |
hakiri keys status reports what’s due and how to rotate.
Right-to-erasure under append-only Parquet
Section titled “Right-to-erasure under append-only Parquet”GDPR Art. 17 (right to erasure) is non-trivial against an append-only store. The architectural answer:
-
Operator declares the subject identifier column in the manifest:
[[pipeline.tables]]name = "customer_events"subject_id = "user_id" -
The catalog maintains a
forget_requests(subject_id, requested_at, completed_at)table. -
hakiri context forget --subject <id>triggers a forced compaction that rewrites every affected snapshot without the subject’s rows. The old snapshot’s runs are GC’d on an expedited schedule (≤24h vs the default 7d retention). -
The forget operation is itself audited — an OTel span and chained-audit entry records what was forgotten, when, and by whom, with a hash of the forgotten subject id (not the cleartext) so the audit trail itself remains lawful under GDPR.
-
Replicas pick up the new snapshot on next refresh; old snapshots GC after the retention window. Until refresh completes, replicas may still hold the subject’s data. The right-to-erasure SLA in v0 is “≤72h from request to last replica refresh” — operators with tighter SLAs use proxy-mode replicas or force-refresh.
Lands in M2.
PHI / sensitive-column tagging
Section titled “PHI / sensitive-column tagging”Declared pii_type on a column makes the column subject to extra rules:
[[pipeline.tables]]name = "patient_visits"
[pipeline.tables.policy.columns] patient_id = { pii_type = "phi", strategy = "tokenize" } visit_notes = { pii_type = "phi", strategy = "redact" } patient_name = { pii_type = "phi", strategy = "redact" } visit_date = { pii_type = "phi", strategy = "bucket:1m" }The runtime enforces:
- No index exists on a column with
strategy = "redact"andpii_typeset (manifest validator refuses). - No
hashmasking alone onpii_type ∈ {ssn, phone, email, mrn}— must be combined withbucketortruncate(09-access-control.md§ Hash strategy guardrails). - No retrieval through MCP of unmasked
pii_typecolumns unless the requesting token carries an explicitphi_access = truegrant. - Audit attribute
hakiri.row.phi_columns_returnedis logged on every read that returned PHI, for HIPAA accounting-of-disclosures. - Inference-zone floor. Columns tagged
pii_type = "phi"get an implicitinference_zone_allowed = ["local:device", "on-prem:*"]floor. The validator refuses a manifest that widens this without an explicitphi_inference_override = trueflag. The mechanics — including the Incognito-mode UX customer-facing teams flip when handling PHI — live in15-inference-placement.md.
Audit trail durability
Section titled “Audit trail durability”Per 09-access-control.md § Tamper-evident audit log:
- OTel spans are the queryable projection (operator-configured sink — Honeycomb, Tempo, Grafana Cloud, self-hosted).
- A parallel append-only hash-chained log under
.hakiri/audit/<project>/is the attestable record. - Signed roots are committed to the sync bucket on a configurable cadence (default every 10 min) so audit history survives a compromised local node.
- Optional: commit signed roots to an external transparency log (Sigstore Rekor) for operator-tamper-resistant audit.
If the audit-write path fails (disk full, permission error, bucket unreachable), reads are refused. No fail-open path returns rows without an audit entry.
Telemetry posture
Section titled “Telemetry posture”Hakiri does not phone home. No telemetry-by-default; no anonymous usage ping; no auto-update probe; no license-server contact. The binary works fully air-gapped.
OTel export is opt-in and operator-controlled: the operator configures the endpoint, sampling rate, and attributes. The default OTEL_EXPORTER_OTLP_ENDPOINT is unset — the runtime emits spans to a no-op sink until the operator points them somewhere.
A weekly auto-update check (HTTP HEAD against a release feed) is off by default in v0 and opt-in via [update] check = true. The check never sends usage data; it only fetches a manifest of recent releases. Sovereign deploys leave it off.
What this spec deliberately leaves out
Section titled “What this spec deliberately leaves out”- Specific attestations (SOC 2 Type II reports, BAA templates, FedRAMP authorization documents). Those are deliverables of the commercial-layer entity, not the OSS data plane.
- Customer-side compliance processes (access reviews, risk assessments, incident response). Hakiri provides the substrate; the customer’s compliance team owns the process.
- Region-specific certifications beyond EU. APAC sovereignty (China, India), specific public-sector frameworks (UK G-Cloud, AU IRAP, Canada PBMM) — supported by the architecture but not formally attested in v0.
Open questions
Section titled “Open questions”- Commercial entity identification. Which entity holds the BAA, signs the DPA, undergoes the SOC 2 audit? FractalBox, an OHC-affiliated entity, or a separate commercial vehicle? Tracked in PRD § Open product questions.
- Audit log to transparency-log integration. Sigstore Rekor is the obvious choice but adds an external dependency. Is the value worth the dependency? Probably yes for HIPAA / SOC 2 customers; off by default for everyone else.
- EU AI Act high-risk-system classification. Does running an agent over customer data classify the customer’s deployment as a high-risk system, or only when the agent makes automated decisions? Tracking the EU Act’s regulatory guidance; spec evolves with the guidance.
- Right-to-erasure SLA. Default ≤72h to last-replica refresh is the v0 commitment. Tighter SLAs require proxy-mode replicas. Whether to support an explicit “erasure pending” replica state where the replica reports the gap is an M3 question.
- Cross-tenant clean-room compliance. When two parties share a clean-room deployment, who is the data controller / processor for each party’s data?
09-access-control.md§ Multi-tenant clean rooms covers the security model; the legal model is a per-deployment contract concern.