Formal verification of the read path
Status: outline / RFC. Open questions at the bottom.
Related specs: 09-access-control.md, 12-semantic-context.md, 15-inference-placement.md, 11-compliance.md.
The read path — the code that decides which rows and which columns reach the agent — is the load-bearing correctness surface in Hakiri. A bug there is not “wrong row counts in a dashboard”; it is “private data sent to a public-cloud LLM in a session the user thought was Incognito.” Tests + code review catch most of these. Formal verification of the policy algebra catches the rest — composition errors, off-by-one edge cases, accidental wildcard widening — by proving properties no test can exhaust.
This spec defines:
- What we prove, in machine-checked form, about the read path.
- How — a small Lean 4 kernel that models policy and runtime semantics, with theorems against it.
- The bridge between the Lean model and the Rust runtime: differential testing + property-based tests + an executable-specification mode.
- What we deliberately do not prove (and why that’s the honest scope).
The goal is proportional rigor: prove the algebra that’s small enough to model and load-bearing enough to matter; trust the rest to tests and review.
What we prove
Section titled “What we prove”The Lean 4 kernel models four objects and proves theorems relating them:
Row— a record with named columns and per-column data values.Subject— the tuple from09-access-control.md§ Subject model, extended withinference_zoneper15-inference-placement.md.Policy— the union of token grants (RLS predicate, CLS masking, action scope) and table policy (inference_zone_allowedper row and per column).Serve : Query → Subject → Catalog → List Row— the runtime’s response function, modeled as a pure total function over the above.
The four load-bearing theorems:
1. Soundness — no forbidden row ever appears in a response
Section titled “1. Soundness — no forbidden row ever appears in a response”theorem serve_sound (q : Query) (s : Subject) (cat : Catalog) (r : Row) (h : r ∈ Serve q s cat) : Policy.allows cat.policy s r := by -- every row served satisfies every layer of policy ...If the agent receives row r, then every layer of policy — token RLS, write-time redaction, sync-edge filter, inference zone — allowed it. There is no path through Serve that returns a row not authorized by Policy.allows.
2. Composition — order of filter application doesn’t matter
Section titled “2. Composition — order of filter application doesn’t matter”theorem filters_commute (rls cls zone : Row → Bool) (rs : List Row) : rs.filter rls |>.filter cls |>.filter zone = rs.filter zone |>.filter rls |>.filter clsRestrictive filters compose in any order. This rules out a class of bugs where the runtime applies RLS first and CLS later and an attacker exploits the intermediate state. Same property holds for the inference-zone pass.
3. Refinement under zone narrowing — Incognito strictly restricts
Section titled “3. Refinement under zone narrowing — Incognito strictly restricts”theorem incognito_refines (q : Query) (s s' : Subject) (cat : Catalog) (h : s'.inference_zone ⊑ s.inference_zone) -- s' is more restrictive : Serve q s' cat ⊆ Serve q s catIf the user flips into Incognito (local:device is ⊑ than public-cloud:*), the result set shrinks or stays the same — it never grows, and it never includes a row that wasn’t allowed under the looser zone. Symmetric theorem for incognito_strict proves that strict refinement excludes at least one row when the policy has any zone-tagged data.
4. Metadata non-leakage — counters don’t reveal forbidden content
Section titled “4. Metadata non-leakage — counters don’t reveal forbidden content”theorem metadata_aggregates (q : Query) (s : Subject) (cat cat' : Catalog) (h : redacted_equivalent cat cat' s) : (Serve q s cat).envelope = (Serve q s cat').envelopeIf two catalogs differ only in forbidden rows (rows s cannot see), the response envelope — rls_filtered_rows, cls_masked_columns, zone_filtered_rows, audit counters — is identical. The metadata cannot be used as an oracle to confirm row presence. This is what makes the honest-disclosure reporting in 12-semantic-context.md § Policy-aware retrieval actually honest.
A fifth theorem covers vector / FTS retrieval specifically:
5. Top-K stability — forbidden rows excluded before the cut
Section titled “5. Top-K stability — forbidden rows excluded before the cut”theorem topk_excludes_forbidden (q : SemanticQuery) (s : Subject) (cat : Catalog) (r : Row) (h : ¬ Policy.allows cat.policy s r) : r ∉ Serve q s cat ∧ r ∉ (Serve q s cat).candidatesA vector match against a forbidden row is dropped before the top-K cut, so an attacker with context.query access but not read:row access cannot infer row presence from displaced K-th results. This makes the 09-access-control.md and 12-semantic-context.md claim about “vector matches against forbidden rows are dropped before the top-K is returned” a checked theorem rather than a documentation promise.
Lean 4 kernel layout
Section titled “Lean 4 kernel layout”Lives at formal/ at the repo root:
formal/ lakefile.lean lean-toolchain HakiriPolicy/ Row.lean -- Row, Column, DataValue Subject.lean -- Subject tuple + zone lattice Policy.lean -- TokenGrant, TablePolicy, allows Serve.lean -- Serve function (pure model of the runtime) Theorems/ Soundness.lean -- theorem serve_sound Composition.lean -- theorem filters_commute Refinement.lean -- theorem incognito_refines Metadata.lean -- theorem metadata_aggregates TopK.lean -- theorem topk_excludes_forbiddenThe model uses only Lean 4 core + std4 — no mathlib. Reason: mathlib is a 1 GB+ dependency that doubles the CI build time and pulls a lot of mathematical infrastructure we don’t need (we never reason about real numbers, topology, or measure theory). The proofs we want are decidable Boolean algebra over finite row sets; the standard library is enough.
Zone lattice as the underlying algebra
Section titled “Zone lattice as the underlying algebra”InferenceZone is modeled as a join-semilattice with local:device as ⊥ (most restrictive) and * (any-zone) as ⊤:
inductive InferenceZone where | localDevice | onPrem (id : String) | privateCloud (account : String) | publicCloud (vendor : String) | any
instance : LE InferenceZone where le := fun a b => ... -- the lattice ordering
theorem zone_le_refl : ∀ z, z ≤ ztheorem zone_le_trans : ∀ a b c, a ≤ b → b ≤ c → a ≤ ctheorem zone_any_top : ∀ z, z ≤ .anyThe refinement theorems in § What we prove are statements over this lattice.
Bridge to the Rust runtime
Section titled “Bridge to the Rust runtime”The Lean kernel proves the algebra. The Rust runtime is the implementation. The two have to agree, or the proofs are decorative.
Three mechanisms keep them in step:
A. Differential testing — proptest + executable Lean model
Section titled “A. Differential testing — proptest + executable Lean model”The Lean model compiles to a small executable (lean --run) that takes a JSON description of (catalog, subject, query) and emits (rows, envelope). The Rust runtime exposes the same interface in test mode. A proptest harness in crates/hakiri-runtime/tests/policy_diff.rs generates random (catalog, subject, query) triples, runs both, and asserts equality:
proptest! { #[test] fn lean_and_rust_agree(input in arb_policy_input()) { let rust = hakiri_runtime::serve(&input); let lean = lean_reference::serve(&input); prop_assert_eq!(rust.rows, lean.rows); prop_assert_eq!(rust.envelope, lean.envelope); }}A Rust runtime divergence from the Lean model fails CI the same way a unit-test failure does. The Lean theorems are then statements about the Rust runtime by reduction — assuming the differential harness covers the input space the proptest shrinks down to.
B. Property-based tests as the executable form of each theorem
Section titled “B. Property-based tests as the executable form of each theorem”Every theorem in formal/HakiriPolicy/Theorems/ has a matching proptest in crates/hakiri-runtime/tests/:
| Theorem | proptest |
|---|---|
serve_sound | prop_no_forbidden_row_in_response |
filters_commute | prop_filter_order_invariant |
incognito_refines | prop_zone_narrowing_shrinks_result |
metadata_aggregates | prop_envelope_invariant_under_redacted_diff |
topk_excludes_forbidden | prop_topk_excludes_forbidden_rows |
The Lean side gives a proof over all inputs; the proptest side gives evidence on sampled inputs that the Rust implementation matches. Both run on every PR.
C. Documentation pinned to the kernel
Section titled “C. Documentation pinned to the kernel”The policy reference in 09-access-control.md and 15-inference-placement.md cite theorem names. When a theorem is renamed or weakened, the citation breaks at doc-build time — preventing silent drift between what the kernel proves and what the docs claim.
CI integration
Section titled “CI integration”A formal job on every PR:
formal: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: leanprover/lean-action@v1 - run: cd formal && lake build - run: cd formal && lake exe lean-reference --self-test - run: cargo test -p hakiri-runtime policy_diffWall-clock budget: under 90 seconds for lake build on a cold cache (achievable without mathlib); under 30 seconds on a warm cache. The proptest job runs in under 60 seconds at the default case count.
A PR that breaks a theorem fails CI. A PR that loosens a theorem (e.g. weakens serve_sound to accept a permitted-but-unaudited path) must explicitly delete the old theorem and explain why in the PR body — the same discipline as an ADR supersession.
What we deliberately do not prove
Section titled “What we deliberately do not prove”Formal verification is proportional, not total. The kernel covers the policy algebra; it does not cover:
- The cryptography. Biscuit signature verification, Parquet modular encryption, and TLS are trusted via their respective libraries’ audits. We do not re-prove them.
- The catalog backends. SQLite / RDS / DO SQLite linearizability is verified by their vendors and by
adr/0011-catalog-port-semantics.md’s conformance suite — not in Lean. - The WASM sandbox. Wasmtime’s capability enforcement is trusted; we do not re-prove component-model isolation.
- The full Rust runtime. Proving the entire
hakiri-runtimecrate in Lean would multiply effort by 50× and freeze the implementation. The Lean model is a reference semantics; the Rust runtime is checked against it differentially, not extracted from it. - Liveness. The kernel proves what is served, not that the runtime serves anything. A runtime bug that returns an empty result set still satisfies
serve_sound(the empty set is a subset of any allowed set). Liveness is a tests / soak-test concern. - Side channels. Timing attacks, cache occupancy, memory-pressure inference. Out of scope; mitigated by the architectural choices in
09-access-control.mdbut not proved.
These are honest scope limits. The verification claim is: “the policy algebra is sound, composition is order-independent, refinement is monotone, metadata does not leak content, and the Rust runtime matches the model under random testing” — not “Hakiri is bug-free.”
Where this matters most
Section titled “Where this matters most”The theorems’ practical value is highest at three points:
- Incognito-mode regression review. A change to the zone filter requires updating either the Lean kernel (if the semantics change) or just the Rust runtime (if it’s an internal refactor). The differential test catches drift; the kernel catches semantic-intent drift.
- Compliance reviews. When a HIPAA / SOC 2 auditor asks “how do you know PHI never reaches a public-cloud zone,” the answer is “soundness theorem
serve_soundinformal/HakiriPolicy/Theorems/Soundness.lean, machine-checked on every PR, paired with the differential property test.” That is a stronger answer than “we have unit tests.” - Cross-vendor swaps. When a team swaps Claude → OpenAI → local Llama (Pillar 3 / Challenge 4), the policy doesn’t change — the subject’s
inference_zonedoes. The refinement theorem proves that swapping to a more-restrictive zone never widens the result set; that property is portable across vendors by construction.
Open questions
Section titled “Open questions”- Lean 4 vs alternatives. Lean 4 is the lean (pun intended) choice: small standard library, strong dependent types, good tactic ergonomics. Alternatives considered: Coq / Rocq (longer history, larger learning curve, slower CI); TLA+ (great for liveness / temporal properties, weak for value-level theorems); Dafny (Microsoft-flavored, less community momentum); F* (strong for crypto, overkill here); Stainless (Scala-coupled). Lean 4 +
std4fits the proportional-rigor bar. Final decision lands when the kernel sketch builds end-to-end. - Who writes the proofs. Proof authoring is a specialized skill; the M2 spike includes proving the five theorems above with one engineer. If the wall-clock exceeds two weeks, we scope the kernel down (drop
metadata_aggregates, keepserve_sound) or bring in a consultant. The kernel must remain maintainable by the project team, not a one-time external artifact. - Extraction vs differential. An alternative to differential testing is code extraction — generating the Rust runtime’s policy filter from the Lean model. Lean → Rust extraction is immature compared to Coq → OCaml; we choose differential as the lower-risk path for v0. Revisit if a Lean-to-Rust toolchain matures.
- Quickcheck shrinking quality. The differential harness’s value depends on
proptest’s shrinking finding minimal counterexamples. CustomArbitraryinstances forPolicy,Catalog,Subjectare part of the M2 deliverable, not an afterthought. - Verifying the WASM connector path. A connector receives raw rows; its output flows back into the runtime. Whether to model the connector boundary in the Lean kernel (treating connectors as adversarial) is an open question — leans no for v0 because the sandbox is the trust boundary, not the policy algebra.
- Coverage of the catalog port semantics.
adr/0011-catalog-port-semantics.mdpins linearizability invariants across backends. Whether those invariants are stated in Lean (and then refined to per-backend implementations) or stay in the prose ADR + conformance suite is open. Leaning prose + conformance for v0; revisit if a catalog-backend bug ever bypasses policy. - Audit-log integrity proof. The hash-chained audit log in
09-access-control.mdhas a natural soundness theorem (a forged entry is detectable). Whether to add it as theorem #6 is open; it would be a small extension to the existing kernel.