Skip to content

Formal verification of the read path

Status: outline / RFC. Open questions at the bottom.

Related specs: 09-access-control.md, 12-semantic-context.md, 15-inference-placement.md, 11-compliance.md.

The read path — the code that decides which rows and which columns reach the agent — is the load-bearing correctness surface in Hakiri. A bug there is not “wrong row counts in a dashboard”; it is “private data sent to a public-cloud LLM in a session the user thought was Incognito.” Tests + code review catch most of these. Formal verification of the policy algebra catches the rest — composition errors, off-by-one edge cases, accidental wildcard widening — by proving properties no test can exhaust.

This spec defines:

  1. What we prove, in machine-checked form, about the read path.
  2. How — a small Lean 4 kernel that models policy and runtime semantics, with theorems against it.
  3. The bridge between the Lean model and the Rust runtime: differential testing + property-based tests + an executable-specification mode.
  4. What we deliberately do not prove (and why that’s the honest scope).

The goal is proportional rigor: prove the algebra that’s small enough to model and load-bearing enough to matter; trust the rest to tests and review.

The Lean 4 kernel models four objects and proves theorems relating them:

  • Row — a record with named columns and per-column data values.
  • Subject — the tuple from 09-access-control.md § Subject model, extended with inference_zone per 15-inference-placement.md.
  • Policy — the union of token grants (RLS predicate, CLS masking, action scope) and table policy (inference_zone_allowed per row and per column).
  • Serve : Query → Subject → Catalog → List Row — the runtime’s response function, modeled as a pure total function over the above.

The four load-bearing theorems:

1. Soundness — no forbidden row ever appears in a response

Section titled “1. Soundness — no forbidden row ever appears in a response”
theorem serve_sound
(q : Query) (s : Subject) (cat : Catalog)
(r : Row) (h : r ∈ Serve q s cat) :
Policy.allows cat.policy s r := by
-- every row served satisfies every layer of policy
...

If the agent receives row r, then every layer of policy — token RLS, write-time redaction, sync-edge filter, inference zone — allowed it. There is no path through Serve that returns a row not authorized by Policy.allows.

2. Composition — order of filter application doesn’t matter

Section titled “2. Composition — order of filter application doesn’t matter”
theorem filters_commute
(rls cls zone : Row → Bool) (rs : List Row) :
rs.filter rls |>.filter cls |>.filter zone =
rs.filter zone |>.filter rls |>.filter cls

Restrictive filters compose in any order. This rules out a class of bugs where the runtime applies RLS first and CLS later and an attacker exploits the intermediate state. Same property holds for the inference-zone pass.

3. Refinement under zone narrowing — Incognito strictly restricts

Section titled “3. Refinement under zone narrowing — Incognito strictly restricts”
theorem incognito_refines
(q : Query) (s s' : Subject) (cat : Catalog)
(h : s'.inference_zone ⊑ s.inference_zone) -- s' is more restrictive
: Serve q s' cat ⊆ Serve q s cat

If the user flips into Incognito (local:device is than public-cloud:*), the result set shrinks or stays the same — it never grows, and it never includes a row that wasn’t allowed under the looser zone. Symmetric theorem for incognito_strict proves that strict refinement excludes at least one row when the policy has any zone-tagged data.

4. Metadata non-leakage — counters don’t reveal forbidden content

Section titled “4. Metadata non-leakage — counters don’t reveal forbidden content”
theorem metadata_aggregates
(q : Query) (s : Subject) (cat cat' : Catalog)
(h : redacted_equivalent cat cat' s)
: (Serve q s cat).envelope = (Serve q s cat').envelope

If two catalogs differ only in forbidden rows (rows s cannot see), the response envelope — rls_filtered_rows, cls_masked_columns, zone_filtered_rows, audit counters — is identical. The metadata cannot be used as an oracle to confirm row presence. This is what makes the honest-disclosure reporting in 12-semantic-context.md § Policy-aware retrieval actually honest.

A fifth theorem covers vector / FTS retrieval specifically:

5. Top-K stability — forbidden rows excluded before the cut

Section titled “5. Top-K stability — forbidden rows excluded before the cut”
theorem topk_excludes_forbidden
(q : SemanticQuery) (s : Subject) (cat : Catalog)
(r : Row) (h : ¬ Policy.allows cat.policy s r) :
r ∉ Serve q s cat ∧ r ∉ (Serve q s cat).candidates

A vector match against a forbidden row is dropped before the top-K cut, so an attacker with context.query access but not read:row access cannot infer row presence from displaced K-th results. This makes the 09-access-control.md and 12-semantic-context.md claim about “vector matches against forbidden rows are dropped before the top-K is returned” a checked theorem rather than a documentation promise.

Lives at formal/ at the repo root:

formal/
lakefile.lean
lean-toolchain
HakiriPolicy/
Row.lean -- Row, Column, DataValue
Subject.lean -- Subject tuple + zone lattice
Policy.lean -- TokenGrant, TablePolicy, allows
Serve.lean -- Serve function (pure model of the runtime)
Theorems/
Soundness.lean -- theorem serve_sound
Composition.lean -- theorem filters_commute
Refinement.lean -- theorem incognito_refines
Metadata.lean -- theorem metadata_aggregates
TopK.lean -- theorem topk_excludes_forbidden

The model uses only Lean 4 core + std4 — no mathlib. Reason: mathlib is a 1 GB+ dependency that doubles the CI build time and pulls a lot of mathematical infrastructure we don’t need (we never reason about real numbers, topology, or measure theory). The proofs we want are decidable Boolean algebra over finite row sets; the standard library is enough.

InferenceZone is modeled as a join-semilattice with local:device as (most restrictive) and * (any-zone) as :

inductive InferenceZone where
| localDevice
| onPrem (id : String)
| privateCloud (account : String)
| publicCloud (vendor : String)
| any
instance : LE InferenceZone where
le := fun a b => ... -- the lattice ordering
theorem zone_le_refl : ∀ z, z ≤ z
theorem zone_le_trans : ∀ a b c, a ≤ b → b ≤ c → a ≤ c
theorem zone_any_top : ∀ z, z ≤ .any

The refinement theorems in § What we prove are statements over this lattice.

The Lean kernel proves the algebra. The Rust runtime is the implementation. The two have to agree, or the proofs are decorative.

Three mechanisms keep them in step:

A. Differential testing — proptest + executable Lean model

Section titled “A. Differential testing — proptest + executable Lean model”

The Lean model compiles to a small executable (lean --run) that takes a JSON description of (catalog, subject, query) and emits (rows, envelope). The Rust runtime exposes the same interface in test mode. A proptest harness in crates/hakiri-runtime/tests/policy_diff.rs generates random (catalog, subject, query) triples, runs both, and asserts equality:

proptest! {
#[test]
fn lean_and_rust_agree(input in arb_policy_input()) {
let rust = hakiri_runtime::serve(&input);
let lean = lean_reference::serve(&input);
prop_assert_eq!(rust.rows, lean.rows);
prop_assert_eq!(rust.envelope, lean.envelope);
}
}

A Rust runtime divergence from the Lean model fails CI the same way a unit-test failure does. The Lean theorems are then statements about the Rust runtime by reduction — assuming the differential harness covers the input space the proptest shrinks down to.

B. Property-based tests as the executable form of each theorem

Section titled “B. Property-based tests as the executable form of each theorem”

Every theorem in formal/HakiriPolicy/Theorems/ has a matching proptest in crates/hakiri-runtime/tests/:

Theoremproptest
serve_soundprop_no_forbidden_row_in_response
filters_commuteprop_filter_order_invariant
incognito_refinesprop_zone_narrowing_shrinks_result
metadata_aggregatesprop_envelope_invariant_under_redacted_diff
topk_excludes_forbiddenprop_topk_excludes_forbidden_rows

The Lean side gives a proof over all inputs; the proptest side gives evidence on sampled inputs that the Rust implementation matches. Both run on every PR.

The policy reference in 09-access-control.md and 15-inference-placement.md cite theorem names. When a theorem is renamed or weakened, the citation breaks at doc-build time — preventing silent drift between what the kernel proves and what the docs claim.

A formal job on every PR:

formal:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: leanprover/lean-action@v1
- run: cd formal && lake build
- run: cd formal && lake exe lean-reference --self-test
- run: cargo test -p hakiri-runtime policy_diff

Wall-clock budget: under 90 seconds for lake build on a cold cache (achievable without mathlib); under 30 seconds on a warm cache. The proptest job runs in under 60 seconds at the default case count.

A PR that breaks a theorem fails CI. A PR that loosens a theorem (e.g. weakens serve_sound to accept a permitted-but-unaudited path) must explicitly delete the old theorem and explain why in the PR body — the same discipline as an ADR supersession.

Formal verification is proportional, not total. The kernel covers the policy algebra; it does not cover:

  • The cryptography. Biscuit signature verification, Parquet modular encryption, and TLS are trusted via their respective libraries’ audits. We do not re-prove them.
  • The catalog backends. SQLite / RDS / DO SQLite linearizability is verified by their vendors and by adr/0011-catalog-port-semantics.md’s conformance suite — not in Lean.
  • The WASM sandbox. Wasmtime’s capability enforcement is trusted; we do not re-prove component-model isolation.
  • The full Rust runtime. Proving the entire hakiri-runtime crate in Lean would multiply effort by 50× and freeze the implementation. The Lean model is a reference semantics; the Rust runtime is checked against it differentially, not extracted from it.
  • Liveness. The kernel proves what is served, not that the runtime serves anything. A runtime bug that returns an empty result set still satisfies serve_sound (the empty set is a subset of any allowed set). Liveness is a tests / soak-test concern.
  • Side channels. Timing attacks, cache occupancy, memory-pressure inference. Out of scope; mitigated by the architectural choices in 09-access-control.md but not proved.

These are honest scope limits. The verification claim is: “the policy algebra is sound, composition is order-independent, refinement is monotone, metadata does not leak content, and the Rust runtime matches the model under random testing” — not “Hakiri is bug-free.”

The theorems’ practical value is highest at three points:

  1. Incognito-mode regression review. A change to the zone filter requires updating either the Lean kernel (if the semantics change) or just the Rust runtime (if it’s an internal refactor). The differential test catches drift; the kernel catches semantic-intent drift.
  2. Compliance reviews. When a HIPAA / SOC 2 auditor asks “how do you know PHI never reaches a public-cloud zone,” the answer is “soundness theorem serve_sound in formal/HakiriPolicy/Theorems/Soundness.lean, machine-checked on every PR, paired with the differential property test.” That is a stronger answer than “we have unit tests.”
  3. Cross-vendor swaps. When a team swaps Claude → OpenAI → local Llama (Pillar 3 / Challenge 4), the policy doesn’t change — the subject’s inference_zone does. The refinement theorem proves that swapping to a more-restrictive zone never widens the result set; that property is portable across vendors by construction.
  • Lean 4 vs alternatives. Lean 4 is the lean (pun intended) choice: small standard library, strong dependent types, good tactic ergonomics. Alternatives considered: Coq / Rocq (longer history, larger learning curve, slower CI); TLA+ (great for liveness / temporal properties, weak for value-level theorems); Dafny (Microsoft-flavored, less community momentum); F* (strong for crypto, overkill here); Stainless (Scala-coupled). Lean 4 + std4 fits the proportional-rigor bar. Final decision lands when the kernel sketch builds end-to-end.
  • Who writes the proofs. Proof authoring is a specialized skill; the M2 spike includes proving the five theorems above with one engineer. If the wall-clock exceeds two weeks, we scope the kernel down (drop metadata_aggregates, keep serve_sound) or bring in a consultant. The kernel must remain maintainable by the project team, not a one-time external artifact.
  • Extraction vs differential. An alternative to differential testing is code extraction — generating the Rust runtime’s policy filter from the Lean model. Lean → Rust extraction is immature compared to Coq → OCaml; we choose differential as the lower-risk path for v0. Revisit if a Lean-to-Rust toolchain matures.
  • Quickcheck shrinking quality. The differential harness’s value depends on proptest’s shrinking finding minimal counterexamples. Custom Arbitrary instances for Policy, Catalog, Subject are part of the M2 deliverable, not an afterthought.
  • Verifying the WASM connector path. A connector receives raw rows; its output flows back into the runtime. Whether to model the connector boundary in the Lean kernel (treating connectors as adversarial) is an open question — leans no for v0 because the sandbox is the trust boundary, not the policy algebra.
  • Coverage of the catalog port semantics. adr/0011-catalog-port-semantics.md pins linearizability invariants across backends. Whether those invariants are stated in Lean (and then refined to per-backend implementations) or stay in the prose ADR + conformance suite is open. Leaning prose + conformance for v0; revisit if a catalog-backend bug ever bypasses policy.
  • Audit-log integrity proof. The hash-chained audit log in 09-access-control.md has a natural soundness theorem (a forged entry is detectable). Whether to add it as theorem #6 is open; it would be a small extension to the existing kernel.