Formal verification of the read path

Status: outline / RFC. Open questions at the bottom.

Related specs: 09-access-control.md, 12-semantic-context.md, 15-inference-placement.md, 11-compliance.md.

The read path — the code that decides which rows and which columns reach the agent — is the load-bearing correctness surface in Hakiri. A bug there is not “wrong row counts in a dashboard”; it is “private data sent to a public-cloud LLM in a session the user thought was Incognito.” Tests + code review catch most of these. Formal verification of the policy algebra catches the rest — composition errors, off-by-one edge cases, accidental wildcard widening — by proving properties no test can exhaust.

This spec defines:

What we prove, in machine-checked form, about the read path.
How — a small Lean 4 kernel that models policy and runtime semantics, with theorems against it.
The bridge between the Lean model and the Rust runtime: differential testing + property-based tests + an executable-specification mode.
What we deliberately do not prove (and why that’s the honest scope).

The goal is proportional rigor: prove the algebra that’s small enough to model and load-bearing enough to matter; trust the rest to tests and review.

What we prove

The Lean 4 kernel models four objects and proves theorems relating them:

Row — a record with named columns and per-column data values.
Subject — the tuple from 09-access-control.md § Subject model, extended with inference_zone per 15-inference-placement.md.
Policy — the union of token grants (RLS predicate, CLS masking, action scope) and table policy (inference_zone_allowed per row and per column).
Serve : Query → Subject → Catalog → List Row — the runtime’s response function, modeled as a pure total function over the above.

The four load-bearing theorems:

1. Soundness — no forbidden row ever appears in a response

theorem serve_sound
    (q : Query) (s : Subject) (cat : Catalog)
    (r : Row) (h : r ∈ Serve q s cat) :
    Policy.allows cat.policy s r := by
  -- every row served satisfies every layer of policy
  ...

If the agent receives row r, then every layer of policy — token RLS, write-time redaction, sync-edge filter, inference zone — allowed it. There is no path through Serve that returns a row not authorized by Policy.allows.

2. Composition — order of filter application doesn’t matter

theorem filters_commute
    (rls cls zone : Row → Bool) (rs : List Row) :
    rs.filter rls |>.filter cls |>.filter zone =
    rs.filter zone |>.filter rls |>.filter cls

Restrictive filters compose in any order. This rules out a class of bugs where the runtime applies RLS first and CLS later and an attacker exploits the intermediate state. Same property holds for the inference-zone pass.

theorem incognito_refines
    (q : Query) (s s' : Subject) (cat : Catalog)
    (h : s'.inference_zone ⊑ s.inference_zone)  -- s' is more restrictive
    : Serve q s' cat ⊆ Serve q s cat

If the user flips into Incognito (local:device is ⊑ than public-cloud:*), the result set shrinks or stays the same — it never grows, and it never includes a row that wasn’t allowed under the looser zone. Symmetric theorem for incognito_strict proves that strict refinement excludes at least one row when the policy has any zone-tagged data.

4. Metadata non-leakage — counters don’t reveal forbidden content

theorem metadata_aggregates
    (q : Query) (s : Subject) (cat cat' : Catalog)
    (h : redacted_equivalent cat cat' s)
    : (Serve q s cat).envelope = (Serve q s cat').envelope

If two catalogs differ only in forbidden rows (rows s cannot see), the response envelope — rls_filtered_rows, cls_masked_columns, zone_filtered_rows, audit counters — is identical. The metadata cannot be used as an oracle to confirm row presence. This is what makes the honest-disclosure reporting in 12-semantic-context.md § Policy-aware retrieval actually honest.

A fifth theorem covers vector / FTS retrieval specifically:

5. Top-K stability — forbidden rows excluded before the cut

theorem topk_excludes_forbidden
    (q : SemanticQuery) (s : Subject) (cat : Catalog)
    (r : Row) (h : ¬ Policy.allows cat.policy s r) :
    r ∉ Serve q s cat ∧ r ∉ (Serve q s cat).candidates

A vector match against a forbidden row is dropped before the top-K cut, so an attacker with context.query access but not read:row access cannot infer row presence from displaced K-th results. This makes the 09-access-control.md and 12-semantic-context.md claim about “vector matches against forbidden rows are dropped before the top-K is returned” a checked theorem rather than a documentation promise.

Lean 4 kernel layout

Lives at formal/ at the repo root:

formal/
  lakefile.lean
  lean-toolchain
  HakiriPolicy/
    Row.lean              -- Row, Column, DataValue
    Subject.lean          -- Subject tuple + zone lattice
    Policy.lean           -- TokenGrant, TablePolicy, allows
    Serve.lean            -- Serve function (pure model of the runtime)
    Theorems/
      Soundness.lean      -- theorem serve_sound
      Composition.lean    -- theorem filters_commute
      Refinement.lean     -- theorem incognito_refines
      Metadata.lean       -- theorem metadata_aggregates
      TopK.lean           -- theorem topk_excludes_forbidden

The model uses only Lean 4 core + std4 — no mathlib. Reason: mathlib is a 1 GB+ dependency that doubles the CI build time and pulls a lot of mathematical infrastructure we don’t need (we never reason about real numbers, topology, or measure theory). The proofs we want are decidable Boolean algebra over finite row sets; the standard library is enough.

Zone lattice as the underlying algebra

InferenceZone is modeled as a join-semilattice with local:device as ⊥ (most restrictive) and * (any-zone) as ⊤:

inductive InferenceZone where
  | localDevice
  | onPrem    (id : String)
  | privateCloud (account : String)
  | publicCloud  (vendor : String)
  | any

instance : LE InferenceZone where
  le := fun a b => ...   -- the lattice ordering

theorem zone_le_refl : ∀ z, z ≤ z
theorem zone_le_trans : ∀ a b c, a ≤ b → b ≤ c → a ≤ c
theorem zone_any_top  : ∀ z, z ≤ .any

The refinement theorems in § What we prove are statements over this lattice.

Bridge to the Rust runtime

The Lean kernel proves the algebra. The Rust runtime is the implementation. The two have to agree, or the proofs are decorative.

Three mechanisms keep them in step:

A. Differential testing — proptest + executable Lean model

The Lean model compiles to a small executable (lean --run) that takes a JSON description of (catalog, subject, query) and emits (rows, envelope). The Rust runtime exposes the same interface in test mode. A proptest harness in crates/hakiri-runtime/tests/policy_diff.rs generates random (catalog, subject, query) triples, runs both, and asserts equality:

proptest! {
  #[test]
  fn lean_and_rust_agree(input in arb_policy_input()) {
    let rust = hakiri_runtime::serve(&input);
    let lean = lean_reference::serve(&input);
    prop_assert_eq!(rust.rows, lean.rows);
    prop_assert_eq!(rust.envelope, lean.envelope);
  }
}

A Rust runtime divergence from the Lean model fails CI the same way a unit-test failure does. The Lean theorems are then statements about the Rust runtime by reduction — assuming the differential harness covers the input space the proptest shrinks down to.

B. Property-based tests as the executable form of each theorem

Every theorem in formal/HakiriPolicy/Theorems/ has a matching proptest in crates/hakiri-runtime/tests/:

Theorem	proptest
`serve_sound`	`prop_no_forbidden_row_in_response`
`filters_commute`	`prop_filter_order_invariant`
`incognito_refines`	`prop_zone_narrowing_shrinks_result`
`metadata_aggregates`	`prop_envelope_invariant_under_redacted_diff`
`topk_excludes_forbidden`	`prop_topk_excludes_forbidden_rows`

The Lean side gives a proof over all inputs; the proptest side gives evidence on sampled inputs that the Rust implementation matches. Both run on every PR.

C. Documentation pinned to the kernel

The policy reference in 09-access-control.md and 15-inference-placement.md cite theorem names. When a theorem is renamed or weakened, the citation breaks at doc-build time — preventing silent drift between what the kernel proves and what the docs claim.

CI integration

A formal job on every PR:

formal:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - uses: leanprover/lean-action@v1
    - run: cd formal && lake build
    - run: cd formal && lake exe lean-reference --self-test
    - run: cargo test -p hakiri-runtime policy_diff

Wall-clock budget: under 90 seconds for lake build on a cold cache (achievable without mathlib); under 30 seconds on a warm cache. The proptest job runs in under 60 seconds at the default case count.

A PR that breaks a theorem fails CI. A PR that loosens a theorem (e.g. weakens serve_sound to accept a permitted-but-unaudited path) must explicitly delete the old theorem and explain why in the PR body — the same discipline as an ADR supersession.

What we deliberately do not prove

Formal verification is proportional, not total. The kernel covers the policy algebra; it does not cover:

The cryptography. Biscuit signature verification, Parquet modular encryption, and TLS are trusted via their respective libraries’ audits. We do not re-prove them.
The catalog backends. SQLite / RDS / DO SQLite linearizability is verified by their vendors and by adr/0011-catalog-port-semantics.md’s conformance suite — not in Lean.
The WASM sandbox. Wasmtime’s capability enforcement is trusted; we do not re-prove component-model isolation.
The full Rust runtime. Proving the entire hakiri-runtime crate in Lean would multiply effort by 50× and freeze the implementation. The Lean model is a reference semantics; the Rust runtime is checked against it differentially, not extracted from it.
Liveness. The kernel proves what is served, not that the runtime serves anything. A runtime bug that returns an empty result set still satisfies serve_sound (the empty set is a subset of any allowed set). Liveness is a tests / soak-test concern.
Side channels. Timing attacks, cache occupancy, memory-pressure inference. Out of scope; mitigated by the architectural choices in 09-access-control.md but not proved.

These are honest scope limits. The verification claim is: “the policy algebra is sound, composition is order-independent, refinement is monotone, metadata does not leak content, and the Rust runtime matches the model under random testing” — not “Hakiri is bug-free.”

Where this matters most

The theorems’ practical value is highest at three points:

Incognito-mode regression review. A change to the zone filter requires updating either the Lean kernel (if the semantics change) or just the Rust runtime (if it’s an internal refactor). The differential test catches drift; the kernel catches semantic-intent drift.
Compliance reviews. When a HIPAA / SOC 2 auditor asks “how do you know PHI never reaches a public-cloud zone,” the answer is “soundness theorem serve_sound in formal/HakiriPolicy/Theorems/Soundness.lean, machine-checked on every PR, paired with the differential property test.” That is a stronger answer than “we have unit tests.”
Cross-vendor swaps. When a team swaps Claude → OpenAI → local Llama (Pillar 3 / Challenge 4), the policy doesn’t change — the subject’s inference_zone does. The refinement theorem proves that swapping to a more-restrictive zone never widens the result set; that property is portable across vendors by construction.

Open questions

Lean 4 vs alternatives. Lean 4 is the lean (pun intended) choice: small standard library, strong dependent types, good tactic ergonomics. Alternatives considered: Coq / Rocq (longer history, larger learning curve, slower CI); TLA+ (great for liveness / temporal properties, weak for value-level theorems); Dafny (Microsoft-flavored, less community momentum); F* (strong for crypto, overkill here); Stainless (Scala-coupled). Lean 4 + std4 fits the proportional-rigor bar. Final decision lands when the kernel sketch builds end-to-end.
Who writes the proofs. Proof authoring is a specialized skill; the M2 spike includes proving the five theorems above with one engineer. If the wall-clock exceeds two weeks, we scope the kernel down (drop metadata_aggregates, keep serve_sound) or bring in a consultant. The kernel must remain maintainable by the project team, not a one-time external artifact.
Extraction vs differential. An alternative to differential testing is code extraction — generating the Rust runtime’s policy filter from the Lean model. Lean → Rust extraction is immature compared to Coq → OCaml; we choose differential as the lower-risk path for v0. Revisit if a Lean-to-Rust toolchain matures.
Quickcheck shrinking quality. The differential harness’s value depends on proptest’s shrinking finding minimal counterexamples. Custom Arbitrary instances for Policy, Catalog, Subject are part of the M2 deliverable, not an afterthought.
Verifying the WASM connector path. A connector receives raw rows; its output flows back into the runtime. Whether to model the connector boundary in the Lean kernel (treating connectors as adversarial) is an open question — leans no for v0 because the sandbox is the trust boundary, not the policy algebra.
Coverage of the catalog port semantics. adr/0011-catalog-port-semantics.md pins linearizability invariants across backends. Whether those invariants are stated in Lean (and then refined to per-backend implementations) or stay in the prose ADR + conformance suite is open. Leaning prose + conformance for v0; revisit if a catalog-backend bug ever bypasses policy.
Audit-log integrity proof. The hash-chained audit log in 09-access-control.md has a natural soundness theorem (a forged entry is detectable). Whether to add it as theorem #6 is open; it would be a small extension to the existing kernel.