Skip to content

ADR-0010 — Polars as the transform engine; Python and TypeScript as authoring surfaces

Pipelines need a transformation layer between source connectors and destinations: column renames and casts, struct/list flattening, regex extraction, type-widening, filtering before write, lightweight enrichment. The v0 outline shipped a minimal four-op transform set (select, rename, cast, filter) and said “anything more belongs in SQL views after landing.” Field experience tells us this is too restrictive — pipeline authors want a real expression language, and they want to write it in the language they already know (Python or TypeScript).

The hard part: most ETL platforms answered this by embedding a runtime — dlt, Airbyte CDK, and many others embed Python; some embed JS. Embedding a runtime:

  • Blows the footprint budget (Pillar 1, Challenge 5≤ 50 MB compressed binary, ≤ 60 MB idle RSS).
  • Breaks sandboxing — embedded Python can import os and exfiltrate.
  • Couples performance to the embedded runtime — Python GIL, JS single-thread.
  • Adds a second installation problem (the user must also have a compatible Python, or Hakiri must ship one).

We need a transform model that is expressive for authors without making the runtime fat or unsandboxed.

The transform engine is Polars, used as a Rust library inside the Hakiri binary.

Authors write transforms as Polars expression trees, with three equivalent authoring surfaces:

  • Rust — Polars’s native API.
  • Python — Polars’s existing Python bindings, surfaced via a thin hakiri-py wrapper.
  • TypeScript / JavaScript — Polars’s existing Node bindings, surfaced via a thin hakiri-ts wrapper.

The Python and TypeScript authoring surfaces are build-time tools that compile to a serialized Polars LazyFrame plan, which is checked into the repo and read by the Hakiri runtime at pipeline time. The runtime executes the plan against Arrow record batches using Polars’s Rust engine. No Python interpreter or JavaScript runtime is embedded in the Hakiri binary.

For per-batch logic that exceeds Polars’s expression algebra (call out to an embedding model, hit a sidecar service, stateful across batches), authors write a WASM Component that implements a transform WIT interface, sister to the source and destination interfaces from ADR-0001. Python compiles to WASM via componentize-py; TypeScript via jco.

Positive

  • Footprint stays bounded. Adding Polars to the Rust binary costs ~5 MB. No Python, no Node, no V8.
  • Sandboxed by construction. A Polars expression cannot open a socket or read the filesystem; the algebra is the sandbox. Imperative logic that needs host access lives in WASM Components, which already have a capability model.
  • Authors keep their language. Data engineers write Python; frontend-leaning engineers write TypeScript; both produce the same Polars plan.
  • One canonical execution path. The runtime has one transform engine, not “Polars for some things, Python interpreter for others.” Reasoning about correctness, replay, and provenance is straightforward.
  • Zero-copy through the pipeline. Connectors emit Arrow batches; Polars operates on Arrow; destinations consume Arrow. No serialization in the middle.
  • Performance is competitive without effort. Polars is one of the fastest dataframe engines on Arrow. Authors get vectorized execution and query optimization without writing them.
  • Replay-safe. The compiled plan is deterministic data, not arbitrary code. Two runs of the same plan against the same source produce byte-identical outputs.
  • Reviewable. PR review is on the compiled plan JSON (always) and optionally the source .py/.ts (if checked in). No hidden side effects from imported modules.

Negative

  • Compile step. Authors must run hakiri compile <file> to produce the plan. Acceptable — it’s the same shape as cargo build or tsc; we ship a Watch mode.
  • Polars’s expression algebra is large but not infinite. Stateful streaming aggregations, fuzzy matching beyond what Polars provides, and bespoke logic require the WASM Component fallback. Two tiers, not one.
  • Polars version skew. Authors using hakiri-py must match the Polars version the Hakiri runtime ships. The CLI pins the version and the wrapper enforces it on compile.
  • Plan format is Polars’s, not ours. If Polars makes a breaking change to its serialized plan format, we follow. Mitigation: the engine version is recorded in the plan and the runtime refuses to run a plan from an incompatible engine version.

Neutral

  • No support for “arbitrary Python at pipeline time.” This is a deliberate non-feature; teams that need it should use a different tool. We document the WASM Component path for legitimate imperative needs.

Embed Python via PyO3. The path most ETL tools have taken. Rejected because:

  • Binary balloons (CPython is ~30 MB; total install with NumPy/Pandas dependencies pushes past 200 MB).
  • Sandboxing is hard — Python’s import system is global and reaching for os is one line away.
  • Author experience requires the runtime user to have a matching Python; either Hakiri ships Python (huge) or fails on mismatched versions.
  • GIL serializes per-process; parallelism story degrades.

Embed JavaScript via deno_core / Boa / QuickJS. Smaller than Python but still substantial:

  • deno_core adds ~15–25 MB and pulls V8 (largest contributor to binary size in the embedded-JS world).
  • QuickJS is small (~1 MB) but slow for data-shaped workloads.
  • Same sandboxing concerns as Python.
  • Author ecosystem is real but smaller than Python’s for data work.

DataFusion as the engine instead of Polars. Also a Rust-native Arrow-aware query engine. Genuinely viable:

  • DataFusion is more SQL-shaped; Polars is more dataframe-shaped. Our authors are dataframe-shaped (Python/Pandas/Polars community), not SQL-as-an-API.
  • Python and TypeScript bindings for Polars exist and are widely used; DataFusion’s Python binding (datafusion-python) is newer and the TS binding does not exist.
  • DataFusion’s optimizer is excellent; Polars is closer to the user-authoring metaphor.
  • We can revisit if a substantial subset of authors prefers SQL transforms over expressions — DataFusion would slot in alongside Polars, not replace it.

Custom expression DSL. Build our own narrow expression language with a typed AST. Rejected:

  • Teaches a new language for marginal benefit.
  • Loses Python/TS ecosystem familiarity.
  • Doubles the documentation surface.

Stay minimal (just select/rename/cast/filter). Rejected because it pushes every team into writing connectors-that-transform or maintaining a parallel transformation tool — the worst of both worlds.