ariadne: a CI/CD Optimizing Compiler
Background
At work, I've been deep in CI refactors for the past few weeks. Simplifying control flow, better utilizing our runners, reducing the number of build steps, speeding up coverage, parallelizing jobs, etc.
Doing all of this, one thing became pretty clear: CI pipelines are deceptively hard to manage at scale. Almost every pipeline starts life simple. You have a build step, a test step, maybe a docs and format step, a publish. Clean. Easy to read, easy to change. But then you start mixing in container images, introducing multiple runner types, sharing binaries between jobs, gating steps on other steps, and suddenly you have a pile of YAML that only two people on the team fully understand.
The other thing that's changed the dynamic recently is LLMs. In a narrow sense, they help: even if you're fuzzy on CI specifics, you generally know what you want to happen. "Build the compiler, run the test suite, package it into a wheel." An LLM can take that and spit out reasonable YAML. The problem is it doesn't know the context your CI actually runs in. So you get output that's plausible-looking but subtly wrong, or suboptimal in ways you only realize after seeing your compute usage skyrocket.
There's also no good way to test any of this. To know if a CI change works, you push it to a branch and wait. If it runs on a different trigger or a different environment than you expected, it can still fail, and you won't find out until it already has. You can't write unit tests that run locally, and orchestrating a simulated run locally is non-trivial enough that most people just don't bother. So the iteration loop is: push, wait, read the error, push again.
At some point, staring at a screen full of YAML, I had a thought: this kinda reads like assembly. You're manually scheduling work, manually deciding where data lives, manually wiring up data movement between steps. And if programming languages taught us anything, it's that humans are bad at writing assembly at scale. We built compilers precisely because specifying WHAT should happen is tractable; specifying HOW is the wrong level of abstraction for a human.
So: what if you applied the same idea to CI? Treat artifacts as typed objects, let actions declare what they consume and produce, and let a compiler figure out the execution plan.
That's where ariadne comes in.
The Thesis
ariadne separates intent from execution.
User intent
↓
Thread IR
↓
ariadne planner
↓
Execution plan (+ optimizations)
↓
GitHub Actions YAML / local containers / ...
You describe the semantics. ariadne generates a correct execution plan, optimizes artifact movement and runner placement, and emits standard CI configuration. Same idea as a compiler: you write in a high-level language, the compiler figures out the machine code.
Why Not Another CI DSL
There are plenty of existing tools that take this angle: replace YAML with a proper language, get better semantics, type safety, real control flow. These are good ideas. The problem is that people are exhausted by new DSLs. If the recent GPU programming explosion taught us anything, it's that engineers have a finite tolerance for "please learn our bespoke language, it's better, we promise." They don't want to learn it. They don't want to set up the toolchain. They don't want to be stranded when the project stalls. And LLMs are going to be terrible at it. They're trained on the corpus of code that exists, not a language we made up.
More importantly, I realized the core thing I actually wanted to build was the compiler. And the compiler doesn't care about the frontend; it cares about the IR. If the IR is expressive enough to represent the full semantic model, the frontend is just a surface. It could be a Python package. It could be a language server plugin. It could be a Go library. It could be hand-written TIR JSON, which is legal, just unpleasant.
So ariadne is frontend-agnostic by design, with a reference frontend in Python because everyone already knows Python. The practical upside is that frontend authors get the full power of their host language for free. You can write functions, loops, conditionals, shared libraries of reusable action definitions, whatever you need to construct TIR. There's no syntax to invent, no parser to write, no standard library to bootstrap. The Python frontend ships as a shared object that runs ariadne in-process via PyO3; another frontend in Go or Rust can link the same library and call the same planning engine.
If we'd built a DSL, we'd be spending time on parsing and syntax and editor tooling instead of on the compiler. If we'd done a thin language wrapper, we'd have had to gut the host language down to whatever subset we could safely expose. Doing neither means frontend authors can choose how much or how little of ariadne's features they represent. A minimal frontend that only does build and test is fine. A full-featured one that exposes every consequence type, policy knob, and placement option is also fine. The compiler handles both correctly.
Thread IR
Thread IR (TIR) is ariadne's canonical semantic representation. It is both an in-memory model and a serialized interchange format (JSON or protobuf binary), which means any frontend that emits valid TIR can drive the planning engine, and any tool that reads TIR can validate, explain, or simulate a workflow without executing it.
TIR is a typed directed graph. Artifacts are the nodes; data flow between producer and consumer actions is the edges. The planner derives execution order by topological sort: if action B consumes an artifact that action A produces, B runs after A. No manual needs: declarations, no explicit dependency wiring. You can insert ordering edges without data flow when you need them: .after() for a single dependency, barrier() for a synchronization point. But in the common case you don't have to think about it.
Five first-class concepts:
Artifacts are typed, immutable, logical data. The type system distinguishes things like SourceTree, Wheel, Binary, and ContainerImage, plus an open Custom variant. The key word is logical. An artifact is not a file path; it's a named, typed thing. The planner decides how it moves based on what the backend supports. SourceTree is never transferred between jobs. Any action that needs the repo gets a fresh checkout injected.
Actions are more than just computation nodes. An action call in TIR carries: typed artifact inputs and outputs, declared consequences, secrets by name, actor constraints, resource requirements, and an implementation. The implementation is what the action actually does: a raw shell script, a container exec, a VCS checkout, or a semantic operation. Semantic implementations carry no command; they name an operation like build.python_wheel and pass typed args, and the planner resolves the concrete tool from the inventory at plan time. That separation is what lets the same workflow compile to cargo on one inventory and pip on another.
Consequences are external mutations declared explicitly on actions: things like GitWrite, PublishRelease, or Deployment. A consequence can require approval before it fires. The planner gates or withholds consequences based on event context: fork PRs get no secrets, PR plans gate all external mutations, approval-required consequences block until approved.
Placements declare the storage strategy for an artifact: shared volumes, persistent caches, OCI registries, and so on. The baseline plan copies everything; placements are hints to the optimizer, not correctness requirements.
Actors are execution resources from the inventory. Each has runner labels (what ends up in runs-on:), capability strings, and an optional resource budget. Resource requirements on an action constrain which actor it can run on.
TIR also carries the inventory, triggers, and coordination settings at the workflow level. Triggers are distinct from the EventContext used at plan time: triggers control workflow entry, event context controls what consequences fire once it's running.
The Python Frontend
The reference frontend is a Python package. You write @action-decorated functions that record semantic operations into TIR:
@action(outputs={"loom": Binary.file("target/release/loom", lifetime="1h")})
def build_loom(src: SourceTree):
return build.binary(src=src, package="loom", release=True)
build.binary means "build a binary," not "run cargo." Which tool realizes it is decided at plan time, from the inventory. The inventory declares the available execution resources and implementation technologies:
Inventory("ariadne-ci")
.actor("runner", selector=["ubuntu-latest"], capabilities=["linux", "x86_64"])
.use("git")
.use("cargo")
.use("maturin")
.use("pytest")
The same test.unit() action becomes test.unit.cargo under .use("cargo") and test.unit.pytest under .use("pytest"). The workflow author picks semantics, the inventory author picks available implementations, and ariadne's lowering rules teach the compiler how each implementation realizes each action.
A @workflow ties it together:
@workflow(inventory=inventory(), triggers=[on.push(branches=["main"])])
def main_ci():
install_dependencies()
objectives("dollar_cost", "critical_path")
src = checkout()
rust_fmt(src)
python_fmt(src)
barrier()
with impls(["cargo"]):
loom = build_loom(src)
test_workspace(src, loom)
with impls(["maturin", "pytest"]):
wheel = build_wheel(src)
env = install_wheel(src, wheel)
test_wheel(src, env)
Selection: Two Phases on One Engine
The inventory tells the planner what actors exist, what tools are available, and what it's allowed to use. It has three parts.
Actors are declared in the inventory. The planner assigns each action to an actor by checking a pinned constraint first, then label matching, then resource satisfaction. The actor's capability strings flow into specification selection as hard gates.
Implementations declare which tools exist and how the planner should rank them. .use("cargo") makes cargo available. .prefer("buildkit") biases toward buildkit when multiple candidates apply. .deny("docker") excludes docker from selection entirely. The ranking is: prefer (0) beats declared-use (1) beats undeclared default (2). A silent inventory still yields a working plan; ariadne always has a default. The planner also infers the system package manager from the inventory: if dnf is declared, system dependencies install via dnf; otherwise apt is assumed.
Placements declare where artifacts can live and how they can be accessed: cache volumes, shared storage, object storage. These feed the placement optimization pass, which uses them to decide whether an artifact can be mounted instead of downloaded.
Lowering itself happens in two phases, both running on the same engine in select.rs.
Specification (plan-time, backend-agnostic): for each semantic action, the planner consults the inventory to pick an implementation and bind the action to it. build.binary with cargo in the inventory becomes build.binary.cargo. The selected specification produces a shell fallback command that any backend can run. The planner also resolves tool dependencies and toolchain versions from the inventory at this phase.
The inventory's non-denied implementations also surface as impl.<id> capabilities in the plan. The backend uses these at emit-time to decide whether a native step is available: impl.pypa-publish-action in the plan means the backend can upgrade package.publish.twine to uses: pypa/gh-action-pypi-publish instead of running twine upload in a shell.
Instruction selection (emit-time, backend-aware): lowers the specified op to a native step. scm.checkout.git upgrades to actions/checkout@v4 on GitHub and falls through to git checkout . on any backend that has no native mapping. build.binary.cargo becomes run: cargo build --release. Every semantic op carries a shell fallback; a backend without a matching instruction runs it. A legal plan is always available.
At the Python frontend level, impls(["cargo", "maturin"]) and impl("pytest") are scoped preference blocks. They softly bias selection within their scope without modifying the inventory, which means you can have a mixed-toolchain workflow where different blocks prefer different implementations of the same action (test.unit via cargo in one block, pytest in another) without needing two separate inventories.
Optimization Passes
Optimizing CI is deceptively difficult. Coming up with ideas is easy: "why are we building the compiler nine times?" (true story). Actually getting it down to one is hard, because there's usually some non-obvious reason you can only get it to five. Artifact boundaries exist for permissions reasons. A job runs separately because it needs a different runner. A download happens redundantly because two jobs that could share a mount can't be colocated given the backend. The constraints compound in ways that are hard to reason about statically in a sufficiently complex pipeline, and even harder to reason about when you're staring at raw YAML.
The idea with ariadne is that you write the dumb pipeline: declare what each action needs and produces, and let the compiler figure out what's actually necessary. The correctness invariant holds regardless: ariadne will always produce a plan that runs. But with optimization enabled, it will try to fuse jobs that don't need to be separate, eliminate artifact boundaries that exist for no good reason, colocate consumers with their producers to avoid transfers, and place shared artifacts in cache mounts instead of uploading and downloading repeatedly.
Once a correct plan exists, ariadne runs passes:
- Deduplication: reuse identical pure actions
- Fusion: combine cheap adjacent actions to eliminate unnecessary artifact boundaries (big win when the artifact is an environment marker that never needs to move between runners)
- Parallelization: run independent work concurrently within policy limits
- Placement optimization: prefer mount or colocation over upload/download when the backend supports it; fall back to copy with a warning when not
- Consequence-aware ordering: deployments, releases, and external mutations never get reordered past tests
Profile-Guided Planning
Once there's a correctness baseline, ariadne can optimize using data from previous runs. The profile tracks things like artifact sizes, action durations, cache hit rates, queue times, and runner setup costs (the fixed overhead of provisioning a job, paid once per execution unit).
The cost model uses this to estimate three things: wall-clock makespan (via a list-schedule of the dependency DAG onto N concurrency slots), total bytes transferred, and dollar cost. Optimization objectives are user-defined and ordered. objectives("dollar_cost", "critical_path") makes dollar cost the primary objective; critical path breaks ties. The first objective that differs between two plans decides.
The setup cost model is where PGO starts doing real work. Under a parallelism cap, two independent jobs that can't run concurrently anyway serialize regardless: merging them removes a redundant provisioning cycle for free. Without a cap, merging just lengthens the makespan, so it's not worth it. Without real profile data, the optimizer can only guess. With it, the cost model knows exactly how long provisioning takes on your runners, how large your artifacts actually are, and how much data movement each plan variant implies.
ariadne's own CI closes the loop: a refresh_profile action aggregates metrics from recent runs and commits the result back to the repo, so the next generation plans with real timings rather than constants.
Testing CI Itself
Because workflows are programs, they're testable. This is the part I think is genuinely underexplored in CI tooling.
There are two levels. The first is structural: assert things about the plan without running any jobs. Does this artifact get produced? Does this event context trigger this consequence? Does this workflow stay within the runner budget?
@test_case(name="docs and coverage are produced for both languages")
def artifacts_produced():
expect.artifact_produced("rust_docs-docs")
expect.artifact_produced("python_docs-docs")
assert plan.on_event("fork-pr").does_not_access_secret("PROD_TOKEN")
assert plan.on_event("tag-release").has_consequence("publish_release")
assert plan.max_parallel_jobs <= 16
That last category catches a whole class of CI security bugs: "this fork PR ran with access to production secrets." Consequence types are first-class in TIR, so the planner knows statically which jobs carry them.
The second level is a full local execution run. ariadne runs each action in its own Podman container, in topological order, with the repo bind-mounted into /workspace. Artifacts move through a shared store on the host, mounted into each container at /loom-artifacts, so inter-job data movement works the same way it does in real CI. Secrets are spoofed. If the plan grants a job access to a secret, the executor injects spoofed-<SECRET_NAME> as the env var. If the event context withholds them (a fork PR, say), they're not injected at all. Consequences like deployments and releases are recorded and gated, never actually fired. Approval gates prompt interactively.
The result: you can run your full CI pipeline locally, end-to-end, with realistic artifact movement and realistic secret behavior, without pushing to a branch and waiting.
There's one more layer: testing changes to the CI definition itself. ariadne generates main.yml, which creates a bootstrap problem: if the generator is broken, the generated workflow is wrong, and using it as the correctness gate defeats the purpose. So ariadne's own CI keeps two workflow files. ci.yml is hand-written and independent of ariadne; it must run even when a change to the generator breaks it. main.yml is the ariadne-generated product demo, committed alongside the TIR it came from.
Every PR runs three checks from ci.yml:
loom check ci/main.tir.json # TIR is structurally valid
loom explain ci/main.tir.json github -O3 --profile ci/profile.json # plan is producible
python ci/main.py && git diff --exit-code ci/main.tir.json # committed TIR is up to date
The first two validate that the committed TIR is sound and that a full optimized plan can be produced from it against the GitHub backend. The third re-runs the Python frontend and diffs the output against what's committed. If you edit main.py and forget to regenerate and commit the TIR, this fails.
One detail worth noting: the check diffs ci/main.tir.json, not main.yml. The YAML legitimately drifts on every run of the profile commit loop as fresh timing data gets committed with [skip ci]. TIR is profile-independent. It's the semantic graph the planner reads, not the backend-specific output. Diffing TIR means the freshness check only fails when the workflow definition actually changed.
Documentation
CI documentation is notoriously neglected. Someone writes the pipeline and doesn't write a README because it obviously just builds the thing. Except it doesn't: it also runs the test suite, generates docs, measures coverage, and on a tag push, publishes a release that requires manual approval. Six months later a new engineer asks what the pipeline does and the answer is: read the YAML.
loom docs fixes this. Because TIR is the source of truth for the pipeline, documentation can be generated directly from it. The output is structured Markdown that covers:
- Summary: action, artifact, and consequence counts, plus a human-readable description derived from the graph: "This workflow builds artifacts, runs tests, publishes releases, and writes to the repository."
- Artifact graph: a tree showing how data flows through the pipeline, from source inputs to final outputs.
- Actions table: every action with its inputs, outputs, and any consequences it carries.
- Artifacts table: every artifact with its type, the action that produces it, the actions that consume it, and its output path.
- Consequences table: every external mutation in the workflow: what kind it is, which action triggers it, whether it requires approval.
- Secrets table: which secrets the pipeline uses and exactly which actions have access to them.
- Release gates: any consequences that block on explicit approval before they can fire.
Pass a backend and you get a backend summary appended: which features the target backend supports (jobs, dependencies, conditions, matrices, secrets, approvals, runner selection).
The key property: this can't drift. The docs are generated from the same TIR that the planner reads and the backend emits from. If you add a consequence to an action, it shows up in the consequences table. If you add a secret, it shows up in the secrets table. There's no separate README to forget.
loom explain covers the other side: not what the workflow does, but why the plan looks the way it does. It shows the full execution plan, per-unit dependencies and ops, an estimated cost breakdown (makespan, transfer bytes, dollar cost), and every optimization decision recorded by name: which pass ran, what changed, and the reason.
optimizations:
[fusion] install_wheel+test_wheel: 2 units -> 1 (consumers colocated with producer; env marker never transferred)
[placement] model: copy -> mount_read_only (shared volume available; 48 consumers)
[actor] build_loom: ubuntu-latest -> arm64-runner (capability match; lower cost)
Between loom docs and loom explain, there's a complete answer to both what the pipeline does and why it's structured the way it is.
CI à la LLMs
The intro mentioned that LLMs are decent at writing CI YAML but don't know your context. ariadne helps with that in a few ways.
The first is the type system. When an LLM writes an @action that returns a Wheel but the downstream action expects a Binary, that's a type error at validation time, caught locally before anything runs. The artifact graph encodes what produces what and what consumes what; the planner checks it statically. All of these produce structured diagnostics rather than a cryptic CI failure on a remote runner.
The second is loom check and loom explain. Before pushing anything, you can validate the full plan locally and get a structured explanation of what ariadne decided and why: which optimizations ran, what changed, and the reason for each decision. Every optimization decision is recorded per pass, so you can see not just what the plan looks like but why each action ended up where it did.
The third is TIR itself. It's serializable JSON, which means an LLM can inspect it, validate it, or hand it to ariadne to plan and get structured diagnostics back. The intended agent flow is that the LLM generates a workflow, ariadne validates and plans, diagnostics go back to the LLM, and it repairs and tries again. Because correctness is enforced at the compiler level and not inferred from YAML structure, the LLM's job narrows to describing what should happen rather than producing valid orchestration YAML for an environment it can't see.
Custom Instructions and Implementations
The built-in specification packs cover the common cases: cargo, pytest, maturin, and so on. But if your org has internal tooling that doesn't map to any of those, you can register your own.
There are two independent extension points, one per selection phase.
Specifications are plan-time, backend-agnostic. A SpecificationDef teaches ariadne how one implementation realizes one semantic action: it has an id, the action it covers, the implementation name, optional hard capability requirements, the tools it depends on, a stability marker, and a build function that takes call args and returns a SpecificationBody. Register it into the specification registry and it participates in selection like any built-in:
r.register(SpecificationDef {
id: "build.python_wheel.company-builder",
action: "build.python_wheel",
implementation: "company-builder",
requirements: vec![],
dependencies: vec![],
stability: Stability::Stable,
build: |a| {
ContainerExec {
image: "registry.company.com/build/python:latest".into(),
script: format!(
"company-build-wheel --package {}",
arg_str(a, "package").unwrap_or_default()
),
}
},
});
Declare it in the inventory and it wins selection for that action:
Inventory("my-ci")
.prefer("company-builder")
That's it. The workflow author still writes build.python_wheel(...). They don't need to know which tool realized it.
Instructions are emit-time, backend-aware. If the specified op build.python_wheel.company-builder should emit a native step on GitHub rather than falling through to the shell command, you register an Instruction in the GitHub backend's catalogue:
Instruction {
id: InstructionId("github.build.python_wheel.company-builder".into()),
backend: BackendKind::Github,
matcher: OpMatcher::for_action_impl("build.python_wheel", "company-builder"),
requires: vec![Capability::new("impl.company-builder")],
cost: CostHint { fixed: 3, per_mb: 0 },
stability: Stability::Stable,
implementation: json!({
"kind": "github.uses",
"ref": "company/build-wheel-action@v1"
}),
..Default::default()
}
Because every semantic op carries a shell fallback, the instruction is an upgrade, not a requirement. A backend without a matching instruction falls through to the shell command and the plan stays correct.
The two layers are fully independent. You can add a new tool as a specification without touching any backend. You can add a native backend step for an existing specification without touching any frontend. And if you have a one-off tool that doesn't fit any semantic action at all, shell(...) in the Python frontend is always available: drop down to a raw shell command with explicit inputs, outputs, and consequences declared so the compiler still understands the boundaries.
Where Things Stand
The core pipeline is working: TIR, validation, planning, the full optimization pass suite, GitHub Actions emission, and a local Podman backend. ariadne is its own guinea pig: the project's CI is authored in Python, compiled by ariadne, and the output lives in .github/workflows/.
One extension I think is genuinely interesting: infrastructure provisioning. The inventory is already a complete specification of an environment: actors with their labels and capabilities, tools with their versions, resource budgets. That's enough information to generate the infrastructure itself. A provisioning backend could read the inventory and spin up a runner cluster with the right specs, build container images with the declared dependencies pre-installed, and emit the Terraform or whatever topology description your CI host needs. You'd write the inventory once and get both a correct execution plan and a fully bootstrapped environment.
This was largely an academic project, and I had a lot of fun applying classical compiler theory to something unconventional. That said, I do have real scalability concerns: providing a good abstraction over the combinatorial explosion of providers, runners, package managers, test frameworks, and everything else is actually quite hard. The specification and instruction layers help, but the design surface is enormous. Open to suggestions on that front.
If you're interested in the project, peep the repo at github.com/abm-77/ariadne. Otherwise, I'd consider my curiosity on this one sated unless there's actual demand for something like it.