Leanstral: Mistral’s open-source foundation for vibe-coding with formal proofs
AI-powered coding agents have already shown they can produce code at a pace that would have been unthinkable just a few years ago, but there is a bottleneck that keeps slowing everyone down: the need for human review. The more sensitive and critical the project, the greater the amount of time and specialized knowledge required to ensure every single line actually does what it promises. This is a long-standing problem, and until now the most common solution was simply throwing more hours of human labor at the machine-generated output, which obviously does not scale well.
Mistral’s vision for tackling this challenge is ambitious: build a new generation of coding agents that not only execute tasks but also formally prove that their implementations are correct against rigorous specifications. Instead of spending hours debugging machine-generated logic, the human simply states what they want, and the system handles both execution and verification. The first major step in that direction has just been taken with the launch of Leanstral.
What is Leanstral and why does it matter
Leanstral is the first open-source code agent specifically designed to work with Lean 4, a formal proof assistant capable of expressing complex mathematical objects like perfectoid spaces and software specifications such as properties of Rust fragments. Unlike other existing proof systems that function as wrappers around large generalist models or focus on isolated math problems, Leanstral was built to operate in realistic formal repositories — the kind you actually encounter day-to-day in serious verification projects.
The architecture behind it is sparse: there are 120 billion parameters total, but only 6 billion are active during inference. That means the model is expressive enough to handle complex formal proof problems without requiring the kind of absurd infrastructure that equivalently sized dense models would need. The practical result is a much lower operating cost and latency that allows real interactive use, without that eternal wait we all know from heavyweight models.
The model is available under an Apache 2.0 license, which means anyone or any company can download, modify, and redistribute it with no commercial restrictions. On top of that, Mistral has promised to publish a technical report detailing the training approach along with a new evaluation suite called FLTEval, designed to push benchmarks beyond the traditional focus on competition mathematics.
Three pillars of Leanstral
Mistral highlighted three core characteristics of the model in the official announcement:
- Open and accessible: the model weights are available under an Apache 2.0 license. It also works in agent mode within Mistral Vibe and has a free API endpoint. The team will also release the technical report and the FLTEval evaluation suite.
- Efficient and powerful: the highly sparse architecture was optimized for proof engineering tasks. By combining parallel inference with Lean as a perfect verifier, Leanstral delivers competitive performance against closed-source competitors at a much lower cost.
- Upgradeable via MCP: Leanstral supports arbitrary MCPs through Vibe and was specifically trained to achieve peak performance with lean-lsp-mcp, a tool frequently used by the Lean community.
How Leanstral works in practice
Lean 4 is a programming language and formal verification framework that lets you write mathematical propositions and then build proofs that those propositions are true. It is used in academic math formalization projects like Mathlib, but it also has direct applications in critical software verification — think embedded systems, cryptographic protocols, and smart contracts. The problem is that writing proofs in Lean requires a level of expertise that very few developers possess. The learning curve is steep, and the process of constructing a complete proof can be extremely time-consuming, even for people who already know the tool well.
This is exactly where Leanstral comes in. It was trained on data specific to the Lean 4 ecosystem so it can suggest tactics, complete intermediate steps, and even generate entire proofs for lemmas and theorems the user is trying to demonstrate. In practice, this transforms the Lean workflow from something purely manual and cerebral into an active collaboration between human and model, where Leanstral proposes paths and Lean’s formal verifier guarantees those paths are mathematically valid. If the model suggests something incorrect, Lean itself rejects the proof, so there is a built-in safety layer that does not depend on blind trust in the model.
Another important point is the native integration with Mistral Vibe, Mistral’s coding agent platform. With support for the MCP protocol — the Model Context Protocol — Leanstral can be plugged into existing development pipelines and function as a piece within a larger code generation and verification workflow. This opens up interesting possibilities, like having one agent generate functional code and then invoking Leanstral to produce proofs that specific properties of that code are respected.
Evaluation: the FLTEval benchmark
To reflect real-world usefulness in proof engineering scenarios, Mistral chose not to evaluate Leanstral solely on isolated math problems. Instead, they used the FLT (Fermat’s Last Theorem) project as a foundation and created the FLTEval benchmark, which measures the model’s ability to complete all formal proofs and correctly define new mathematical concepts in each pull request of the project. This is far more representative of the actual work a proof engineer does on a daily basis.
Leanstral was compared against the leading coding agents on the market, including models from the Claude family — Opus 4.6, Sonnet 4.6, and Haiku 4.5 — as well as large-scale open-source models like Qwen3.5 397B-A17B, Kimi-K2.5 1T-A32B, and GLM5 744B-A40B.
Leanstral vs. open-source models
Leanstral-120B-A6B demonstrated a significant efficiency advantage over its open-source peers, which are much larger in terms of active parameters. Models like GLM5-744B-A40B and Kimi-K2.5-1T-32B struggled to scale, with their FLTEval scores plateauing at roughly 16.6 and 20.1, respectively. Leanstral surpassed both with just a single inference pass.
Even Qwen3.5-397B-A17B, the strongest open-source competitor evaluated, needed 4 passes to reach a score of 25.4. In contrast, Leanstral hit 26.3 with half that investment using pass@2, and continued to scale linearly, reaching 29.3 at the same cost level.
Leanstral vs. the Claude family
The results against the Claude family are where Leanstral’s value proposition becomes most apparent. At pass@2, it reaches 26.3 points, beating Sonnet 4.6 by 2.6 points at a cost of just $36 versus $549 for Sonnet. At pass@16, Leanstral climbs to 31.9 points, beating Sonnet by a comfortable 8 points.
Claude Opus 4.6 remains the leader in absolute quality at 39.6 points, but it carries a staggering cost of $1,650 — that is 92 times more expensive than running Leanstral. For context, here is the full table:
- Haiku 4.5: $184 — 23.0 points
- Sonnet 4.6: $549 — 23.7 points
- Opus 4.6: $1,650 — 39.6 points
- Leanstral (pass@1): $18 — 21.9 points
- Leanstral (pass@2): $36 — 26.3 points
- Leanstral (pass@4): $72 — 29.3 points
- Leanstral (pass@8): $145 — 31.0 points
- Leanstral (pass@16): $290 — 31.9 points
One important detail: in the benchmarks, Mistral used Mistral Vibe as a scaffold with no evaluation-specific modifications whatsoever.
Case studies: Leanstral in action
Benchmark numbers are great, but what really counts is how the model behaves in real-world situations. Mistral presented two case studies that illustrate Leanstral’s capabilities nicely.
Solving real Lean migration problems
When breaking changes land in a new Lean version, migrating code can turn into a massive headache. The team fed Leanstral a real question from the Proof Assistants Stack Exchange about a script that mysteriously stopped compiling on Lean 4.29.0-rc6, a version so recent it was not even used during the model’s training.
The issue involved a rewrite tactic (rw) that suddenly started failing when trying to match patterns involving a simple type alias originally written as def T2 := List Bool.
Instead of blindly guessing at a fix, Leanstral was methodical. It built test code to recreate the failing environment and diagnosed the underlying problem related to definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the structure it needed for pattern matching.
The proposed fix was simple and elegant: swap def for abbrev. Since abbrev creates a transparent alias that is immediately definitionally equal to the original type, the rw tactic was able to match perfectly again. Beyond solving the problem, Leanstral clearly explained the rationale behind the fix to the user. 👏
Reasoning about programs
In the second case study, the team copied definitions in Rocq (formerly known as Coq) from a Princeton course on imperative languages and asked Leanstral to convert everything to Lean 4. The model successfully completed the conversion, including the implementation of custom notation. Even more impressive: when given only the statement of a proposition in Rocq — without the proof — Leanstral managed to translate it into Lean and then prove properties about programs in that language. This kind of cross-system translation and reasoning capability is extremely difficult and demonstrates a deep understanding of both the semantics and the mechanics of proofs.
What this changes for coding agents
The most significant impact of Leanstral may not be in pure mathematics, but in what it represents for the future of coding agents. Today, most AI agents that generate code work in a relatively straightforward way: receive an instruction, produce code, and hope the tests pass. When tests fail, the agent tries to fix the problem in a trial-and-error loop. This approach works reasonably well for simple tasks but becomes fragile in scenarios where code correctness is absolutely critical.
Introducing formal proofs into this pipeline fundamentally changes the equation, because the agent stops relying solely on empirical tests and starts offering mathematical guarantees about the code’s behavior. Instead of having 95% test coverage and praying the remaining 5% does not hide a catastrophic bug, the agent can formally demonstrate that certain properties always hold, regardless of the inputs.
With the integration into Mistral Vibe and support for the MCP protocol, you can imagine workflows where one coding agent generates an implementation, another agent translates the requirements into formal specifications in Lean 4, and Leanstral takes care of producing proofs that the implementation satisfies those specifications. If any proof fails, the system knows exactly where the problem is and can direct the fix far more precisely than a simple broken test log. This kind of pipeline is still in its early stages, but the fact that all the components already exist as accessible open-source tools is a huge step in the right direction.
How to use Leanstral right now
Leanstral is already available for immediate use through three paths:
- Mistral Vibe (zero setup): the model has been integrated directly into Mistral Vibe for vibe coding and instant proofing, with no configuration needed. Just use the
/leanstralcommand to get started. - Labs API: the model can be accessed through the free or near-free endpoint
labs-leanstral-2603. Mistral is keeping this endpoint highly accessible for a limited time to collect realistic feedback and observability data that will feed the next generation of verified code models. - Weight download: since the model is under an Apache 2.0 license, anyone can download it and run it on their own infrastructure.
The trend of AI specialization
The launch of Leanstral also signals a broader trend in the AI model market: specialization. Instead of trying to build a single model that does everything at a mediocre level, companies like Mistral are investing in smaller, more efficient models deeply specialized in specific domains. The results speak for themselves: with just 6 billion active parameters, Leanstral outperforms models with tens of times more active parameters on proof engineering tasks, and it does so at a fraction of the cost.
For anyone following the artificial intelligence ecosystem, this is a reminder that the biggest model is not always the best model. Sometimes what really makes the difference is having the right tool for the right problem. Leanstral is a clear example of how this philosophy can produce impressive results without requiring absurd computational resources. And the fact that it is completely open-source means the community can contribute, adapt, and build on top of this foundation, accelerating the path toward a future where AI-generated code comes with formal correctness guarantees 🚀
