When AI generates code, who makes sure it actually works?
AI models have become pretty famous for generating code fast, but there is a question nobody really likes to answer: does that code actually work?
Folks who work in development know the bottleneck is no longer writing the code. It is reviewing it. It is that moment when a human needs to sit down, analyze line by line, and make sure what the AI generated actually makes sense, especially when the project is critical, when a mistake is expensive, or when mathematical precision has zero room for negotiation. Mistral AI described this as the main barrier to modern engineering velocity: human review is the chokepoint that limits how much we can scale the use of AI agents in high-risk domains.
What if you could automate software proving itself, so the model mathematically verifies what it just generated? That is exactly what Mistral is proposing with Leanstral, the first open-source agent designed to work with Lean 4, a proof assistant capable of handling both high-level mathematics and real-world software specifications. The idea significantly changes how we think about AI code generation. Instead of debugging what the model created, you tell it what you want, and it proves it delivered. 🚀
What Leanstral is and why it matters now
Leanstral is a formal proof agent developed by Mistral AI and built on Lean 4, a functional programming language and interactive proof system. Lean 4 is not just any tool: it has already been used to express complex mathematical objects like perfectoid spaces and software specifications such as properties of Rust fragments, which shows how deep this platform can go when it comes to verification. Unlike other proof systems that work as thin layers on top of generalist models or that focus on isolated math problems, Leanstral was designed from the ground up to operate on realistic formal repositories, the kind that actually look like production software projects.
One technical detail that makes a big difference is the architecture. Leanstral uses a highly sparse architecture with only 6 billion active parameters inside a model with 120 billion total parameters. That means it runs efficiently without requiring absurd infrastructure, which is rare for models with this level of capability. Mistral optimized the model specifically for proof engineering tasks, leveraging parallel inference with Lean as a perfect verifier to ensure both performance and cost efficiency at the same time.
Historically, formal proof systems were confined to academic environments or sectors with extreme safety requirements, like aerospace and critical embedded systems. The cost of writing formal proofs by hand was just too high for most software projects. With the arrival of AI models capable of reasoning about mathematical logic, that barrier is starting to come down, and Leanstral is one of the first projects that tries to turn this into something concrete, functional, and available to the developer community without licensing or access restrictions.
The three pillars of Leanstral: open, efficient, and integrable
Mistral built Leanstral around three characteristics that reveal a lot about the project strategy:
- Open and accessible: The model weights were released under the Apache 2.0 license, one of the most permissive licenses in the open-source world. On top of that, the model is available in agent mode within Mistral Vibe and through a free API endpoint. The company also promised to release a detailed technical report on the training approach and a new evaluation suite called FLTEval, which moves evaluations beyond the traditional focus on competition mathematics.
- Efficient and powerful: The sparse architecture allows Leanstral to deliver competitive results against much larger models while spending a fraction of the computational resources. By leveraging Lean as a perfect verifier, the system manages to be both performant and economical compared to closed-source competitors.
- Updatable via MCP: Leanstral supports arbitrary MCPs through Vibe and was specifically trained to achieve peak performance with lean-lsp-mcp, which is the most widely used MCP in the Lean community. This means the agent can be extended and customized without having to rewrite anything from scratch.
Benchmarks: numbers that speak for themselves
One of the strongest points of the Leanstral announcement is the benchmark results, which were not measured the traditional way. Instead of testing the model on isolated math problems, Mistral evaluated Leanstral on the completion of full formal proofs and the correct definition of new mathematical concepts in each pull request of the FLT project, which is a real mathematical formalization repository. This approach reflects real-world proof engineering utility much more accurately.
Leanstral vs. open-source models
The Leanstral-120B-A6B demonstrated a significant efficiency advantage over its open-source peers, which are considerably larger. While models like GLM5-744B-A40B and Kimi-K2.5-1T-32B plateaued with maximum FLTEval scores of approximately 16.6 and 20.1 respectively, Leanstral surpassed both with just a single inference pass.
The Qwen3.5-397B-A17B, which was the strongest open-source competitor in the test, needed 4 passes to reach a score of 25.4. In contrast, Leanstral achieved a superior score of 26.3 with half that computational investment, using only pass@2, and continued scaling linearly, reaching 29.3 at the same cost level. This is especially impressive when you consider that Leanstral has only 6 billion active parameters while its competitors operate with tens of billions.
Leanstral vs. the Claude family
The comparison with Anthropic’s Claude models is where the numbers get even more revealing. Here are the results published by Mistral:
- Claude Haiku: cost $184 to run and scored 23.0
- Claude Sonnet: cost $549 and scored 23.7
- Claude Opus 4.6: quality leader at 39.6 points, but at a cost of $1,650
- Leanstral (pass@1): cost just $18 and scored 21.9
- Leanstral (pass@2): $36 for a score of 26.3, beating Sonnet by 2.6 points
- Leanstral (pass@4): $72 and a score of 29.3
- Leanstral (pass@8): $145 and a score of 31.0
- Leanstral (pass@16): $290 and a score of 31.9, beating Sonnet by 8 points
The numbers show that Leanstral serves as a high-value alternative to the Claude suite. Pass@2 beats Sonnet while costing just $36, compared to $549 for the Anthropic model. And while Claude Opus 4.6 remains the absolute quality leader, it costs 92 times more than running Leanstral. For anyone managing an AI infrastructure budget, that gap is massive. 💰
It is worth noting that in the benchmarks Mistral used Mistral Vibe as a scaffold without any evaluation-specific modifications, which reinforces that the results reflect the model behavior under real-world usage conditions.
Real-world use cases that prove the concept
Solving real problems from the Lean community
One of the most interesting case studies Mistral presented involves a situation any developer knows all too well: when an update breaks everything. The team fed Leanstral a real question from the Proof Assistants Stack Exchange about a script that mysteriously stopped compiling after updating to Lean 4.29.0-rc6, a version so recent that the model was not even trained on it.
The problem involved a rewrite tactic (rw) that suddenly failed to match patterns involving a simple type alias, originally written as def T2 := List Bool. Instead of guessing a solution, Leanstral built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that def creates a rigid definition that requires explicit unfolding, which was blocking the rw tactic from seeing the structure it needed for pattern matching.
The proposed fix was straightforward: swap def for abbrev. Because abbrev creates a transparent alias that is immediately definitionally equal to the original type, the rw tactic worked perfectly again. And Leanstral did not just solve the problem, it explained the full reasoning to the user. This demonstrates that the agent is not just capable of fixing code, but of communicating why the fix works, which is essential for team learning and trust.
Reasoning about programs and translating between languages
Another demonstrated use case involved translating definitions written in Rocq (formerly Coq), based on material from the Princeton Computer Science course, into Lean 4. Leanstral performed the conversion successfully, including the implementation of custom notation. Even more impressive, the agent managed to translate and then prove properties about programs in that language when given only the statement in Rocq without the proof, showing the ability to reason abstractly about program behavior.
Open-source as strategy, not just philosophy
The decision to release Leanstral as open-source says a lot about the direction Mistral is heading. In a market where big AI companies lock their most powerful models behind paid APIs, Mistral has consistently bet on openness as a competitive advantage. This creates a virtuous cycle: the community contributes improvements, use cases, and integrations, the model evolves faster, and the company earns technical credibility organically.
For Leanstral specifically, open-source is even more strategic because the formal proof space is dominated by academic tools with steep learning curves and small communities. By opening the code and making integration with modern development workflows easy, Mistral is essentially expanding the potential market for formal proof beyond research labs. Developers who would never consider using Coq or Isabelle can start experimenting with Lean 4 through a much friendlier, AI-driven interface.
Additionally, the open-source nature of the project allows researchers and companies to audit how the agent works, which is especially important when the whole point is to guarantee reliability. It would be contradictory to use a formal verification tool that you cannot verify yourself. By making the complete code available under the Apache 2.0 license, Mistral reinforces the coherence between the product proposition and how it is distributed, something the AI market has been increasingly demanding from companies in the space. 🔍
How to access and start using Leanstral
Mistral made Leanstral available through three different channels, designed for different user profiles:
- Zero-Setup on Mistral Vibe: Leanstral is integrated directly into Mistral Vibe for immediate use, with no prior configuration needed. Just use the /leanstral command to activate the agent. Then press Shift+Tab until the model appears as Leanstral. Alternatively, you can use vibe –agent lean directly from the CLI.
- Labs API: The model can be accessed through the labs-leanstral-2603 API endpoint, which is being kept free or at near-zero cost for a limited time. The idea is to collect realistic feedback and observability data to fuel the next generation of verified code models.
- Downloadable weights: Those who prefer can download the model licensed under Apache 2.0 and run it on their own infrastructure, with full autonomy over deployment.
What changes in the daily workflow of developers using AI
For anyone already using AI models in development, Leanstral represents an important evolution in the workflow. Today, the typical process is to generate code with a model, manually review it, write tests, and hope the coverage is good enough. With a software proving tool integrated into the generation process, part of that cycle can be replaced by automated mathematical verification, which does not eliminate the developer role but focuses their attention on the points that genuinely need human judgment.
The most immediate impact is on trust in generated code. One of the biggest friction points in adopting AI for code generation in critical projects is precisely the difficulty of auditing what was produced. Formal proofs solve this problem definitively: if Leanstral delivers a proof verified by Lean 4, there is no subjective interpretation about whether the code is correct. It is, within the specified premises. This makes code reviews, security audit passes, and approvals in certification processes that require formal evidence of correctness significantly easier.
The most exciting scenario, however, is the long-term one. As AI models become more capable of reasoning about formal logic and proof systems become faster and more expressive, the boundary between writing software and specifying software keeps getting thinner. Leanstral is not just a new product. It is an indicator of where AI-assisted software engineering is heading, and Mistral is positioning the agent as a central piece in that transition. 💡
Leanstral is already publicly available and can be accessed directly through Mistral AI official channels, including Mistral Vibe, the Labs API, and the model weights download, along with full documentation for anyone who wants to start exploring formal proof with AI without having to start from scratch.
