AI-generated code that works — and proves it

How do you know that code behaves as expected?

For decades, the standard answer in software engineering has been straightforward: write tests, run them, and check that they pass. But testing only tells you that a program behaved correctly on the inputs that were tried. It says nothing about all the inputs that were not.

As Edsger Dijkstra famously observed more than fifty years ago:

“Program testing can be used to show the presence of bugs, but never to show their absence.”

While testing remains one of the most effective tools for discovering bugs, safety- and mission-critical systems cannot rely on post-deployment patches when failures occur. A bug in avionics software, semiconductor tooling, embedded systems, financial infrastructure, or autonomous systems can have severe consequences. In these environments, the question is not merely whether software appears to work during testing, but whether we can establish stronger guarantees about all possible behaviors of the system.

That is the role of formal methods.

Formal methods are mathematically rigorous techniques for proving that a program satisfies a specification. Rather than reasoning only about observed executions, formal methods reason about every execution permitted by the system.

For decades, these techniques have been used in some of the world’s most demanding software environments, including aerospace, rail infrastructure, operating systems, and cryptographic systems [1,2,3,4,5,11]. The challenge in scaling formal methods to general software has historically been cost: building machine-checked proofs traditionally required deep expertise and enormous engineering effort, limiting the use of formal methods to domains where correctness was absolutely critical.

Today, however, the equation is beginning to change.

Trusted code translation at Code Metal

At Code Metal, this is the central problem we are trying to solve: building AI-driven code translation systems for domains where correctness is non-negotiable. Whether translating CUDA kernels into OpenCL, converting M files into VHDL, or modernizing legacy C++ into Rust, the challenge is not merely generating code that compiles, but proving that the translated system preserves the behavior and guarantees of the original one. This is exactly the kind of setting where formal methods can provide uniquely strong guarantees.

The companies we work with in aerospace, defense, semiconductors, and automotive operate under strict correctness requirements where failures can lead to recalls, certification issues, security vulnerabilities, or mission failure.

In these settings, checking that the code compiles is not enough.

The real question is:

Does the translated code preserve the behavior and guarantees of the original one?

That is a fundamentally difficult problem.

Two programs written in different languages and targeting different architectures will almost never be identical internally. A Rust implementation and its C++ counterpart may differ substantially in memory layout, optimization strategies, execution order, and low-level behavior. What matters instead is whether they agree at the level of observable behavior:

Do the same inputs produce the same outputs?
Are the same invariants preserved?
Are the same safety and compliance guarantees maintained?
Does the translated system preserve the intended semantics of the original?

Code Metal’s approach uses a full spectrum of software-assurance techniques throughout the translation pipeline. Lightweight validation techniques such as differential and property-based testing, type and compatibility analysis, and static analysis continuously guide and validate the AI system during generation, helping uncover behavioral discrepancies, semantic mismatches, vulnerabilities, and undefined behavior early in the process.

At the stronger end of the spectrum, formal verification is used whenever possible to establish rigorous guarantees about behavioral equivalence between the original and translated systems.

Validation and verification are integrated throughout the entire pipeline rather than treated as a final post-processing step. Source programs are analyzed, translated, continuously checked for semantic consistency, and optimized for the target hardware architecture while preserving intended behavior.

The goal is not merely to generate code quickly, but to generate software that can be trusted.

Why now: AI and formal methods

Historically, formal methods were powerful but difficult to scale. Building machine-checked proofs often required years of specialized work. Systems such as the formally verified seL4 microkernel [6] and the CompCert verified C compiler [7] demonstrated that full verification was possible, but also highlighted how expensive it could be.

AI is beginning to change that equation. Modern language models are remarkably effective at navigating large search spaces. They can assist with tasks that previously required substantial formal-methods expertise: generating candidate implementations, proposing invariants, drafting proof sketches, searching for useful intermediate lemmas, and synthesizing verification artifacts alongside generated code.

Crucially, this use of AI does not require trusting the model itself. Formal verification systems rely on small trusted kernels whose sole responsibility is to independently check correctness. The model may generate many incorrect proof attempts, but the checker remains authoritative: the proof is either accepted or rejected.

That distinction is fundamental. AI scales generation; formal methods scale trust.

This combination is becoming increasingly important as engineers rely more heavily on AI systems to generate large amounts of software. As models begin producing hundreds of thousands — or even millions — of lines of code, traditional review processes become increasingly infeasible.

At Code Metal, we use AI not only to generate candidate translations, but also to help construct the formal artifacts that justify them. The same systems that synthesize code can assist with generating invariants, certificates, proof obligations, and verification scaffolding alongside the generated programs. The entire process is fully automatic, no human intervention is required.

The goal is not merely:

“Trust the model.”

The goal is:

“Here is the evidence. Verify it yourself.”

That is what “AI-generated code that works — and proves it” means at Code Metal.

Welcoming Prof. Loris D’Antoni

Today, we are excited to announce that Professor Loris D’Antoni is joining Code Metal as our first Code Metal Scholar.

Prof. D’Antoni joins Code Metal’s Formal Methods group, a team composed of researchers and engineers with backgrounds spanning NASA, Galois, Bloomberg, SRI International, and Amazon Web Services.

Prof. D’Antoni will continue serving as a Jacobs Faculty Scholar and Full Professor of Computer Science at the University of California San Diego. For more than a decade, he has been a leading researcher in formal methods and their applications to program verification and program synthesis, the science of generating programs guaranteed to satisfy a specification. More recently, his work has focused on an increasingly important question: how can we ensure that the outputs of large language models are provably correct with respect to a specification?

Two of his current research directions are especially aligned with Code Metal’s mission.

The first direction is specification-aligned LLMs: techniques for ensuring that large language models generate code that satisfies formal correctness constraints by construction. Projects such as ChopChop [8] and Constrained Adaptive Rejection Sampling [9] explore how formal methods and probabilistic reasoning can be combined to steer model generation toward verifiable outputs. The second direction is synthesizing personalized compilers: using program synthesis and LLMs to automatically construct compiler components [10] tailored to a developer’s codebase, invariants, and optimization goals.

Before joining Code Metal, Prof. D’Antoni also served as an Amazon Scholar at Amazon Web Services, where he designed formal methods for verifying and synthesizing access control policies.

“My goal as a researcher has always been to help people write software they can trust. Code Metal is combining formal methods and AI to build a new generation of trusted software translation systems that can serve as the foundation for critical software infrastructure.

Beyond the importance of the problem itself and the great people involved in this work, what drew me to Code Metal is the team’s goal to bring the full arsenal of formal methods to bear on it, with the ambition of giving genuinely new meaning to the word ‘trusted’ in AI-driven software systems.”

— Loris D’Antoni

Welcome, Loris!

Advancing trusted AI-driven code generation at Code Metal

The arrival of Professor Loris D’Antoni at Code Metal further strengthens the company’s formal methods team and reflects a continued investment in bringing rigorous software assurance techniques into AI-driven code generation.

At Code Metal, we believe the next generation of code translation systems will require not only advances in AI, but more importantly, a deep integration of formal methods technologies to ensure that generated software can be trusted in high-assurance environments.

If you are a researcher, engineer, or practitioner working at the intersection of AI and formal methods, we invite you to follow Code Metal and check out our open positions as we build the next generation of trusted AI-driven code generation.

References

[1] A brief overview of NASA Langley’s Research Program in Formal Methods, System Validation Methods Branch, NASA LaRC, 1992

[2] L. Titolo, M. Moscato, C. Muñoz, A. Dutle, and F. Bobot. A Formally Verified Floating-Point Implementation of the Compact Position Reporting Algorithm, 22nd International Symposium on Formal Methods (FM 2018).

[3] C. Muñoz. Formal Methods in Air Traffic Management: The Case of Unmanned Aircraft Systems, 12th International Colloquium on Theoretical Aspects of Computing (ICTAC 2015) (Invited Lecture).

[4] Formal Methods in Action in the Railways, ClearSy, White Paper, April 2022.

[5] D. Delmas and J. Souyris. Astrée: from Research to Industry, 14th International Static Analysis Symposium (SAS 2007).

[6] C. Klein, J. Andronick, K. Elphinstone, T. Murray, T. Sewell, R. Kolanski, and G. Heiser. Comprehensive Formal Verification of an OS Microkernel, ACM Transactions on Computer Systems, Vol. 32, No. 1 (2014).

[7] X. Leroy, S. Blazy, D. Kästner, B. Schommer, M. Pister, C. Ferdinand. CompCert — A Formally Verified Optimizing Compiler, 8th European Congress on Embedded Real Time Software and Systems (ERTS 2016).

[8] S. Nagy, T. Zhou, N. Polikarpova, L. D’Antoni. ChopChop: a Programmable Framework for Semantically Constraining the Output of Language Models, 53rd ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2026).

[9] P. Parys, S. Vaidya, T. Berg-Kirkpatrick, L. D’Antoni. Constrained Adaptive Rejection Sampling, 43rd International Conference on Machine Learning.

[10] X. Peng, D. Kennedy, Y. Fan, B. Greenman, J. Regehr, L. D’Antoni. Nice to Meet You: Synthesizing Practical MLIR Abstract Transformers, 53rd ACM SIGPLAN Symposium on Principles of Programming Languages (POPL 2026).

[11] A. Appel. Verification of a Cryptographic Primitive: SHA-256, ACM Transactions on Programming Languages and Systems (TOPLAS 2015), Volume 37, Issue 2.

Related Research

The Real Cost of Leaving NVIDIA

What Automated Transpilation Actually Costs, and What It Doesn't

UniPar: A Unified LLM-Based Framework for Parallel and Accelerated Code Translation in HPC

Introduces UniPar, an evaluation framework for assessing how large language models translate between parallel programming languages, achieving 69% compilation success and 33% functional correctness through fine-tuning and compiler-guided repair.