Who's going to review AI code?
This question popped into my head back in November 2025 when I noticed that more and more of the code I reviewed and created was being generated with the help of AI tools. While four months feels like an eternity in this day and age, the question still remains: who's going to review AI code?
Disclaimer
These days, writing about AI without a two-week expiration date attached is quite the challenge. However, I believe there are some more stable topics and questions we can explore related to the subject, without the risk of becoming outdated by the time we finish this post. I use "AI-assisted programming," "AI," and "agents" somewhat interchangeably, and in all cases, I'm considering the LLM type of AI tooling and not others.
Each generation raises the floor
Grady Booch cites that the history of software is a story of increasing levels of abstraction, and I find this quite an accurate description. And when discussing the topic of AI assisted programming this represents yet another jump in abstraction.
The abstraction framing is excellent because it provides parameters for assessing how previous jumps in abstraction occurred. When transitioning from assembly language to C, from C to C++, or even from punched cards to magnetic memory, there's a period where both the current technology and the future abstraction coexist. This transition period isn't just about porting existing systems; it’s also meant to provide enough time for the new technology to be adjusted and proven viable.
Who's on the hook?
Before diving into viability, let me jot down a few lines about liability. I’m not referring to it in the strict legal sense but in the context of "Who did this?" Most software development isn't highly risky, so I wouldn't assume that all software engineers out there have had the chance to think deeply about the topic of responsibility. However I think I can frame it in a way that most software engineers can relate.
A new library has been released for whatever purpose you find most interesting. The library promises to solve common issues that arise from not using it, and it also makes the case that it will abstract parts of the routines you need to work on, freeing up time for more important tasks. When you start experimenting with this library, there's an implicit expectation that it does what it claims, at least in terms of functionality. This isn't about subtle bugs; the library works. After experimenting with it for some time, you decide to bring it into your codebase.
This whole process depends on the cost of integrating the library into your codebase, making it worth the hassle of learning it, as well as the risks that come with it. If the library is not stable from the start, you can still include it, but you're taking on a larger risk since you might not be sure about its overall stability or where it could leave you stranded. Note, you are, as the engineer responsible for the system, ultimately responsible for the overall codebase, even if you didn't review every line of the library.
Over time, the library has proven to be useful, stable, and reliable, so at this point, there are few concerns about any defects arising from it. The library's abstraction was successful, but more importantly, the liability of having it was balanced well enough that its use doesn't create a sense of imminent failure in the system. Adding a new tool, language, or build system, whatever it may be, all of these have a common trait: they absorb the liability of what they were designed for. If they don’t, your options are to either choose another technology or create one yourself.
The trust problem
I’ve chosen a liability framing for this argument, but we can ultimately reframe it as a trust challenge. Every time the topic of trusting what an AI agent says comes up, this old joke always pops into my mind:
All the professors from a university's engineering department win an all-expenses-paid trip to the Bahamas. They board the plane, and the flight attendant announces a warm welcome to the professors, mentioning that this airplane was designed by their very own engineering students. All but one professor stands up and leaves the plane. The flight attendant asks the remaining professor, "Aren't you afraid of flying in this aircraft?" He replies, "If it were my students who built and designed this thing, I'm confident the engines won't even start!"
Some of you might know that for certain applications, software can't be designed and programmed in the same way most systems are created. From formal specifications to considering bit flips caused by cosmic rays, minimizing uncertainty is crucial depending on the required reliability of the system. This process involves specific development patterns and verifiable testing gauntlets that ensure the system is within its risk margin.
Within our enterprise corporate systems, the bar is far lower but it is not non-existent. One cannot break production every other day because the changes were not reviewed. Which brings me back to the key point of this post:
The adoption of AI-assisted programming depends on how well it can absorb the liability of its use in the first place.
The point is that it's not about whether AI-assisted programming can be applied, but rather what degree of risk it can absorb. If we can develop processes and tools that create a better liability shift, then its adoption can grow until those techniques are no longer sufficient. In other words, the adoption and use of AI-assisted programming hinge on our ability to trust the artifacts produced; otherwise, the "gains" in productivity will be limited by our capacity to verify. More succinctly, if you have to review all line changes made by an AI agent, then its output is capped by that.
The verification gap
There are several efforts underway to create "better" ways to provide context to AI tools. However, all of them face one or more significant limitations of the current technology, along with another constraint: complex systems can seldom be specified without incurring unacceptable costs for most organizations.
Back at my previous job at Thoughtworks, we used to say that our consultants wrote a significantly larger amount of test code than production code. This statement was a direct result of Agile software development, the iterative process, eXtreme programming, and the fact that consultants would join ongoing projects with tight deadlines to modify complex systems that could be decades old without breaking everything else.
Testing frameworks, continuous integration, continuous delivery, and other techniques stem from our understanding that larger systems are complex. Software is mostly grown rather than designed, and having situational awareness of the current state of the system under change are key factors for most software development.
I believe that the pressure AI-assisted programming puts on the input side, by generating and altering larger amounts of code, along with the non-determinism of these technologies, heightens the need for verifying these changes. Unfortunately, I still find that not all software engineers are adequately trained in testing, as there have been few innovations in verification technologies. This often results in complex test codebases, slow builds, and high maintenance costs.
So, while innovation is advancing on one side, investment in verification technologies, techniques, processes, and other methods that engineers can rely on is essential for successfully shifting the liability of adopting these tools.
So, who reviews the code?
I hope I've shown that a successful adoption of these new AI tools into software engineering isn't just about how well or better they perform. Given the non-determinism and complexity of software-intensive systems, we won't be able to harness all this innovation without also advancing our verification technologies and modernizing our testing techniques. An engineer can't just overlook the risks of changing a system; doing so isn't engineering, it's gambling. Engineers use tools and processes to build safely.