I’ve had a hypothesis I’ve been turning in my head for a while. The current version goes something like this:
By default, AI tools increase speed and throughput but damage software quality.
The marketing for the tools boasts about speed and throughput if you just use them everywhere…and that leads to them being used everywhere, even if they shouldn’t be.
So I talked to some folks and did some research around the topic, and it turns out that other people have found the same thing:
The 2024 DORA report found that while individual productivity is up, there are negative impacts to delivery stability and that the delivery cycle is actually negatively affected.
Erik Doernenburg has a great example of where his Agent made some changes which would have had led to code quality issues and defects had they not been caught.
While AI can be used to create tests, trusting the agent completely can be fraught with peril. It can make a lot of tests quickly, but quantity tends to be the focus, not quality, and it can enshrine bugs as expected behavior.
Interestingly, I’ve seen this multiple times on a project. Here’s just one example:
I was working on a code review where a developer had let the AI write unit tests. Those tests were written quickly, but they actually contained duplicates, some tests were incorrect, where the tests would have passed without the code actually working correctly, and on top of that, one requirement was completely untested!
The developer got to move a ticket across their board quickly, but trusting the AI agent lead to a slowdown as I and others had to go through the review carefully, then the developer had to go back and fix things, then there was a second round of reviews. This doubled the amount of time the ticket took to complete.
I’ve seen and heard other examples as well.
Tests are a way that we prove that we’ve created quality software, if those tests are made poorly, we can’t have confidence in the quality feedback they give.
So what do we do about this?
I don’t think it’s feasible to not use AI Agents, and honestly, there’s been some very strong indicators of value from them!
For example, Matteo Vaccari gave the agent a structure to build tests from, leading to cleaner and more consistent tests being generated.
At SEP, Steven Mills had success with the idea of building up a structure for our testing (including examples), then using the Agent to leverage that structure in order to build up more tests. Crucially, the way he crafted that infrastructure meant that code reviews were simpler for the humans doing reviewing. Some files were expected to be updated or created, but others were not. So if the agent made changes in that latter group, a reviewer knew to give that even more scrutiny.
The key difference is that, instead of sending off an agent to build a grand testing structure from whole cloth, Steven sat down with other developers and had a plan for what he wanted as well as how to make sure that the review process would be safer.
Yes, the process was slower to start, but reviews are faster and reviewers are more confident in the results, which means that, over time, this structure gains ground over generating lots of code quickly and dealing with the consequences later.
My takeaway from this is that you need to set up your AI Agent for success so that it can’t help but write the right code.
When talking about planning with prompting, you have to include not just the why of the project and the how of the application code, but also the how of verification.