Codeball AI Code Review, a deepdive into the numbers

Codeball is an AI trained to perform code review, and the first available version โ€“ CODEBALL-1, is able to accurately identify PRs that will be approved.

What happens when you add Codeball to your project, is it safe, and can it even save you money? Let's dig into the numbers from the default Codeball demo from the index page.

Supabase is a well-known open-source project mainly built and maintained by employees of the company with the same name. We've used Supabase as the default example for the Codeball demo site, because is an open-source project that's largely run like any project at closed-source companies (business === money) (and because the results are pretty good).

The Codeball demo runs Codeball on the latest 50 merged PRs (in this case, created between 2022-06-02 and 2022-06-13), and compares the outcome with what Codeball would have done if it had been run on the PRs originally.

For the latest 50 Pull Requests in supabase/supabase:

  • 32 where approved without any further feedback or modifications, the only comment on these PRs is the comment to approve the PR.
  • 4 PRs where modified by the author after opening the PR, triggered by failing tests etc, and then instantly approved (in the demo, these are labeled as "Gave feedback").
  • 14 PRs received feedback or "requested changes" and where updated at least once before getting merged.
Approved (72%)
Got feedback

Codeball correctly predicted the instant-YOLO-merge for 14 out of the 36 Pull Requests, and did not approve a single PR that was rejected or further iterated on.

Approved (28%)
Not Approved
Not Approved

Codeball was unsure about 18 PRs that ended up getting merged directly, did not approve them.

The Codeball result can be expressed with a confusion matrix like so:

Prediction
โœ…
โŒ
Reality
โœ…
โŒ
14
22
0
14

This result gives Codeball a recall of 38% and a precision of 100%!

Aka, it's not approving everything that it could have approved, but when it does approve something, it was never wrong.

This is a very good thing. Codeball is able to flag safe PRs with a extremely high precision, and when it's unsure about what to do, it leaves the remaining work to the humans (just like before).

"So Codeball is like a strict bartender who only serves you when they are absolutely sure you're old enough. You may still be overage but Codeball's not serving you."
โ€” apugoneappu on Hacker News

What can we tell about Codeball's performance?

Codeball calculates many different characteristics ("features") for a PR. This includes things like diff itself, but also the size and shape of the diffs of in the PR, the authors previous contributions to the modified files, the authors history with the repository and nearby directories, as well as the health of the files and repository itself (+ a few other things, but what they are is apparently a business secret) .

Almost all Pull Requests ever created end up getting merged sooner or later, but some require one or many rounds of feedback and discussions before it's ready.

In this blog post, "merged" and "approved" means that the first version of the Pull Request was approved and merged, and that the only feedback it got was some form of "LGTM ๐Ÿš€๐ŸŒ•".

Honestly it's a bit of a black box, and I'm just here to write a blog post. So this will be a bit of guessing, and me going through all 50 PRs manually to try to characterize the PRs and see if I can find any patterns (this was a very intimate exercise, it feels live I've gotten to know all core contributors).

  • 11 PRs changed files that where entirely or almost entirely created by the PR author. These are often smaller improvements or fixes, and generally safer than the average Pull Request, as the author is already familiar with the code. 9 out of 11 PRs (81%) in this category was approved without further feedback.
  • The average number of files modified per PR is 4.6. The PRs that where approved on the first iteration had an average of 3.79 files per PR.
  • 26 PRs changed only a single file, 24 of which where approved, including 6 PRs from external and first-time contributors.
  • 19 PRs changed more than 1 file. 12 of these where YOLO-approved by the team (being YOLO is a great thing).

There is a clear trend that "smaller is better", which makes sense, a smaller PR has few lines of code that could potentially screw something up. And a bugfix is probably smaller (measured in LOC) than a feature addition.

Codeball seems to agree, and picks up on this trend as well. Codeball has a higher-than-average recall for Pull Requests in the self-authored category (45%) and single-file-changed category (81%).

What's cool is that while there is strong indicator that "smaller is better", not all small PRs where approved, and that Codeball was able to identify the PRs that should not be approved as-is, and for this demo, did not make a single mistake.

For the latest 50 PRs to supabase/supabase the precision is 100%. Internally however, with much larger test set, there are of-course some false-positives, and the precision is around 99%.

Codeball has a higher precision (~99%) than the efficacy of condoms (98%). Similarly to the usage of condoms, both have a "Plan B" readily available 1 (identify the oopsie-doopsie and revert).

Can Codeball save money for Supabase?

In this world, time is money, and money is time. And it's popular to track metrics like "change lead time" etc etc.

Out of the 50 latest PRs in the repository:

  • 16 (32%) PRs where approved and merged within 1 hour
  • 22 (44%) PRs had to wait between 1 and 24 hours to receive their first review
  • 12 (24%) PRs waited for more than 24 hours to receive their first review

The slowest PR was open for 5 days before getting the first review (which Codeball correctly approved btw ๐Ÿ˜‰).

The time-to-merge for the Pull Requests that Codeball would have approved was spread out in buckets of roughly the same size (25% / 50% / 25%).

It's hard to measure what the associated cost is to merge a PR after 24 hours instead of 15 minutes, but it's safe to say that it's above 0.

What's easier to argue for is that, for each of these PRs, someone had to interact with it at least three times: The author to create it, the reviewer to review it, and then for the author to go back to it to merge it.

Each of these interactions takes a small amount of time, maybe just a minute or two, but they all add up, and then we haven't even included the cost of context switching.

With an epic AI such as Codeball, it's possible to get the number of interactions required to merge a PR down to just one, and reducing the "time to merge" to just a few seconds.

So yes, there is money to be saved here.

HIRE A BOT

With ๐Ÿง  from the Codeballers,
17th of June 2022


  • Codeball is not affiliated with Supabase, we're just fans. Or is this an elaborate attempt to try to get Supabase to install Codeball?
  • In some calculations above, some PRs from dependabot have been excluded, as those generally are not that interesting. This is why the total number of PRs does not always total to 50.
  • 1 Depending if you live in the modern world or not โคด๏ธ