Home / Blog /
Measuring productivity in an AI world
//

Measuring productivity in an AI world

//
Tabnine Team /
3 minutes /
April 17, 2022

“This is really cool! It’s going to help me tremendously.” These are words I often hear from folks using Tabnine for the first time. They were words that I said when I tried it my first time as well. Still, in my heart I’m an engineer. A mathematician by training and a metallurgist, aerodynamicist and structural engineer by hobby. So when a product or a tool claims to help by XX%, be “the best”, wash your dishes and make your teeth whiter, I’m always the one looking for the asterisk after the claim.

And so it goes for software tools as well. There is no doubt that there are tools that help coders be better, faster and more accurate. Some of these are actual tools and some are simply hacks around IDE’s colors and even environments that contribute to better code. But quantitatively measuring productivity in the world of software engineering is actually quite difficult. It’s something of a mythical beast in the industry and there is substantial investment and a fair number of companies that have grown up around software productivity metrics (1, 2, 3). Each of these companies’ tools measure productivity differently, with various levels of statistical measurements and ML built in to the software.

At the heart of these products is the desire to measure a developers time spent in adherence to best practices and code generated to address product issues prioritized by the team. To a certain extent the industry has some basic metrics that are standardizing the way to measure team performance, though this can still have its issues on an individual developer’s contribution. But if we look at software development in terms of inputs and outputs this results in an overly simplified view of productivity. For example, hours worked in front of the IDE and lines of code produced at the back guarantee nothing when it comes to good software. And over reliance on metrics risks falling foul of Goodhart’s law, “When a measure becomes a metric, it ceases to be a good measure.”

When I worked at Google one of the early projects that I enjoyed was working with the Air Force Agency for Modeling and Simulation (AFAMS). A goal of the project was to help modernize pilot training infrastructure be more efficient, flexibly deployable and faster to integrate modern simulation technology. During the project I was able to talk with fighter pilots about the subject of what information they wanted and were capable of digesting in high stress environments. Graphics presented to the pilot had to balance information density with intuition and reaction speed. The cognitive function of the human brain in various states of concentration and its ability to process information and react is something that has also been studied in the Heads Up Display arena (1, 2, 3)

However, no two pilots are the same. One fighter wants different information from another and one software engineer wants different tools than the next. There is a trend in AI pair programming to generate and suggest increasingly large pieces of code, up to and including DeepMind’s AlphaCode that seeks to generate complete code. Is this direction helpful to software engineers? Or we could consider the other extreme. Suggesting a single word for an engineer writing code is probably equally unhelpful. With large chunks the engineer has to stop and “debug” the code before accepting and with small suggestions the engineer is likely annoyed since they already knew what they were going to type next and constant smaller suggestions are simply noise. So there is a balance to be found with AI code tools where the suggestions are meaningful enough to help complete code faster, while not being either too long nor too short. And this will look different to each individual. Our CTO Eran Yahav talked about the nuances in this recent podcast: AI-assisted development is here to stay – Tabnine Blog

So let’s tie fighter pilot cognition together with measuring software performance. If a tool like Tabnine is to be used in a large company across many software engineering teams, the ROI of Tabnine does need to be evaluated. In essence, any CTO is going to ask, “How is Tabnine going to make my engineers more productive?” Giving engineers access to company code best practices as well as shortening code review times would certainly increase productivity metrics by helping generate code with fewer PR issues in the same or less time. And Tabnine can do this in different ways for different engineers. Some junior engineers might appreciate longer code snippets while others like Tabnine’s own Amir Bilu would appreciate shorter suggestions that don’t break the cognitive flow.

Productivity is a complex question in software engineering. AI tools like Tabnine can certainly improve productivity, but how that happens is both very individual and complex. Keeping tabs on current development metrics and then giving Tabnine a try across the same teams would be a great way to truly understand the best way to deploy these new tools. One thing is for certain, AI pair programming tools are here to stay and if used correctly might even make your teeth whiter too!