"You don't care about quality" A story of single metric bias.

This was not a high point in my career. It's a story of single metric bias, how I let one measure become a 'source of truth', failed to manage up and ended up yelling at one of the most respected engineers in my team.

10 minute read

The stage

As an organisation we were in the process of 'adopting agile', heading roughly in the right direction. We created cross functional ‘2 pizza teams’ focused on delivering value to customers, aligned with business strategy. Engineering teams were implementing team practices like test driven development, pairing, mobbing, working out loud, taking ownership for deployments, monitoring funnel throughput and performance. They spent time understanding business value by working closely with stakeholders, keeping changes small, frequent and rapid with regular feedback loops. We spent time bringing together UX and frontend, implemented style guides and design systems improving consistency and performance. Outside engineering we created communities of best practice working together across teams, they were actively investigating and reducing complexity, sharing openly both the positive and negative effects of the changes they were making. Our infrastructure teams were engaged, they’d brought together new ideas from around the evolving cloud market to enable early automated deployment systems, taking on feedback from engineers on environments and pipelines and enabling the teams to deliver quicker and understand costs.

So why on earth was I standing next to an engineering leader who absolutely cared about quality accusing him and his team, loudly and in public, of not caring?!

We had settled on an approach to monitor the quality of our code by using test coverage, a number highlighting the percentage of code covered by unit tests. In theory from a management view this felt like a good number. We knew the teams had time and space to move towards implementing a test driven model and we could measure the positive impact of that. The number of rollbacks had dropped, we’d had less issues from customers, code was being deployed more frequently in smaller chunks and the feedback from the team (and business) was positive. It felt quite straightforward to ask each team for the test coverage figures each week and report this on a simple graph. The hypothesis was that the graph would go up, and we all love showing the board an upward pointing graph. After a few weeks and months of chasing this down it started to get a little fractious. I had to chase more and more, asking around the team to get an answer, spotting glances between engineers when I arrived. On one occasion I approached one of the engineers (a senior consultant) and received open disdain, he produced the number which had not changed for a number of weeks. I queried it and watched as he returned to his desk with a sigh typed a few things and ran a script that displayed a number slightly higher and presented it back, “make you happy?”. I probably didn't need this interaction to tell me that the team was not really aligned with my theory, but the almost brazen nature of it left me feeling a little suspicious that they weren't taking it seriously at all. It wasn’t really personal, but I was right, they weren't. With frustration building I returned later in the day and confronted the lead engineer. I was wrong, I knew they cared, I let my frustration bubble over because they really didn’t care about my upward facing graph. I was going to have to face another 1:1 where the number wasn’t accurate, I felt like a failure.

My mistake:

It’s not really rocket science but I was failing to manage up. Firstly I’d agreed to the idea of using test coverage as a single measurement of quality, the board wanted a simple way to demonstrate a return on the investment and I’d acquiesced when it was tabled. My second mistake was not correcting this when I knew it wasn't working, the team had talked to me about it. My final and worst mistake of all was letting my frustration, my fear of the consequences of underperforming and my failure to manage up boil over into a rage that I threw at the team.

Single metric bias

Single metric bias can really hurt, there's plenty of examples, some of them catastrophic. Organisations fail and people lose their lives when it becomes endemic. In most software projects the potential fallout of ‘Single Metric Bias’ is not catastrophic, but here’s a few well known examples where it was.

Both Kodak and Blockbuster arguably might attribute their downfall to single metric bias. Kodak was once a dominant player in the film photography industry, its focus on maximising short-term profits by selling more film led to a failure to recognise and invest in the emerging digital photography market. Blockbuster suffered a similar fate, reluctant to let go of the revenue generated by late fees caused them to hold onto their analogue video hire business while the world around them went digital. It gets worse, In 2010 the offshore oil platform Deep Water Horizon exploded causing the world's worst oil spill, there was a long judicial process to establish the cause but ultimately it came down to a drive to minimise cost which led to a catalogue of failure. On 9 November 2010, a report by the Oil Spill Commission said that there had been "a rush to completion" on the well and criticised poor management decisions. "There was not a culture of safety on that rig," the co-chair said. Quality suffered in favour of a focus on delivery speed and over time that contributed to the culture which led to catastrophic failure. Closer to home, we all hate it when we’re judged only by the results of exams, good or bad a person is more than the sum of an exam. Thankfully most modern recruitment has moved on from this. Test coverage is an important measure, but it’s only ONE metric of quality and it’s arguably not the most important one. It can be gamed (one test to cover it all!) and the number has no reflection on the quality of the test! Google's DORA report from 2022 highlights:

high-trust, low-blame cultures — as defined by Westrum — focused on performance were significantly more likely to adopt emerging security practices

Their calculation of performance in relation to this quote around security practices used 13 separate measures from two separate frameworks.

Back at my coal face, it didn't take long for the teams I was managing to realise I wasn’t really asking about quality, I was chasing the upward trending graph. We had built the right culture, THEY cared about quality and had lots of proof the product was improving. They did care about coverage too, but they also cared about other metrics of software quality (readability, performance, maintainability, reliability etc), customer experience, product performance and security too. I wasn't asking for these figures, I was chasing the graph, in the end I wasn't really asking about quality and it was this that led to their disdain.

A better response.

Firstly I should never have fallen into the ‘single metric bias’ trap. It's an easy management mistake to make especially when under pressure or coping with non-technical leadership looking to simplify results. Here’s some ideas on what I’d be counselling myself with now. When thinking of quality I believe there are two key concerns. Customer and Risk.

Customers

Technology exists to deliver value to help achieve a ‘goal’. That value can manifest itself in a number of ways but there is usually service or product that is generating it. Customers and their behaviours are a great indicator of success and also a great place to measure quality. If while using your product they are frustrated by poor performance you can usually find out through a range of metrics. NPS, Customer support queries, bug requests, product tracking data, reviews, even AARRR metrics can help identify bottlenecks caused by product engineering. Customers will tell you when the quality is poor, and you can measure the improvements through these metrics when it improves.

Risk & Blast Radius

Every organisation has an appetite for risk. Broadly this is a feeling or sentiment that usually comes from senior leaders but it manifests itself all over the organisation and dictates how comfortable they are with failure. Appetite for risk is a significant factor in an organisation's culture, processes and compliance regime. It can vary across teams and can also change overnight, getting it wrong can cripple an organisation. One of the ways to help establish an organisation's appetite for risk is to consider blast radius. This analogy works on a simple idea that when something goes wrong we can imagine a bomb going off. The centre is the broken or initial damage there are immediate issues that caused the problem and then a radiating set of things that are affected. Further away from the blast centre is usually less affected.

When talking about the impact size of a problem and what else might be affected we often start to feel where a team is with risk. It's worth pointing out that well engineered solutions will be designed to protect 'fallout' from sensitive or critical business areas. Brittle platforms that allow single points of failure often led to a feeling of low risk. Its a great debate to have with your team!

It’s generally assumed that small startups with low numbers of customers have a high appetite for risk, ‘Move fast and break things’ was a motto for Facebook especially in the early days, Google too has a tradition of ‘fast failure’ in public, they are not small anymore but their success using that model set a generational trend. Its a great environment to work in, and is in part designed to remove the fear of risk create high trust, to help people try out new ideas, rapidly learning from their mistakes. We might all be understandably more concerned if development teams working on flight control systems, or missile technology took the same approach, these teams spend a lot of time working out how to move fast without ‘breaking’ things and we’re glad they do!

‘Engineering for risk’ is a great way to help a business stay agile, poorly engineered software increases the blast radius, well engineered software can build in firebreaks that stop the impact zone of an issue from radiating.

Appetite for risk can be a guide to the type of quality you need to consider. For example in most financial trading platforms speed is essential, transactions happen in milliseconds and prices fluctuate rapidly, there is a low appetite for risk. In this case performance and load testing is critical application design can also help to increase reliability of time based transactions.

Conclusion

I learnt the hard way, and in the intervening years I've stayed honest to my promise to never let down a team I manage like that again.

If you are responsible for measuring quality of the code in your teams it's important to look at code coverage alongside all of the other software development metrics, but it’s also helpful to consider speed of delivery, performance and appetite for risk. ‘Fail fast’ and 'fast rollback' sometimes works better than lots of exhaustive focus on quality upfront. I’ve encountered opinions in the wider engineering community that feel code quality is a fixed target. It's not. There are best practices for many facets of writing code but you need to assess which of these is important to you team, your organisation and your customer before you agree to ‘show’ evidence that it’s improving.

As Kent Beck puts it "Being proud of 100% test coverage is like being proud of reading every word in the newspaper. Some are more important than others."

Tweet from Kent Beck - "Being proud of 100% test coverage is like being proud of reading every word in the newspaper. Some are more important than others."

https://twitter.com/kentbeck/status/812703192437981184?lang=en

https://en.wikipedia.org/wiki/Deepwater_Horizon_oil_spill