Key Performance Indicators (KPIs) are a great principle to measure how well your organization performs. DevOps accelerates the automation of (almost) everything within the entire organization. Therefore you need useful KPIs to measure the success of your DevOps initiatives. Common KPIs in the DevOps world are deployment frequency, lead time of new features, mean time to recovery, etc. Powerful tools help to implement those. If the metrics of the common KPIs are satisfactory, you have a solid base to further improve, accelerate and optimize. But what if the results are a bit on the negative side? Dealing with DevOps Metrics and KPIs can be hard. In this article, I will present considerations for when this is the case. It acts as a starting point to help you change and improve your organizational KPIs to further accelerate.
Well-crafted KPIs focus on the power and strengths of DevOps. They share characteristics of modern Software Development processes and they do not focus on old habits and “old business processes” like ITIL or Waterfall. A quick investigation reveals that popular KPIs aim to measure lead time and development activities, learn and recovery from failures, and dealing with defects and complaints.
Number of tickets
Let’s start with the end-user in mind. End-users, whether they are internal or external oriented complain when your system does not work as expected. If you see the number of tickets in your ticket system go up, you have a clear indication that there is something wrong. The key is to find patterns between those tickets and understand the correlation between them.
Suppose you receive a bunch of complaints about a report which does not show the information that the customer really needs. And at the same time, you face a lot of tickets that mention the slowness of a specific section of the system. You have to estimate the perceived loss of business opportunity for them. To do this, you need to step into the shoes of your customers. This might sound easy, but it’s not. Try to answer questions like:
- What is the impact of the incorrect report for the customer? What is it that they can’t do? Are they stopped in a critical phase of making an important decision? Perhaps, they can’t place an order for the products you sell online, a significant impediment that requires attention.
- What is the consequence of the slowness of the system? Does it just hamper customers from easily navigating around or are they completely blocked? If it’s the latter, you are more likely to miss out on valuable opportunities.
But which of those problems gains the highest priority? To solve this issue, you can start with the issue which takes the least effort to roll-out. Once rolled out, measure the impact over a specific period in time which is a good representation of the previous situation.
To measure the effectiveness of the systems’ speed up, you might use the blue/green deployment pattern to split the group of customers into two categories. Redirect 50 % of them to the improved version and the other 50% should be redirected to the old version. Once you measure a clear decline of tickets from customers who were served by the new version you get an indication if the fix works.
Increased deployment frequency is one of the key pillars of DevOps. Automation, speed of development, and deployment all help to deploy more frequently. The faster a new feature is deployed, the faster your customers can use it and the more competitive you are as an organization. Ideal values for deployment frequency are steady and consistent. If the numbers are bumpy and greatly differ per team or department, this is an indication that something might be wrong.
When troubleshooting this KPI, think of the following:
- Do all of the teams have the proper knowledge of what they are trying to achieve?
- Be on the look-out for brittle and/or unreliable infrastructure.
- Features (user stories) are too large and too complex.
- Pipelines that are not set up according to industry and/or organizational standards (everyone invents the wheel again).
- The thresholds for quality gates are too unrealistic. It’s okay to break a pipeline for critical issues, but keep in mind to give teams time and opportunity to improve their security skills. Therefore don’t break their pipelines for medium and low-risk issues, that only gives frustration on all ends. Perhaps: monitor for those risks and take appropriate action.
- Be aware of too many deployments in a short period of time: tests might take longer than a new push and a new deployment…thus builds might queue up and this hampers team-members working on different features. Once this happens, think of picking up fewer user stories to work on at the same time. This also improves collaboration and also spreads the (domain) knowledge across the team. Domain knowledge adds context to applications and business processes.
As you can see, most of these are organizational aspects that need to be addressed by different departments. It’s a solid reason to share KPIs plus the outcome of the metrics (facts and figures) across the entire organization. Transparency wins so everyone can help to improve.
Another popular KPI is lead time. Measure the average time between the initial idea and the actual rollout of a feature. From a DevOps’ perspective, this all starts when an issue is put at the backlog, moves through the sprint(s), deployed to production, and showcased in the sprint review. It’s important to have short lead times to quickly jump to (market) opportunities before they are not relevant anymore.
Especially in large organizations, there is a lot more to take into account. Basically, the DevOps team “just implements” the idea which saw the light higher up in the “organizational stack”. Don’t forget that great ideas also come from the bottom. However, when an idea boils down to the development teams, there might be a lot of discussions that happened before the idea ended up on a backlog.
Derived from this topic, consider the following aspects:
- A business manager who has a great idea does not always know that the same idea might already exist somewhere else in the organization.
- The majority of business representatives lack detailed cloud knowledge to validate their idea early on in the process. This might not be a problem, but if too many discussions focus around a sub-optimal cloud solution there is a huge waste of time and energy.
- Sometimes, budgets need to be allocated very early on in the process. What if the idea is not as good as it sounds – budget is allocated and other initiatives need to be skipped or postponed. The central question: who to involve to validate these kinds of ideas?
- You need to define a proper baseline when you measure the lead time of business features. If you do not have this, metrics are not reliable. First of all, you need to have an agreement on the baseline itself.
- How to tackle problems with regard to long lead times in an efficient way? Collect and aggregate similar problems to pinpoint the problem. It might not be the DevOps team itself that is the bottleneck. Think of other departments as well, like the pipeline team or First Line Risk.
Change failure rate
Eager developers practice the creation of unit tests before the actual code of a feature. Fail fast: fail locally before you push your source code to Git. Git in turn triggers a CI/CD pipeline. You want to avoid your CI/CD pipeline to break on every single change. Therefore you need to deploy fast, but in conjunction with this, you want a low change failure rate. The % of builds of your pipeline compared to the total number of builds should be as low as possible. Trial and error here need to be balanced.
Rather inexperienced teams which forget this paradigm tend to commit before they properly validate their source code. Syntax errors should never be checked in, even an IDE can reveal those. Rule number one for infrastructure-related source code: Validate your infrastructure components like your Terraform modules, your CloudFormation templates, and your Helm charts before you commit it.
As said, test-driven development helps to keep the change failure rate low. In addition to this, keep your commits small so you can pinpoint a problem fast and efficiently. This also helps other team members to focus on what has actually changed and they aligned. It’s bad if they are overwhelmed by a big bunch of changes that they did not account for.
Don’t forget to review every merge to the master branch, since you want to avoid your “ultimate source of truth” from breaking.
DevOps encourages to make changes fast and often. But this can be misleading. Frequent but small changes do not impact the user experience (or the underlying system) as much as a large number of changed code lines. The impact of 100+ lines of code that has changed is always more prevalent compared to just 2 lines to change a title. Therefore, keep an eye on the change volume and take that into account when setting up KPIs for this category.
Commits are implicitly grouped if you prefix every commit with the ticket number. Pre-commit hooks can help to enforce it. Use this feature for all of your source code repositories: it also helps to track down the right person in case an error occurs.
And last but not least: avoid long-running feature branches: this makes merging later in the process very hard and thus increases the risk of a broken pipeline. Similar to the above statement: it’s harder to find a problem based on 30+ commits instead of 1 commit.
Defect escape rate
An interesting KPI which is not so common. The defect escape rate measures the (relative) number of errors that actually occurred in pre-production environments versus the ones in the real production environments. Once you get the numbers for this KPI, you have better insight into how well your activities rank in your pre-production environments.
A simple example: if you conduct lengthy integration or performance tests in your pre-production environment and you capture problems before they end up in production, you know how to get an indication of the collective to quantify of your software releases.
The above-mentioned example might sound easy, but it’s not so trivial since there will almost always be subtle differences between a pre-production environment and the actual production environment.
- Production environments need to take into account the actual workload of real end-users – it’s difficult to predict their behavior.
- Pre-production systems do not have the same amount of data so it’s difficult to determine how well an application behaves when it comes to dealing with this aspect.
- Using IaC helps to set up similar environments, but this needs to be done consistently. Zero-touch platforms to avoid humans from messing around to avoid configuration drift.
Great KPIs help to measure the effectiveness of DevOps teams to bring their applications to production. With every common KPI, there are a number of considerations to take into account. Focus on these to get the most out of them. I hope this article gave you some insights into the details so your KPIs will definitively work for you in the best possible manner.