With the ever-increasing popularity of containerized microservices in the public cloud, deployments have become more complex. Because changes are deployed quickly and often, how these services behave in production is not always predictable. Recently, we highlighted ways to properly monitor these applications. But what about the business perspective? CI/CD related metrics provide valuable feedback for any DevOps team to monitor their applications and to further improve their product. Read more about which CI/CD metrics matter to your business.
Before we dive deeper into this topic, let’s do a quick overview of typical infrastructure-related metrics.
Assume you have a Kubernetes cluster running in production. You have deployed your applications and end-users can consume your services. Typical metrics on an infrastructure layer are:
- What is the health of your worker nodes? Is there any resource oversubscription or resource contention? Usually, physical or virtual servers are memory-constrained, but processor usage or storage performance may be bottlenecks, too.
- Does the log show any critical errors? And if yes, how many in a given period of time?
- What is the status and the latency of connections to other systems?
Metrics on most critical issues
Companies avoid the biggest risks to survive and grow their business. Software development companies face security risks in every step of their Software Development Life Cycle. Capturing security flaws in an early stage avoids big risks in production. Preventing or catching re-work early is cheaper than having to fix it after it made it all the way to production, with much less risk.
What happens when critical security issues are found in a system that is already running in production? Let’s assume you collect these metrics:
- The number of critical vulnerabilities in all of the applications. Now we have an idea of how vulnerable the organization is. But there is no context since it’s all put in one big pile. Security teams will be overwhelmed by the issues. Categorization and filtering are needed to set the right priorities.
- Collect the amount of failed login attempts. Imagine a hacker tries to get access to the infrastructure or application. What is the (security) blast radius when a single application is hacked? Try to answer the question from an enterprise-wide perspective, not just from a technological point of view. Hint: what about the loss of valuable data, the costs to fix the patch in the middle of the night. Furthermore, the reputation of the company is in danger.
These metrics make sense when viewed in isolation. They make even more sense when viewed as a total set. You need to know the context of the applications, their place in the application landscape, as well as what business functions they perform. A service transforming data from one format to another is less risky than a service processing personally identifiable information or payments.
All of this shows the need for more contextual information. Every organization needs it before it can take meaningful decisions.
Feedback from a DevOps perspective
Useful metrics also matter from a DevOps process perspective. The following metrics act as a starting point to get your mind pointed in the right direction:
Production downtime during deployment
Do you know what the duration of your change window is? Perhaps you need a couple of hours to roll out the new version to production. This also includes scenarios for some quick sanity tests and some time to rollback to a previous version in case the deployment goes wrong.
Useful metrics here include the downtime itself within the perspective of how critical an application is. 2 hours of downtime in the middle of the night can be acceptable if no customers are using it. However, that number can be unacceptable in case thousands of users rely on your application. The length of each individual window matters, too. Many customers don’t mind multiple short outages but become grumpy when their workflow is interrupted for too long.
Given these examples, determine how much (and how often) downtime is acceptable. This results in an error budget: knowing how much downtime you can afford. For each application or even service within an application, this budget varies and depends on technical dependencies, importance to revenue and other factors.
Number of code branches
At first glance, this might look like a very technical metric. It’s not about putting a version control system in place (it’s a hard requirement to have one) or to start the discussion of Trunk Based Development versus Git flow. It’s about code quality and the effort it takes to maintain.
Having a lot of feature branches can give you an indication of how many features are in progress. The more components in progress, the more risk is created when all of these are merged back into the final release. Perhaps developers forget to delete the feature branch when they merge, technical debt will slow the team down. Unattended code poses another risk. What if all of these feature branches are scanned for security issues and a security team fixes the issues in feature branches that are not needed anymore? You will lose valuable time fixing code that’s not even in use.
Lead time to production
Perhaps a very obvious one. If you measure the lead time to production for all of your applications, you have an indication of the maturity level of the DevOps team. If it takes one team two months between each new version and another team just two days…a knowledge sharing session between those teams can be useful. It can also give you an indication if there are many (cultural, political or technical) impediments. It does not necessarily say anything about the maturity or technical prowess of a team, however.
You can only compare this objectively if the variables are the same across teams. And they are usually not. Technical debt, risk associated with applications, availability, adoption and familiarity with monitoring, CI/CD and other tools and many other factors impact this lead time.
And shouldn’t be a single measuring stick for all applications. Instead, measuring relative changes in lead time per team over time is a more sensible approach that does away with most of the variance between teams.
Regression test duration
One of the key aspects of DevOps is continuous testing. This includes testing in different phases and with different purposes: from unit and integration testing on specific components, to end-to-end functionality, load and performance testing, as well as security testing throughout the software pipeline.
Regression tests should be executed and the results should be analyzed before pushing to production. What if the duration of the regression test for one application takes about 6 hours while for the other application takes just 15 minutes? It would be good to compare the duration of the tests compared to the size, complexity and number of tests in the application. The test framework can play a vital role here as well as the number of resources being used to perform the test. Auto-scaling and resource optimization can help here. All of these can be measured. Metadata adds the right context here. And again, don’t forget to only look at relative improvements and not across teams.
Broken build time and first-time right
Teams practicing continuous delivery have to strive for successful builds, preventing re-work. But what if something in the CI/CD pipelines goes wrong and the pipeline breaks? You can measure the time it takes the team to fix the issue and get their pipeline running again after it broke. Several indications which can be measured to give you valuable insights:
- Is the pipeline being fixed by the person who broke it or does the entire team help? This indicates the level of collaboration on pushing the product forward. You don’t want to rely on just a single person in case it breaks at a critical moment in time.
- What time does it take to fix a broken pipeline? Perhaps the team is too busy on Tuesdays and then it takes too long. Attention needs to be shifted, other tasks need to wait. This could be a bad sign if the team pushes to production every Wednesday. Stress could build up, which can result in more errors.
- How many commits are needed to fix the broken build? If the team needs too many commits, it could lead to more risks. This in turn increases the risks of another broken build and the application becomes harder to maintain. Quick fixes often means cutting corners and causing a cascading effect in the near future, resulting in even more broken builds.
- How often does the build break on the same issue? This indicates that the team needs to find the root cause. This root cause might be outside of the team’s immediate control, then they have to collaborate with another team or department. The organizational health as a whole will be improved as well when this root cause is found since other teams might face the same problem. Find a pattern here and prioritize on the cross-team issues instead of individual ones.
More powerful metrics
As seen in the previous sections, the operational metrics as well as the DevOps related metrics are important. However, when combining both of them creates even more powerful insights. Combining them enables organizations to actually take business-related decisions.
The following example helps to understand the relationship and shows the true power of the combined metrics.
Imagine you run a complex application that is difficult to update with new features and also requires a lot of complex configuration settings before it can be deployed to your production systems. If at the same time, the number of commits of the new release is high and thus the number of the “moving parts” is high as well, you need to be extra careful. Add to that the relative number of vulnerabilities compared to the total lines of code.
Everyone can see deployments of these kinds of applications involve a lot of risks. Things can go wrong very quickly. Perhaps the sales team also measures the revenue per API request (for instance, a webshop that interacts with an external payment provider). In case this is your “number one” application which brings the greatest sums of revenue, you need to reduce those risks.
It’s obvious this application should not suffer from infrastructural issues like CPU overloads or low memory. You should really limit the downtime when pushing a new release to production. The management should prioritize helping the team to improve ‘first time right’, reducing errors and re-work. Perhaps the project temporarily needs more budget to fix the technical debt. Without the information being combined it is hard to take such decisions since there are no facts. Metrics should produce facts to steer your business on.
Operational metrics collected from the infrastructure layer help an organization with the stability of their applications. CI/CD related metrics add more context to how well the DevOps teams perform, and where they need additional time and budget to improve.
When combining these two, it adds valuable insights to which business decisions the organization should make to grow and sustain their future. It gives them useful information to initiate new projects.