DevOps teams are responsible for their workloads from the early stages of development up until production. All team-members now face what sysadmins already know: being on call. In a true DevOps team, they need to be available after working hours to solve critical disruptions in a timely manner. Most of the people on call don’t like this, since it interferes with their personal life. When having fun, no one wants to be interrupted with annoying and/or invalid notifications. However, it is a consequence of DevOps in which teams are responsible for both the Dev and Ops related work. In this article, I will share some tips and tricks to get most out of the alerts.
Having some fundamental considerations in place is essential to a good alerting system. Without this foundation, there is no good base to design and implement the rules. Besides these technical aspects that are essential to create and maintain the rules, there is a social aspect that should be taken into account. Keep your team happy and be sure to think about the following guidelines as a frame of reference:
- Alerts should trigger an appropriate action. An alert that reads like: “this fails again” does not trigger any action. It just annoys the person who receives it, because he/she does not know exactly what went wrong.
- Scripted / robotic alerts should be avoided. They should include proper context and intelligence so the person in charge can deal with it.
- It’s important that the person on-call can define the priority and react with a sense of urgency. This helps to spend his/her valuable time in a good manner. No time and energy is wasted on low-priority pages.
Keeping these fundamental issues in place helps to further implement the actual rules.
When you are already here, make sure you create rules which are truly meaningful and which really help to make the system more stable in the long run. This might sound easy, but sometimes it’s not so obvious.
Suppose you have a rule which states: storage device X is almost full. What does that mean? For sure, it means device X is running out of free space. But the person on call might not know how quickly it fills up. Simply said: it can be anywhere between 1 minute and 1 week. It depends on the actual size of the storage device, the speed with which it fills up and also on the actual cause. Besides this, you don’t know the impact of the alert. What happens if the storage device is actually full? You need more context to make an informed decision.
Try to answer the following questions which lead to a context-rich rule:
- What is the result if you ignore this rule (in the future)? What would be the ultimate reason to rewrite or completely discard this rule? And when is the right moment?
- When do I need to jump into action? If an alert does not tell you that, you don’t know if you need to rush home to fix things or you can finish the evening drink with your friends.
- Who else is being notified? If more than one person gets the alert: will that person fix it or should you do it yourself? Perhaps you can work together to handle two topics at the same time: the actual problem and the underlying cause of the problem. You need to know from each other if multiple persons are alerted to avoid confusion and conflicts. In the end, you don’t want to fix an issue that conflicts with the work of your colleague, making things even worse.
- Does the problem hurt your end-users? For example, a website which is down and your customers cannot order any products from your company. If yes, there is a real sense of urgency. In the case of an internal reporting application that is used only one day of the month, it might not hurt your users at all.
This list of questions might take a little while to handle, but in the end, it can save you a lot of headaches and frustration. Well begun is half done!
Think from the end-user’s perspective
End users are not interested in technical messages which they cannot or will not understand. They are not the system administrators nor the developers of the application.
They only care about what is visible to them and what really hurts the way they work and perceive your application. So it’s not wise to bother them with messages like: “the database is down” or “The file system driver cannot save file ABC to location XYZ”. From a security perspective. don’t log any sensitive details such as “Cannot make a database connection with user “john” to DB server 127.0.0.1. That would reveal too much information and makes your system vulnerable to attackers.
A couple of things that are really important to keep in mind when thinking from your end-users’ perspective:
- Core services should always be available to end-users. Those services should work properly 100% of the time. Pages and graphics like images and videos should load correctly. No CSS or JS errors are allowed. Be sure to use SSL so users won’t get an SSL warning/banner/blocker as soon as they open up your website address.
- Besides the core services, all of the related features should work as well. Monitor any feature which supports the core service, since this can have a negative effect on the core service itself. Even if this feature is not so clear and visible to the end-user.
- End-users care about latency. They want a fast and responsive system. Always and everywhere. No compromise on this.
- Their data should be handled trustworthy. So process their data with utmost care: store it correctly, refresh it when it needs to, and handle errors gracefully if it is not available immediately.
Be faster than the system
One common thing to keep in mind is to be faster than the system itself. With this I intend to say you need to think ahead of time: what could possibly be wrong with (core) service X or Y that poses a threat on system Z and what is the impact of your end-users? It’s not enough to react to things that happen as they actually happen but to act upon things that are (very) likely to happen and which really hurts your end-users’ experience or destabilizes your system.
When thinking from the end-users’ perspective and when keeping an eye on the future, you have a clear path to start defining the rules which really matter.
Where to focus
Many sysops tend to focus their alerts on causes that pose a problem for your system. They raise an alert if their (critical) application server logs a 500 error (a server-side problem). There can be numerous reasons for a server to generate a 500 error. It could be one of the following: disk full, memory consumption too high, network latency, database error, etc. Basically, it could be anything and thus the system can create too many alerts that do not focus on the actual (root) cause.
It would be much better to pinpoint alerts towards the actual cause. To focus on this, answer questions like the following:
- What eats away my disk space so quickly (f.e. is log rotation enabled or are temporary files cleaned up regularly?).
- Does my application or another system process have any memory limits (f.e. configure containers to limit the amount of memory they can utilize so your system won’t exhaust resources).
- Why does network latency hamper your application? (f.e. is the network connection too slow or unpredictable or is your application not resilient enough?).
In case you do not allocate time to investigate these issues you end up with a lot of redundant alerts that do not remove the real symptom.
In contrary to the above, sometimes you do need to focus on the causes to see if they are still relevant. Suppose you think you need to send out an alert when a server is down. If you plan to implement a high-availability solution this alert is not relevant anymore unless both servers are down. Perhaps you can tweak the alert to check the availability of both servers instead of one. If your uptime figures are good and if there is auto fail-over and self-healing of the system (a new server spins up as soon as one crashes beyond repair) you might not even need this. Then you might need an alert to warn the person on-call when the self-healing feature stops working.
Client / server perspective
Many modern systems are based on the client/server perspective. Front-ends like websites acts as client whereas application or back-end services act as their server counterpart.
Great alerts can be written when taking the client’s perspective in mind. An example:
It’s the client that mentions latency in a trustful manner, not the server. This is true since clients start at the beginning of the chain of (server/network) hops. Perhaps the application server sits next to the database server. The connection between them is rock solid, latency is not an issue. However, the front-end server might be somewhere else (loose coupling of applications) and thus the latency between the front-end and the back-end can be a problem for the client / end-user who interacts (through a website) with the front-end.
Besides the above, keep in mind that clients actually aggregate results from various other systems beyond the application server and database server. It might integrate with third-party services like message queues, account authorization services, etc. Suppose you would monitor them all individually, you miss out on the customer experience. Aggregation is king here since it generates a simpler view on things that are otherwise difficult to capture.
Of course you can set alerts on all of these services to ensure a certain level of quality can be expected from the clients’ perspective. But if these services are not yours (another team or even another company is responsible for it), this is not in your hands). Therefore, keep in mind the client’s perspective to define reliable and truly useful rules.
Alerts are not always needed. I’ve seen a lot of DevOps teams that report every “successful build” to their teams’ Slack channel. Question the value of that. In case 75% of your builds are okay since developers test properly locally, you don’t need this kind of information. It would be much more useful to report only 25% of the builds which actually break. This is also a shift left approach: putting more responsibility into the hands of developers to commit as soon as they are confident with their change and thus limit the risks of introducing a broken build.
Perhaps you don’t even need to alert on things. Does it really matter if you don’t receive an alert? Sometimes you can just generate a report based on daily or weekly statistics. Based on the example above: the report can contain the following:
- The total number of builds
- The # and % of successful builds
- The # and % of broken builds
Advanced statistics might reveal:
- An aggregation of similar reasons why a build breaks: for example a syntax error in module XYZ or every critical vulnerability which is introduced last week
- New dependencies of the external system: call the team who maintains it!
- An infrastructure-related error: fix it yourself if you are responsible for it (DevOps: you build it, you run it).
For every item which needs to be monitored and in which someone might need to receive some kind of alert, think about whether a direct alert is really needed or you can find another way to inform the person or team in charge.
Whatever your solution is, be sure to track every (unique) alert through a ticket system. Every alert requires proper follow up since it eats away time to be spent on business features. Use your time wisely.
People on call need to make sure they react to alerts in a proper way. Smart and proper alerts help to keep the person on call happy and it also helps to judge the issue in a meaning full way. In this article, I highlighted tips and tricks to define proper rules when it comes to alerts and other ways to report issues. Hopefully, this inspired you to think about it when you are working on this topic in your organization.