Cloud adoption grows exponentially. Development teams use on-demand infrastructure that scales up and down. They benefit from multiple datacenters and regions, as well as dedicated connections from their on-premises datacenter. Although this is great, cloud infrastructure can still fail. Teams need to be aware of this and they should resilient infrastructure & applications. It’s important to build self-healing systems that are more resilient to avoid downtime.
In this article, I will highlight some potential failures that can occur and how your application and infrastructure should cope with these failures. In the end, a reliable service is what customers want to do business with you.
Load balancing ensures traffic is distributed across many instances of application services. A very basic form of load balancing is to spread the load evenly across your servers in a round-robin fashion. Another method is latency-based routing, which looks at the load of each member of the pool. Here the load balancer redirects the traffic to the resource which poses the least latency. All of these services help to ensure your service keeps operating as much as possible. But what if they fail for whatever reason? The following examples help to make your application (infrastructure) more robust to handle different kinds of failures:
- Fail-over. A traditional Load Balancer in front of several Virtual Machines. If the health check for one of those VMs fails, the traffic is sent to the remaining VMs. At the same time traffic to the failed VM is switched off and an alert can be raised.
- Another pattern that is closely related to traditional load balancing is the “Queue Based Load Leveling” pattern. This pattern helps to smooth out sudden spikes in traffic which causes systems to get overloaded. To do so, it queues requests asynchronously to be processed later. Think of a buffer of requests which are kept to be processed when the system has a lower load.
- Not a specific type of load balancing but related to it is a database which is “hot standby”. The system switches over to this database if the main database fails. It’s like an extra guardrail to secure you walking a steep path. Bear in mind that a hot standby database increases your cloud costs.
- In Kubernetes, applications run in so-called “Pods”. A typical deployment pattern is to deploy multiple Pods as a fail-over. If the desired number of Pods is 3, Kubernetes ensures there are always 3 Pods running. In case of a Pod failure, it automatically starts a new Pod. When using a Daemon set, you can enforce that every Pod runs on a single worker node. This is another layer of protection since an entire worker node can fail as well.
- Besides this, keep in mind that Kubernetes depends on infrastructure which is provisioned upfront. In AWS you can use a so-called auto-scaling group which makes sure a worker node is replaced with one fails. Azure offers Virtual Machines Scale Sets for this purpose. Google uses node-pools, so all major cloud providers offer this kind of service.
The examples mentioned above are all configured on the infrastructure layer. Be sure to create these using Infrastructure as Code principles, to re-provision them quickly and reliably. The biggest benefit here: you don’t need to change the application itself.
Network failures and transactions
Since you are running your applications in the cloud, you don’t own nor control the network between resources and applications. Although cloud providers have a very strong and reliable network infrastructure, failure can and will happen. You need to be prepared for it.
From the application perspective, these tips can help you:
- Retry failed operations help you to overcome a temporary network or database connection failure. Build retry operations into the source code which checks the connection for X number of times. Azure provides a clear pattern and guidelines to handle these kinds of failures. One of the tips is to use idempotent services to keep your infrastructure resources consistent.
- The circuit breaker pattern prevents a service that fails from “hammering” another service when it retries over and over again. You have to make sure your application doesn’t cause another (micro) services to fail.
- There can be a situation in which you need to perform the same calculation or get the same response. Send a request to all of those services and compare the outcome. If the outcome is not equal for the given services, shutdown the faulty service. Using this pattern you can verify the reliability of individual services.
Long-running transactions or transactions with a large payload are more likely to get delayed or fail compared to quick and small transactions. Split up larger transactions to smaller ones and aggregate the responses to be handled by your application. For sure this is complex and also you face the risk of 1 transaction which is part of a bigger number fails…and then your entire transaction chain fails.
If this happens, be sure to undo your (partial) transaction since your data might not be consistent (anymore). In case a long-running transaction really can’t be avoided, it’s wise to checkpoint long-running transactions so they can be picked up (by a new system) at the point where they failed.
Handling client-side issues
Clients can be a great pain for back-end systems. In case clients are not very intelligent, they need to be limited in what their bad behavior will be on other services:
- Some client-side services generate a huge amount of (small) requests which cause a huge load on the services they request. In case you don’t want (or can’t auto-scale) you can throttle the clients to avoid the calling service being overloaded. For example: reject a client which already requested a similar API call 10 times in a row within a specific amount of time. Or put those kinds of requests in a queue that gets less priority over other requests.
- Some clients need to be blocked because they create real problems for other services. For example block clients that generate a big bunch of faulty requests. If you know the owner of that service, talk to them to resolve the problem on their side.
- Some clients cannot handle all of the new and fancy features your service provides. Use the graceful degradation pattern to gracefully reduce the features they can use. This pattern is similar to how it was in the past: when a lot of browsers were unable to show specific features which they did not support (yet). The website in charge would not fail, but the functionality was slowly degraded without a big “red flag” to the end-users.
- Clients can also cause your system to fail when they take a different route than expected. Be prepared for non-happy flows which can cause problems by including them into your integration tests. Sometimes non-happy flows can be best defined by people outside of the project since they have a fresh light on things. It’s then up to the development team to include those scenarios into their integration tests.
Introduce random failures
Perhaps one of the most disturbing tests: introduce random failures to your system. Sometimes you want to test a disaster recovery plan. You can plan all upfront or just pull the plug and see what happens. Of course, this is a very aggressive action. However, it gives you a very good practical view on how quickly you can recover from true failure and it gives you a good clue on what you need to do to limit the damage. For sure people will you will get the attention of your department if you do this without an up-front notice…
Did you already meet the chaos monkey? 😉 Netflix is famous for its initiatives on chaos engineering. In short, chaos engineering:
comprises causing deliberate faults to distributed software systems in production to test resilience in the face of turbulent or unexpected conditions.
They use a tool called “Chaos Monkey” to randomly unplug servers and other infrastructure components (in the cloud) during office hours. Engineers were given time to react and adapt their application services to this new phenomenon. Sooner rather than later this led to creating applications that were more resilient to overcome those kinds of problems. Resiliency was “build-in” in every service they build.
Another goal of chaos monkey was to minimize the “blast radius” or the impact on a service in case of an issue. In the end, downtime should be kept to a minimum since that greatly affects the customer experience.
In case you want to try out chaos engineering yourself (be very very careful and make sure your DevOps team is at a very high level of the DevOps maturity model), take a look at Gremlin.
Resilient infrastructure & applications are not created by themselves. You have to take into account the infrastructure, the application, and the client-side of things. With the tips and tricks in this article, you have a starting point to make your application more resilient and to keep your services up and running when failures occur. Good luck.