Chaos engineering: the steps to achieve on your application
In recent years, the methods of hosting and application development (micro-services) have led us to rethink the way our applications communicate with each other and the way we serve our customer service.
The multiplication of services makes it possible to have better controlled applications in terms of development and business but also brings their share of problems: sources of error are indeed multiplied
.
Chaos testing, or Chaos engineering, is a philosophy that requires developers to take into account possible failures that may occur on an application and thus prepare themselves to face a chaotic situation, namely:
- Errors
applicable
, - Errors of
infrastructure
, - Errors
network
, - In a general way,
any unexpected error
.
Initiated in 2011 by Netflix, the teams described some principles to be respected, called the [http://principlesofchaos.org/](Principles of Chaos). I invite you to read them before reading this article. We will see the steps to achieve Chaos engineering in your teams and thus have a resilient application to failure `.
All the failures mentioned above can (and should) be anticipated by developers when developing new applications, but existing applications in production today are sometimes far from being resilient to all these sources of error. That's why I thought it would be interesting to share with you the steps that will lead to a more resilient application, I hope.
The first advice I'd like to give to start with is to do not break your services on your production environment right now
. You have pre-production or recipe environments, use them first!
There is no need to damage your production while you have the possibility to anticipate these failures, your users or customers do not have to suffer this.
Then, as specified above, errors can be both infrastructure, network and also application! It is up to you, the developers, to implement the services that will make it possible to bear these failures
, even sometimes to infrastructure or network problems.
Finally, proceed step by step
. We will now see the different steps you can take, not necessarily in the order mentioned.
Some information before starting
I would not deliberately quote tools before starting because everyone is free to implement their chaos rules. The important thing is to observe and react to a failure you have caused; thus, a few simple command lines can be used to start playing on your environments.
To cut a process, a simple kill
can be used:
$ kill -9 <pid>
Also, in case your applications are deployed on a Kubernetes cluster, to delete a pod randomly
:
$ kubectl delete pod/`kubectl get pods | cut -d' ' -f1 | sed 1d | shuf -n1`
As you can see, you can write your chaos rules quite simply.
Well, now let's see the different cases that you can start with observe
.
Add latency
To start off not too violently, you can start by simply adding latency to your servers, so you should start to see various timeout problems on internal and external services that can damage your application.
These errors are very frequent and occur by high load
or soft connectivity
by your provider. You must take them into account.
To add network latency, you can connect to a machine and simply play with the "tc" command (for TrafficControl) :
# Add 500ms of latency
$ tc qdisc add dev eth0 root netem delay 500ms
# Verify that the rule has been applied
$ tc -s qdisc
qdisc netem 8002: dev eth0 root refcnt 2 limit 1000 delay 500.0ms
# Delete the rule
$ tc qdisc del dev eth0 root netem
Easy and efficient to start testing and observing the behavior of your application under latency !
Cut off your scheduled tasks
Without directly breaking your application, you can start by thinking about what would happen if your asynchronous jobs (sending emails, data synchronization,...) stopped working. These are not directly visible to your users and it sometimes doesn't matter if they are triggered a little late, as long as they are triggered.
Let's take an example of data denormalization to render this data on a front: when indexing new data, remember to suffix your index with a timestamp so that it doesn't impact your current data. <Also, your current data must be stored in a dedicated index (or better: an alias pointing to an index), in order to be able to switch at the end of the denormalization job and thus ensure that in case of job error, the index is not affected and that the "old" data is always displayed on your front.
Of course, if these are jobs that absolutely must be triggered on time (opening rights following an order made upstream for example), make sure that your jobs are correctly executed and have alerting
and retry
on them.
Cut your event publisher/subscriber server
When your applications exchange data with a pub/sub server (publisher / subscriber), you should also expect that it may be unavailable.
Even when your server is in cluster
mode, you are unfortunately not protected from a crash
.
You must therefore ensure that all event notifications that have failed to be sent to the ad/sub server are stored in order to be sent back as soon as it is available again.
It is indeed much better to be able to catch up the time rather than to lose data
important for your business.
Cut your database
We reach a critical point here: in general, when a database is made unavailable, many applications are concerned because they can no longer access their data, read or write. In addition to advising you to set up a cluster of your databases, I would also suggest that you expect them to become unavailable: corrupted data, failed network connection,...
The most important thing here is to try to reassure your users and show them something nice. If you have some cached information in the user's storage room, take advantage of it and display it to the user, if not better.
Delete your data
Your database may still be available, but the problem could very well come from your data becoming corrupted or being erased as a result of a flaw in your application. In this case, you must make sure that you can be able to detect this and quickly re-import a stable and recent backup.
In the same way, it is very important that you test your backups! At any time, these can be faulty and it would be a pity not to be able to restore a recent backup in case of data loss because you have not made sure that they are functional.
Cut off your micro-services
Your micro-services are most certainly contacted in one of the following two ways: by an API Gateway (GraphQL?) upstream and HTTP or gRPC links or perhaps they are only micro-services responding to events (consumers / producers). In any case, you must expect cuts on these applications and make sure, at a minimum, that they do not jeopardize your entire application. Thus, the part concerning the micro-service in question could be made unavailable (favorites management, for example) but the other features would continue to work. Better still, in this case, you can tell your users that you have a concern about getting their favorites back but take the opportunity to push them the latest available content, if nothing else.
If your infrastructure allows it, a possible solution would also be to serve a fallback cache to users using a cache strategy of type LRU (Last Recently Used) or LFU (Last Frequently Used) depending on the cases.
Thus, the data would not necessarily be completely up to date, but the user would have at least some available content and in most cases only fire.
Of course, the fallback cache can represent a large volume, that's why it's important to calculate the data that would potentially be stored in it and thus control the data
that you cache.
Increase the complexity of chaos
When you can control most of the failures that can happen to your infrastructure, it is time to increase the scale and prepare yourself to control multiple failures in parallel
. It's chaos.
Indeed, several micro-services can be rendered unavailable simultaneously: if you had therefore planned a case of fallback of your product recommendation, for example, on another micro-service allowing you to return the last products available in your catalogue, you must be able to find another solution.
In the event that your application is available in multiple geographical areas, then make a completely unavailable area to ensure that your customers are redirected to the second area. You will then start playing with tools such as [https://netflix.github.io/chaosmonkey/](Chaos Monkey) initially developed for Netflix's needs on these topics.
You are ready to play the chaos on your production
Remember, until now we were on non-production environments. There will be a time when you will need to test these different cases on your production environment
.
A good practice when it comes to this is to set up "game days" in order to devote a full day to putting chaos on your infrastructure and mobilizing your teams so that they are ready to intervene to test their fallback solution and/or restore the failure in case of failure.
Even in the event of a failure, it will only be beneficial for your project
because it will allow you to improve the resilience of your application
as you go along, so don't be afraid to get there.
Conclusion
Chaos engineering is approached step by step step by step
because in the life cycle of a project, it is usually one of the last steps once the application is stable and in production.
It allows you to make your application resilient to failures but also to prepare your teams to intervene on these subjects which can be really frustrating when they arrive to be able to solve them quickly but also to try to minimize the chances of them happening.