Why Resiliency Matters
Resilience is the ability of a distributed system to continue operating despite
the occurrence of a partition. A partition is a situation where some components of the system cannot
communicate with each other due to some problem. In terms of the CAP theorem, resilience is what we
refer to as partition tolerance.
Imagine that you and your friends are working on a group project.
You have to share some files and data with each other over the internet. However, one day, there is a
problem with the internet connection, and some of you cannot access the shared files or data. This means
that there is a partition between you and your friends. You are now in separate subnets, or groups and cant
communicate with each other. This partition can cause the group leader to submit the wrong project’s
state to the lecturer, or worse, lose your work altogether.
Most of today’s software applications are distributed. The functioning of the entire system is distributed
among several services, to the minimum, one that deals with UI logic and another that facilitates data
exchange between the database and the UI. If any of these services fail or become unavailable,
the whole system can be affected.Therefore, it is essential to design and develop resilient systems that can withstand failures and recover
quickly.
I will give you five tips on things you can look at and implement to keep your systems functioning correctly and consistently.
Timeouts
timeouts play a crucial role in building resilient systems, especially when dealing with communication
between services. Integrating timeouts into network calls serves as a trigger for failover mechanisms
and allows for graceful degradation when a service's latency exceeds acceptable levels.
Here's an example to illustrate this concept:
Scenario: Communication between Services
Suppose you have a microservices architecture where Service A communicates with Service B to retrieve data.
To make this communication resilient to potential issues and prevent prolonged delays, timeouts are introduced.
- 1. Normal Scenario
- - Service A sends a request to Service B.
- - Service A waits for a response within a predefined timeout period (e.g., 2 seconds).
- 2. Successful Response
-
- If Service B responds within the timeout, the operation proceeds as usual, and
Service A continues its processing.
- 3. Timeout Exceeded
-
- If the timeout is exceeded, Service A considers this as a potential issue.
Instead of waiting indefinitely, it triggers a failover mechanism. But well,
depending on the nature of the task, you may want to implement a retry mechanism
before you turn to fallbacks.
- 4. Failover Mechanism
-
- Service A initiates a fallback or alternative action, such as using a cached response,
employing a backup service, or providing a default response to the user. On a different note,
API calls are resource intensive and you may wanna consider caching data to files in a root
folder or Redis(for advanced scenarios). This not only provides a seamless experience. But It
saves your users data credits as well. More on this on an upcoming blog.
- 5. Graceful Degradation
-
- By setting reasonable timeouts, you ensure that even if Service B is available but
experiencing higher latency than your system can tolerate, Service A gracefully degrades
its performance by switching to alternative strategies without affecting the entire
system's responsiveness.
Do:
// Process the response
Except TimeoutError:
// Timeout Exceeded so we initiate a fallback
fallback_response = use_cached_result_or_alternative_service()
By incorporating timeouts, your system becomes more resilient to variations in service responsiveness,
ensuring that even during periods of high latency or unavailability, it can gracefully handle the situation
and maintain a responsive and reliable user experience.
But what happens when your service decides to take a pause?
Fallbacks
To achieve true resilience, you cannot assume that external services will always be available. Implement a fallback and ensure
that even during transient errors or unavailability, your app can seamlessly switch to an alternative service or approach
to achieve the same outcomes. This approach aligns well with the circuit breaker architecture pattern, in which if your primary
is unavailable, calling service switches to a fallback mode when it detects a timeout. The fallback mode can be an alternative
service, a cached response in a json file in the root folder, or just a default value. For example well, depending on how important
information should be representative of the actual values, if the external service provides weather information, you can use a
cached response from the last successful call, or a generic message like “Weather information is currently unavailable”. This way,
you can still provide some service to your users, even if it is degraded, and avoid cascading failures in your system.
Graceful Degradation
Think about Netflix. If the network is congested or unstable, the video streaming service can lower the bitrate to reduce
the buffering time and avoid interruptions. This way, the service can still deliver the video content, even if the quality
is reduced. Its an example that shows how graceful degradation can help to improve your users experience and the reliability
of your system, by handling stress gracefully and providing the best possible service under any circumstances.
Use a distributed Database
I personally recommend Apache Cassandra. Its syntax is not too far from SQL, just similar. But its a full time
distributed database actually built to be highly resilient. Cassandra uses a query-first design approach, which means
that the tables are created based on the queries that the application needs to perform, rather than the structure of the data.
This allows for denormalization of data, which improves the availability and performance of the system. It completely does
not have a single point of failure and does not suffer from bottle necks as every node in its cluster can perform any operation.
It achieves this by replicating data across multiple nodes and data centers, ensuring that the data is always available
and durable. The replication allows for load balancing and data locality, improving performance and efficiency of the system.
You also have the flexibility to choose the leve of consistency you want for your reads and writes and this helps you balance
between latency and accuracy as well as handling network partitions gracefully. you have a cluster of Cassandra nodes distributed
across different regions. Now, imagine that there is a network partition that prevents some nodes in one region from
communicating with the rest of the cluster. How would Cassandra handle this situation? Cassandra would still be able to serve
requests from the nodes that are available, even if they are not the most updated ones. This means that your web application would
not experience any downtime or errors, and your users would still be able to access your service. However, you might have some
data inconsistency or stale reads, as some nodes might have different versions of the data than others but consistency is a
topic for another day. Let me know in the comments if I should write an article about it.
Message Queues
When users upload files for complex processing, involving steps like content extraction, casting, cleaning, analysis,
persistence, and display, employing an asynchronous, queue-based system becomes paramount for both resilience and a
seamless user experience. As soon as a user uploads a file, instead of processing each step synchronously and risking
potential delays or failures, a message queue is employed to orchestrate the workflow. The uploaded file triggers a
message on the queue, each step represented by a separate process or worker. For instance, the content extraction process
can be handled by one worker, casting by another, and so forth. If a particular service is momentarily unavailable or
experiences a slowdown, the other processes can continue to operate independently, ensuring that the overall workflow
is not entirely disrupted. This asynchronous, queue-driven architecture introduces a resilient system where each processing
step is decoupled, and potential issues in one step do not halt the entire process. The user, meanwhile, experiences a
smooth interaction, as the upload operation is not hindered by the intricacies of processing each step in real-time.
Though archive all this asynchronously by spanning threads in within a single process, It will always catch up on the limits.
You don't want all this data processing to soon affect other services as well when is starts failing. This is why you
need to use the Bulkhead Architectural Pattern.
Bulkhead Pattern
When applied to a system, involves isolating critical services to prevent the failure of one from affecting
others, thus containing and minimizing the impact of failures. This pattern is particularly crucial for
services handling sensitive operations like payment processing. You usually find systems that have one big
API project encapsulating tens of unrelated service endpoints. By treating each critical API endpoint as its
own independent service, you create isolated compartments or "bulkheads" for functionalities. For instance,
payment processing, data processing, and other critical functions are treated as separate, independently
scalable services. If one service encounters issues or fails, the impact is confined to that specific area,
and other parts of the system remain unaffected. This approach not only enhances fault tolerance but also
allows for targeted optimizations and scalability adjustments based on the unique requirements of each critical service.
Make your Infrastructure Immutable
Instead of modifying existing components, you replace them entirely with new instances, this ensures consistency
and predictability in your deployments. Infrastructure Immutability contributes to resilience by simplifying
rollback and recovery processes, minimizing security vulnerabilities, and facilitating efficient scaling. For example,
consider a web application deployed using immutable infrastructure principles. When an update is needed, rather than
applying changes to existing servers, a new instance is created with the updated configuration, and the traffic is
gradually shifted to the new instance(s). If an issue arises, the system can quickly revert to the previous version
by redirecting traffic to the untouched, stable instance. Downtime is minimized and operations become reliable.
Capacity
Capacity considerations are critical for maintaining a resilient system, particularly when dealing with components like queues.
As the volume of messages exceeds the consumers consumption capacity, a backlog in the queue can occur, as it also has limits,
posing challenges to producers.
The solution lies in scaling out, adding more consumers and adopting the competing consumers pattern. By introducing additional consumer
processes, you enhance message concurrency and increase throughput. However, it's crucial to recognize that scaling out might shift the
bottleneck to other downstream services like a database with differing capacities. To address this, limiting
concurrent processing based on the external system's capacity becomes essential. For example, when interacting with a database,
indiscriminate scaling may overload it. Employing asynchronous processing is beneficial, but understanding and managing capacity at
various points in the system, including external dependencies, is key. Implementing load leveling techniques, prioritized queues, and
considering factors like API handling capacity and database call limits are integral to preventing overload and ensuring a resilient
architecture that adapts to diverse workloads.
It’s not just queues, it’s any part of your system. How many requests can your HTTP API's handle, and how many database calls? All these
are variable on the workload being performed and at what volume. Not all work is created equal.
So, how do you know what your capacity is? I’m glad you asked.
Metrics and Telemetry
In a resilient architecture, having solid metrics and telemetry means understanding your system's normal behavior and setting up
alarms to detect deviations before they become problems. When dealing with message queues, monitor the inflow and outflow rates –
the messages produced and consumed. Keep an eye on queue depth over time to catch any potential backlogs. If your system relies on
HTTP calls to third-party services, establish metrics for the usual duration of these calls. Set alarms to trigger when these durations
exceed your defined thresholds. Being proactive with metrics allows you to adjust your system before disruptions occur,
ensuring a smoother and more reliable operation.
I hope you enjoyed these tips and learned something you can start looking at to make your system more resilient. I would love to hear
your feedback and opinions on this topic. Please leave a comment below and let me know what you think.
Get notified when I make a post
*Note : your information are safe
You will now be notified
You will now be notified