Exponential Backoff and Retry Patterns in Mobile

Retry Pattern

This article will describe a simple, easy-to-understand pattern that could be quite useful to add to the designs a sufficient capacity of resilience against error scenarios, transient failures, scenarios of unavailability of service, even unavailable resources.

Although I will describe this design pattern using the application approach in mobile clients, I should note that it is a reasonably versatile pattern that can apply to any component that requires communication with other components or services. For example, in Serverless Architectures or Microservices Architectures and specifically in the integration of Cloud Services, we can find multiple uses and exploits of this pattern. I will say more about this later.

Context

Retry Pattern is essentially a mechanism to react to the communication failure between two elements. These items could well be a software component or a service. Communication failure can be caused by multiple reasons such as overload of the destination service, loss of connectivity between the components, a temporary network failure. It could also not necessarily be communication failures. For example, a result that requires multiple verification attempts could be implemented through this pattern. It could also be implemented to add a certain level of security to applications to prevent attacks.

There are multiple strategies for implementing this pattern directly related to the algorithm chosen to calculate the estimated retry time. This algorithm could be applied through the following equations:

t: Waiting time between each attempt, that is, delay.

n: A variable that increases by 1 with each attempt. Its initial value can be 0.

interval: Constant of a time window in seconds. It can be, for example, 1 second.

rate: The exponential rate constant. It is usually 2.

Can improve the previous algorithm to avoid scenarios where multiple components that use said algorithm generate retry requests simultaneously and end up causing an adverse effect.

random: It is a random variable calculated on each retry, and its value could range from less than or equal to 1000 milliseconds.

And an even more sophisticated algorithm recommended on GCP:

min: It is a function that calculates a value that ranges between the first and second arguments.

max: It is the constant for the maximum number of retries to execute the operation.

These algorithms determine the time to wait to retry the operation. Depending on the requirements and the algorithm used, you could have the following types of implementation of the pattern:

Immediate retry: In this case, the delay is zero, and the retry is performed immediately.

With constant time window: In this case, the retry is continuously performed every time delay.

With constant incremental time window: The retry is performed every time delay with an additional fixed increment time.

With exponential increment time window: The retry is performed every time delay with an additional exponential increment time. This more sophisticated strategy is known as the exponential backoff.

With exponential and random increment time window: It is similar to the exponentially incremented time window strategy with the difference of a random variant that allows a better distribution of retries.

The following image is intended to illustrate these strategies:

Best practices in implementation

For Apply this design pattern correctly, the following acceptable practices are recommended.

Apply the pattern on transient faults

As described in context, this pattern is designed to handle communication failure scenarios, but not every communication failure is appropriate to apply this pattern. It is recommended to use the pattern on transient communication failures, that is, temporary failure scenarios reestablished after some time.
For example, some bugs in which should not use the pattern are:

The failure is caused by a bug in the component or service and prevents establishing communication.

What internal or fatal error is generated on the server with which you are trying to communicate.

How could you verify what type of communication failure you are dealing with? In elaborate implementations of this pattern, you could add verification of the response and determine whether or not to retry the communication.

Response codes such as NOT_FOUND, INTERNAL, INVALID_ARGUMENT could be used, as shown in the following table.

This example of a table is taken from "Developing Applications with Google Cloud Platform Specialization - Section: Replication, Query Types, Transactions, and Handling Errors" on Coursera.

Use conditions that avoid infinite loops

In each of the pattern implementation strategies, there must be a mechanism that allows to conclude and end the retry; that is, there must be a fixed number of attempts. It is also recommended to be careful not to create infinite cycles when integrating microservices in a serverless architecture style. The best way to prevent this is always to have a maximum retry limit.

Use idempotent functions

When this pattern is applied to microservices in an event-driven style through FaaS (Function as a Service), it is advisable to maintain the idempotency property, that is, to maintain the consistency and immutability of a function that serves as a service.

Use monitoring tools

Retry pattern implementations could also include monitoring services that allow cross-notification of the restoration of service. That could serve as support to determine in real-time if the retry mechanism should be maintained or suspended, thus making the implementation even more sophisticated.

Combine with other design patterns

One design pattern that could combine with the Retry pattern is the Circuit Breaker pattern. In a previous article, I made an introduction to this other design pattern. Through a suitable combination could provide a highly resilient solution to failures or errors.

Implementation

For both Android and iOS applications, the implementation of this pattern will be shown with the help of the Rx extensions applying RxJava for Android clients and RxSwift for iOS clients. The recommendations of The Clean Way to Use Rx manual will be used as a reference of good practices so that we have an elegant implementation of this pattern.

Implementation in Android

With exponential and random increment time window approach:

private void exponentialWithRandomRetry(final int maxiTimeRetry) {
       compositeDisposable.add(
                    this.operationWithPossibleFailure()
                                .retryWhen(errors -> errors
                                            .map(throwable -> 1)
                                            .scan((attempt, next) -> attempt + next)
                                            .map(attempt -> new Pair<>(attempt, this.exponentialBackoff(attempt, true)))
                                            .doOnNext(pair -> this.printCurrentTime(pair.first, pair.second))
                                            .map(pair -> pair.second)
                                            .flatMap(delayTime ->
                                                        Observable.just(delayTime)
                                                                        .delay(delayTime, TimeUnit.SECONDS))
                                            .take(maxiTimeRetry)
                                            .concatWith(Observable.error(new Throwable("unexpected error in service"))))
                                .onErrorResumeNext(Observable::error)
                                .subscribeOn(Schedulers.computation())
                                .observeOn(AndroidSchedulers.mainThread())
                                .subscribe(user -> logs(user, true),
                                                throwable -> logs(throwable, false),
                                                () -> Log.i(TAG, "completed")));
}

Function to calculate the delay time used in the pattern retry:

private long exponentialBackoff(final int attempt, final boolean withRandom) {
        final long min = -1000L;
        final long max = 1000L;
        if (withRandom) {
                long random_number = min + (long) (Math.random() * (max - min));
                double random_number_milliseconds = random_number * 0.001;
                return (long) (Math.pow(2, attempt) + random_number_milliseconds);
        } else {
                return (long) Math.pow(2, attempt);
        }
}

In my repository on Github, you can find the full implementation.

Implementation in iOS

With exponential and random increment time window approach:

private func exponentialWithRandomRetry(_ maxiTimeRetry: Int) {

operationWithPossibleFailure()

.retry{ errors in errors

.map{ errors in 1 }

.scan(0){ attempt, next in return attempt + next }

.map{ attempt in [attempt, self.exponentialBackoff(attempt, true)] }

.do(onNext: {pair in self.printCurrentTime(pair[0], pair[1])})

.map{ pair in pair[1] }

.flatMap{delayTime in Observable.just(delayTime)

.delay(.seconds(delayTime), scheduler: self.serialScheduler)}

.take(maxiTimeRetry)

.concat(Observable.error(SampleError()))}

.catch({ error in Observable.error(error) })

.subscribe(on: serialScheduler)

.observe(on: MainScheduler.instance)

.subscribe(onNext: { user in self.logs(user, true) },

onError: { error in self.logs(error, false) },

onCompleted: { print("completed") } )

.disposed(by: disposeBag)

}

Function to calculate the delay time used in the pattern retry:

private func exponentialBackoff(_ attempt: Int, _ withRandom: Bool) -> Int {

let min = -1000

let max = 1000

if (withRandom) {

let random_number = Int.random(in: min..<max)

let random_number_milliseconds = round(Double(random_number) * 0.001)

return Int((pow(Double(2), Double(attempt))) + random_number_milliseconds)

} else {

return Int(pow(Double(2), Double(attempt)))

}

In my repository on Github, you can find the full implementation.

Conclusion

Design patterns such as exponential backoff and circuit breaker have been useful tools for a long time and are now mostly involved in microservices and serverless architectures. Designs based on cloud services such as Cloud Storage, Cloud IoT, Cloud Functions in GCP, and other AWS services are benefited by applying these patterns.
This article has used a solution-oriented focus in the client layer, specifically on mobile clients. The application of this pattern is not only limited to integrating microservices or external persistence. It extends to local tasks such as checking status or internal libraries' responses, including communication with internal components.
To mention some mobile use cases that come to mind: