September 2nd, 2021

Introducing xbcloud: Exponential Backoff Feature in Percona XtraBackup

https://www.percona.com/blog/introducing-xbcloud-exponential-backoff-feature-in-percona-xtrabackup/

https://www.percona.com/blog/?p=77939

xbcloud Percona XtraBackup

xbcloud Percona XtraBackupStoring your data locally can impose security and availability risks. Major cloud providers have object storage services available to allow you to upload and distribute data across different regions using various retention and restore policies.

Percona XtraBackup delivers the xbcloud binary – an auxiliary tool to allow users to upload backups to different cloud providers directly.

Today we are glad to announce the introduction of the Exponential Backoff feature to xbcloud.

In short, this new feature will allow your backup upload/download to work better with unstable network connections by retrying each chunk and adding an exponential wait time in between retries, increasing the chances of completion in case of an unstable connection or network glitch.

This new functionality is available on today’s release of Percona XtraBackup 8.0.26 and will be available in Percona XtraBackup 2.4.24.

How it Works – in General

Whenever one chunk upload or download fails to complete its operation, xbcloud will check the reason for the failure. It can be either a CURL / HTTP or a client-specific error. If the error is listed as retriable (more about that later in this post), xbcloud will backoff/sleep for a certain amount of time before trying again. It will retry the same chunk 10 times before aborting the whole process. 10 is the default retry amount and can be configured via --max-retries parameter.

How it Works – Backoff Algorithm

Network glitches/instabilities usually happen for a short period of time. To make xbcloud tool more reliable and increase the chances of a backup upload/download to complete during those instabilities, we pause for a certain period of time before retrying the same chunk. The algorithm chosen is known as exponential backoff.

In the case of a retry, we calculate the power of two using the number of retries we already did for that specific chunk as the exponential factor. Since xbcloud does multiple asynchronous requests in parallel, we factor in a random number of milliseconds between 1 and 1000 to each chunk. This is to avoid all asynchronous request backoff for the same amount of time and retry all at once, which could cause network congestion.

The backoff time will keep increasing as the same chunk keeps failing to upload/download. Getting by example the default --max-retry of 10, that would mean the last backoff will be around 17 minutes. 

To overcome this, we have implemented the --max-backoff parameter. This parameter defines the maximum time the program can sleep in milliseconds between chunk retries – Default to 300000 (5 minutes).

How it Works – Retriable Errors

We have a set of errors that we know we should retry the operations. For CURL, we retry on:

CURLE_GOT_NOTHING
CURLE_OPERATION_TIMEDOUT
CURLE_RECV_ERROR
CURLE_SEND_ERROR
CURLE_SEND_FAIL_REWIND
CURLE_PARTIAL_FILE
CURLE_SSL_CONNECT_ERROR

For HTTP, we retry the operation in case of the following status codes:

503
500
504
408

Each cloud provider might return a different CURL or HTTP error depending on the issue. To allow users to extend this list and not rely on us providing a new version of xbcloud, we created a mechanism to allow users to extend this list.

One can add new errors by setting --curl-retriable-errors / --http-retriable-errors respectively.

On top of that, we have enhanced the error handling when using --verbose output to specify in which error xbcloud failed and what parameter a user will have to add to retry on this error. Here is one example:

210701 14:34:23 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210701 14:34:23 /work/pxb/ins/8.0/bin/xbcloud: Curl error (52) Server returned nothing (no headers, no data) is not configured as retriable. You can allow it by adding --curl-retriable-errors=52 parameter

Those options accept a comma list of error codes.

Example

Below is one example of xbcloud exponential backoff in practice used with --max-retries=5 --max-backoff=10000

210702 10:07:05 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210702 10:07:05 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 2384 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [1]

. . .

210702 10:07:23 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Server returned nothing (no headers, no data)
210702 10:07:23 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 4387 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [2]

. . .

210702 10:07:52 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Failed sending data to the peer
210702 10:07:52 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 8691 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [3]

. . .

210702 10:08:47 /work/pxb/ins/8.0/bin/xbcloud: Operation failed. Error: Failed sending data to the peer
210702 10:08:47 /work/pxb/ins/8.0/bin/xbcloud: Sleeping for 10000 ms before retrying backup3/xtrabackup_logfile.00000000000000000006 [4]

. . .

210702 10:10:12 /work/pxb/ins/8.0/bin/xbcloud: successfully uploaded chunk: backup3/xtrabackup_logfile.00000000000000000006, size: 8388660

Let’s analyze the snippet log above:

  1. Chunk xtrabackup_logfile.00000000000000000006 failed to upload by the first time (as seen in the [1] above) and slept for 2384 milliseconds.
  2. Then the same chunk filed by the second time (as seen by the number within [] ) exponentially increasing the sleep time by 2
  3. When the chunk failed by the third time, we continued exponentially increasing the sleep time to around 8 seconds
  4. On the fourth time, we would originally increase the exponential time to around 16 seconds; however, we have used --max-backoff=10000which means that is the maximum sleep time between retries, resulting in the program waiting 10 seconds before trying the same chunk again.
  5. Then we can see that in the end, it successfully uploaded the chunk  xtrabackup_logfile.00000000000000000006

Summary

Best practices recommend distributing your backups to different locations. Cloud providers have dedicated services for this purpose. Using xbcloud alongside Percona XtraBackup are the tools to ensure you meet this requirement when talking about MySQL backup. On the other hand, we know that network connectivity can be unstable at the worst times. The new version of xbcloud won’t stop you from completing your backups as it will be more resilient to those instabilities with a variety of options to tune the network transfer.

Percona Distribution for MySQL is the most complete, stable, scalable, and secure, open-source MySQL solution available, delivering enterprise-grade database environments for your most critical business applications… and it’s free to use!

Download Percona Distribution for MySQL Today