Skip to main content

Bin Convergence

Aerospike version 5.4 introduced the bin convergence feature which will help with mesh/active-active topologies. Get an overview of it in the architecture guide.

Configuring this feature is a bit more involved than usual and some important things need to be observed to make sure bin convergence works fine. The configurations are spread in multiple sections of the configuration file. Here is the summary of configs that need to be used.

Config NameSectionDescription
conflict-resolve-writesnamespaceThis should be set to true for the receiving namespace to store the bin-level LUTs to resolve conflicts. Not allowed when single-bin is true.
src-idxdrA non-zero unique ID should be selected for each DC and the same ID should be given to all the nodes in the DC.
ship-bin-lutsxdr->dc->namespaceThis should be set to true for the sender to send the bin-level LUTs when shipping the record.
bin-policyxdr->dc->namespaceBin shipping should be enabled by setting this value to one of: only-changed, changed-and-specified, changed-or-specified.

A sample configuration will look as below:

namespace someNameSpaceName {
...
conflict-resolve-writes true
}

xdr {
src-id 1

dc dataCenter1 {
node-address-port someIpAdress1 somePort1
namespace someNameSpaceName {
bin-policy only-changed
ship-bin-luts true
}
}
}

Refer to the config example for an example in a 3-DC mesh topology.

Overhead

1-byte per bin for src-id in addition to the 6-byte overhead based on the bin-policy.

Restriction on client writes

Record replace not allowed

When bin-convergence is enabled (conflict-resolve-writes is set for the namespace), the client cannot do writes with record replace policy. The record replace policy will simply overwrite whatever is already present (without even reading the old version of the record on the disk). If some of the old bins are not part of new the record, we will lose last-update-time (LUT) of those bins. Without the LUT of the deleted bins during the replace, we will not be able to provide proper bin-convergence.

To overcome this restriction, the client should use the operate API in the clients instead of put. The operate API provides a way to perform multiple operations in a single call. The application can delete all the bins first (without needing to know the bin names) and then write the desired new bins. The delete all operation will convert all the existing bins to bin-tombstones with LUTs which will help with bin-convergence.

Durable Deletes

Starting Aerospike version 5.5, bin convergence is supported even with record deletes as long as they are durable deletes. When bin-convergence is enabled for a namespace and a record is durably deleted, the delete will convert the record to a tombstone but also maintains bin-tombstones along with the necessary LUTs and src-id. We call them bin-cemeteries and they are available as a metric named xdr_bin_cemeteries. Any future updates to the record coming from the local client or a remote XDR write will be evaluated against the LUT & src-id of the bin-tombstones. The update will succeed if it comes with a future LUT.

Internally, these durable deletes are converted to writes which delete all the bins. So, this will have an impact on the statistics. These operations will not be counted under client_delete_success but will be counted under client_write_success. Similarly, these deletes will show up as writes in the write latency histogram.

As mentioned above, a durable delete will leave a bunch of bin-tombstones with LUTs and src-id. So, these records will occupy more space than regular record-tombstones which do not have any bins. They should be taken into account while doing capacity planning.

These tombstones will be removed from the system by the regular tomb-raider. If you change the default behavior of the tomb-raider which will delete tombstones beyond 1-day, you should not set the value for tomb-raider-eligible-age too low. You should set this to a value which is higher than the potential conflict window + lag.

Dependencies

System Clock

As described above and in the architecture doc, this feature relies heavily on the timestamps. The timestamps are based on the system clock. For this feature to give the desired results the system clocks should be setup properly. We expect that the system clocks are setup with the Network Time Protocol (NTP).

Tie-break

The bin-level LUT has a millisecond resolution. This should give a good level of resolution in choosing a winner. But there can be a tie at the millisecond level. The src-id is used to break the tie in that case. When there is a tie, the highest src-id wins. That is why it is important to make sure that the src-id is unique across all the clusters connected via XDR.

Client write behavior

In spite of the best efforts, there may be cases where the system clock may fall behind or go ahead compared to the other DCs. If a node's system clock falls behind, the client writes may fail with error code 28 (LOST CONFLICT). This may happen because the incoming XDR writes would have written the bins with a higher timestamp. A subsequent local client write will use the local time and may be behind the XDR write's timestamp. In this case, the client write will be rejected as we cannot allow the bin-level LUT to go back in time.

Forwarding from the active-active mesh

There are use cases where the writes happening in the active-active mesh DCs needs to be forwarded to one or more read-only DCs. For example, assuming 3 DCs A, B and C setup as an active-active mesh and assuming DC A requiring to forward all the writes to a read-only DC D. While this is supported, we cannot use bin-shipping from A to D. When forwarding the writes from A to D, the bin-policy should be set to all (default).

Bin shipping is based on bin LUTs. With bin-convergence, XDR writes may bring in bins with older LUTs than the last-ship-time of the forwarded DC. If that is the case, the bins will not be shipped to the forwarded DC. Hence the need to set the bin-policy to all where we ship the entire record irrespective of the bin LUTs.