Bin Convergence
Aerospike version 5.4 introduced the bin convergence feature which will help with mesh/active-active topologies. Get an overview of it in the architecture guide.
Configuring this feature is a bit more involved than usual and some important things need to be observed to make sure bin convergence works fine. The configurations are spread in multiple sections of the configuration file. Here is the summary of configs that need to be used.
This feature does not apply to connectors.
Config Name | Section | Description |
---|---|---|
conflict-resolve-writes | namespace | This should be set to true for the receiving namespace to store the bin-level LUTs to resolve conflicts. Not allowed when single-bin is true . |
src-id | xdr | A non-zero unique ID should be selected for each DC and the same ID should be given to all the nodes in the DC. |
ship-bin-luts | xdr->dc->namespace | This should be set to true for the sender to send the bin-level LUTs when shipping the record. |
bin-policy | xdr->dc->namespace | Bin shipping should be enabled by setting this value to one of: only-changed , changed-and-specified , changed-or-specified . |
A sample configuration will look as below:
namespace someNameSpaceName {
...
conflict-resolve-writes true
}
xdr {
src-id 1
dc dataCenter1 {
node-address-port someIpAdress1 somePort1
namespace someNameSpaceName {
bin-policy only-changed
ship-bin-luts true
}
}
}
Refer to the config example for an example in a 3-DC mesh topology.
Overhead
1-byte per bin for src-id in addition to the 6-byte overhead based on the
bin-policy
. This overhead applies to all records in the namespace once conflict-resolve-writes true
is enabled, not only to sets for which bin convergence is utilized.
Restriction on client writes
Record replace not allowed
When bin-convergence is enabled (conflict-resolve-writes
is set for the namespace), the client
cannot do writes with record replace policy. The record replace policy will simply overwrite
whatever is already present (without even reading the old version of the record on the disk). If
some of the old bins are not part of new the record, we will lose last-update-time (LUT) of those bins.
Without the LUT of the deleted bins during the replace, we will not be able to provide proper
bin-convergence.
To overcome this restriction, the client should use the operate
API in the clients instead of
put
. The operate API provides a way to perform multiple operations in a single call. The
application can delete all the bins first (without needing to know the bin names) and then write the
desired new bins. The delete all operation will convert all the existing bins to bin-tombstones with
LUTs which will help with bin-convergence.
Durable Deletes
Starting Aerospike version 5.5, bin convergence is supported even with record deletes as long as
they are durable deletes. When bin-convergence is enabled for a namespace and a record is durably
deleted, the delete will convert the record to a tombstone but also maintains bin-tombstones along
with the necessary LUTs and src-id. We call them bin-cemeteries and they are available as a metric
named xdr_bin_cemeteries
. Any future
updates to the record coming from the local client or a remote XDR write will be evaluated against
the LUT & src-id of the bin-tombstones. The update will succeed if it comes with a future LUT.
Internally, these durable deletes are converted to writes which delete all the bins. So, this will
have an impact on the statistics. These operations will not be counted under
client_delete_success
but will be counted under
client_write_success
.
Similarly, these deletes will show up as writes in the write latency histogram.
As mentioned above, a durable delete will leave a bunch of bin-tombstones with LUTs and src-id. So, these records will occupy more space than regular record-tombstones which do not have any bins. They should be taken into account while doing capacity planning.
These tombstones will be removed from the system by the regular tomb-raider. If you change the
default behavior of the tomb-raider which will delete tombstones beyond 1-day, you should not set
the value for tomb-raider-eligible-age
too low. You should set this to a value which is higher than the potential conflict window + lag.
Dependencies
System Clock
As described above and in the architecture doc, this feature relies heavily on the timestamps. The timestamps are based on the system clock. For this feature to give the desired results the system clocks should be setup properly. We expect that the system clocks are setup with the Network Time Protocol (NTP).
Tie-break
The bin-level LUT has a millisecond resolution. This should give a good level of resolution in
choosing a winner. But there can be a tie at the millisecond level. The src-id
is used to break
the tie in that case. When there is a tie, the highest src-id
wins. That is why it is important
to make sure that the src-id
is unique across all the clusters connected via XDR.
Client write behavior
In spite of the best efforts, there may be cases where the system clock may fall behind or go ahead compared to the other DCs. If a node's system clock falls behind, the client writes may fail with error code 28 (LOST CONFLICT). This may happen because the incoming XDR writes would have written the bins with a higher timestamp. A subsequent local client write will use the local time and may be behind the XDR write's timestamp. In this case, the client write will be rejected as we cannot allow the bin-level LUT to go back in time.
Forwarding from the active-active mesh
There are use cases where the writes happening in the active-active mesh DCs needs to be forwarded to one or more read-only DCs. For example, assuming 3 DCs - A, B and C - are setup as an active-active mesh. Also assume DC A requires forwarding all the writes to a read-only DC D. As bin shipping is based on bin LUTs, with bin-convergence, XDR writes may bring in bins with older LUTs than the last-ship-time of the forwarded DC. This configuration is not supported and forwarding cannot be turned on if bin convergence is being used.