Capacity Planning Guide
For an Aerospike Database, you can size your hardware by calculating the capacity requirements as described below. Then, depending on whether you want an in-memory database, a hybrid SSD/memory database, an in-memory cache with no persistence, or a database deployed to the cloud (such as on Amazon EC2), you can provision your SSDs and RAM appropriately.
Required Memory
In Aerospike's Hybrid Memory architecture the indexes are always stored in RAM.
There must be sufficient RAM for the primary index,
secondary indexes and set indexes.
You should provision enough RAM so that you have enough room in case of unexpectedly losing a node. It should also
be configured such that it does not exceed the high water mark,
if configured. A common best practice is to have memory_used_bytes not
exceed 60% of the memory-size
of the namespace but it can be higher,
depending on the cluster size and use case.
Starting with Aerospike Enterprise Edition 4.3, the primary index of a namespace can be stored on a dedicated device. In this configuration, known as Aerospike All Flash, memory consumption is reduced to a bare minimum.
Make sure that the combined memory-size
of your namespaces does not exceed
the available RAM on the machine. Enough memory should be reserved for the OS,
namespace overhead, and other software running on the machine.
Memory Required for Primary Index
The primary index of each namespace is partitioned into 4096 partitions, and each partition is structured as a group of sprigs (shallow red-black trees). The sprigs point to record metadata, 64 bytes per record.
Calculating byte size of primary index
In Aerospike's Hybrid Memory configuration, the memory consumed by the primary index is:
64 bytes × (replication factor) × (number of records)
Replication factor is the number of copies each record has within the namespace. The default replication factor for a namespace is 2 - a master copy and a single replica copy for each record.
Aerospike All Flash
Refer to the Primary Index Configuration page for further details.
It is important to understand the subtleties of All Flash sizing as scaling up an All Flash namespace may require an increase of partition-tree-sprigs
which would require a rolling Cold Restart. Adding nodes will add more capacity but as sprigs fill up and overflow their initial 4KiB disk allocation, performance would be impacted.
When a namespace uses an All Flash configuration, the 64 bytes of record metadata is not stored in memory, but rather as part of a 4KiB block on an index device. Only the sprig portion of the primary index consumes RAM, 10 bytes per-sprig. [Note this overhead is 13 bytes per-sprig in server versions prior to 5.7.]
To reduce the number of read operations to the index device, consider the 'fill fraction' of an index block. You should ensure that each sprig contains fewer than 64 records (as 64 x 64B is 4KiB).
If the namespace is projected to grow rapidly, you will want to use a lower fill fraction, to leave room for future records. Full sprigs will span more than a single 4KiB index block, and will likely require more than a single index device read. Modifying the number of sprigs, to mitigate such a situation, requires a cold start to rebuild the primary index. So, it's better to determine the fill factor in advance.
sprigs per partition = (total unique records / (64 x fill fraction)) / 4096
partition-tree-sprigs
must be a power of 2, so whatever the above calculation yields, pick the
nearest appropriate power of 2.
For example, with 4 billion unique objects, and a fill factor of 1/2, the sprigs per partition should be:
(4 x 10^9) / (64 x 0.5) / 4096 = ~ 30,517 -> nearest power of 2 = 32,768
Each sprig requires 10 bytes of RAM overhead so
total sprigs overhead = 10 bytes x total unique sprigs x replication factor
= 10 bytes x ((number of records x replication factor) / (64 x fill fraction))
The total sprigs are then divided evenly over the number of nodes.
For our previous example, the amount of memory consumed by the primary index sprigs is:
10 bytes x 32768 x 4096 x 2 = 2.5GiB
Or for server versions prior to 5.7:
13 bytes x 32768 x 4096 x 2 = 3.25GiB
So with 4 billion objects and a replication factor of 2, the RAM consumed in association with the primary index (across the cluster) in All Flash is 2.5GiB, instead of 476.8GiB of RAM that would be used by the same example in a Hybrid Memory configuration.
When calculating the number of required sprigs, calculations must also be made to ensure
the correct amount of space is provided on the disk for the Primary Indexes.
The mounts-size-limit
should then
be adjusted. To do so, the following formula is to be applied to get the minimum size
needed for this configuration parameter.
mounts-size-limits
= ((( 4096 × replication-factor
) ÷ min-cluster-size
) × partition-tree-sprigs
) × 4KB
To explain the above, the mounts-size-limit
should be 4096 (the number of master partitions),
multiplied by replication-factor
to get the total number of partitions,
divided by the minimum cluster size
that you will have, to get the partitions per node maximum,
multiplied by the number of partition tree sprigs
to get the max sprigs per node,
and then multiplied by 4KB (as each sprig occupies a minimum of 4KB). This should be the minimum size of the usable mount size for your Primary Indexes.
Please also take into account the file system overhead when partitioning the disk for the All Flash mounts.
In addition, when shutting down, the sprig roots (5 bytes per sprig) get rewritten for optimizing the subsequent fast restart. You must allow for sufficient disk space available for this, or the node will not shutdown cleanly. The following formula calculates this extra disk space requirement for the cluster:
5 bytes x partition-tree-sprigs
x 4096 x replication-factor
Using this example, this would be:
5 x 32,768 x 4096 x 2 = 1342177280 bytes or 1.25 GB
This space does not have to be included within mounts-size-limit
.
If the size of the primary index exceeds 2TiB, you must change index-stage-size
from the default value of 1GiB. Index space is allocated in slices (arenas), the size of which are defined by the index-stage-size
configuration parameter.
The maximum number of arenas is 2048, so if the index needs to be bigger than 2TiB the index-stage-size
must be increased.
Capacity Planning for Set Indexes
- For each set index, 4MiB x replication factor of RAM overhead is used, distributed across the cluster nodes.
- Each record in an indexed set costs 16 bytes of RAM, multiplied by the namespace replication factor.
- 16MiB x replication factor of RAM is pre-allocated for each set index, divided across the cluster nodes, as soon as the set index is created. This allocation is reserved for the first million records in the set.
- Memory for indexing additional records in the set is allocated in 4KiB micro-arena stage increments. Each additional 4KiB micro-arena stage enables set-indexing of 256 records in a specific partition.
If a namespace has 1000 sets, each with a set index, and a replication factor of 2:
- The overhead is
4MiB x 1000 x 2 = 8GiB
, divided across the nodes of the cluster. - The initial stage pre-allocates
16MiB x 1000 x 2 = 31.25GiB
, also divided across the nodes of the cluster. - Once the number of records in an indexed set passes one million, an additional 4KiB (holding up to 256 records) is allocated in the partition that's being written to.
Capacity Planning for Secondary Indexes
Secondary indexes can be optionally built over your data. See the Secondary Index Capacity Planning for more details.
Memory Required for Data in Memory
If a namespace is configured to store data in memory, the RAM requirement for that storage can be calculated as the sum of:
Overhead for each record:
2 bytes
If the key is saved for the record:
+ 12 bytes overhead + 1 byte (key type overhead) + (8 bytes (integer key)
OR length of string/blob (string/blob key))
General overhead for each bin:
Either
+ 12 bytes
(Aerospike database version prior to 5.4, or XDR bin-policy
set so as to incur overhead)
or
+ 11 bytes
(Aerospike database version 5.4 or later and XDR
bin-policy
not incurring overhead)
and
+ 6 bytes
for LUT if XDR bin-policy
is set so as to incur overhead.
+ 1 byte
for src-id if XDR bin convergence is enabled.
Type-dependent overhead for each bin:
+ 0 bytes for bin tombstone
(seebin-policy
) or+ 0 bytes for integer, double or boolean data
or+ 5 bytes for string, blob, list/map, geojson data
Data: size of data in all the record's bins (0 bytes for integer, double and boolean data, which is stored by replacing some of the general overhead). Please see Data Size
+ data size
For example, for a record with two bins containing an integer and a string of length 20 characters, and database version prior to 5.4, we find:
2 + (2 × 12) + (0 + 0) + (5 + 20) = 51 bytes.
Or for the same type of record, and database version 5.4 or later
(and bin-policy
not incurring overhead),
we find:
2 + (2 × 11) + (0 + 0) + (5 + 20) = 49 bytes.
This memory is actually split into different allocations — the record overhead plus (all) general bin overhead are in one allocation, and the type-dependent bin overhead plus data are in separate allocations per bin. Note that integer data does not need the per-bin allocation. The system heap will round allocation sizes, so there may be a few more bytes used than the above calculation implies.
Memory Required for Data in Single-Bin Namespaces
If a namespace configured to store data in memory is also configured as single-bin true
, the record overhead and the general bin overhead (the first allocation) described above are not needed — this overhead is stored in the index. The only allocation needed is for the type-dependent overhead plus data. Therefore, numeric data (integer, double) and booleans have no memory storage cost — both the overhead and data are stored in the index. If it is known that all the data in a single-bin namespace is a numeric data type, the namespace can be configured to indicate this by setting data-in-index true
. This will enable fast restart for this namespace, despite the fact that it is configured to store data in memory.
Data Storage Size
The storage requirement for a single record is the sum of the following:
Overhead for each record:
39 bytes
in server 6.0 and later when four-byte record end mark was added in format change35 bytes
in server prior to 6.0If using a non-zero void-time (TTL). Note that tombstones have no expiration:
+ 4 bytes
If using a set name:
+ 1 byte overhead + set name length in bytes
If storing the record's key. Flat key size is the exact opaque bytes sent by client:
+ 1-3 bytes overhead (1 byte for key size <128 bytes, 2 bytes for <16K, 3 bytes for >= 16K)
+ 1 byte (key type overhead) + flat key size
Bin count overhead. No overhead for single-bin and tombstone records:
+1 byte for count < 128, +2 bytes for < 16K, or +3 bytes for >= 16K
General overhead for each bin. No overhead for single-bin:
+ 1 byte + bin name length in bytes
and+ 6 bytes
for LUT depending on XDRbin-policy
and+ 1 byte
for src-id if XDR bin convergence is enabled.Type-dependent overhead for each bin:
+ 1 byte for bin tombstone
(seebin-policy
) or+ 2 bytes + (1, 2, 4, 8 bytes) for integer data values 0-255, 256-64K, 64K-4B, bigger
or+ 1 byte + 1 byte for boolean data values
or+ 1 byte + 8 bytes for double data
or+ 5 bytes + data size for all other data types
. See Data Size
This resulting storage size should then be rounded up to a multiple of 16 bytes. For example, a tombstone record with a set name 10 characters long and no stored key we need:
35 + (1 + 10) = 46 -> rounded up = 48 bytes
Or for a record in the same set, no TTL, two bins (8 character names) containing an integer and a string of 20 characters:
35 + (1 + 10) + 1 + (2 × (1 + 8)) + (2 + 8) + (5 + 20) = 100 -> rounded up = 112 bytes
All Flash Index Device Space
Refer to the Primary Index Configuration page for further details.
In the Aerospike All Flash configuration, every sprig requires 4KiB of index device space, assuming that index block holds metadata for no more than 64 records.
The index device space needed if each sprig has one index block is
4 KiB x total unique sprigs x replication factor
In the earlier example we had 4 billion unique records, a fill fraction of 1/2, and replication factor 2. We calculated the total unique sprigs per-partition to be 32,768.
4 KiB x 32768 x 4096 x 2 = 1 TiB index device space needed for the cluster
You can now calculate the index device space per node.
You want to tolerate cluster splits and nodes going down. Use the minimal size of the sub-cluster you want functioning in case of a cluster split or missing nodes.
device size per node = cluster-wide index device space needed / minimal number of nodes
For the example above, assume a min-cluster-size
of 3. The index device space per node will need to be
1 TiB / 3 = ~341GiB index device space per node
Similar to high-water-memory-pct
for namespace RAM and high-water-disk-pct
for SSD data storage, there is a mounts-high-water-pct
configuration param. By default it is disabled (set to 0) but a typical value for use cases leveraging evictions
is 80% of the mounts-size-limit
and will evict from the index device when this high water mark is breached.
In our example, we will take this high water mark into account and provision the index device per node accordingly:
341GiB / 0.8 = 427GiB index device space per node
Total Storage Required for Cluster
(size per record as calculated above) x (Number of records) x (replication factor)
Data can be stored in RAM or on flash storage (SSD). You should not exceed 50-60% capacity on your SSDs. You can use one of our recommended SSDs or test/certify your own SSD using the Aerospike Certification Tool.
Throughput (bytes)
(number of records to be accessed in a second) × (the size of each record)
Calculate your desired throughput so that the cluster would continue to work even if one node goes down (i.e., we want to make sure that each node can handle the full traffic load).
Provisioning
See Provisioning a Cluster for examples.