Skip to main content
Loading

Capacity Planning Guide

Overview

Aerospike Database Enterprise Edition (EE) and Standard Edition (SE) store namespace data and metadata in several areas: the data storage engine, the primary index, and optional secondary indexes. Each of these can be configured independently to use solid state drives (SSD), shared memory (shmem), or Intel Optane Persistent Memory (PMem). The optional set indexes are always stored in memory.

Aerospike Database Community Edition (CE) is limited to storing namespace data on SSD or in volatile process memory; CE primary and secondary indexes store their metadata in process memory.

This page describes how to calculate the capicity requirements of your namespaces.

Version changes

  • Since server 7.0 in-memory data is stored in shmem.
  • Server 6.4 added the ability to store secondary indexes in flash (SSD).
  • Server 6.3 added the ability to store secondary indexes in PMem.
  • Since server 6.1 secondary indexes are stored in shmem.
  • Server 4.8 added the ability to store data in PMem.
  • Server 4.5 added the ability to store the primary index in PMem.
  • Server 4.3 added the ability to store the primary index in flash (SSD), known as Aerospike All Flash.

Required memory

You must provision enough memory to avoid losing a node to an out-of-memory(OOM) crash. To assist with preserving memory, Aerospike provides a configurable stop-writes threashold for its namespaces, and the option to configure an eviction threshold. See configuring namespace data retention for more details.

Enough memory should be reserved for the OS, namespace overhead, and other software running on the machine.

caution

For versions prior to server 7.0, verify that the combined memory-size of your namespaces does not exceed the available RAM on the machine.

Calculating primary index storage

info

See Primary Index Configuration for more configuration details.

The primary index of each namespace is partitioned into 4096 partitions, and each partition is structured as a group of shallow red-black trees called sprigs.

Each record has 64 bytes of record metadata in the primary index.

    64 bytes × (replication factor) × (number of records)

The replication factor is the number of copies each record has within the namespace. The default replication-factor for a namespace is 2 - a master copy and a single replica copy for each record.

Primary index on flash

When you configure a namespace with index-type flash (AKA All Flash), the 64 bytes of record metadata is stored as part of a 4KiB block on an index device. Only the sprig portion of the primary index consumes RAM, 10 bytes per sprig (or 13 bytes per-sprig in versions prior to server 5.7).

caution

It is important to understand the subtleties of All Flash sizing. Scaling up an All Flash namespace may require an increase of partition-tree-sprigs, which would require a rolling cold restart. Additional nodes increase capacity, but performance can be impacted as sprigs fill up and overflow their initial 4KiB disk allocation.

To reduce the number of read operations to the index device, consider the 'fill fraction' of an index block. You should aim that each sprig contains fewer than 64 records (as 64 x 64B is 4KiB).

If the namespace is projected to grow rapidly, you will want to use a lower fill fraction, to leave room for future records. Full sprigs will span more than a single 4KiB index block, and will likely require more than a single index device read. Modifying the number of sprigs, to mitigate such a situation, requires a cold start to rebuild the primary index. So, it's better to determine the fill factor in advance.

    sprigs per partition = (total unique records / (64 x fill fraction)) / 4096

partition-tree-sprigs must be a power of 2, so whatever the above calculation yields, pick the nearest power of 2.

For example, with 4 billion unique objects, and a fill factor of 1/2, the sprigs per partition should be:

    (4 x 10^9) / (64 x 0.5) / 4096 = ~ 30,517 -> nearest power of 2 = 32,768

Each sprig requires 10 bytes of RAM overhead so

    total sprigs overhead = 10 bytes x total unique sprigs x replication factor
= 10 bytes x ((number of records x replication factor) / (64 x fill fraction))

The total sprigs are then divided evenly over the number of nodes.

For our previous example, the amount of memory consumed by the primary index sprigs is:

    10 bytes x 32768 x 4096 x 2 = 2.5GiB

Or for server versions prior to 5.7:

    13 bytes x 32768 x 4096 x 2 = 3.25GiB

So with 4 billion objects and a replication factor of 2, the memory consumed in association with the primary index (across the cluster) in All Flash is 2.5GiB, instead of 476.8GiB of memory that would be used by the same example in a Hybrid Memory configuration, where the primary index is in memory.

When calculating the number of required sprigs, calculations must also be made to ensure the correct amount of space is provided on the disk for the primary index. This primdary index size, configured in mounts-budget in server 7.0, and previously mounts-size-limit, should then be adjusted. To do so, the following formula is to be applied to get the minimum size needed for this configuration parameter.

    primary index size = ((( 4096 x replication-factor / min-cluster-size ) x partition-tree-sprigs) x 4KiB

To explain the equation above, the mounts-budget (or mounts-size-limit) should be 4096 (the number of master partitions), multiplied by replication-factor to get the total number of partitions, divided by the minumum cluster size (min-cluster-size) that you will have, to get the partitions per node maximum, multiplied by the number of springs (partition-tree-sprigs) to get the maximum number of sprigs per node, and then multiplied by 4KiB (as each sprig occupies a minimum of 4KiB). This should be the minimum usable mount size for your primary indexes. You should also take into account the file system overhead when partitioning the disk for the All Flash mounts.

In addition, when shutting down, the sprig roots (5 bytes per sprig) get rewritten for optimizing the subsequent fast restart. You must allow for sufficient disk space available for this, or the node will not shutdown cleanly. The following formula calculates this extra disk space requirement for the cluster:

    5 bytes x partition-tree-sprigs x 4096 x replication-factor

Using this example, this would be:

    5 x 32,768 x 4096 x 2 = 1342177280 bytes or 1.25GiB

This space does not have to be included within mounts-budget (or mounts-size-limit).

If the size of the primary index exceeds 2TiB, you must change index-stage-size from the default value of 1GiB. Index space is allocated in arenas, the size of which are defined by the index-stage-size configuration parameter. The maximum number of arenas is 2048, so if the index needs to be bigger than 2TiB the index-stage-size must be increased.

Calculating set index storage

info

Set indexes are always stored in memory.

See Adding and Removing a Set Index for more configuraiton details.

  1. For each set index, 4MiB x replication factor of memory overhead is used, distributed across the cluster nodes.
  2. Each record in an indexed set costs 16 bytes of memory, multiplied by the namespace replication factor.
    • 16MiB x replication factor of memory is pre-allocated for each set index, divided across the cluster nodes, as soon as the set index is created. This allocation is reserved for the first million records in the set.
    • Memory for indexing additional records in the set is allocated in 4KiB micro-arena stage increments. Each additional 4KiB micro-arena stage enables set-indexing of 256 records in a specific partition.

Example

If a namespace has 1000 sets, each with a set index, and a replication factor of 2:

  • The overhead is 4MiB x 1000 x 2 = 8GiB, divided across the nodes of the cluster.
  • The initial stage pre-allocates 16MiB x 1000 x 2 = 31.25GiB, also divided across the nodes of the cluster.
  • Once the number of records in an indexed set passes one million, an additional 4KiB (holding up to 256 records) is allocated in the partition that's being written to.

Calculating secondary index storage

info

See Secondary Index Configuration for more configuration details.

See the separate Secondary Index Capacity Planning page for how to calculate their storage needs.

Calculating data storage

note

Starting in server 7.0, in-memory data storage is pre-allocated and static. Prior to server 7.0, data storage grew progressively and was bound by the now obsolete memory-size configuration parameter.

Prior to server 7.0, in-memory namespaces had a distinct storage format. To calculate the memory size associated with a pre-7.0 in-memory namespace see below.

The storage requirement for a single record is the sum of the following:

  • Overhead for each record:

    39 bytes in server 6.0 and later. Four bytes were added via a record end mark. Prior to server 6.0, the overhead was 35 bytes.

  • If using a non-zero void-time (TTL). Note that tombstones have no expiration:

    + 4 bytes

  • If using a set name:

    + 1 byte overhead + set name length in bytes

  • If storing the record's key. Flat key size is the exact opaque bytes sent by client:

    • 1-3 bytes overhead
      • 1 byte for key size <128 bytes
      • 2 bytes if 128 bytes <= key size < 16KB
      • 3 bytes if key size >= 16KB
    • 1 byte (key type overhead)+ flat key size
  • Bin count overhead. No overhead for single-bin and tombstone records:

    +1 byte for count < 128, +2 bytes for < 16K, or +3 bytes for >= 16K

  • General overhead for each bin. No overhead for single-bin:

    + 1 byte + bin name length in bytes and

    + 6 bytes for LUT depending on XDR bin-policy and

    + 1 byte for src-id if XDR bin convergence is enabled.

  • Type-dependent overhead for each bin:

    + 1 byte for bin tombstone (see bin-policy) or

    + 2 bytes + (1, 2, 4, 8 bytes) for integer data values 0-255, 256-64K, 64K-4B, bigger or

    + 1 byte + 1 byte for boolean data values or

    + 1 byte + 8 bytes for double data or

    + 5 bytes + data size for all other data types. See Data Size

This resulting storage size should then be rounded up to a multiple of 16 bytes. For example, a tombstone record with a set name 10 characters long and no stored key we need:

    35 + (1 + 10) = 46 -> rounded up = 48 bytes

Or for a record in the same set, no TTL, two bins (8 character names) containing an integer and a string of 20 characters:

    35 + (1 + 10) + 1 + (2 × (1 + 8)) + (2 + 8) + (5 + 20) = 100 -> rounded up = 112 bytes

Defragmentation considerations

Your storage engine needs a portion of the total storage space available to the namespace for defragmentation, as determined by the defrag-lwm-pct configuration parameter.

By default, you should plan to use no more than 50% on your storage space. Raising the defrag-lwm-pct makes more space accessible to data storage, at the cost of more CPU (when using an in-memory namespace in server >= 7.0) or device IO (for data on SSD or PMem).

Calculating in-memory data storage prior to server 7.0

Prior to server 7.0, a namespace configured to store data in memory had the following calculation:

  • Overhead for each record:

    2 bytes

  • If the key is saved for the record:

    + 12 bytes overhead + 1 byte (key type overhead) + (8 bytes (integer key) OR length of string/blob (string/blob key))

  • General overhead for each bin:

Either
+ 12 bytes(Aerospike database version prior to 5.4, or XDR bin-policy set so as to incur overhead)
or
+ 11 bytes (Aerospike database version 5.4 or later and XDR bin-policy not incurring overhead)

and
+ 6 bytes for LUT if XDR bin-policy is set so as to incur overhead.

+ 1 byte for src-id if XDR bin convergence is enabled.

  • Type-dependent overhead for each bin:

    + 0 bytes for bin tombstone (see bin-policy) or

    + 0 bytes for integer, double or boolean data or

    + 5 bytes for string, blob, list/map, geojson data

  • Data: size of data in all the record's bins (0 bytes for integer, double and boolean data, which is stored by replacing some of the general overhead). Please see Data Size

    + data size

For example, for a record with two bins containing an integer and a string of length 20 characters, and database version prior to 5.4, we find:

    2 + (2 × 12) + (0 + 0) + (5 + 20) = 51 bytes.

Or for the same type of record, and database version 5.4 or later (and bin-policy not incurring overhead), we find:

    2 + (2 × 11) + (0 + 0) + (5 + 20) = 49 bytes.

This memory is actually split into different allocations — the record overhead plus all general bin overhead are in one allocation, and the type-dependent bin overhead plus data are in separate allocations per bin.

note

Integer data does not need the per-bin allocation. The system heap rounds allocation sizes, so there may be a few more bytes used than the above calculation implies.

Memory Required for Data in Single-Bin Namespaces

caution

Single-bin namespaces were removed in server 6.4. See the special upgrade instructions.

If a namespace configured to store data in memory is also configured as single-bin true, the record overhead and the general bin overhead (the first allocation) described above are not needed — this overhead is stored in the index. The only allocation needed is for the type-dependent overhead plus data. Therefore, numeric data (integer, double) and booleans have no memory storage cost — both the overhead and data are stored in the index. If it is known that all the data in a single-bin namespace is a numeric data type, the namespace can be configured to indicate this by setting data-in-index true. This will enable fast restart for this namespace, despite the fact that it is configured to store data in memory.

Throughput (bytes)

    (number of records to be accessed in a second) × (the size of each record)

Calculate your desired throughput so that the cluster would continue to work even if one node goes down (i.e., we want to make sure that each node can handle the full traffic load).

Provisioning

See Provisioning a Cluster for examples.