Skip to main content

Configuring the primary index

Aerospike's primary index can be stored in three different ways: RAM, persistent memory (PM), and flash (normally NVMe SSDs). Different namespaces within the same cluster can use different index storage methods.

index-type

To specify an index storage method, use the namespace context configuration item index-type.

The default index-type is shmem, with the index stored in RAM (Linux shared memory). To specify a persistent memory index, use index-type pmem, and to specify an index in flash, use index-type flash.

Cautions for Systemd

In a Systemd environment you might need to increase TimeoutSec from the default of 15s. This setting is in /usr/lib/systemd/system/aerospike.service. This prevents systemd from killing the asd process prematurely while the service is being shutdown. The primary index clean-up process during shutdown might take longer than the default 15s. In Aerospike, this default value has been increased to 10 minutes as of version 4.6.0.2.

Persistent memory index

Aerospike's Persistent memory (PM) index feature allows primary indexes to be stored in persistent memory (for example, Intel Optane DC NVDIMMs) instead of RAM-based shared memory segments.

Unlike a RAM-based index, a PM index is preserved across reboots of a cluster node's OS, which allows for fast restarts of Aerospike after a reboot.

Aerospike requires the persistent memory to be accessible via fsdax, that is, via block devices such as /dev/pmem0:

  • The NVDIMM regions must have been configured as AppDirect regions, as in the following example from a machine with a 750-GiB AppDirect region:
$ sudo ipmctl show -region
SocketID ISetID PersistentMemoryType Capacity FreeCapacity HealthState
0 0x59727f4821b32ccc AppDirect 750.0 GiB 0.0 GiB Healthy
  • The NVDIMM regions must have been turned into fsdax namespaces, as in the following example from the same machine:
$ sudo ndctl list
[
{
"dev":"namespace0.0",
"mode":"fsdax",
"blockdev":"pmem0",
...
}
]

Filesystem configuration

The PM block device must contain a filesystem that is capable of DAX (Direct Access), such as EXT4 or XFS. On the machine in the above example, this could be accomplished in the usual way:

EXT4 filesystem:

$ sudo mkfs.ext4 /dev/pmem0

XFS filesystem:

$ sudo mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0

Finally, the file system needs to be mounted with the dax mount option. The dax mount option is important. Without this option, the Linux page cache is involved in all I/O to and from persistent memory, which would drastically reduce performance.

In the following example, we use /mnt/pmem0 as the mount point.

$ sudo mount -o dax /dev/pmem0 /mnt/pmem0

Remember to make the mount persistent to survive system reboots by adding it to /etc/fstab. The mount point config line can be copied from /etc/mtab to /etc/fstab.

index-type for PM

The primary index type is configured per namespace. To enable a PM index for a namespace, add an index-type subsection with an index type of pmem to its namespace section. The added index-type subsection must contain:

  • One or more mount directives to indicate the mount points of the persistent memory to be used for the PM index.

    A single namespace can use persistent memory across multiple mount points and will evenly distribute allocations across all of them.

    Conversely, mount points can be shared across multiple namespaces. The file names underlying namespaces' persistent memory allocations are namespace-specific, which avoids file name clashes between namespaces when they share mount points.

  • A mounts-size-limit directive to indicate this namespace's share of the space available across the given mount points.

    When multiple namespaces share mount points, this configuration directive tells Aerospike how much of the total available memory across mount points each namespace is expected to use.

    The specified value, along with configuration item mounts-high-water-pct (default is 80) forms the basis for calculating the high watermark for evictions, for example.

    If mount points are not shared between namespaces, then simply specify the total available space.

    Ensure mounts-size-limit is lower or equal to the size of the filesystem.

The following configuration snippet extends the above example and makes all of /mnt/pmem0 memory (i.e., 750 GiB) available to the namespace:

namespace test {
...
index-type pmem {
mount /mnt/pmem0
mounts-size-limit 750G
}
...
}

Flash Index

The Aerospike All Flash feature allows primary indexes to be stored on flash memory devices, typically NVMe SSDs.

This index storage method is typically used for extremely large primary indexes with relatively small records. Accuracy is critical for certain aspects of capacity planning and configuration.

Cautions

It is important to understand the subtleties of All Flash Sizing as scaling up an All Flash namespace may require an increase of partition-tree-sprigs which would require a rolling Cold Restart.

While it is advisable to adjust the kernel's min_free_kbytes parameter in any configuration, it is especially important to do so when using All Flash. The linux kernel will attempt to make use of all free space by caching disk writes. With All Flash configuration, this may result in an OOM kill if there isn't enough free RAM left for normal system operations. Aerospike recommends setting min_free_kbytes=1153434 (1.1GB) for this reason. For more information, see here.

Enable Flash index for a namespace

To enable a Flash index for a namespace, in the configuration file, add an index-type subsection with an index type of flash to its namespace section. The added index-type subsection must contain:

  • One or more mount directives to indicate the mount points on the flash storage to be used for the flash index.

    A single namespace can use flash index storage across multiple mount points and will evenly distribute allocations across all of them.

    Conversely, mount points can be shared across multiple namespaces. The file names underlying namespaces' flash index allocations are namespace-specific, which avoids file name clashes between namespaces when they share mount points.

  • A mounts-size-limit directive to indicate this namespace's share of the space available across the given mount points.

    When multiple namespaces share mount points, this configuration directive tells Aerospike how much of the total available space across mount points each namespace is expected to use.

    The specified value, along with configuration item mounts-high-water-pct (default is 80) forms the basis for calculating the high watermark for evictions, for example.

    If mount points are not shared between namespaces, then specify the total available space.

An XFS file system is recommended because it has been shown to provide better concurrent access to files compared to EXT4.

Recommendation for multiple physical devices

Having more physical devices improves performance by increasing parallelism across those. More partitions per physical device doesn't necessarily improve performance. Aerospike instantiates at least 4 different arena allocations (files) and will allocate more if more devices (logical partitions or physical devices) are present. Instantiating more than 1 arena at a time helps with contention against the same arena, which is important during heavy insertion loads.

Sample configuration snippet

namespace test {
...
partition-tree-sprigs 1M # Typically very large for flash index - see sizing guide.
...
index-type flash {
mount /mnt/nvme0
mount /mnt/nvme1
mount /mnt/nvme2
mount /mnt/nvme3
mounts-size-limit 1T
}
...
}

Flash index calculations summary

For more information, see Linux capacity planning.

Here is a summary for calculating the disk space and memory required for a 4 billion records namespace with a replication factor of 2.

Number of sprigs required

  • 4 billion records ÷ 4096 partitions ÷ 32 records per sprig, to retain half-fill-factor = ~30,517
  • Round up to power of 2: 32,768 sprigs per partition

Disk space required

  • 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 4KiB size of each block = 1TiB for the whole cluster
  • 1TiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with mounts-high-water-pct at 80% = 427 GiB per node

Because All Flash uses a filesystem with multiple files, the mountpoint size should be slightly larger than 427 GiB to accommodate the filesystem overheads. This is filesystem-dependent. The 427 GiB is for actual usable space storage inside the files.

RAM required

With server version 5.7 or later, where 10 bytes are required per sprig:

  • 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 10 bytes memory required per sprig = 2,560 MiB for the whole cluster
  • 2,560 MiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with mounts-high-water-pct at 80% = 1,066 MiB per node

Or with server versions prior to 5.7, where 13 bytes are required per sprig:

  • 32,768 sprigs per partition × 4096 partitions × 2 replication factor × 13 bytes memory required per sprig = 3,328 MiB for the whole cluster
  • 3,328 MiB required for the whole cluster ÷ 3 minimal number of nodes ÷ 0.8 with mounts-high-water-pct at 80% = 1,387 MiB per node