# HyperLogLog Data Type and Probabilistic Data

The HyperLogLog bin data type gives you *estimated* counts of members in a large
dataset for your application to form fast, reasonable approximations of members
in the union or intersection between multiple HyperLogLog bins. HyperLogLog’s
estimates are a balance between complete accuracy and efficient savings in space
and speed in dealing with extremely large datasets.

In this discussion, the words "set" and "dataset" are used with the meaning from mathematical set theory, not the Aerospike concept of sets.

## Set theory fundamental background

The HyperLogLog data returned to your application is based on some basic ideas from set theory. The article Probabilistic: Definition, Models, and Theory Explained gives some brief background information.

### Union of sets

### Intersection of sets

## Some business use cases

The HyperLogLog data type is useful for any problem where your underlying data cannot give you exact answers. Some business use cases include deriving probable answers for the following needs.

### Bank fraud

Count the number of suspicious indicators related to an account and its transactions to determine in real-time the probability of fraud of an incoming transaction.

### Ad campaign scope

Given the user segments an ad is targeting, what is the approximate number of people who will see the ad?

### Online sales conversion rate

By comparing user cohorts and their interest in adjacent items, how many possible customers who login to the website will finally end up purchasing something in the same user session? In multiple sessions?

## HyperLogLog data and data modeling

As with all Aerospike data types, your own values define the underlying HyperLogLog data. You need to model your data on values that have some intrinsic relationship so that you can derive counts of membership in individual data sets or the union or intersection of multiple datasets. In colloquial words, you must “compare apples to apples but not apples to oranges”.

Continuing the example of user segmentation in online ad campaigns, you might have records representing user segments, each with a HyperLogLog bin to represent the membership in the segment. With the data returned by batch reading the bins of multiple segments, you can ask the server for the count of the members in an intersection of multiple segments. For example:

- All the people who love basketball. This is one segment, represented in an HLL bin of one record.
- All the people who like the Golden State Warriors, another segment.
- All the people who also like hats as a fashion accessory.

The meaning of the HLL bins is specific to the use case of identifying users as parts of audience segments, and counting those segments and intersections and unions between them for ad campaigns.

As a baseline for comparison, you can compare between records with HyperLogLog bins representing the same kind of data set specific to your use case. At a minimum, you need to define two HyperLogLog data sets so that their relationships can be explored. The exact meaning of the data is up to you.

- Continuing the example of bank fraud in Some business use cases, you have records that you are certain show fraud. With the data returned by HyperLogLog, your application can find the probability that other records match the pattern of the identified records.
- Continuing the example of online sales conversion rate in
Some business use cases,
the following need to be defined:
- Date/time of login to the website
- Date/time of purchase

## What the HyperLogLog data type returns to your application

The HyperLogLog returns the following information to your application:

- The estimated size of a set.
- The estimated cardinality of the union of multiple sets.
- The estimated similarity of multiple sets.
- The estimated cardinality of the intersection of multiple sets.
- The estimated union of multiple sets. This estimate is returned as a HyperLogLog data type.

### Calculating the size of the returned data

Because the HyperlogLog data type is for working with large data sets, the data it returns can also be large. You need to be aware of the storage cost of a HyperLogLog bin. For small sets a data type other than HyperLogLog might suffice, but for large data sets the unchanging storage size is extremely advantageous.

Each HLL contains 11 bytes of metadata and an array of 2

^{n_index_bits}registers.Each register contains 6 bits of hll_val and n_minhash_bits bits of minhash_val. The size of the registers is rounded up to the nearest byte.

sizeof(HLL) = 11 + roundUpToByte(2^{n_index_bits} × (6 + n_minhash_bits))

For general guidelines on bin overheads, see Linux Capacity Planning.

## Operations

HyperLogLog relies on the following APIs the clients applications use:

Name | Value | Description |
---|---|---|

create_only | 0x01 | Disallow updating an existing value of this bin. |

update_only | 0x02 | Disallow creation of a new Blob bin. |

no_fail | 0x04 | Allow a set of operations to proceed if an individual operation would fail due to a policy violation. |

allow_fold | 0X08 | Allow the resulting set to be the minimum of provided n_index_bits. For intersect_counts and similarity, allow the usage of less precise HLL algorithms when n_minhash_bits of all participating sets do not match. |

## Modify Operations

### init

`init(policy, bin_name, n_index_bits)`

Initializes or resets a standard HyperLogLog.

`init(policy, bin_name, n_index_bits, n_minhash_bits)`

Initializes or resets a HyperLogLog with minhash information (see HyperMinHash) bits to improve accuracy of intersection and similarity estimates.

**Flags:** create_only, update_only, no_fail

**Arguments:**

**policy**(library_specific): HLL modify policy.**bin_name**(string): Name of bin.**n_index_bits**(integer): Number of index bits. It must be between 4 and 16 inclusive.**n_minhash_bits**(integer): Number of minhash bits. It must be between 4 and 51 inclusive.**hll_bits + minhash_bits**(integer): The sum of index and minhash bits, it must be less than or equal to 64 bits inclusive.

**Returns:** (none)

### add

`add(policy, bin_name, items, n_index_bits)`

Adds values to HLL set. If HLL bin does not exist, use n_index_bits to create HLL bin.

`add_mh(policy, bin_name, items, n_index_bits, n_minhash_bits)`

Adds values to HLL set. If HLL bin does not exist, use n_index_bits and n_minhash_bits to create HLL bin.

`update(policy, bin_name, items)`

Adds values to HLL set. The HLL bin must already exist.

**Flags:** create_only, no_fail

**Arguments:**

**policy**(library_specific): HLL modify policy.**bin_name**(string): Name of bin.**n_index_bits**(integer): Number of index bits. Must be between 4 and 16 inclusive.**n_minhash_bits**(integer): Number of minhash bits. Must be between 4 and 51 inclusive.

**Returns:** (integer)

### set_union

`set_union(policy, bin_name, hlls)`

Sets union of specified list of HLLs with HLL bin.

**Flags:** create_only, update_only, allow_fold, no_fail

**Arguments:**

**policy**(library_specific): HLL modify policy.**bin_name**(string): Name of bin.**hlls**(list): List of HLL objects.

**Returns:** (none)

### refresh_count

`refresh_count(bin_name)`

Updates the cached count (if stale) and returns the count. For relative error see Error Bounds.

**Arguments:**

**bin_name**(string): Name of bin.

**Returns:** (integer)

### fold

`fold(bin_name, n_index_bits)`

Folds the HLL bin to the specified n_index_bits.

**Note**: Fails if existing HLL has n_minhash_bits set to non-zero.

**Arguments:**

**bin_name**(string): Name of bin.**n_index_bits**(integer): Number of index bits. Must be between 4 and 16 inclusive.

**Returns:** (none)

## Read Operations

### get_count

`get_count(bin_name)`

Estimate of the number of unique entries in the HLL set. For relative error see Error Bounds.

**Arguments:**

**bin_name**(string): Name of bin.

**Returns:** (integer)

### get_union

`get_union(bin_name, hlls)`

Returns an HLL object that is the union of all specified HLL objects in the hlls list with the HLL bin.

**Arguments:**

**bin_name**(string): Name of bin.**hlls**(list): List of HLL objects.

**Returns:** (HLL)

### get_union_count

`get_union_count(bin_name, hlls)`

Estimate of the number of elements that would be contained by the union of these HLL objects and the HLL bin. For relative error see Error Bounds.

**Arguments:**

**bin_name**(string): Name of bin.**hlls**(list): List of HLL objects.

**Returns:** (integer)

### get_intersect_count

`get_intersect_count(bin_name, hlls)`

Estimate of the number of elements that would be contained by the intersection of these HLL objects and the HLL bin. For relative error see Error Bounds.

**Arguments:**

**bin_name**(string): Name of bin.**hlls**(list): List of HLL objects.

**Returns:** (integer)

### get_similarity

`get_similarity(bin_name, hlls)`

Estimate of the similarity (or Jaccard Index) of these HLL objects and the HLL bin. For absolute error see Error Bounds.

**Arguments:**

**bin_name**(string): Name of bin.**hlls**(list): List of HLL objects.

**Returns:** (float)

### describe

`describe(bin_name)`

List containing the HLL bin's configured **n_index_bits** and **n_minhash_bits**.

**Arguments:**

**bin_name**(string): Name of bin.

**Returns:** (list)

## Error Bounds

Operation | Error | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

refresh_count | Relative Error: 1.04 / sqrt(2^{n_index_bits})
| |||||||||||||

get_count | ||||||||||||||

get_union_count | ||||||||||||||

get_intersect_count (relative error) | Absolute or Relative Error:When n_minhash_bits is zero, the error is unstable.When n_minhash_bits is non-zero, the error can be found with the following:Where e is the target error of similarity estimate and t is the minimum similarity of interest; choose n_minhash_bits and n_index_bitssuch that:n_minhash_bits ≥ log_{2}(6/et)n_index_bits ≥ log_{2}(e^{-2}) | |||||||||||||

get_similarity (absolute error) |

## Performance

Symbol | Description |
---|---|

N | Number of index bits. |

M | Number of minhash bits. |

S | Size of a HLL in bytes (M + N) × 2^{N} |

K | Number of HLLs being operated on. |

C | Cost of memcpy (added to all modifies). |

R | Cost of storage read (applies to any transaction - only once per transaction). |

W | Cost of writing to storage (applies to any modify transaction - only once per transaction). |

Operation | HyperLogLog (n_minhash_bits = 0) | HyperMinHash (n_minhash_bits > 0) |
---|---|---|

init | S | S |

add | 1 | 1 |

set_union | K × S | K × S |

refresh_count | S | S |

fold | K × S | K × S |

get_count | S | S |

get_union | K × S | K × S |

get_union_count | K × S | K × S |

get_intersect_count | K! × S 0 ≤ K ≤ 2 | K × S + S |

get_similarity | K! × S 0 ≤ K ≤ 2 | K × S + S |

describe | 1 | 1 |