Skip to main content
Loading

Creating Spark Applications to Access Aerospike Database

Creating a Spark application that can access an Aerospike database requires downloading the Spark connector's package and then adding that to your application's environment.

Prerequisites for using the Spark connector

Ensure that you meet these prerequisites before installing Aerospike Connect for Spark:

  • Your Spark cluster must be at version 2.4.x 1 , 3.0.x, 3.1.x or 3.2.x.

Versions of Spark that are supported by Aerospike Connect for Spark (aka the Spark connector)

Aerospike Connect for Spark versionsSupported Apache Spark versionJfrog Artifactory versions
3.5.53.0.x, 3.1.x, 3.2.x3.5.5-spark3.0-allshaded, 3.5.5-spark3.0-clientunshaded, 3.5.5-spark3.1-allshaded, 3.5.5-spark3.1-clientunshaded, 3.5.5-spark3.2-allshaded, 3.5.5-spark3.2-clientunshaded
3.5.43.0.x, 3.1.x, 3.2.x
3.5.33.0.x, 3.1.x, 3.2.x
3.5.23.0.x, 3.1.x, 3.2.x
3.5.13.0.x, 3.1.x, 3.2.x
3.5.03.0.x, 3.1.x, 3.2.x
3.4.23.0.x, 3.1.x
3.4.13.0.x, 3.1.x
3.4.03.0.x, 3.1.x
3.3.1_spark3.13.1.x
3.3.1_spark3.03.0.x
3.2.23.0.x
3.2.13.0.x
3.2.03.0.x
3.1.13.0.x
3.1.03.0.x
3.0.33.0.x
3.0.23.0.x
3.0.13.0.x
3.0.03.0.x
2.9.02.4.x 1
2.8.12.4.x 1
1 Apache Spark version 2.4.8 is not supported.
caution

2.9.0 is likely to be the last release which is compatible with the Apache Spark 2.4.7 binary. Aerospike has ceased developing new features to support Spark 2.x.y. However, we will make bug fixes available until October 12, 2023. Please plan to move to Apache Spark 3.0.x and use Aerospike Connect for Spark version 3.x.y.

Jar naming convention:

To support multiple spark versions, we have changed jar naming convention. Starting from 3.3.0 release, all binaries will be named as aerospike-spark_x_spark_y_z.jar, where x is connector version, y is spark version and z can be either allshaded or clientunshaded. The binary name aerospike-spark-3.3.0_spark3.1_allshaded.jar indicates release version is 3.3.0, supported spark version is 3.1.x and all the internal libraries are shaded. Similarly, aerospike-spark-3.3.0_spark3.1_clientunshaded.jar indicates that all libraries except aerospike java client are shaded.

To find out when these different versions of the Spark connector were released, see the "Aerospike Connect for Spark Release Notes".

info

Although Aerospike Connect for Spark was tested with versions of Apache Spark, it should work with the Spark versions available in DataProc in Google Cloud Platform (GCP) and EMR in Amazon Web Services (AWS).

  • The Java 8 SDK must be installed on the system on which you plan to run Aerospike Connect for Spark. (Tip: If you want to test with different versions of the Java 8 SDK, consider using sdkman to help you manage those versions.)
  • Your Aerospike Database Enterprise Edition cluster must be at version 5.0 or later if you plan to use Aerospike Connect for Spark version 2.0 or later.
  • Connector does not bundle Spark and Hadoop related binaries within its jar. This means your production system must have spark and hadoop installed.

Download and install the Spark connector's .jar file

Install using Jfrog artifactory

  • In build.sbt file, add Jfrog repository resolver resolvers += "Artifactory Realm" at "https://aerospike.jfrog.io/artifactory/spark-connector"
  • Specify dependency as "com.aerospike" %% "aerospike-spark" % <<version>> where the version could be values like 3.5.5-spark3.0-allshaded. 3.5.5-spark3.0-allshaded which indicates that we are seeking dependency from the 3.5.5 release built for spark 3.0 with all the dependencies shaded.

Download

Download appropriate version 3.x of the connector if you are using Apache Spark 3.0.x, and version 2.x if you are using Apache Spark 2.4.x (not including 2.4.8).

You can download the .jar package from the Aerospike Enterprise Downloads site.

Add the .jar package to your application's environment

You can do this in either of these ways:

  • If you plan to create a batch job or address the challenges of real-time business insights by leveraging the streaming job, write a Scala, Java, or Python application by following the interactive code in the Jupyter notebooks. Specify the downloaded JAR as a dependency. Once your Spark application is ready, submit it to the Spark cluster using either spark-submit or spark-shell. See Submitting Applications in the Spark documentation for detailed information.

Example using spark-submit

spark-submit --jars path-to-aerospike-spark-connector-jar --class application-entrypoint application.jar

  • If you plan to create a Jupyter notebook that uses the Spark connector, add the JAR path to the environment variables.

    Example using Python

    import os
    os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars aerospike-spark-assembly-2.7.0.jar pyspark-shell'

    Example using Scala

    launcher.jars = ["aerospike-spark-assembly-2.7.0.jar"]  

    See our notebooks for other examples.