Creating Spark Applications to Access Aerospike Database
Creating a Spark application that can access an Aerospike database requires downloading the appropriate Spark connector's jar and then adding that to your application's environment.
Prerequisites for using the Spark connector
Ensure that you meet these prerequisites before installing Aerospike Connect for Spark:
- Your Spark cluster must be at version 2.4.x 1, 3.0.x, 3.1.x, 3.2.x, 3.3.x or 3.4.x.
- From release version 4.0.0 onwards, connectors support all multiple versions of scala versions ( if supported by corresponding Apache Spark versions).
|Aerospike Connect for Spark versions||Supported Apache Spark version||Jfrog Artifactory versions|
|4.0.0||3.0.x, 3.1.x, 3.2.x, 3.3.x, 3.4.x||4.0.0-spark3.4-scala2.13-allshaded|
|3.5.5||3.0.x, 3.1.x, 3.2.x||3.5.5_spark_3.0_allshaded|
|3.5.4||3.0.x, 3.1.x, 3.2.x|
|3.5.3||3.0.x, 3.1.x, 3.2.x|
|3.5.2||3.0.x, 3.1.x, 3.2.x|
|3.5.1||3.0.x, 3.1.x, 3.2.x|
|3.5.0||3.0.x, 3.1.x, 3.2.x|
2.9.0 is likely to be the last release which is compatible with the Apache Spark 2.4.7 binary. Aerospike has ceased developing new features to support Spark 2.x.y. However, we will make bug fixes available until October 12, 2023. Please plan to move to Apache Spark 3.x and use the corresponding latest connector version.
Jar naming convention:
To support multiple spark versions, we have changed the jar naming convention.
- Starting from the
4.0.0) release, all binaries will be named as aerospike-spark_x_spark_y_z.jar, where x is the connector version, y is the spark version and z can be either
clientunshaded. The binary name
aerospike-spark-3.3.0_spark3.1_allshaded.jarindicates the release version is 3.3.0, the supported spark version is 3.1.x and all the internal libraries are shaded. Similarly,
aerospike-spark-3.3.0_spark3.1_clientunshaded.jarindicates that all libraries except the aerospike java client are shaded.
- To accommodate support for multiple scala versions, from the
4.0.0release onward, all the binaries follow the general format
[connector-version]-[spark-version]-[supported-scala-version]-[allshaded/clientunshaded]. Please refer to the table above for all the supported versions.
To find out when these different versions of the Spark connector were released, see the "Aerospike Connect for Spark Release Notes".
Although Aerospike Connect for Spark was tested with versions of Apache Spark, it should work with the Spark versions available in DataProc in Google Cloud Platform (GCP) and EMR in Amazon Web Services (AWS).
- The Java 8 SDK must be installed on the system on which you plan to run Aerospike Connect for Spark. (Tip: If you want to test with different versions of the Java 8 SDK, consider using sdkman to help you manage those versions.)
- Your Aerospike Database Enterprise Edition cluster must be at version 5.0 or later if you plan to use Aerospike Connect for Spark version 2.0 or later.
- Connector does not bundle Spark and Hadoop related binaries within its jar. This means your production system must have spark and hadoop installed.
Spark connector installation
Install using Jfrog artifactory
build.sbtfile, add Jfrog repository resolver
resolvers += "Artifactory Realm" at "https://aerospike.jfrog.io/artifactory/spark-connector"
- Specify dependency as
"com.aerospike" %% "aerospike-spark" % <<version>>where version is the Jfrog Artifactory versions (listed in the above table) .
- Download the appropriate version of the connector based on which Apache Spark version (2.x or 3.x) is being used. Apache Spark version 2.4.8 is not supported.
- You can download the
.jarpackage from the Aerospike Downloads site.
Add the .jar package to your application's environment
You can do this in either of these ways:
- If you plan to create a batch job or address the challenges of real-time business insights by leveraging the streaming job, write a Scala, Java, or Python application by following the interactive code in the Jupyter notebooks. Specify the downloaded JAR as a dependency. Once your Spark application is ready, submit it to the Spark cluster using either
spark-shell. See Submitting Applications in the Spark documentation for detailed information.
spark-submit --jars path-to-aerospike-spark-connector-jar --class application-entrypoint application.jar
If you plan to create a Jupyter notebook that uses the Spark connector, add the JAR path to the environment variables.
Example using Python
os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars aerospike-spark-assembly-2.7.0.jar pyspark-shell'
Example using Scala
launcher.jars = ["aerospike-spark-assembly-2.7.0.jar"]
See our notebooks for other examples.