Creating Spark Applications to Access Aerospike Database
Creating a Spark application that can access an Aerospike database requires downloading the Spark connector's package and then adding that to your application's environment.
Prerequisites for using the Spark connector
Ensure that you meet these prerequisites before installing Aerospike Connect for Spark:
- Your Spark cluster must be at version 2.4.x 1 , 3.0.x, 3.1.x or 3.2.x.
Versions of Spark that are supported by Aerospike Connect for Spark (aka the Spark connector)
Aerospike Connect for Spark versions | Supported Apache Spark version | Jfrog Artifactory versions |
---|---|---|
3.5.5 | 3.0.x, 3.1.x, 3.2.x | 3.5.5-spark3.0-allshaded, 3.5.5-spark3.0-clientunshaded, 3.5.5-spark3.1-allshaded, 3.5.5-spark3.1-clientunshaded, 3.5.5-spark3.2-allshaded, 3.5.5-spark3.2-clientunshaded |
3.5.4 | 3.0.x, 3.1.x, 3.2.x | |
3.5.3 | 3.0.x, 3.1.x, 3.2.x | |
3.5.2 | 3.0.x, 3.1.x, 3.2.x | |
3.5.1 | 3.0.x, 3.1.x, 3.2.x | |
3.5.0 | 3.0.x, 3.1.x, 3.2.x | |
3.4.2 | 3.0.x, 3.1.x | |
3.4.1 | 3.0.x, 3.1.x | |
3.4.0 | 3.0.x, 3.1.x | |
3.3.1_spark3.1 | 3.1.x | |
3.3.1_spark3.0 | 3.0.x | |
3.2.2 | 3.0.x | |
3.2.1 | 3.0.x | |
3.2.0 | 3.0.x | |
3.1.1 | 3.0.x | |
3.1.0 | 3.0.x | |
3.0.3 | 3.0.x | |
3.0.2 | 3.0.x | |
3.0.1 | 3.0.x | |
3.0.0 | 3.0.x | |
2.9.0 | 2.4.x 1 | |
2.8.1 | 2.4.x 1 |
2.9.0 is likely to be the last release which is compatible with the Apache Spark 2.4.7 binary. Aerospike has ceased developing new features to support Spark 2.x.y. However, we will make bug fixes available until October 12, 2023. Please plan to move to Apache Spark 3.0.x and use Aerospike Connect for Spark version 3.x.y.
Jar naming convention:
To support multiple spark versions, we have changed jar naming convention.
Starting from 3.3.0
release, all binaries will be named as aerospike-spark_x_spark_y_z.jar, where
x is connector version, y is spark version and z can be either allshaded
or clientunshaded
.
The binary name aerospike-spark-3.3.0_spark3.1_allshaded.jar
indicates release version is 3.3.0,
supported spark version is 3.1.x and all the internal libraries are shaded.
Similarly, aerospike-spark-3.3.0_spark3.1_clientunshaded.jar
indicates that all libraries except aerospike java client are shaded.
To find out when these different versions of the Spark connector were released, see the "Aerospike Connect for Spark Release Notes".
Although Aerospike Connect for Spark was tested with versions of Apache Spark, it should work with the Spark versions available in DataProc in Google Cloud Platform (GCP) and EMR in Amazon Web Services (AWS).
- The Java 8 SDK must be installed on the system on which you plan to run Aerospike Connect for Spark. (Tip: If you want to test with different versions of the Java 8 SDK, consider using sdkman to help you manage those versions.)
- Your Aerospike Database Enterprise Edition cluster must be at version 5.0 or later if you plan to use Aerospike Connect for Spark version 2.0 or later.
- Connector does not bundle Spark and Hadoop related binaries within its jar. This means your production system must have spark and hadoop installed.
Download and install the Spark connector's .jar
file
Install using Jfrog artifactory
- In
build.sbt
file, add Jfrog repository resolverresolvers += "Artifactory Realm" at "https://aerospike.jfrog.io/artifactory/spark-connector"
- Specify dependency as
"com.aerospike" %% "aerospike-spark" % <<version>>
where the version could be values like3.5.5-spark3.0-allshaded
.3.5.5-spark3.0-allshaded
which indicates that we are seeking dependency from the 3.5.5 release built for spark 3.0 with all the dependencies shaded.
Download
Download appropriate version 3.x of the connector if you are using Apache Spark 3.0.x, and version 2.x if you are using Apache Spark 2.4.x (not including 2.4.8).
You can download the .jar
package from the Aerospike Enterprise Downloads site.
Add the .jar package to your application's environment
You can do this in either of these ways:
- If you plan to create a batch job or address the challenges of real-time business insights by leveraging the streaming job, write a Scala, Java, or Python application by following the interactive code in the Jupyter notebooks. Specify the downloaded JAR as a dependency. Once your Spark application is ready, submit it to the Spark cluster using either
spark-submit
orspark-shell
. See Submitting Applications in the Spark documentation for detailed information.
Example using spark-submit
spark-submit --jars path-to-aerospike-spark-connector-jar --class application-entrypoint application.jar
If you plan to create a Jupyter notebook that uses the Spark connector, add the JAR path to the environment variables.
Example using Python
import os
os.environ["PYSPARK_SUBMIT_ARGS"] = '--jars aerospike-spark-assembly-2.7.0.jar pyspark-shell'Example using Scala
launcher.jars = ["aerospike-spark-assembly-2.7.0.jar"]
See our notebooks for other examples.