Querying Presto By Using Jupyter Notebook
Jupyter Notebook is an open source, interactive, and web-based notebook that is used for data analysis and visualization.
You can now analyze data stored in the Aerospike database via Presto using the Jupyter Notebook with the PyHive Presto Python library.
The advantage of using a Jupyter Presto notebook is that you can not only explore data stored in Aerospike, but also use that data to create AI/ML models using popular Python libraries such as Pandas, NumPy, Scikit-learn, etc.
Setting up
Populate your Aerospike database with the data that you would like to explore. You can do this by using one of the Aerospike clients.
Ensure that Java 11 or later, Python 3.7 or later, and the Presto interface for PyHive 0.6.3 or later are installed on the system where you plan to run Jupyter Notebook.
Install Jupyter Notebook 6.1.4 or later.
Ensure that your Presto cluster is at version 340 of Presto or later, but not 351 or later.
Install and configure Aerospike Connect for Presto on the worker nodes of your Presto cluster.
If you know the schema of your data, provide it in a JSON file stored where the Presto connector can access it, as described in step 3 of the procedure for configuring.
Restart Presto.
Creating a notebook for analysis
At a command-line prompt on the system where Jupyter Notebook is installed, issue this command to start Jupyter Notebook:
jupyter notebook
In the Jupyter Notebook webpage, create a new Python3 notebook.
- Use the following configuration to connect to your Aerospike database by using Presto and the Presto connector:
from pyhive import presto
presto_conn = presto.connect(
host='<hostname>',
port=<port>,
catalog='<name>',
schema='<namespace>'
)
presto_cur = presto_conn.cursor()
where:
host
is the hostname or IP address of the Presto coordinator that you specified in the.properties
file for configuring the Presto connector.port
is the number of the port to use when connecting to the Presto coordinator.catalog
is the name of the.properties
file, without the file extension.schema
is the namespace to run queries against Aerospike.Run a SQL query. This example runs a SELECT statement:
presto_cur.execute('SELECT * FROM <schema>.<set>')
print(presto_cur.fetchall())
where:
<schema>
is the name of the Aerospike namespace.<set>
is the Aerospike set in which the data resides.
Refer to the section "Supported SQL statements" here for a complete list of the types of SQL statements that are supported by the the Presto connector.
Example
In this example, you use one notebook to populate your Aerospike database with data. Then, you provide a schema of that data to the Presto connector. Finally, you query the data by using another notebook.
Load data into your Aerospike database by using this notebook.
Copy this JSON into a file, and then specify the path to the file as the value of the property aerospike.table-desc-dir
in the configuration file for the Presto connector.
{
"schemaName": "test",
"tableName": "write_set",
"columns":[
{
"name": "id",
"type": "bigint",
"hidden": false
},
{
"name": "name",
"type": "varchar",
"hidden": false
},
{
"name": "age",
"type": "double",
"hidden": false
},
{
"name": "salary",
"type": "bigint",
"hidden": false
}
]
}
Restart your Presto cluster.
Run queries against the data by using this notebook.