Macs in Chemistry

Insanely great science

 

Using JDBC Driver to access Athena from Vortex

This tutorial was created by Jackson Pullman (jackson.pullman@yale.edu)

Getting Started

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL. Most results are delivered within seconds. With Athena, there’s no need for complex ETL jobs to prepare your data for analysis. This makes it easy for anyone with SQL skills to quickly analyze large-scale datasets.

This guide will show you how to create a connection to Athena from Vortex and pull in your desired data through an sql query. Before we get started here are a few prerequisites.

Make sure you have downloaded the JDBC Driver here: https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html.

You may have to upgrade your JDK to the newest version as well. If you do not have Java installed on your system you can do so here: https://www.oracle.com/java/technologies/javase-downloads.html.

Move this newly downloaded Athena jar file to a 'libs' folder in your Vortex Folder. Create a new document called AthenaQuery and save it as a .vpy file in your \vortex\scripts folder. Vortex script extensions must be saved in your \vortex\scripts folder to work properly. The body of your AthenaQuery.vpy file should be copy and pasted from below. You will need to have your Amazon Public and Private Access keys handy, and an s3 location where your output result set can be stored. The script will first ask the user to input this information as well as a few other parameters such as what Athena database you would like to connect to and your SQL query, and then create a connection to Athena to properly query the data. The remainder of the script properly formats the result set so that it can be displayed in Vortex.

Athena_Query.vpy

#Import the JDBC Driver jar file from your libs folder into your current Vortex Folder
import sys
from os import sep
import glob
libsfolder = vortex.getVortexFolder() + sep + "libs"
files = [f for f in glob.glob(libsfolder + "**/*.jar")]
for i in files:
    sys.path.append(i)

#Import Required Packages
import com
import com.simba.athena
import javax.sql.DataSource
import javax.sql.ConnectionPoolDataSource
import javax.swing.JOptionPane

#Accept user input on what Athena source you're pulling from, your access keys, query, and other parameters
public = javax.swing.JOptionPane.showInputDialog(None, “Enter your public Athena access key:”)
private = javax.swing.JOptionPane.showInputDialog(None, “Enter your private Athena access key:”)
Data_Source = javax.swing.JOptionPane.showInputDialog(None, "What Data Source (Catalog) would you like to query from? (Leave blank for AwsDataCatalog)")
Database = javax.swing.JOptionPane.showInputDialog(None, "What Database (schema) would you like to query from? (Leave blank for default)")
query = javax.swing.JOptionPane.showInputDialog(None, "Enter your full SQL Query:”)
s3Loc = javax.swing.JOptionPane.showInputDialog(None, “Enter an s3 path to store the data:”)
workspace = javax.swing.JOptionPane.showInputDialog(None, "What would you like to name your workspace:”)

#Build the Connection Url which will query the correct source
Connection_Url = "jdbc:awsathena://AwsRegion=us-east-1;User=" + public + ";Password=“ + private + ";S3OutputLocation=“ + s3Loc + ";ConnectionTest=1;Schema=" + Database + ";Catalog=" + Data_Source + ";"

# Create the connection to the Athena DataSource
ds = com.simba.athena.jdbc.DataSource()
ds.setURL(Connection_Url)
connection = ds.getConnection()
statement = connection.createStatement()

#Execute the query and save the result set and metadata
rs = statement.executeQuery(query)
rsMetaData = rs.getMetaData()

#Get the names and number of retrieved columns from the metadata file
count = rsMetaData.getColumnCount()
colNames = []
for x in range(1,count + 1):
    colNames = colNames + [str(rsMetaData.getColumnName(x))]

# Sequentially parse through the data in the result set, adding it to a 2d Vortex Array
AllData = []
while(rs.next()):
    arr = []
    for x in range(count):
        arr = arr + [str(rs.getString(colNames[x]))]
    AllData = AllData + [arr]

#Output column names and the 2d array of results into a vortex workspace.
arrayToWorkspace(AllData,colNames, workspace)

Running the script

From Vortex, click the scripts tab from the header menu. If you have saved your Athena_Query.vpy file in the correct \vortex\scripts folder you should be able to select the script and it will automatically run. You will be prompted with input boxes to run your query, and depending on the size of your set, should see your data imported soon.

Last updated 14 June 2021