A Comparison of JDBC & ODBC Drivers for Amazon Athena



The metrics in this article are from the most up-to-date drivers available as of August 2019.

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With the CData Drivers for Amazon Athena, you get top-of-the-line performance with those queries through standards-based interfaces such as JDBC and ODBC. In this article, we compare read performance and resource usage. We find that the CData Drivers consistently retrieve large data sets nearly three times faster than the Amazon Drivers and make better usage of client-side resources to drive that performance.

Preparation

This article will compare two sets of drivers:

  1. The Amazon-supported ODBC Driver for Amazon Athena 1.0.51 to the CData Software ODBC Driver for Amazon Athena3
  2. The Amazon-supported JDBC Driver for Amazon Athena 2.0.72 to the CData Software JDBC Driver for Amazon Athena4

To provide a reproducible comparison, we copied the Amazon Customer Reviews dataset5 to a bucket in our S3 instance and created a partitioned table in Athena (amazon_reviews_parquet) based on the S3 bucket.

Since the drivers are being compared side-by-side, the performance of the machine itself is relatively unimportant; what matters is how the drivers compare relative to one another.


Comparison



The relevant details for the table are below:

Table Size     Table Number of Rows     Number of Columns
2.74 GB 160,796,570 16

We compared the related performance of the drivers by running the same queries with each driver. The queries are listed below:

  1. SELECT * FROM amazon_reviews_parquet LIMIT 250000
  2. SELECT * FROM amazon_reviews_parquet LIMIT 10000000


Results



For the JDBC drivers, we connected to Athena using the java.sql library in a basic Java application. For the ODBC drivers, we connected to Athena using a DSN from a sample C program. The results were read and stored for every field in each row. The times you see in the chart below are based on averages of multiple runs, which should serve to level out any outliers due to spikes in network traffic, etc.

Query Times by Driver (in seconds)
Query CData JDBC Amazon JDBC CData ODBC Amazon ODBC
25,000 rows 15.43 (105.1% faster) 31.67 18.14 (161.8% faster) 47.50
10,000,000 rows 333.66 (179.1% faster) 931.23 353.03 (311.3% faster) 1,451.99

As can be seen in the results, the CData Drivers significantly outperformed the Amazon Drivers when working with large result sets, regularly retrieving and processing large datasets nearly three times faster than the Amazon Drivers could.

JDBC Driver Resource Usage



While testing the read performance of the JDBC drivers, we also measured client-side resource usage, looking specifically at memory and CPU usage. The charts below were found by running a sample Java program and using Java VisualVM to capture the CPU and memory usage. We used Java version 8 update 211 with a maximum heap size of 4.27 Gigabytes.

For this comparison, we ran a query for 10 million rows: SELECT * FROM amazon_reviews_parquet LIMIT 10000000

CData Driver

Amazon Driver*

* Note the change in scale for the Heap graph.

Based on the graph, we can infer that both drivers send a request to Amazon Athena and then wait for the response. Once Athena begins returning the results of the query, the drivers begin to process it.

The CData Driver retrieves and processes the bulk of the data quickly, averaging near 600 MB of heap usage for the first minute and then using around 150 MB after the bulk of the data has been processed. Based on the CPU usage graph, we can see that the CData Driver is consistently using around 25% of the CPU available for most of the transaction and drops to around an average of 15%, consistently processing the data as fast as it is returned.

The native driver appears to start processing data as soon as Athena responds, but uses less CPU capacity and less heap (averaging around 17% of CPU capacity and around 150 MB of heap usage) throughout the transaction. This lower CPU and heap usage results in a longer time to process the results.

ODBC Driver Resource Usage



We also measured client-side resource usage while testing the ODBC drivers, looking specifically at network and CPU usage. The charts below were found by running a sample C program and using Windows Resource Monitor.

For this comparison, we ran a query for 10 million rows: SELECT * FROM amazon_reviews_parquet LIMIT 10000000

CData Driver

Amazon Driver*

* Note the change in scale for the Network graph.

The graphs pictured capture the same window of the sample C application for each driver: after Athena has started returning the results from the query. Based on the graph, the CData ODBC Driver retrieves and processes the data simultaneously and is retrieving data at around 200 Mbps. The native driver appears to retrieve and process the data simultaneously as well, but does so at only 50 Mbps, at approximately 1/4 of the rate of the CData Driver.


Conclusion



The CData Driver's performance far exceeds that of the Amazon-supported driver. Our developers have spent countless hours optimizing the performance in processing the results returned by Amazon to the point that the drivers seem to only be hindered by web traffic and server processing times. This performance is particularly highlighted when a driver is required to process large amounts of data.

References



  1. https://docs.aws.amazon.com/athena/latest/ug/connect-with-odbc.html
  2. https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html
  3. https://www.cdata.com/drivers/athena/odbc
  4. https://www.cdata.com/drivers/athena/jdbc
  5. https://s3.amazonaws.com/amazon-reviews-pds/readme.html