Discover how a bimodal integration strategy can address the major data management challenges facing your organization today.
Get the Report →A Comparison of JDBC & ODBC Drivers for Amazon Athena
The metrics in this article are from the most up-to-date drivers available as of August 2019.
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. With the CData Drivers for Amazon Athena, you get top-of-the-line performance with those queries through standards-based interfaces such as JDBC and ODBC. In this article, we compare read performance and resource usage. We find that the CData Drivers consistently retrieve large data sets nearly three times faster than the Amazon Drivers and make better usage of client-side resources to drive that performance.
Preparation
This article will compare two sets of drivers:
- The Amazon-supported ODBC Driver for Amazon Athena 1.0.51 to the CData Software ODBC Driver for Amazon Athena3
- The Amazon-supported JDBC Driver for Amazon Athena 2.0.72 to the CData Software JDBC Driver for Amazon Athena4
To provide a reproducible comparison, we copied the Amazon Customer Reviews dataset5 to a bucket in our S3 instance and created a partitioned table in Athena (amazon_reviews_parquet) based on the S3 bucket.
Since the drivers are being compared side-by-side, the performance of the machine itself is relatively unimportant; what matters is how the drivers compare relative to one another.
Comparison
The relevant details for the table are below:
Table Size | Table Number of Rows | Number of Columns | ||
---|---|---|---|---|
2.74 GB | 160,796,570 | 16 |
We compared the related performance of the drivers by running the same queries with each driver. The queries are listed below:
- SELECT * FROM amazon_reviews_parquet LIMIT 250000
- SELECT * FROM amazon_reviews_parquet LIMIT 10000000
Results
For the JDBC drivers, we connected to Athena using the java.sql library in a basic Java application. For the ODBC drivers, we connected to Athena using a DSN from a sample C program. The results were read and stored for every field in each row. The times you see in the chart below are based on averages of multiple runs, which should serve to level out any outliers due to spikes in network traffic, etc.
Query Times by Driver (in seconds) | ||||
---|---|---|---|---|
Query | CData JDBC | Amazon JDBC | CData ODBC | Amazon ODBC |
25,000 rows | 15.43 (105.1% faster) | 31.67 | 18.14 (161.8% faster) | 47.50 |
10,000,000 rows | 333.66 (179.1% faster) | 931.23 | 353.03 (311.3% faster) | 1,451.99 |
As can be seen in the results, the CData Drivers significantly outperformed the Amazon Drivers when working with large result sets, regularly retrieving and processing large datasets nearly three times faster than the Amazon Drivers could.
JDBC Driver Resource Usage
While testing the read performance of the JDBC drivers, we also measured client-side resource usage, looking specifically at memory and CPU usage. The charts below were found by running a sample Java program and using Java VisualVM to capture the CPU and memory usage. We used Java version 8 update 211 with a maximum heap size of 4.27 Gigabytes.
For this comparison, we ran a query for 10 million rows: SELECT * FROM amazon_reviews_parquet LIMIT 10000000
CData Driver
Amazon Driver*
* Note the change in scale for the Heap graph.
Based on the graph, we can infer that both drivers send a request to Amazon Athena and then wait for the response. Once Athena begins returning the results of the query, the drivers begin to process it.
The CData Driver retrieves and processes the bulk of the data quickly, averaging near 600 MB of heap usage for the first minute and then using around 150 MB after the bulk of the data has been processed. Based on the CPU usage graph, we can see that the CData Driver is consistently using around 25% of the CPU available for most of the transaction and drops to around an average of 15%, consistently processing the data as fast as it is returned.
The native driver appears to start processing data as soon as Athena responds, but uses less CPU capacity and less heap (averaging around 17% of CPU capacity and around 150 MB of heap usage) throughout the transaction. This lower CPU and heap usage results in a longer time to process the results.
ODBC Driver Resource Usage
We also measured client-side resource usage while testing the ODBC drivers, looking specifically at network and CPU usage. The charts below were found by running a sample C program and using Windows Resource Monitor.
For this comparison, we ran a query for 10 million rows: SELECT * FROM amazon_reviews_parquet LIMIT 10000000
CData Driver
Amazon Driver*
* Note the change in scale for the Network graph.
The graphs pictured capture the same window of the sample C application for each driver: after Athena has started returning the results from the query. Based on the graph, the CData ODBC Driver retrieves and processes the data simultaneously and is retrieving data at around 200 Mbps. The native driver appears to retrieve and process the data simultaneously as well, but does so at only 50 Mbps, at approximately 1/4 of the rate of the CData Driver.
Conclusion
The CData Driver's performance far exceeds that of the Amazon-supported driver. Our developers have spent countless hours optimizing the performance in processing the results returned by Amazon to the point that the drivers seem to only be hindered by web traffic and server processing times. This performance is particularly highlighted when a driver is required to process large amounts of data.