A Comparison of JDBC & ODBC Drivers for BigQuery



BigQuery is "Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics" and provides a robust, widely-used way to store and access your data. For many companies, Google BigQuery is the first choice for a cloud-based analytics platform. Because BigQuery is cloud-based, there is no need for a database administrator and there is no infrastructure to maintain.

Preparation

This article will serve to compare the Google-supported Simba ODBC Driver for Google BigQuery 2.0.6.10111 to the CData Software ODBC Driver for Google BigQuery 20162 and the Simba JDBC Driver for Google BiqQuery 1.0.6.10081 to the CData Software JDBC Driver for Google BigQuery 20163. In order to provide a reproducible comparison, we copied the trips table from the yellow dataset in the public nyc-tlc project4 (table ID: nyc-tlc:yellow.trips) to a private dataset. For clarity, we renamed the test table to nyc_yellow_trips.

The test machine specifications are as follows:
Operating System: Windows 7 Ultimate, SP1
Processor: Intel® CoreTM i3-2120 CPU @ 3.30GHz
Installed Memory (RAM): 8.00 GB
System type: 64-bit Operating System

Since the drivers are being compared side-by-side, the performance of the machine itself is relatively unimportant; what matters is how the drivers compare relative to one another.


Comparison



The relevant details for the table are below:

Table Size     Table Number of Rows     Number of Columns
130 GB 1,108,779,463 19

The main goal of this investigation was to compare the related performance of the drivers. We did this by running the same queries with each driver. The queries are listed below:

  1. SELECT * FROM nyc_yellow_trips LIMIT 100000
  2. SELECT * FROM nyc_yellow_trips LIMIT 1000000
  3. SELECT * FROM nyc_yellow_trips LIMIT 10000000


Results



For the ODBC Drivers, we connected to BigQuery using a DSN from an ADO.NET console application and executed the above queries repeatedly. The results were read and stored in a new string variable for each row. The times you see in the chart below are based on averages, which should serve to level out any outliers due to spike in network traffic, etc.

Query Times by Driver (in seconds)
Query Simba ODBC CData ODBC Simba JDBC CData JDBC
1 (100,000 rows) 34.07 16.68 (+104%) 71.63 18.99 (+277%)
2 (1,000,000 rows) 461.63 233.56 (+98%) 318.04 149.11 (+113%)
3 (10,000,000 rows) * 1,748.94 ** 1,771.29

* We were unable to retrieve 10 million rows using the Simba ODBC Driver without receiving a System.StackOverflowException
** We were unable to get consistent results for 10 million rows using the Simba JDBC Driver as we regularly encountered errors like "Error fetching results from server." and "Error trying to obtain Google BigQuery object."

As can be seen in the results, the CData drivers significantly outperformed the Simba drivers when working with large result sets, regularly retrieving and processing results twice as fast. It is noteworthy that the CData drivers were consistently able to process 10 million rows, whereas the Simba ODBC driver was unable to process such a large result set and the Simba JDBC driver was only able to do so sparingly.

The average runtime for each query is compared in the charts below:

Results for 100,000 Rows

Results for 1,000,000 Rows


Conclusion



The CData driver's performance far exceeds that of the Google-supported Simba driver. Our developers have spent countless hours optimizing the performance in processing the results returned by Google to the point that the drivers seem to only be hindered by web traffic and server processing times. This performance is particularly highlighted when the driver is required to process large amounts of data.

References



  1. https://cloud.google.com/bigquery/partners/simba-drivers/
  2. https://www.cdata.com/drivers/bigquery/odbc
  3. https://www.cdata.com/drivers/bigquery/jdbc
  4. https://cloud.google.com/bigquery/public-data/nyc-tlc-trips

Related Articles