A Comparison of JDBC Drivers for BigQuery



The metrics in this article were found using the most up-to-date drivers available as of June 2019.

BigQuery is "Google's fully managed, petabyte scale, low cost enterprise data warehouse for analytics" and provides a robust, widely-used way to store and access your data. For many companies, Google BigQuery is the first choice for a cloud-based analytics platform. Because BigQuery is cloud-based, there is no need for a database administrator and there is no infrastructure to maintain.

Preparation

This article will compare the performance of the Google-supported JDBC Driver for Google BigQuery1 to the CData Software JDBC Driver for Google BigQuery2 when querying larger datasets. In order to provide a reproducible comparison, we queried the tlc_yellow_trips_2018 table from the new_york_taxi_trips dataset in the public bigquery-public-data project3.


Comparison



The details for the tlc_yellow_trips_2018 table are below:

Table Size     Table Number of Rows     Number of Columns
18.1 GB 112,234,626 17

The main goal of this investigation was to compare the related performance of the drivers. We did this by running the same set of queries with each JDBC driver. We used LIMIT clauses to change the size of the dataset returned in each query, but we requested the same columns for each query.

Base Query

SELECT 
  vendor_id, 
  passenger_count, 
  trip_distance, 
  rate_code, 
  store_and_fwd_flag, 
  payment_type, 
  fare_amount, 
  extra, 
  mta_tax, 
  tip_amount, 
  tolls_amount, 
  imp_surcharge, 
  total_amount, 
  pickup_location_id, 
  dropoff_location_id 
FROM 
  `bigquery-public-data.new_york_taxi_trips.tlc_yellow_trips_2018`

Limit by Query

  1. LIMIT 100000
  2. LIMIT 1000000
  3. LIMIT 10000000


Results



To test the drivers, we connected to BigQuery using a basic Java application and executed the above queries repeatedly. The results were read and stored in a new string variable for each row. The times you see in the chart below are the average times for each query, in seconds.

Query Times by Driver (in seconds)
Query CData JDBC Google JDBC
1 (100,000 rows) 10.80 (+17%) 12.99
2 (1,000,000 rows) 103.49 (+17%) 125.17
3 (10,000,000 rows) 1,047.33 (+17%) 1,266.90

As can be seen in the results, the CData drivers regularly outperformed the Google drivers, retrieving and processing results 17% faster, regardless of the size of the result.

The average runtime for each query is compared in the chart below:

Results for All Queries


Conclusion



The CData driver's performance has improved since our last comparison and continues to offer improvement over the Google-supported driver. Our developers have labored to optimize the performance in processing the results returned by Google to the point that the drivers seem to only be hindered by web traffic and server processing times.

References



  1. Google Drivers
  2. CData JDBC Drivers
  3. City of New York Taxi Trip Public Dataset

Related Articles