Work with HPCC Systems Data in Apache Spark Using SQL

Ready to get started?

Download for a free trial:

Download Now

Learn more:


Rapidly create and deploy powerful Java applications that integrate with HPCC Systems!

Access and process HPCC Systems Data in Apache Spark using the CData JDBC Driver.

Apache Spark is a fast and general engine for large-scale data processing. When paired with the CData JDBC Driver for HPCC Systems, Spark can work with live HPCC Systems data. This article describes how to connect to and query HPCC Systems data from a Spark shell.

The CData JDBC Driver offers unmatched performance for interacting with live HPCC Systems data due to optimized data processing built into the driver. When you issue complex SQL queries to HPCC Systems, the driver pushes supported SQL operations, like filters and aggregations, directly to HPCC Systems and utilizes the embedded SQL engine to process unsupported operations (often SQL functions and JOIN operations) client-side. With built-in dynamic metadata querying, you can work with and analyze HPCC Systems data using native data types.

Install the CData JDBC Driver for HPCC Systems

Download the CData JDBC Driver for HPCC Systems installer, unzip the package, and run the JAR file to install the driver.

Start a Spark Shell and Connect to HPCC Systems Data

  1. Open a terminal and start the Spark shell with the CData JDBC Driver for HPCC Systems JAR file as the jars parameter: $ spark-shell --jars /CData/CData JDBC Driver for HPCC Systems/lib/cdata.jdbc.hpcc.jar
  2. With the shell running, you can connect to HPCC Systems with a JDBC URL and use the SQL Context load() function to read a table.

    To connect, set the following connection properties: Set URL to the machine name or IP address of the server and the port the server is running on, for example, https://server:port. The User and Password are required to authenticate to the HPCC Systems cluster specified in the URL. Note that LDAP authentication is not currently supported by our ODBC driver.

    Set Version to the WsSQL Web server version. Note that if you have not already done so, you will need to install the WsSQL service on the HPCC Systems server. The WsSQL Web service is used to interact with the underlying HPCC Systems platform.

    Set Cluster to the target cluster.

    Built-in Connection String Designer

    For assistance in constructing the JDBC URL, use the connection string designer built into the HPCC Systems JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.

    java -jar cdata.jdbc.hpcc.jar

    Fill in the connection properties and copy the connection string to the clipboard.

    Configure the connection to HPCC Systems, using the connection string generated above.

    scala> val hpcc_df ="jdbc").option("url", "jdbc:hpcc:URL=;User=test;password=xA123456;Version=1;Cluster=hthor;").option("dbtable","hpcc::test::orders").option("driver","cdata.jdbc.hpcc.HPCCDriver").load()
  3. Once you connect and the data is loaded you will see the table schema displayed.
  4. Register the HPCC Systems data as a temporary table:

    scala> hpcc_df.registerTable("hpcc::test::orders")
  5. Perform custom SQL queries against the Data using commands like the one below:

    scala> hpcc_df.sqlContext.sql("SELECT CustomerName, Price FROM hpcc::test::orders WHERE ShipCity = New York").collect.foreach(println)

    You will see the results displayed in the console, similar to the following:

Using the CData JDBC Driver for HPCC Systems in Apache Spark, you are able to perform fast and complex analytics on HPCC Systems data, combining the power and utility of Spark with your data. Download a free, 30 day trial of any of the 200+ CData JDBC Drivers and get started today.