by Jerod Johnson | April 14, 2016

SQL Access to Apache Cassandra

The Apache Cassandra database provides its users scalability, high availability, fault tolerance, and linear scalability without sacrificing performance. By using any of the CData Drivers for Apache Cassandra, you can access your Apache Cassandra data using SQL to read from and write to your data in your own applications or in 3rd party applications such as Tableau or MS Power BI.

By translating SQL queries into standard CQL queries, we allow you to consume your Apache Cassandra data as if it were a relational database, a format that is familiar and easy to use. In addition to traditional SELECT, INSERT, UPDATE, and DELETE operations, the CData drivers support flexible interpretation of your Cassandra data and have built-in client-side SQL functions that allow you to better drill down to the data that you need.

We have highlighted the different ways to interpret your Cassandra data below.

  • Free-Form Queries: Learn to make queries based on what you understand to be in the database, with or without table schema.
  • Flattening Objects & Arrays: Learn about the connection properties that allow you to dynamically drill down into your Cassandra data, regardless of the complexity of the documents (and their internal objects) stored in Cassandra.
  • Custom Schema Definitions: Learn to define a schema for your data, allowing you to expose exactly the data you want in exactly the way that you want to BI, reporting, and ETL tools.
  • Client-Side JSON Functions: Learn to use the built-in query functions that allow you to aggregate the data from your Cassandra documents to expose the information that you want and perform calculations on your data.

Free-Form Queries

By allowing for free-form queries, the CData Drivers give you the flexibility to work with exactly the data that you want, regardless of the existence of a strict table schema. You can freely reference any field in you Cassandra data, regardless of its complexity. Consider the following document:

{
  id: 1754,
  name: "Joe Smith",
  address: {street: "9th Street", city: "New York", state: "NY"},
  children: ["John", "Sarah", "Elizabeth"],
  birthday: "1960-01-01"
}

If you know that you want the id, address.street, and children[0] fields from each document, you can freely query that data from the database:

SELECT
  [id],
  [address.street],
  [children.0],
FROM Company;

Any items that contain data in those fields will have the respective values returned. If the field does not exist in a given document, the driver will simply return a NULL for the field. This feature is useful whenever you have control over the SQL query being submitted to the CData driver, but obviously cannot be used when the opposite is true. There are other options available in that case.

Flattening Arrays & Objects

When it comes to working with your Cassandra data, you can configure the driver to dynamically interpret a table schema, based on the type of information stored in the database. In the event that your Cassandra data has JSON objects or arrays, the Flatten Arrays and Flatten Objects Connection properties control how the schema is determined.

Consider the following document:

{
  id: 1754,
  name: "Joe Smith",
  address: {street: "9th Street", city: "New York", state: "NY"},
  children: ["John", "Sarah", "Elizabeth"],
  birthday: "1960-01-01"
}

In the examples below, we display the expected results, based on various values for Flatten Arrays and Flatten Objects, for the following query:

SELECT *
FROM People

FlattenArrays=0;FlattenObjects=False;

Without any flattening, the drivers would discover five columns for the table: id, name, address, children, and birthday. Any objects and arrays in the document are returned in aggregate form.

Result

id name address children birthday
1754 Joe Smith {street: "9th Street", city: "New York", state: "NY"} ["John", "Sarah", "Elizabeth"] 1960-01-01

FlattenArrays=0;FlattenObjects=True;

If you set Flatten Objects to "true", the number of columns expands as the "address" object is flattened.

Result

id name address_street address_city address_state children birthday
1754 Joe Smith 9th Street New York NY ["John", "Sarah", "Elizabeth"] 1960-01-01

FlattenArrays=2;FlattenObjects=False;

The Flatten Arrays property determines how many items in an array to treat as individual columns. By setting Flatten Arrays to "2" (while leaving Flatten Objects = "false"), we will extract the first two items in the arrays of a document.

Result

id name address children_0 children_1 birthday
1754 Joe Smith {street: "9th Street", city: "New York", state: "NY"} John Sarah 1960-01-01

These columns are extended through INSERT and UPDATE statements, allowing you to add or update entries that contain objects and arrays.

Custom Schema Definitions

In order to treat your Apache Cassandra data as a relational database, a table schema must exist. By default, the schema will be determined dynamically, based on Connection properties, but you can also create a custom for your data yourself. Consider the following data:

{
  id: 1754,
  name: "Joe Smith",
  address: {street: "9th Street", city: "New York", state: "NY"},
  children: ["John", "Sarah", "Elizabeth"],
  birthday: "1960-01-01"
}

Given the document above, you could expose the id (as the primary key), name, address.state, and the first entry in the children fields by creating the following schema:

  

    
    
    <attr  name="id"      xs:type="int32"  iskey="true" other:jsonpath="$.id" />
    <attr  name="name"    xs:type="string"  other:jsonpath="$.id"             />
    <attr  name="state"   xs:type="string"  other:jsonpath="$.address.state"  />
    <attr  name="child_1"  xs:type="string"  other:jsonpath="$.children.0"    />
    

  


By creating a custom schema for your Cassandra data, you are able to specifically select your data in a way that is not commonly supported in BI, reporting, and ETL tools when it comes to NoSQL data. With the CData drivers, you will be able to utilize the powerful features of your favorite tools to work with your data in the way that works best for you.

Client-Side JSON Functions

The entries in Apache Cassandra data sets are essentially JSON structures. The CData Drivers support using standard SQL functions to work with JSON structures. These functions work by pulling in the Apache Cassandra data and parsing the relevant information in the client. While there are many functions supported, we will only highlight a few here. The examples below use the following array:

[
  { "day": "Mon", "hours": 2 },
  { "day": "Tue", "hours": 6 },
  { "day": "Wed", "hours": 10 },
  { "day": "Thu", "hours": 9 },
  { "day": "Fri", "hours": 8 }
]

JSON_EXTRACT

The JSON_EXTRACT function can extract individual values from a JSON object. The following query returns the values shown below based on the JSON path passed as the second argument to the function:

SELECT 
  JSON_EXTRACT(days,'[0].day') AS Day,
  JSON_EXTRACT(days,'[0].hours') AS Hours
FROM Employees;

This query would return the following data:

Day Hours
Mon 2

JSON_SUM

The JSON_SUM function returns the sum of the numeric values of a JSON array within a JSON object. The following query returns the total of the values specified by the JSON path passed as the second argument to the function:

SELECT 
  Name, 
  JSON_SUM(hours,'[x].hours') AS TotalHours 
FROM Employees;

This query would return the following data:

TotalHours
35

JSON

The JSON function can be used to retrieve the entire table as a JSON string. See the following query and its result as an example:

SELECT 
  JSON(*) 
FROM Employees;

The query above will return the entire table as a single JSON string.

With these features at your disposal you get to combine the flexibility of NoSQL data with the industry-proven functionality of a relational database.

More Information & Free Trial

Check out our Apache Cassandra product page for more information. Download a free trial and get started working with your Apache Cassandra data today!