by Tomas Restrepo | November 14, 2018

Key Considerations for Driver Development

This is the first of a series of blog posts where I want to take a look at our driver technology, and how we build our products. CData offers drivers for 100+ data sources, across various technologies like ODBC, JDBC, ADO.NET, and more.

A common topic that comes up in conversations about the driver model is that building a full-fledged connectivity platform to all of these data sources would probably be easy. After all, a lot of drivers are based on standard protocols such as HTTP and OData. As usual, however, a lot of complexity is hidden from view at first glance.

There are several interesting challenges that come up when building connectors to such a broad list of data sources. In this first post in the series, I'd like to touch point on some of these and discuss how CData has approached these challenges. To do this, I'd like to start by covering the different perspectives we use to understand the level of effort involved in building a driver for a new potential data source.

Protocol Considerations

One of the first aspects that we examine when evaluating a new data source is the protocol requirements.

  • OData: Many data sources expose an OData-based interface. We have a very strong OData client implementation that is at the core of all these drivers, which substantially simplifies and reduces the time it takes to build a driver for a new OData-based data source.
  • REST: Many typical SaaS applications expose a custom REST-based API based JSON/XML content over HTTP. While our core includes a strong foundation for building these, it still requires more effort because each API from each data source will model common patterns (such as paging, filtering, and so on) in different ways.
  • SOAP: Some data sources expose Web Service interfaces based on SOAP/WS-* protocols.
  • Other Protocols: These are data sources based on other protocols, such as TDS, LDAP, IMAP, and others. This category usually includes drivers for full database engines, which often have more extensive query capabilities.

This distinction is important because it highlights where the complexity of building a driver for each technology lies, and allows us to evaluate how much base work will be needed to create a basic driver. For example, for a protocol driver, the complexity often lies in building the networking stack and supporting the query language dialect of the data source.

The complexity for REST and SOAP drivers often lies in the query capabilities. Some data sources, like Salesforce, have a strong APIs capable of relatively complex queries. Others barely have the capability to return all entities in a collection or querying a single entity based on a key property. Much of the complexity here will be in how to support extensive query capabilities while 'pushing down' as much of the query as you can to the data source for efficient processing.

Relational vs Non-relational Data Sources?

All of our drivers expose a relational model. That is, we expose tables and columns for a data source and a relational query language (based on SQL-92). However, some of the data sources we support are non-relational.

Examples of non-relational data source include:

  • Document Databases such as MongoDB, CosmosDB, Cloudant, and others which are JSON document stores.
  • Key/Value Stores such as Redis.
  • Hierarchical Stores such as LDAP / Active Directory.
  • Analytical Stores such as Google Adwords, Bing Ads, and others that are more closely aligned with a nalytical processing.
  • Raw Stores such as JSON/XML/CSV files.

Each of these have unique challenges. The two most common ones are:

  • The query capabilities of each data source: some allow more expressive queries, while others are more limited. Some, like raw stores, have no built-in query capability at all.
  • How to model the result sets into tables and columns.

Exposing data through a relational model is very useful for a lot of use cases. One key scenario for our customers is being able to integrate their data sources into Business Intelligence, Analytics, and Data Visualization tools, all of which have extensive support for SQL as a query language and understand the relational model. We leverage our extensive horizontal and vertical flattening capabilities to model results in a way that makes these tools capable of querying a wide variety of data sources.

Metadata Discovery

Discovering metadata of the relational model exposed from a driver is also an important factor in judging the implementation complexity for a new data source:

  • Static Metadata Discovery: Some data sources have a fixed, well-known, documented schema. Implementing metadata discovery for such a data source usually involves building that schema into the driver, which we do using our proprietary DataScript language.
  • Dynamic Metadata Discovery: Some data sources have fully dynamic metadata that is not known ahead of time, and offer introspection capabilities for discovering said schema. An obvious example is OData sources, where you can query the $metadata document to list entities, properties, and relationships. Another example is Salesforce, which has APIs for discovering tables and fields.
  • Hybrid Metadata Discovery: Some data sources are a hybrid of the dynamic/static models. For example, our Email driver (IMAP) dynamically discovers tables (folders), but column metadata is static. Other data sources have a known, static list of tables, but columns can be dynamic and need to be discovered at runtime.
  • RowScan Metadata Discovery: Some data sources have completely dynamic schemas, without a way to query some kind of fixed shape to the data stored. An example of these are JSON document databases, such as MongoDB: Two JSON documents on the same collection might have completely different fields on them, and don't need to match a single schema. For these, we usually support dynamic metadata discovery by examining the data stored directly. Effectively, metadata is constructed by 'scanning the rows' stored up to a certain RowScanDepth to derive a working schema.

Metadata, regardless of how it is discovered, also poses interesting challenges on its own. For some data sources, discovering metadata is an expensive and slow process. Caching metadata reliably is key to ensuring the driver has good performance. Other data sources can also have very large metadata models, impacting memory usage. This is an area with invested in heavily during the past few months, and one we continue improving.

Authentication Model

How a client authenticates to a data source can also be a source of complexity when building a new driver. Some of the authentication models that come up include:

  • Username/Password/Tokens: Many data sources support some level of basic authentication based on username/password or additional secrets.
  • OAuth: Some data sources (usually REST-based) support OAuth 1/2. OAuth supports adds additional complexity to driver development because each OAuth implementation has unique challenges such as:
    • Are client id/secrets required? If so, where are clients registered?
    • How are access tokens refreshed?
    • Is the token duration fixed or dynamic?
    • Are Token signing/encryption keys published or do they need to be provided by the user?
    • Are multiple sessions supported or not?
  • SSO: Some data sources support single-sign on as an authentication mechanism. This is usually simple for OAuth-based sources, but can be more complex if SAML or WS-Federation is used.

Query Capabilities

The biggest effort when building a driver is often the query capabilities of the data source. Some common features we look at when evaluating a potential data source are:

  • Does the data source support aggregations and grouping? If so, what aggregation functions are supported?
  • Are joins supported? Notice that often joins are supported in specific directions, or between specific entities only. For example, OData-based sources often support some level of join capabilities through the $expand feature.
  • Is ordering the results supported? Often, support for ordering is partial. For example, CosmosDB cannot order query results by all fields. Some other sources support ordering in one direction only.
  • Is paging results supported? Some data sources only support limiting the number of results returned; some support full paging, and some don't support anything at all.
  • What filtering capabilities are supported? Some data sources only support filtering results for some specific fields. Other support a limited set of comparison operators.
  • Are bulk operations supported? If so, for which operations? Some will support inserts/updates/deletes, while others just inserts, for example.

Conclusion

There is a lot of complexity in building fully-capable data access drivers based on standards technology for a large number of data sources. In this initial post, I've covered some of the different questions and aspects we look at when evaluating the complexity and level of technical effort involved in adding support for a new data source. In a follow up post, I will cover our product architecture and what technology we use to deliver over 100+ data sources across every major data access technology.