Integrate Lakebase Data in Pentaho Data Integration

Jerod Johnson
Senior Technology Evangelist

Build ETL pipelines based on Lakebase data in the Pentaho Data Integration tool.

The CData JDBC Driver for Lakebase enables access to live data from data pipelines. Pentaho Data Integration is an Extraction, Transformation, and Loading (ETL) engine that data, cleanses the data, and stores data using a uniform format that is accessible.This article shows how to connect to Lakebase data as a JDBC data source and build jobs and transformations based on Lakebase data in Pentaho Data Integration.

Configure to Lakebase Connectivity

To connect to Databricks Lakebase, start by setting the following properties:

DatabricksInstance: The Databricks instance or server hostname, provided in the format instance-abcdef12-3456-7890-abcd-abcdef123456.database.cloud.databricks.com.
Server: The host name or IP address of the server hosting the Lakebase database.
Port (optional): The port of the server hosting the Lakebase database, set to 5432 by default.
Database (optional): The database to connect to after authenticating to the Lakebase Server, set to the authenticating user's default database by default.

OAuth Client Authentication

To authenicate using OAuth client credentials, you need to configure an OAuth client in your service principal. In short, you need to do the following:

Create and configure a new service principal
Assign permissions to the service principal
Create an OAuth secret for the service principal

For more information, refer to the Setting Up OAuthClient Authentication section in the Help documentation.

OAuth PKCE Authentication

To authenticate using the OAuth code type with PKCE (Proof Key for Code Exchange), set the following properties:

AuthScheme: OAuthPKCE.
User: The authenticating user's user ID.

For more information, refer to the Help documentation.

Built-in Connection String Designer

For assistance in constructing the JDBC URL, use the connection string designer built into the Lakebase JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.

java -jar cdata.jdbc.lakebase.jar

Fill in the connection properties and copy the connection string to the clipboard.

Using the built-in connection string designer to generate a JDBC URL (Salesforce is shown.)

When you configure the JDBC URL, you may also want to set the Max Rows connection property. This will limit the number of rows returned, which is especially helpful for improving performance when designing reports and visualizations.

Below is a typical JDBC URL:

jdbc:lakebase:DatabricksInstance=lakebase;Server=127.0.0.1;Port=5432;Database=my_database;InitiateOAuth=GETANDREFRESH;

Save your connection string for use in Pentaho Data Integration.

Connect to Lakebase from Pentaho DI

Open Pentaho Data Integration and select "Database Connection" to configure a connection to the CData JDBC Driver for Lakebase

Click "General"
Set Connection name (e.g. Lakebase Connection)
Set Connection type to "Generic database"
Set Access to "Native (JDBC)"

Set Custom connection URL to your Lakebase connection string (e.g.

jdbc:lakebase:DatabricksInstance=lakebase;Server=127.0.0.1;Port=5432;Database=my_database;InitiateOAuth=GETANDREFRESH;

Set Custom driver class name to "cdata.jdbc.lakebase.LakebaseDriver"
Test the connection and click "OK" to save.

Create a Data Pipeline for Lakebase

Once the connection to Lakebase is configured using the CData JDBC Driver, you are ready to create a new transformation or job.

Click "File" >> "New" >> "Transformation/job"
Drag a "Table input" object into the workflow panel and select your Lakebase connection.
Click "Get SQL select statement" and use the Database Explorer to view the available tables and views.
Select a table and optionally preview the data for verification.

At this point, you can continue your transformation or jb by selecting a suitable destination and adding any transformations to modify, filter, or otherwise alter the data during replication.