Integrate Google Cloud Storage Data in Pentaho Data Integration



Build ETL pipelines based on Google Cloud Storage data in the Pentaho Data Integration tool.

The CData JDBC Driver for Google Cloud Storage enables access to live data from data pipelines. Pentaho Data Integration is an Extraction, Transformation, and Loading (ETL) engine that data, cleanses the data, and stores data using a uniform format that is accessible.This article shows how to connect to Google Cloud Storage data as a JDBC data source and build jobs and transformations based on Google Cloud Storage data in Pentaho Data Integration.

Configure to Google Cloud Storage Connectivity

Authenticate with a User Account

You can connect without setting any connection properties for your user credentials. After setting InitiateOAuth to GETANDREFRESH, you are ready to connect.

When you connect, the Google Cloud Storage OAuth endpoint opens in your default browser. Log in and grant permissions, then the OAuth process completes

Authenticate with a Service Account

Service accounts have silent authentication, without user authentication in the browser. You can also use a service account to delegate enterprise-wide access scopes.

You need to create an OAuth application in this flow. See the Help documentation for more information. After setting the following connection properties, you are ready to connect:

  • InitiateOAuth: Set this to GETANDREFRESH.
  • OAuthJWTCertType: Set this to "PFXFILE".
  • OAuthJWTCert: Set this to the path to the .p12 file you generated.
  • OAuthJWTCertPassword: Set this to the password of the .p12 file.
  • OAuthJWTCertSubject: Set this to "*" to pick the first certificate in the certificate store.
  • OAuthJWTIssuer: In the service accounts section, click Manage Service Accounts and set this field to the email address displayed in the service account Id field.
  • OAuthJWTSubject: Set this to your enterprise Id if your subject type is set to "enterprise" or your app user Id if your subject type is set to "user".
  • ProjectId: Set this to the Id of the project you want to connect to.

The OAuth flow for a service account then completes.

Built-in Connection String Designer

For assistance in constructing the JDBC URL, use the connection string designer built into the Google Cloud Storage JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.

java -jar cdata.jdbc.googlecloudstorage.jar

Fill in the connection properties and copy the connection string to the clipboard.

When you configure the JDBC URL, you may also want to set the Max Rows connection property. This will limit the number of rows returned, which is especially helpful for improving performance when designing reports and visualizations.

Below is a typical JDBC URL:

jdbc:googlecloudstorage:ProjectId='project1';InitiateOAuth=GETANDREFRESH

Save your connection string for use in Pentaho Data Integration.

Connect to Google Cloud Storage from Pentaho DI

Open Pentaho Data Integration and select "Database Connection" to configure a connection to the CData JDBC Driver for Google Cloud Storage

  1. Click "General"
  2. Set Connection name (e.g. Google Cloud Storage Connection)
  3. Set Connection type to "Generic database"
  4. Set Access to "Native (JDBC)"
  5. Set Custom connection URL to your Google Cloud Storage connection string (e.g.
    jdbc:googlecloudstorage:ProjectId='project1';InitiateOAuth=GETANDREFRESH
  6. Set Custom driver class name to "cdata.jdbc.googlecloudstorage.GoogleCloudStorageDriver"
  7. Test the connection and click "OK" to save.

Create a Data Pipeline for Google Cloud Storage

Once the connection to Google Cloud Storage is configured using the CData JDBC Driver, you are ready to create a new transformation or job.

  1. Click "File" >> "New" >> "Transformation/job"
  2. Drag a "Table input" object into the workflow panel and select your Google Cloud Storage connection.
  3. Click "Get SQL select statement" and use the Database Explorer to view the available tables and views.
  4. Select a table and optionally preview the data for verification.

At this point, you can continue your transformation or jb by selecting a suitable destination and adding any transformations to modify, filter, or otherwise alter the data during replication.

Free Trial & More Information

Download a free, 30-day trial of the CData JDBC Driver for Google Cloud Storage and start working with your live Google Cloud Storage data in Pentaho Data Integration today.

Ready to get started?

Download a free trial of the Google Cloud Storage Driver to get started:

 Download Now

Learn more:

Google Cloud Storage Icon Google Cloud Storage JDBC Driver

Rapidly create and deploy powerful Java applications that integrate with Google Cloud Storage.