Build Amazon Athena-Connected ETL Processes in Google Data Fusion



Load the CData JDBC Driver into Google Data Fusion and create ETL processes with access live Amazon Athena data.

Google Data Fusion allows users to perform self-service data integration to consolidate disparate data. Uploading the CData JDBC Driver for Amazon Athena enables users to access live Amazon Athena data from within their Google Data Fusion pipelines. While the CData JDBC Driver enables piping Amazon Athena data to any data source natively supported in Google Data Fusion, this article walks through piping data from Amazon Athena to Google BigQuery,

About Amazon Athena Data Integration

CData provides the easiest way to access and integrate live data from Amazon Athena. Customers use CData connectivity to:

  • Authenticate securely using a variety of methods, including IAM credentials, access keys, and Instance Profiles, catering to diverse security needs and simplifying the authentication process.
  • Streamline their setup and quickly resolve issue with detailed error messaging.
  • Enhance performance and minimize strain on client resources with server-side query execution.

Users frequently integrate Athena with analytics tools like Tableau, Power BI, and Excel for in-depth analytics from their preferred tools.

To learn more about unique Amazon Athena use cases with CData, check out our blog post: https://www.cdata.com/blog/amazon-athena-use-cases.


Getting Started


Upload the CData JDBC Driver for Amazon Athena to Google Data Fusion

Upload the CData JDBC Driver for Amazon Athena to your Google Data Fusion instance to work with live Amazon Athena data. Due to the naming restrictions for JDBC drivers in Google Data Fusion, create a copy or rename the JAR file to match the following format driver-version.jar. For example: cdataamazonathena-2020.jar

  1. Open your Google Data Fusion instance
  2. Click the to add an entity and upload a driver
  3. On the "Upload driver" tab, drag or browse to the renamed JAR file.
  4. On the "Driver configuration" tab:
    • Name: Create a name for the driver (cdata.jdbc.amazonathena) and make note of the name
    • Class name: Set the JDBC class name: (cdata.jdbc.amazonathena.AmazonAthenaDriver)
  5. Click "Finish"

Connect to Amazon Athena Data in Google Data Fusion

With the JDBC Driver uploaded, you are ready to work with live Amazon Athena data in Google Data Fusion Pipelines.

  1. Navigate to the Pipeline Studio to create a new Pipeline
  2. From the "Source" options, click "Database" to add a source for the JDBC Driver
  3. Click "Properties" on the Database source to edit the properties

    NOTE: To use the JDBC Driver in Google Data Fusion, you will need a license (full or trial) and a Runtime Key (RTK). For more information on obtaining this license (or a trial), contact our sales team.

    • Set the Label
    • Set Reference Name to a value for any future references (i.e.: cdata-amazonathena)
    • Set Plugin Type to "jdbc"
    • Set Connection String to the JDBC URL for Amazon Athena. For example:

      jdbc:amazonathena:RTK=5246...;AWSAccessKey='a123';AWSSecretKey='s123';AWSRegion='IRELAND';Database='sampledb';S3StagingDirectory='s3://bucket/staging/';

      Authenticating to Amazon Athena

      To authorize Amazon Athena requests, provide the credentials for an administrator account or for an IAM user with custom permissions: Set AccessKey to the access key Id. Set SecretKey to the secret access key.

      Note: Though you can connect as the AWS account administrator, it is recommended to use IAM user credentials to access AWS services.

      Obtaining the Access Key

      To obtain the credentials for an IAM user, follow the steps below:

      1. Sign into the IAM console.
      2. In the navigation pane, select Users.
      3. To create or manage the access keys for a user, select the user and then select the Security Credentials tab.

      To obtain the credentials for your AWS root account, follow the steps below:

      1. Sign into the AWS Management console with the credentials for your root account.
      2. Select your account name or number and select My Security Credentials in the menu that is displayed.
      3. Click Continue to Security Credentials and expand the Access Keys section to manage or create root account access keys.

      Authenticating from an EC2 Instance

      If you are using the CData Data Provider for Amazon Athena 2018 from an EC2 Instance and have an IAM Role assigned to the instance, you can use the IAM Role to authenticate. To do so, set UseEC2Roles to true and leave AccessKey and SecretKey empty. The CData Data Provider for Amazon Athena 2018 will automatically obtain your IAM Role credentials and authenticate with them.

      Authenticating as an AWS Role

      In many situations it may be preferable to use an IAM role for authentication instead of the direct security credentials of an AWS root user. An AWS role may be used instead by specifying the RoleARN. This will cause the CData Data Provider for Amazon Athena 2018 to attempt to retrieve credentials for the specified role. If you are connecting to AWS (instead of already being connected such as on an EC2 instance), you must additionally specify the AccessKey and SecretKey of an IAM user to assume the role for. Roles may not be used when specifying the AccessKey and SecretKey of an AWS root user.

      Authenticating with MFA

      For users and roles that require Multi-factor Authentication, specify the MFASerialNumber and MFAToken connection properties. This will cause the CData Data Provider for Amazon Athena 2018 to submit the MFA credentials in a request to retrieve temporary authentication credentials. Note that the duration of the temporary credentials may be controlled via the TemporaryTokenDuration (default 3600 seconds).

      Connecting to Amazon Athena

      In addition to the AccessKey and SecretKey properties, specify Database, S3StagingDirectory and Region. Set Region to the region where your Amazon Athena data is hosted. Set S3StagingDirectory to a folder in S3 where you would like to store the results of queries.

      If Database is not set in the connection, the data provider connects to the default database set in Amazon Athena.

      Built-in Connection String Designer

      For assistance in constructing the JDBC URL, use the connection string designer built into the Amazon Athena JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.

      java -jar cdata.jdbc.amazonathena.jar

      Fill in the connection properties and copy the connection string to the clipboard.

    • Set Import Query to a SQL query that will extract the data you want from Amazon Athena, i.e.:
      SELECT * FROM Customers
  4. From the "Sink" tab, click to add a destination sink (we use Google BigQuery in this example)
  5. Click "Properties" on the BigQuery sink to edit the properties
    • Set the Label
    • Set Reference Name to a value like amazonathena-bigquery
    • Set Project ID to a specific Google BigQuery Project ID (or leave as the default, "auto-detect")
    • Set Dataset to a specific Google BigQuery dataset
    • Set Table to the name of the table you wish to insert Amazon Athena data into

With the Source and Sink configured, you are ready to pipe Amazon Athena data into Google BigQuery. Save and deploy the pipeline. When you run the pipeline, Google Data Fusion will request live data from Amazon Athena and import it into Google BigQuery.

While this is a simple pipeline, you can create more complex Amazon Athena pipelines with transforms, analytics, conditions, and more. Download a free, 30-day trial of the CData JDBC Driver for Amazon Athena and start working with your live Amazon Athena data in Google Data Fusion today.

Ready to get started?

Download a free trial of the Amazon Athena Driver to get started:

 Download Now

Learn more:

Amazon Athena Icon Amazon Athena JDBC Driver

Rapidly create and deploy powerful Java applications that integrate with Amazon Athena.