Discover how a bimodal integration strategy can address the major data management challenges facing your organization today.
Get the Report →Build Amazon Athena-Connected ETL Processes in Google Data Fusion
Load the CData JDBC Driver into Google Data Fusion and create ETL processes with access live Amazon Athena data.
Google Data Fusion allows users to perform self-service data integration to consolidate disparate data. Uploading the CData JDBC Driver for Amazon Athena enables users to access live Amazon Athena data from within their Google Data Fusion pipelines. While the CData JDBC Driver enables piping Amazon Athena data to any data source natively supported in Google Data Fusion, this article walks through piping data from Amazon Athena to Google BigQuery,
Upload the CData JDBC Driver for Amazon Athena to Google Data Fusion
Upload the CData JDBC Driver for Amazon Athena to your Google Data Fusion instance to work with live Amazon Athena data. Due to the naming restrictions for JDBC drivers in Google Data Fusion, create a copy or rename the JAR file to match the following format driver-version.jar. For example: cdataamazonathena-2020.jar
- Open your Google Data Fusion instance
- Click the to add an entity and upload a driver
- On the "Upload driver" tab, drag or browse to the renamed JAR file.
- On the "Driver configuration" tab:
- Name: Create a name for the driver (cdata.jdbc.amazonathena) and make note of the name
- Class name: Set the JDBC class name: (cdata.jdbc.amazonathena.AmazonAthenaDriver)
- Click "Finish"
Connect to Amazon Athena Data in Google Data Fusion
With the JDBC Driver uploaded, you are ready to work with live Amazon Athena data in Google Data Fusion Pipelines.
- Navigate to the Pipeline Studio to create a new Pipeline
- From the "Source" options, click "Database" to add a source for the JDBC Driver
- Click "Properties" on the Database source to edit the properties
NOTE: To use the JDBC Driver in Google Data Fusion, you will need a license (full or trial) and a Runtime Key (RTK). For more information on obtaining this license (or a trial), contact our sales team.
- Set the Label
- Set Reference Name to a value for any future references (i.e.: cdata-amazonathena)
- Set Plugin Type to "jdbc"
- Set Connection String to the JDBC URL for Amazon Athena. For example:
jdbc:amazonathena:RTK=5246...;AWSAccessKey='a123';AWSSecretKey='s123';AWSRegion='IRELAND';Database='sampledb';S3StagingDirectory='s3://bucket/staging/';Authenticating to Amazon Athena
To authorize Amazon Athena requests, provide the credentials for an administrator account or for an IAM user with custom permissions: Set AccessKey to the access key Id. Set SecretKey to the secret access key.
Note: Though you can connect as the AWS account administrator, it is recommended to use IAM user credentials to access AWS services.
Obtaining the Access Key
To obtain the credentials for an IAM user, follow the steps below:
- Sign into the IAM console.
- In the navigation pane, select Users.
- To create or manage the access keys for a user, select the user and then select the Security Credentials tab.
To obtain the credentials for your AWS root account, follow the steps below:
- Sign into the AWS Management console with the credentials for your root account.
- Select your account name or number and select My Security Credentials in the menu that is displayed.
- Click Continue to Security Credentials and expand the Access Keys section to manage or create root account access keys.
Authenticating from an EC2 Instance
If you are using the CData Data Provider for Amazon Athena 2018 from an EC2 Instance and have an IAM Role assigned to the instance, you can use the IAM Role to authenticate. To do so, set UseEC2Roles to true and leave AccessKey and SecretKey empty. The CData Data Provider for Amazon Athena 2018 will automatically obtain your IAM Role credentials and authenticate with them.
Authenticating as an AWS Role
In many situations it may be preferable to use an IAM role for authentication instead of the direct security credentials of an AWS root user. An AWS role may be used instead by specifying the RoleARN. This will cause the CData Data Provider for Amazon Athena 2018 to attempt to retrieve credentials for the specified role. If you are connecting to AWS (instead of already being connected such as on an EC2 instance), you must additionally specify the AccessKey and SecretKey of an IAM user to assume the role for. Roles may not be used when specifying the AccessKey and SecretKey of an AWS root user.
Authenticating with MFA
For users and roles that require Multi-factor Authentication, specify the MFASerialNumber and MFAToken connection properties. This will cause the CData Data Provider for Amazon Athena 2018 to submit the MFA credentials in a request to retrieve temporary authentication credentials. Note that the duration of the temporary credentials may be controlled via the TemporaryTokenDuration (default 3600 seconds).
Connecting to Amazon Athena
In addition to the AccessKey and SecretKey properties, specify Database, S3StagingDirectory and Region. Set Region to the region where your Amazon Athena data is hosted. Set S3StagingDirectory to a folder in S3 where you would like to store the results of queries.
If Database is not set in the connection, the data provider connects to the default database set in Amazon Athena.
Built-in Connection String Designer
For assistance in constructing the JDBC URL, use the connection string designer built into the Amazon Athena JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.
java -jar cdata.jdbc.amazonathena.jar
Fill in the connection properties and copy the connection string to the clipboard.
- Set Import Query to a SQL query that will extract the data you want from Amazon Athena, i.e.:
SELECT * FROM Customers
- From the "Sink" tab, click to add a destination sink (we use Google BigQuery in this example)
- Click "Properties" on the BigQuery sink to edit the properties
- Set the Label
- Set Reference Name to a value like amazonathena-bigquery
- Set Project ID to a specific Google BigQuery Project ID (or leave as the default, "auto-detect")
- Set Dataset to a specific Google BigQuery dataset
- Set Table to the name of the table you wish to insert Amazon Athena data into
With the Source and Sink configured, you are ready to pipe Amazon Athena data into Google BigQuery. Save and deploy the pipeline. When you run the pipeline, Google Data Fusion will request live data from Amazon Athena and import it into Google BigQuery.
While this is a simple pipeline, you can create more complex Amazon Athena pipelines with transforms, analytics, conditions, and more. Download a free, 30-day trial of the CData JDBC Driver for Amazon Athena and start working with your live Amazon Athena data in Google Data Fusion today.