Ready to get started?

Learn more about CData Connect Cloud or sign up for free trial access:

Free Trial

Connect to Amazon Athena Data from AWS Glue



Use CData Connect Cloud to gain access to live Amazon Athena data from your AWS Glue jobs.

Amazon AWS Glue is an ETL service designed to simplify the preparation and loading of data for storage and analytics purposes. By employing Glue Studio and CData Connect Cloud, you have the capability to construct ETL jobs without the need for coding or with minimal coding. These jobs can interact with data through the CData Glue Connector. This article provides a step-by-step guide on connecting to Amazon Athena via CData Connect Cloud and utilizing the CData Glue Connector to establish and execute an AWS Glue job that operates with real-time Amazon Athena data.

CData Connect Cloud offers a seamless cloud-to-cloud interface tailored for Amazon Athena, simplifying the direct access to live Amazon Athena data within AWS Glue jobs. All you need to do is employ the AWS Glue Connector and choose a table (or craft your custom SQL query). With its inherent optimized data processing capabilities, CData Connect Cloud efficiently channels all supported query operations, including filters, JOINs, and more, straight to Amazon Athena. This harnesses server-side processing to promptly retrieve Amazon Athena data for your ETL jobs.

This setup requires a CData Connect Cloud instance and the CData AWS Glue Connector. To get started, sign up a free trial of Connect Cloud and subscribe to the free Glue Connector for Connect Cloud.


Configure Amazon Athena Connectivity for AWS Glue

Connectivity to Amazon Athena from AWS Glue is made possible through CData Connect Cloud. To work with Amazon Athena data from AWS Glue, we start by creating and configuring a Amazon Athena connection.

  1. Log into Connect Cloud, click Connections and click Add Connection
  2. Select "Amazon Athena" from the Add Connection panel
  3. Enter the necessary authentication properties to connect to Amazon Athena.

    Authenticating to Amazon Athena

    To authorize Amazon Athena requests, provide the credentials for an administrator account or for an IAM user with custom permissions: Set AccessKey to the access key Id. Set SecretKey to the secret access key.

    Note: Though you can connect as the AWS account administrator, it is recommended to use IAM user credentials to access AWS services.

    Obtaining the Access Key

    To obtain the credentials for an IAM user, follow the steps below:

    1. Sign into the IAM console.
    2. In the navigation pane, select Users.
    3. To create or manage the access keys for a user, select the user and then select the Security Credentials tab.

    To obtain the credentials for your AWS root account, follow the steps below:

    1. Sign into the AWS Management console with the credentials for your root account.
    2. Select your account name or number and select My Security Credentials in the menu that is displayed.
    3. Click Continue to Security Credentials and expand the Access Keys section to manage or create root account access keys.

    Authenticating from an EC2 Instance

    If you are using the CData Data Provider for Amazon Athena 2018 from an EC2 Instance and have an IAM Role assigned to the instance, you can use the IAM Role to authenticate. To do so, set UseEC2Roles to true and leave AccessKey and SecretKey empty. The CData Data Provider for Amazon Athena 2018 will automatically obtain your IAM Role credentials and authenticate with them.

    Authenticating as an AWS Role

    In many situations it may be preferable to use an IAM role for authentication instead of the direct security credentials of an AWS root user. An AWS role may be used instead by specifying the RoleARN. This will cause the CData Data Provider for Amazon Athena 2018 to attempt to retrieve credentials for the specified role. If you are connecting to AWS (instead of already being connected such as on an EC2 instance), you must additionally specify the AccessKey and SecretKey of an IAM user to assume the role for. Roles may not be used when specifying the AccessKey and SecretKey of an AWS root user.

    Authenticating with MFA

    For users and roles that require Multi-factor Authentication, specify the MFASerialNumber and MFAToken connection properties. This will cause the CData Data Provider for Amazon Athena 2018 to submit the MFA credentials in a request to retrieve temporary authentication credentials. Note that the duration of the temporary credentials may be controlled via the TemporaryTokenDuration (default 3600 seconds).

    Connecting to Amazon Athena

    In addition to the AccessKey and SecretKey properties, specify Database, S3StagingDirectory and Region. Set Region to the region where your Amazon Athena data is hosted. Set S3StagingDirectory to a folder in S3 where you would like to store the results of queries.

    If Database is not set in the connection, the data provider connects to the default database set in Amazon Athena.

  4. Click Create & Test
  5. Navigate to the Permissions tab in the Add Amazon Athena Connection page and update the User-based permissions.

With the connection configured, you are ready to connect to Amazon Athena data from in AWS Glue.

Add a Personal Access Token

If you are connecting from a service, application, platform, or framework that does not support OAuth authentication, you can create a Personal Access Token (PAT) to use for authentication. Best practices would dictate that you create a separate PAT for each service, to maintain granularity of access.

  1. Click on your username at the top right of the Connect Cloud app and click User Profile.
  2. On the User Profile page, scroll down to the Personal Access Tokens section and click Create PAT.
  3. Give your PAT a name and click Create.
  4. The personal access token is only visible at creation, so be sure to copy it and store it securely for future use.

Update Permissions for your IAM Role

When you create the AWS Glue job, you specify an AWS Identity and Access Management (IAM) role for the job to use. The role must grant access to all resources used by the job, including Amazon S3 for any sources, targets, scripts, temporary directories, and AWS Glue Data Catalog objects. The role must also grant access to the CData Glue Connector for Amazon Athena from the AWS Glue Marketplace.

The following policies should be added to the IAM role for the AWS Glue job, at a minimum:

  • AWSGlueServiceRole (For accessing Glue Studio and Glue Jobs)
  • AmazonEC2ContainerRegistryReadOnly (For accessing the CData AWS Glue Connector for Amazon Athena)

If you will be accessing data found in Amazon S3, add:

  • AmazonS3FullAccess (For reading from and writing to Amazon S3)

And lastly, if you will be using AWS Secrets Manager to store confidential connection properties (see more below), you will need to add an inline policy similar to the following, granting access to the specific secrets needed for the Glue Job:

{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "secretsmanager:GetResourcePolicy", "secretsmanager:GetSecretValue", "secretsmanager:DescribeSecret", "secretsmanager:ListSecretVersionIds" ], "Resource": [ "arn:aws:secretsmanager:us-west-2:111122223333:secret:aes128-1a2b3c", "arn:aws:secretsmanager:us-west-2:111122223333:secret:aes192-4D5e6F", "arn:aws:secretsmanager:us-west-2:111122223333:secret:aes256-7g8H9i" ] } ] }

For more information about granting access to AWS Glue Studio and Glue Jobs, see Setting up IAM Permissions for AWS Glue in the AWS Glue documentation.

For more information about granting access to the Amazon S3 buckets, see Identity and access management in the Amazon Simple Storage Service Developer Guide.

For more information on setting up access control for your secrets, see Authentication and Access Control for AWS Secrets Manager in the AWS Secrets Manager documentation and Limiting Access to Specific Secrets in the AWS Secrets Manager User Guide. The credential retrieved from AWS Secrets Manager (a string of key-value pairs) is used in the JDBC URL used by the CData Glue Connector when connecting to the data source, as shown above.

(Optional) Store Amazon Athena Connection Properties Credentials in AWS Secrets Manager

To safely store and use your connection properties, you can save them in AWS Secrets Manager.

Note: You must host your AWS Glue ETL job and secret in the same region. Cross-region secret retrieval is not supported currently.

  1. Sign in to the AWS Secrets Manager console.
  2. On either the service introduction page or the Secrets list page, choose Store a new secret.
  3. On the Store a new secret page, choose Other type of secret. This option means you must supply the structure and details of your secret.
  4. You can read more about the required properties to connect to Amazon Athena in the "Activate" section below. Once you know which properties you wish to store, create a key-value pair for each property. For example:
    • Username: CData Connect Cloud user (for example, user@example.com)
    • Password: CData Connect Cloud user PAT

    For more information about creating secrets, see Creating and Managing Secrets with AWS Secrets Manager in the AWS Secrets Manager User Guide.

  5. Record the secret name, which is used when configuring the connection in AWS Glue Studio.

Subscribe to the CData Glue Connector for Amazon Athena

To work with the CData Glue Connector for Amazon Athena in AWS Glue Studio, you need to subscribe to the Connector from the AWS Marketplace. If you have already subscribed to the CData Glue Connector for Amazon Athena, you can jump to the next section.

  1. Navigate to the CData AWS Glue Connector for Connect Cloud AWS Marketplace listing
  2. Click "Continue to Subscribe"
  3. Accept the terms for the Connector and wait for the request to be processed
  4. Click "Continue to Configuration"

Activate the CData Glue Connector for Connect Cloud in Glue Studio

To use the CData Glue Connector for Amazon Athena in AWS Glue, you need to activate the subscribed connector in AWS Glue Studio. The activation process creates a connector object and connection in your AWS account.

  1. Once you subscribe to the connector, a new Configure tab shows up in the AWS Marketplace connector page.
  2. Choose the delivery options and click the "Continue to Launch" button.
  3. On the launch tab, click "Usage Instructions" and follow the link that appears to create and configure the connection.
  4. Under Connection access, select the JDBC URL format and configure the connection. Below you will find sample connection string(s) for the JDBC URL format(s) available for Amazon Athena. You can read more about authenticating with Amazon Athena in the Help documentation for the Connector.

    If you opted to store properties in the AWS Secrets Manager, leave the placeholder values (e.g. ${Property1}), otherwise, the values you enter in the AWS Glue Connection interface will appear in the (read-only) JDBC URL below the properties.

    Connect Cloud

    jdbc:cdata:Connect:AuthScheme=Basic;User=${Username};Password=${Password};defaultCatalog=${defaultCatalog}
    1. ${Username}: set this to your Connect Cloud user
    2. ${Password}: set this to your Connect Cloud PAT
    3. ${defaultCatalog}: set this to the name of the connection you configured (e.g. AmazonAthena1)
  5. (Optional): Enable logging for the Connector.

    If you want to log the functionality from the CData Glue Connector for Amazon Athena you will need to append two properties to the JDBC URL:

    • Logfile: Set this to "STDOUT://"
    • Verbosity: Set this to an integer (1-5) for varying depths of logging. 1 is the default, 3 is recommended for most debugging scenarios.
  6. Configure the Network options and click "Create Connection."

Configure the Amazon Glue Job

Once you have configured a Connection, you can build a Glue Job.

Create a Job that Uses the Connection

  1. In Glue Studio, under "Your connections," select the connection you created
  2. Click "Create job"

    The visual job editor appears. A new Source node, derived from the connection, is displayed on the Job graph. In the node details panel on the right, the Source Properties tab is selected for user input.

Configure the Source Node properties:

You can configure the access options for your connection to the data source in the Source properties tab. Refer to the AWS Glue Studio documentation for more information. Here we provide a simple walk-through.

  1. In the visual job editor, make sure the Source node for your connector is selected. Choose the Source properties tab in the node details panel on the right, if it is not already selected.
  2. The Connection field is populated automatically with the name of the connection associated with the marketplace connector.
  3. Enter information about the data location in the data source. Provide either a source table name or a query to use to retrieve data from the data source. For example: SELECT Name, TotalDue FROM AmazonAthena1.AmazonAthena.Customers WHERE CustomerId = 12345

    NOTE: Use the fully qualified domain for the source table, where the name of the connection in CData Connect Cloud is the catalog name and the name of the data source is the schema. For example: AmazonAthena1.AmazonAthena.Customers.

  4. To pass information from the data source to the transformation nodes, AWS Glue Studio must know the schema of the data. Select "Use Schema Builder" to specify the schema interactively.
  5. Configure the remaining optional fields as needed. You can configure the following:
    • Partitioning information - for parallelizing the read operations from the data source
    • Data type mappings - to convert data types used in the source data to the data types supported by AWS Glue
    • Filter predicate - to select a subset of the data from the data source

    See "Use the Connection in a Glue job using Glue Studio" for more information about these options.

  6. You can view the schema generated by this node by choosing the Output schema tab in the node properties panel.

Edit, Save, & Run the Job

Edit the job by adding and editing the nodes in the job graph. See Editing ETL jobs in AWS Glue Studio for more information.

After you complete editing the job, enter the job properties.

  1. Select the Job properties tab above the visual graph editor.
  2. Configure the following job properties when using custom connectors:
    • Name: Provide a job name.
    • IAM Role: Choose (or create) an IAM role with the necessary permissions, as described previously.
    • Type: Choose "Spark."
    • Glue version: Choose "Glue 3.0 - Supports spark 3.1, Scala 2, Python 3."
    • Language: Choose "Python 3."
    • Use the default values for the other parameters. For more information about job parameters, see "Defining Job Properties" in the AWS Glue Developer Guide.
  3. At the top of the page, choose "Save."
  4. A green top banner appears with the message "Successfully created Job."
  5. After you successfully save the job, you can choose "Run" to run the job.
  6. To view the generated script for the job, choose the "Script" tab at the top of the visual editor. The "Job runs" tab shows the job run history for the job. For more information about job run details, see "View information for recent job runs."

Review the Generated Script

At any point in the job creation, you can click on the Script tab to review the script being created by Glue Studio. If you create a simple job to write Amazon Athena data to an Amazon S3 bucket, your script will look similar to the following:

Sample Script

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ["JOB_NAME"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args["JOB_NAME"], args) # Script generated for node CData AWS Glue Connector for CData Connect CDataAWSGlueConnectorforCDataConnect_node1 = ( glueContext.create_dynamic_frame.from_options( connection_type="marketplace.jdbc", connection_options={ "tableName": "AmazonAthena1.AmazonAthena.Customers", "dbTable": "AmazonAthena1.AmazonAthena.Customers", "connectionName": "cdata-cloud-connector", }, transformation_ctx="CDataAWSGlueConnectorforCDataConnect_node1", ) ) job.commit()

Using CData Connect Cloud and AWS Glue Connector for Connect Cloud in AWS Glue Studio, you can easily create ETL jobs to load Amazon Athena data into an S3 bucket or any other destination. You can also use the Glue Connector to add, update, or delete Amazon Athena data in your Glue Jobs.

To get live data access to 100+ SaaS, Big Data, and NoSQL sources directly from your cloud applications, try CData Connect Cloud today!