Integrating Dataiku with Amazon Athena Data via CData Connect AI

Yazhini G
Yazhini G
Technical Marketing Engineer
Leverage the CData Connect AI Remote MCP Server to enable Dataiku Agents to securely query and act on live Amazon Athena data.

Dataiku is a collaborative data science and AI platform that enables teams to design, deploy, and manage machine learning and generative AI projects within a governed environment. It's Agent and GenAI framework allows users to build intelligent agents that can analyze, generate, and act on data through custom workflows and model orchestration.

By integrating Dataiku with CData Connect AI through the built-in MCP (Model Context Protocol) Server, these agents gain secure, real-time access to live Amazon Athena data. The integration bridges Dataiku's agent execution environment with CData's governed enterprise connectivity layer, allowing every query or instruction to run safely against authorized data sources without manual exports or staging.

This article demonstrates how to configure Amazon Athena connectivity in Connect AI, prepare a Python code environment in Dataiku with MCP support, and create an agent that queries and interacts with live Amazon Athena data directly from within Dataiku.

About Amazon Athena Data Integration

CData provides the easiest way to access and integrate live data from Amazon Athena. Customers use CData connectivity to:

  • Authenticate securely using a variety of methods, including IAM credentials, access keys, and Instance Profiles, catering to diverse security needs and simplifying the authentication process.
  • Streamline their setup and quickly resolve issue with detailed error messaging.
  • Enhance performance and minimize strain on client resources with server-side query execution.

Users frequently integrate Athena with analytics tools like Tableau, Power BI, and Excel for in-depth analytics from their preferred tools.

To learn more about unique Amazon Athena use cases with CData, check out our blog post: https://www.cdata.com/blog/amazon-athena-use-cases.


Getting Started


Step 1: Configure Amazon Athena Connectivity for Dataiku

Connectivity to Amazon Athena from Dataiku is made possible through CData Connect AI's Remote MCP Server. To interact with Amazon Athena data from Dataiku, you start by creating and configuring a Amazon Athena connection in CData Connect AI.

  1. Log into Connect AI, click Sources, then click Add Connection
  2. Select "Amazon Athena" from the Add Connection panel
  3. Enter the necessary authentication properties to connect to Amazon Athena.

    Authenticating to Amazon Athena

    To authorize Amazon Athena requests, provide the credentials for an administrator account or for an IAM user with custom permissions: Set AccessKey to the access key Id. Set SecretKey to the secret access key.

    Note: Though you can connect as the AWS account administrator, it is recommended to use IAM user credentials to access AWS services.

    Obtaining the Access Key

    To obtain the credentials for an IAM user, follow the steps below:

    1. Sign into the IAM console.
    2. In the navigation pane, select Users.
    3. To create or manage the access keys for a user, select the user and then select the Security Credentials tab.

    To obtain the credentials for your AWS root account, follow the steps below:

    1. Sign into the AWS Management console with the credentials for your root account.
    2. Select your account name or number and select My Security Credentials in the menu that is displayed.
    3. Click Continue to Security Credentials and expand the Access Keys section to manage or create root account access keys.

    Authenticating from an EC2 Instance

    If you are using the CData Data Provider for Amazon Athena 2018 from an EC2 Instance and have an IAM Role assigned to the instance, you can use the IAM Role to authenticate. To do so, set UseEC2Roles to true and leave AccessKey and SecretKey empty. The CData Data Provider for Amazon Athena 2018 will automatically obtain your IAM Role credentials and authenticate with them.

    Authenticating as an AWS Role

    In many situations it may be preferable to use an IAM role for authentication instead of the direct security credentials of an AWS root user. An AWS role may be used instead by specifying the RoleARN. This will cause the CData Data Provider for Amazon Athena 2018 to attempt to retrieve credentials for the specified role. If you are connecting to AWS (instead of already being connected such as on an EC2 instance), you must additionally specify the AccessKey and SecretKey of an IAM user to assume the role for. Roles may not be used when specifying the AccessKey and SecretKey of an AWS root user.

    Authenticating with MFA

    For users and roles that require Multi-factor Authentication, specify the MFASerialNumber and MFAToken connection properties. This will cause the CData Data Provider for Amazon Athena 2018 to submit the MFA credentials in a request to retrieve temporary authentication credentials. Note that the duration of the temporary credentials may be controlled via the TemporaryTokenDuration (default 3600 seconds).

    Connecting to Amazon Athena

    In addition to the AccessKey and SecretKey properties, specify Database, S3StagingDirectory and Region. Set Region to the region where your Amazon Athena data is hosted. Set S3StagingDirectory to a folder in S3 where you would like to store the results of queries.

    If Database is not set in the connection, the data provider connects to the default database set in Amazon Athena.

  4. Click Save & Test
  5. Open the Permissions tab and set user-based permissions

Add a Personal Access Token

A Personal Access Token (PAT) is used to authenticate the connection to Connect AI from Dataiku. It is best practice to create a separate PAT for each integration to maintain granular access control

  1. Click the gear icon () at the top right of the Connect AI app to open Settings
  2. On the Settings page, go to the Access Tokens section and click Create PAT
  3. Give the PAT a descriptive name and click Create
  4. Copy the token when displayed and store it securely. It will not be shown again

With the Amazon Athena connection configured and a PAT generated, Dataiku can now connect to Amazon Athena data through the CData MCP Server.

Step 2: Prepare Dataiku and the Code Environment

A dedicated python code environment in Dataiku provides the runtime support needed for MCP-based communication. To enable Dataiku Agents to connect to CData Connect AI, create a Python environment and install the MCP client dependencies required for agent-to-server interaction.

  1. In Dataiku Cloud, open Code Envs
  2. Click Add a code env to open the DSS settings window
  3. In DSS, click New Python env. Name it (for example, MCP_Package) and choose Python 3.10 (3.10 to 3.13 supported)
  4. Open Packages to install and add the following pip packages:
    • httpx
    • anyio
    • langchain-mcp-adapters
  5. Open Containerized execution and under Container runtime additions select Agent tool MCP servers support
  6. Check Rebuild env and click Save and update to install packages
  7. Back in Dataiku Cloud, open Overview and click Open instance
  8. Click + New project and select Blank project. Name the project

Step 3: Create a Dataiku Agent and connect to the MCP server

The Dataiku Agent serves as the bridge between the Dataiku workspace and the CData MCP Server. To enable this connection, create a custom code-based agent, assign it the configured Python environment, and embed your Connect AI credentials to allow the agent to query and interact with live Amazon Athena data.

  1. Go to Agents & GenAI Models and click Create your first agent
  2. Choose Code agent, name it, and for Agent version select Asynchronous agent without streaming
  3. From the tab above select Settings. In Code env selection set Default Python code env to the environment you created (for example, MCP_Package)
  4. Return to the Agent Design tab and paste the following code. Replace EMAIL, and PAT with your values
  5. 
    
    import os
    import base64
    from typing import Dict, Any, List
     
    from dataiku.llm.python import BaseLLM
    from langchain_mcp_adapters.client import MultiServerMCPClient
     
    # ---------- Persistent MCP client (cached between calls) ----------
    _MCP_CLIENT = None
     
    def _get_mcp_client() -> MultiServerMCPClient:
        """Create (or reuse) a MultiServerMCPClient to CData Cloud MCP."""
        global _MCP_CLIENT
        if _MCP_CLIENT is not None:
            return _MCP_CLIENT
     
        # Set creds via env/project variables ideally
        EMAIL = os.getenv("CDATA_EMAIL", "YOUR_EMAIL") 
        PAT   = os.getenv("CDATA_PAT",   "YOUR_PAT")        
        BASE_URL = "https://mcp.cloud.cdata.com/mcp"
     
        if not EMAIL or PAT == "YOUR_PAT":
            raise ValueError("Set CDATA_EMAIL and CDATA_PAT as env variables or inline in the code.")
     
        token = base64.b64encode(f"{EMAIL}:{PAT}".encode()).decode()
        headers = {"Authorization": f"Basic {token}"}
     
        _MCP_CLIENT = MultiServerMCPClient(
            connections={
                "cdata": {
                    "transport": "streamable_http",
                    "url": BASE_URL,
                    "headers": headers,
                }
            }
        )
        return _MCP_CLIENT
     
     
    def _pick_tool(tools, names: List[str]):
        L = [n.lower() for n in names]
        return next((t for t in tools if t.name.lower() in L), None)
     
     
    async def _route(prompt: str) -> str:
        """
        Simple intent router:
          - 'list connections' / 'list catalogs' -> getCatalogs
          - 'sql: ...' or 'query: ...' -> queryData
          - otherwise -> help text
        """
        client = _get_mcp_client()
        tools = await client.get_tools()
     
        p = prompt.strip()
        low = p.lower()
     
        # 1) List connections (catalogs)
        if "list connections" in low or "list catalogs" in low:
            t = _pick_tool(tools, ["getCatalogs", "listCatalogs"])
            if not t:
                return "No 'getCatalogs' tool found on the MCP server."
            res = await t.ainvoke({})
            return str(res)[:4000]
     
        # 2) Run SQL
        if low.startswith("sql:") or low.startswith("query:"):
            sql = p.split(":", 1)[1].strip()
            t = _pick_tool(tools, ["queryData", "sqlQuery", "runQuery", "query"])
            if not t:
                return "No query-capable tool (queryData/sqlQuery) found on the MCP server."
            try:
                res = await t.ainvoke({"query": sql})
                return str(res)[:4000]
            except Exception as e:
                return f"Query failed: {e}"
     
        # 3) Help
        return (
            "Connected to CData MCP
    
    "
            "Say **'list connections'** to view available sources, or run a SQL like:
    "
            "  sql: SELECT * FROM [Salesforce1].[SYS].[Connections] LIMIT 5
    
    "
            "Remember to use bracket quoting for catalog/schema/table names."
        )
     
     
    class MyLLM(BaseLLM):
        async def aprocess(self, query: Dict[str, Any], settings: Dict[str, Any], trace: Any):
            # Extract last user message from the Quick Test payload
            prompt = ""
            try:
                prompt = (query.get("messages") or [])[-1].get("content", "")
            except Exception:
                prompt = ""
     
            try:
                reply = await _route(prompt)
            except Exception as e:
                reply = f"Error: {e}"
     
            # The template expects a dict with a 'text' key
            return {"text": reply}
    
    

    Run a Quick Test

    1. Open Quick Test on the right side panel
    2. Paste the JSON code and click Run test
    3. 
      {
         "messages": [
            {
               "role": "user",
               "content": "list connections"
            }
         ],
         "context": {}
      }
      
      

    Chat with your Agent

    Switch to the Chat tab and try prompting like, "List all connections". The chat output will show a list of connection catalogs.

    Get CData Connect AI

    To access 300+ SaaS, Big Data, and NoSQL sources from your AI agents, try CData Connect AI today.

Ready to get started?

Learn more about CData Connect AI or sign up for free trial access:

Free Trial