How to work with Outlook Data in Apache Spark using SQL
Apache Spark is a fast and general engine for large-scale data processing. When paired with the CData JDBC Driver for Outlook, Spark can work with live Outlook data. This article describes how to connect to and query Outlook data from a Spark shell.
The CData JDBC Driver offers unmatched performance for interacting with live Outlook data due to optimized data processing built into the driver. When you issue complex SQL queries to Outlook, the driver pushes supported SQL operations, like filters and aggregations, directly to Outlook and utilizes the embedded SQL engine to process unsupported operations (often SQL functions and JOIN operations) client-side. With built-in dynamic metadata querying, you can work with and analyze Outlook data using native data types.
Install the CData JDBC Driver for Outlook
Download the CData JDBC Driver for Outlook installer, unzip the package, and run the JAR file to install the driver.
Start a Spark Shell and Connect to Outlook Data
- Open a terminal and start the Spark shell with the CData JDBC Driver for Outlook JAR file as the jars parameter:
$ spark-shell --jars /CData/CData JDBC Driver for Outlook/lib/cdata.jdbc.api.jar
- With the shell running, you can connect to Outlook with a JDBC URL and use the SQL Context load() function to read a table.
Using OAuth Authentication
Microsoft Graph API uses OAuth 2.0 for authentication. You must register an application in the Microsoft Azure Portal to obtain OAuth credentials (Client ID and Client Secret).
Obtaining OAuth Credentials
- Log in to the Azure Portal.
- Navigate to Azure Active Directory > App registrations.
- Click New registration to create a new application.
- Enter an application name and select the appropriate account types.
- Set the Redirect URI to your application's callback URL (e.g., http://localhost:33333 for desktop apps).
- Click Register to create the application.
- On the application overview page, copy the Application (client) ID - this is your OAuthClientId.
- Navigate to Certificates & secrets and create a new client secret.
- Copy the client secret value - this is your OAuthClientSecret.
- Navigate to API permissions and add the required Microsoft Graph API permissions:
- Mail.Read - For accessing email messages
- Contacts.Read - For accessing contacts
- Calendars.Read - For accessing calendar events
- Tasks.Read - For accessing To Do tasks
- offline_access - For obtaining refresh tokens
- Click Grant admin consent to grant these permissions.
Connecting with OAuth
After setting the following connection properties, you are ready to connect:
- AuthScheme: Set this to OAuth.
- InitiateOAuth: Set this to GETANDREFRESH. The CData API Profile for Outlook will automatically walk through the OAuth process in order to obtain the access token.
- OAuthClientId: Set this to the Application (client) ID from Azure Portal.
- OAuthClientSecret: Set this to the client secret value from Azure Portal.
- TenantId: Set this to your Azure AD tenant identifier (GUID or domain name like 'contoso.onmicrosoft.com').
- CallbackURL: Set this to the Redirect URI you specified in your app registration (e.g., http://localhost:33333 for desktop apps).
Example connection string
Profile=C:\profiles\Outlook.apip;AuthScheme=OAuth;InitiateOAuth=GETANDREFRESH;OAuthClientId=your_client_id;OAuthClientSecret=your_client_secret;TenantId=your_tenant_id;CallbackUrl=http://localhost:33333;
Built-in Connection String Designer
For assistance in constructing the JDBC URL, use the connection string designer built into the Outlook JDBC Driver. Either double-click the JAR file or execute the jar file from the command-line.
java -jar cdata.jdbc.api.jar
Fill in the connection properties and copy the connection string to the clipboard.
Configure the connection to Outlook, using the connection string generated above.
scala> val api_df = spark.sqlContext.read.format("jdbc").option("url", "jdbc:api:Profile=C:\profiles\Outlook.apip;AuthScheme=OAuth;InitiateOAuth=GETANDREFRESH;OAuthClientId=your_client_id;OAuthClientSecret=your_client_secret;TenantId=your_tenant_id;CallbackUrl=http://localhost:33333;").option("dbtable","CalendarGroupCalendars").option("driver","cdata.jdbc.api.APIDriver").load() - Once you connect and the data is loaded you will see the table schema displayed.
Register the Outlook data as a temporary table:
scala> api_df.registerTable("calendargroupcalendars")-
Perform custom SQL queries against the Data using commands like the one below:
scala> api_df.sqlContext.sql("SELECT , FROM CalendarGroupCalendars WHERE CalendarGroupId = group_id").collect.foreach(println)You will see the results displayed in the console, similar to the following:
Using the CData JDBC Driver for Outlook in Apache Spark, you are able to perform fast and complex analytics on Outlook data, combining the power and utility of Spark with your data. Download a free, 30 day trial of any of the hundreds of CData JDBC Drivers and get started today.