Replicate Data from Multiple Files in an Amazon S3 Bucket Using CData Sync



Amazon S3 is internet storage designed to enable and empower web-scale computing. With S3, you can store and retrieve any amount of data, from anywhere, at any time. Developers gain access to scalable, reliable, fast, cost-effective storage exactly like the infrastructure Amazon uses.

CData Sync is able to point to either a directory or a folder, pull all files that are similarly formatted, and replicate them into a single destination table. After the initial replication, CData Sync checks the last modified date of each file in the directory and pulls in only the new files, preventing the replication of duplicate data.

Organizations regularly produce related files for periodic data (like receipts, invoices, etc.). For this article, we presume that there is an S3 bucket, which itself contains CSV files, and we replicate related data from these files into a Snowflake data warehouse.

To begin, log into your CData Sync instance, so that we may set up a CSV file as a source connector from CData Sync.

Configuring a CSV Source to Read Data Across Multiple Files

To replicate data from CSV files, first set up a CSV Source connection in CData Sync.

  1. Log into CData Sync and navigate to the Connections tab. Click + Add Connection, then select Sources, and search for and select CSV.
  2. Enter the connection properties, including Connection Type, AWS Region, Auth Scheme, AWS Access Key, AWS Secret Key, and the URI. The URI can be accessed within your S3 bucket folder by clicking "Copy S3 URI".
  3. To replicate data across multiple files, set Aggregate Files to True. Click Save and Test.

Configuring a Snowflake Destination

After testing and saving our CSV connection, we need to connect to the destination, in this case, Snowflake.

  1. Click the Connections tab, select Destinations, and choose Snowflake.
  2. Enter your Snowflake connection properties, including Warehouse, URL, Auth Scheme, User, Password, Database, and Schema and click Save and Test.

Replicating Data from Multiple Files into Snowflake

Now, we are ready to create a job to replicate multiple files from an S3 bucket to our Snowflake data warehouse.

  1. Click Jobs in the Sync interface and then select + Add Job.
  2. Give your job a name and select the CSV source and Snowflake destination connections that we just created. Click Add Job.
  3. After the job has been created, select the job and click on the Task tab. Click + Add Tasks and select the AggregatedFiles table. Click Add Tasks (1).
  4. You may replicate the data to an existing table or create a new one (to change the destination table, click the edit button for the Destination Information on the Task's Overview tab).
  5. Navigate to the Task's Column table to ensure the column mapping is correct.
  6. On the Job screen, click Run.
  7. Next, check your Snowflake instance to verify that indeed the data from your CSV files was replicated to there.

Free Trial & More Information

Now that you have seen how to replicate data from multiple files stored in Amazon S3, visit our CData Sync page to read more information about CData Sync and download a free trial. Start consolidating your enterprise data today! As always, our world-class Support Team is ready to answer any questions you may have.