Amazon S3 Connector for Data Prep¶
User Persona: Data Prep User, Data Prep Admin, or Data Source Admin
This document covers all configuration fields available during connector setup. Some fields may have already been filled out by your Administrator at an earlier step of configuration and may not be visible to you. For more information on Data Prep's connector framework, see Data Prep Connector setup. Also, your Admin may have named this connector something else in the list of Data Sources.
Configure Data Prep¶
This connector enables the ability to import and export data against Amazon S3 object storage. The following fields are used to define the connection parameters.
Name: Name of the data source as it will appear to users in the UI.
Description: Description of the data source as it will appear to users in the UI.
You can connect Data Prep to multiple S3 buckets. Using a descriptive name can be a big help to users in identifying the appropriate data source. If you are a Data Prep SaaS customer, please inform Data Prep DevOps how you would like this set.
Amazon S3 Client Configuration¶
Bucket name: An S3 bucket represents a collection of objects stored in Amazon S3. The connector requires the following permissions: s3:ListBucket, s3:GetObject, and (for export only) s3:PutObject. In addition, if there is a SourceIP condition block specified in your bucket policy, then you must include the IP addresses for your Main Core Server and Automation Core Server (if you have one).
See AWS S3 Bucket Permission/Policy Details at the bottom of this article for more details.
Prefix: Limits results to only those keys that begin with the specified prefix.
Encryption type: Server-side encryption type to be used. See AWS Encryption Types for more information.
Bucket region: This option allows users to specify the region in which their S3 bucket is hosted or to choose that the connector should automatically determine the region.
Amazon S3 Authentication¶
These options specify how to authenticate with S3.
AWS Credentials: The Access Key ID and Secret Key associated with the user’s AWS Access Key. This is the default setting.
See AWS Security Credentials for more details.
Instance Profile (IAM Role): enables all users in this tenant to access AWS without needing to individually authenticate.
See Using Instance Profile (IAM Role) to Grant Access to AWS Resources on Amazon EC2 for more details.
This connector will automatically retrieve credentials from the EC2 server instance.
IAM Cross Account: enables access to S3 by assuming a role in another AWS account that has access to the configured S3 bucket.
See Cross Account Access for more details.
For the Instance Profile (IAM Role) and IAM Cross Account options, Data Prep must be installed on your Amazon EC2 hosts.
If you connect to Amazon S3 through a proxy server, these fields define the proxy details.
- Web Proxy: 'None' if no proxy is required or 'Proxied' if the connection to the Amazon S3 REST Endpoint should be made via a proxy server. If a web proxy server is required, the following fields are required to enable a proxied connection.
- Proxy host: The host name or IP address of the web proxy server.
- Proxy port: The port on the proxy server for Data Source.
- Proxy username: The username for the proxy server.
- Proxy password: The password for the proxy server. *Leave username & password blank for an unauthenticated proxy connection.
Socket Timeout Seconds: The number of seconds to wait for a response from Amazon S3 on an established connection. The default value is 5 minutes. To handle the export of large files, increase the value.
Data Import and Export Information¶
The Connector will present a browsable directory hierarchy starting at the location defined in the Prefix field.
The Connector also supports Wildcard and Glob importing which enables users to import multiple S3 data files into Data Prep as a single Dataset.
Via SQL Query¶
As S3 is a file store, SQL Queries are not supported for this data source. If you would like to directly query AWS S3 data, please reach out to your Customer Success contact regarding Data Prep’s AWS Athena Connector.
AWS S3 Bucket Permission/Policy Details¶
This section reviews the permissions that must be assigned in your S3 bucket policy and what you are required to do if you have a SourceIP condition block specified in your bucket policy.
The AWS S3 connector requires specific permissions in your S3 bucket policy to ensure that you can successfully import data from S3, publish to S3, and automate importing from an S3 source. In summary:
- The connector requires the s3:ListBucket permission on the bucket for browsing.
- For importing the bucket contents, Data Prep requires the permissions s3:GetObject
- For exporting to the bucket, Data Prep requires the permission s3:PutObject
Sample bucket policy example¶
Minimum policy permissions¶
The minimum policy permissions for reading from an S3 bucket are:
The minimum policy permissions for writing to an S3 bucket are:
For a detailed explanation of S3 buckets, refer to Working with Amazon S3 Buckets.
SourceIP condition block¶
If there is a SourceIP condition block specified in your bucket policy, then you must include the IP addresses of your Data Prep cloud servers or Data Prep Core Server (depending on your Data Prep deployment) in the SourceIP Condition block. In addition, if you have a dedicated Data Prep server for automation, you must also include the automation server IP addresses in the SourceIP Condition block.
Please consult with Data Prep's Customer Success team to obtain the list of IP addresses for Data Prep cloud servers.
For details on the condition block element and examples, see Specifying Conditions in a Policy and Identity and Access Management (IAM) Policy Elements Reference.