Provide a fully managed data infrastructure stack with privileges on AWS resources while:
- never accessing customer data
- allowing a full audit of all operations performed in a Datacoral installation
- Data Security
- No passwords/tokens of customer systems (including databases and SaaS products) should be readable by Datacoral
- Customer Data in S3 or a data warehouse (Redshift, Snowflake, etc) should not be readable by Datacoral
- Network Security
- Customer should be able to control which data and systems are exposed to the internet
- Keep track of external IPs/endpoints accessed by resources within the Datacoral installation
- Keep track of all AWS API calls made by Datacoral DevOps personnel
- All audit logs should be available for analysis by customer
- Identity and Access Management
- Datacoral requires the creation of roles with minimum privileges for running the installation successfully.
- Manageability and Monitorability
- To provide a fully managed service, Datacoral needs permissions within a VPC to create and delete resources that belong to the Datacoral data infrastructure stack using a Cross Account Role
- Datacoral needs read-only privileges to monitor the installation
- AWS Cloudwatch logs and metrics
- Data Warehouse system and metadata tables (like
pg_catalog.*in Redshift) to support materialized views and for monitoring purposes
Datacoral Architecture Diagram
All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross account role does not have decrypt permissions on the KMS keys. This means that Datacoral cannot read any customer data. The credentials needed by the ingest connectors to connect to SaaS products and databases are also stored encrypted using customer managed KMS keys within the customer AWS account in dynamodb.
- Data Source Credentials like database connection strings and API keys for SaaS products are stored in dynamodb encrypted using customer managed KMS keys.
- Credentials for data warehouses like Redshift, Snowflake are also stored encrypted in DynamoDB
All customer data is encrypted in motion via SSL and at rest using server side encryption in S3 and Redshift with keys owned by the customer. All efforts are made to provide the Datacoral fully managed service without any Datacoral personnel having access to customer data.
That said, in the rare occasions when there are data quality issues or other errors that need to be debugged, some data maybe revealed to qualified Datacoral personnel as part of the debugging process. Below are a few (but not all) instances when Datacoral personnel may have access:
- If there are errors while loading to Redshift, there's an entry in
stl_load_errorstable with the errored out row.
- If there's an error in processing an event sent to API Gateway, the event is logged for debugging.
In such cases, Datacoral will create a security incident with an audit trail of all actions performed as part of the debugging process and the specific data items that were leaked in the logs.
Identity and Access Management
Datacoral uses AWS' Identity and Access Management (IAM) features to use the principle of least privilege. Prior to setting up a Datacoral installation, the following roles and user have to be created in the customer AWS account.
Administrative Role and User
- Cross Account Role - This is used to deploy and monitor your stack in your AWS account.
- Read-only AWS console user - This user is used to monitor your stack using the AWS console. Datacoral by default turns on Multi-Factor Authentication (MFA) for the console user.
The above roles are used to provide a fully managed data infrastructure as a service in the customer AWS.
- For the most part, Datacoral software running in the Datacoral AWS Account with a dedicated IAM role assumes the customer cross account role for administration functionality.
- Only qualified employees are allowed to access customer accounts using the cross account role and consoler user for ad hoc administrative and debugging tasks.
- Each qualified employee that’s going to access customer accounts using cross account roles have separate IAM roles in Datacoral. So, any api/CLI use of cross account roles is tagged with the specific employee making those calls.
- Each employee has a datacoral.co account that is used as the AWS user in the Datacoral account.
Roles used by Datacoral Installation
Create different roles needed within the Datacoral installation. Note that the cross account role will not be able to assume these roles. Only specific services within the Datacoral installation will use them. We create all of these roles separately in order to follow the Principle of Least Privilege.
All of the roles below are created through one of the following Datacoral Role Cloudformation Template. You can click on each of the roles below to see the Cloud Formation Templates with the details of all the permissions.
- APILambdaExecRole - Role used by lambda functions that get triggered via API Gateway
- ApplicationAutoscalingRole - Role used to monitor and auto-scale DynamoDB tables
- LambdaExecRole - Role used by all lambda functions called as part of all the connectors
- FirehoseRole - Role used by Firehose to write data to S3 and Redshift.
- RedshiftCopyRole - Role used to load data from S3 to Redshift.
- VPCFlowLogsRole - Role used to collect VPC Flow Logs.
- BeanstalkRole - Role used in Beanstalk apps that are used for long running processes like Query Executor, Metabase and Jupyterhub.
- InstallationRole - Role used to initialize a Datacoral installation.
- BatchRole - Role used to support execution of Python Compute, UDFs and AWS Batch processes.
The Datacoral installation creates all resources within a VPC in the customer AWS account. Only exceptions include S3 buckets, dynamodb tables, roles, which cannot be created within VPCs.
Public and private subnets
Most AWS resources managed by Datacoral will be in private subnets in the customer VPC not accessible to the internet. Exceptions include
- HTTPS endpoints - either for receiving data (events endpoints) or for making data available (DaaS endpoints)
- Redshift - if customer wants to connect to it using third party clients and SaaS software
- Metabase, Jupyterhub
All lambdas access external networks and endpoints via one elastic IP. In order to provide read access to databases, the external IP needs to be added to the security group of the databases.
Datacoral provides a VPN connector using openvpn which can be used to manage customer VPN. Tools and Redshift can be accessed using VPN if needed.
Endpoint API Keys
Events and DaaS endpoints only accept requests from allowed origins with allowed API Keys. New API keys can be requested at any time by the customer.
For each of the following, we recommend setting up logging to an S3 bucket. Ideally, this S3 Bucket will have been configured to not allow any deletes and should be in a separate AWS account.
All API actions in your AWS Account can be captured in CloudTrail. So, you can look at CloudTrail for an audit log of all actions performed by Datacoral in your AWS account. In order to setup CloudTrail in your AWS account, click here.
VPC Flow Logs
Capture all outbound and inbound traffic from/to resources within the VPC. See the steps here for how to set this up yourself.
Redshift Query Logs
All Redshift queries that are used for monitoring by Datacoral and their corresponding outputs can be logged in an S3 bucket. In the case that Datacoral is setting up a Redshift cluster for this will be automatically setup. See the steps here for how to set this up yourself if you are using an existing Redshift cluster.
S3 Access Logs
S3 Server Access Logs can be setup to get detailed records of every request made to the S3 Bucket in which customer data is saved. Click here to see how you can set this up.
Manageability and Monitorability
During configuration of the Datacoral installation
- Users provide credentials to access services like databases/salesforce/zendesk securely
- Credentials are written to customer dynamodb db instance encrypted using customer KMS key
During configuration of the Datacoral installation
- Datacoral uses the AWS cross account role to deploy services corresponding to Datacoral stack in the customer account - CloudTrail should show all api calls made while deploying the stack
- All deployment is done automatically
- No Datacoral employee needs to assume the AWS cross account role
- Each Datacoral employee has a separate AWS IAM role so that the logs contain exactly who in Datacoral assumed the customer AWS cross account role
Day-to-day operations and monitoring
- Day-to-day operation of the Datacoral stack does not require any involvement of Datacoral employees
- All data is encrypted in motion as well as at rest using customer managed KMS keys
There are 2 interfaces between customer’s Datacoral stack and Datacoral owned systems
- Periodic heartbeats (sent to specific Datacoral IPs) to make sure that the customer account is active, i.e., all bills are paid
- System and processing metadata to monitor for usage and errors. This data is written to a specific S3 bucket that belongs to the customer, and is readable by Datacoral's AWS account. This S3 bucket is used for
- Capturing summaries of lambda execution
- Capturing cloudwatch metrics