Provide a fully managed data infrastructure stack with privileges on AWS resources while:
- providing a full audit of all operations performed in a Datacoral installation
- never accessing customer data
- Data security
- No passwords/tokens of customer systems (including databases and SaaS products) should be readable by datacoral
- Customer Data in S3/Redshift should not be readable by datacoral
- Network security
- Customer should be able to control which data and systems are exposed to the internet
- Keep track of external IPs/endpoints accessed by resources within the datacoral installation
- Keep track of all AWS API calls made by datacoral software and datacoral devops personnel
- All audit logs should be available for analysis by customer
- Manageability and Monitorability
- For Datacoral to provide a fully managed service, Datacoral needs the permissions within a VPC to create and delete resources that belong to the Datacoral data infrastructure stack.
- Datacoral needs read-only privileges to monitor the installation and provide functionality
- AWS Cloudwatch logs and metrics
- redshift system and metadata tables like
pg_catalog.*to support materialized views and for monitoring purposes
- hive system and metadata tables to support materialized views
Datacoral Architecture Diagram
All customer data is encrypted at rest as well as in motion using customer managed KMS keys. Datacoral’s cross account role does not have decrypt permissions on the KMS keys. This means that datacoral cannot read any customer data. The credentials needed by the collect slices to connect to SaaS products and databases are also stored encrypted using customer managed KMS keys within the customer AWS account in dynamodb.
- Data Source Credentials like database connection strings and API keys for SaaS products are stored in dynamodb encrypted using customer managed KMS keys.
- Credentials for analytics databases like hive and redshift are also stored encrypted in dynamodb
All customer data is encrypted in motion via SSL and at rest using server side encryption in S3 and Redshift with keys owned by the customer. All efforts are made to provide the Datacoral fully managed service without any Datacoral personnel having access to customer data.
That said, in the rare occasions when there are data quality issues or other errors that need to be debugged, some data maybe revealed to qualified Datacoral personnel as part of the debugging process. Below are a few (but not all) instances when Datacoral personnel may have access:
- while loading to redshift, when there are errors, there's an entry in stl_load_errors table with the errored out row.
- when there's an error in processing an event sent to api gateway, the event is logged for debugging.
In such cases, Datacoral will create a security incident with an audit trail of all actions performed as part of the debugging process and the specific data items that were leaked in the logs.
Identity and Access Management
Datacoral uses AWS' Identity and Access Management (IAM) features to use the principle of least privilege. Prior to setting up a Datacoral installation, the following roles and user have to be created in the customer AWS account.
Administrative Role and User
- Cross Account Role - This is used to deploy and monitor your stack in your AWS account.
- Read-only AWS console user - This user is used to monitor your stack using the AWS console. Datacoral by default turns on Multi-Factor Authentication (MFA) for the console user.
The above roles are used to provide a fully managed data infrastructure as a service in the customer AWS.
- For the most part, Datacoral software running in the Datacoral AWS Account with a dedicated IAM role assumes the customer cross account role for administration functionality.
- Only qualified employees are allowed to access customer accounts using the cross account role and consoler user for ad hoc administrative and debugging tasks.
- Each qualified employee that’s going to access customer accounts using cross account roles have separate IAM roles in datacoral. So, any api/CLI use of cross account roles is tagged with the specific employee making those calls.
- Each employee has a datacoral.co account that is used as the aws user in the datacoral account.
Roles used by Datacoral Installation
Create different roles needed within the datacoral installation. Note that the cross account role will not be able to assume these roles. Only specific services within the datacoral installation will use them. We create all of these roles separately in order to follow the Principle of Least Privilege.
All of the roles below are created through one of the following Datacoral Role Cloudformation Template. You can click on each of the roles below to see the Cloud Formation Templates with the details of all the permissions.
- APILambdaExecRole - Role used by lambda functions that get triggered via API Gateway
- ApplicationAutoscalingRole - Role used to monitor and auto-scale DynamoDB tables
- LambdaExecRole - Role used by all lambda functions called as part of all the slices
- FirehoseRole - Role used by Firehose to write data to S3 and Redshift.
- RedshiftCopyRole - Role used to load data from S3 to Redshift.
- VPCFlowLogsRole - Role used to collect VPC Flow Logs.
- BeanstalkRole - Role used in Beanstalk apps that are used for long running processes like Query Executor, Metabase and Jupyterhub.
- InstallationRole - Role used to initialize a Datacoral installation.
- BatchRole - Role used to support execution of Python Compute, UDFs and AWS Batch processes.
The datacoral installation creates all resources within a VPC in the customer AWS account. Only exceptions include S3 buckets, dynamodb tables, roles, which cannot be created within VPCs.
Public and private subnets
Most AWS resources managed by datacoral will be in private subnets in the customer VPC not accessible to the internet. Exceptions include
- HTTPS endpoints - either for receiving data (events endpoints) or for making data available (DaaS endpoints)
- Redshift - if customer wants to connect to it using third party clients and SaaS software
- Metabase, Jupyterhub
All lambdas access external networks and endpoints via one elastic IP. In order to provide read access to databases, the external IP needs to be added to the security group of the databases.
Datacoral provides a VPN slice using openvpn which can be used to manage customer VPN. Tools and Redshift can be accessed using VPN if needed.
Endpoint API Keys
Events and DaaS endpoints only accept requests from allowed origins with allowed API Keys. New API keys can be requested at any time by the customer.
All API actions in your AWS Account are being captured in CloudTrail. So, you can look at cloudtrail for an audit log of everything that we have done in your AWS account. In addition, any query we run on your redshift cluster is tracked in redshift itself. So, you should have full visibility into all the actions we take in your account and on your data.
VPC Flow logs
Capture all outbound and inbound traffic from/to resources within the VPC
Redshift query logs
All redshift queries that are used for monitoring by datacoral and their corresponding outputs are stored in the s3 bucket dedicated to monitoring.
During configuration of the datacoral installation
- Users provide credentials to access services like databases/salesforce/zendesk securely
- Credentials are written to customer dynamodb db instance encrypted using customer KMS key
During configuration of the datacoral installation
- Datacoral uses the AWS cross account role to deploy services corresponding to datacoral stack in the customer account - cloudtrail should show all api calls made while deploying the stack
- All deployment is done automatically
- No datacoral employee needs to assume the AWS cross account role
- Each datacoral employee has a separate AWS IAM role so that the logs contain exactly who in datacoral assumed the customer AWS cross account role
Day-to-day operations and monitoring
- Day-to-day operation of the datacoral stack does not require any involvement of datacoral employees
- All data is encrypted in motion as well as at rest using customer managed KMS keys
There are 2 interfaces between customer’s datacoral stack and datacoral owned systems
- Periodic heartbeats (sent to specific datacoral IPs) to make sure that the customer account is active, i.e., all bills are paid
- System and processing metadata to monitor for usage and errors. This data is written to a specific S3 bucket that belongs to the customer, and is readable by datacoral’s AWS account. This S3 bucket is used for
- Capturing summaries of lambda execution
- Capturing cloudwatch metrics