Redshift WLM Configuration

Workload Management for Redshift (WLM)

Redshift, like many other database engines, has infrastructure for managing resources and workloads. Given that resources are often either scarce or costly, it makes sense to have an infrastructure that lets users govern the usage and prioritize the different types of tasks that use them. In the case of Redshift, additional considerations involve the architecture of Redshift as an MPP database and the implications of Redshift being a managed service.

The Redshift architecture

Redshift's fundamental MPP architecture generally has the effect that the data stored in a Redshift cluster is distributed over all the cluster nodes. Ideally, the distribution is somewhat even between nodes. But distributing the data of a table over all the nodes, e.g., by using a hash function on some column, means that any query accessing the table would require work to be performed on all the nodes – you can’t just run one particular query on one specific node and another (presumably more important query) on four specific other nodes. As a result of running on all the cluster nodes, queries tend to be very fast, but only a limited number can be run concurrently without risking overloading the system. Redshift resolves this issue by having a queueing mechanism that makes newly submitted queries wait if the system is fully loaded. The WLM functionality provides a means for controlling the behavior of the queueing mechanism, including setting priorities for queries from different users or groups of users.

Redshift as a managed service

The managed service aspect of Redshift also has an impact on resource management in the area of concurrency. In order to properly monitor what happens on the system and alert users of problems, the Redshift control plane constantly issues SQL queries. Those queries tend to go against system tables rather than user data, but since the data sources for many Redshift system tables are spread out over all the nodes, these monitoring queries may have some impact on the number of user queries that can be executed concurrently.

The role of Concurrency Scaling

The need for WLM may be diminished if Redshift’s Concurrency Scaling functionality is used. WLM is used to govern the usage of scarce resources and prioritize certain activities over others. Rather than restricting activity, Concurrency Scaling is meant to add resources in an elastic way as needed so to avoid scarcity issues. Concurrency Scaling can be configured as part of the WLM functionality.

Changing WLM settings

It is possible to modify the WLM settings away from the defaults provided by Datacoral. The Datacoral defaults should provide adequate out-of-the-box performance without any need for user intervention, but it is possible to change the WLM configuration if need be. However, such modifications are not recommended in normal cases. The Redshift WLM has two fundamental modes, automatic and manual. The automatic mode provides some tuning functionality, like setting priority levels for different queues, but Redshift tries to automate the processing characteristics for workloads as much as possible. The manual mode provides rich functionality for controlling workloads. See the AWS documentation for WLM.

Generally, the fine-tuning of workload settings requires a good understanding of Redshift performance characteristics and the factors that affect them as well as an understanding of the specifics of the workloads that are to be tuned. As such, manual efforts can be nontrivial and time consuming – in so far, the Datacoral default behavior or the Redshift automatic mode provides adequate performance, a manual tuning effort might not be cost effective.

Another factor that should be taken into account is that there are two types of changes to the WLM configuration, dynamic and static. Static changes, which include changing between the automatic and manual WLM modes, require a cluster reboot to take effect. The AWS documentation lists what changes are dynamic and static.

A cluster reboot can be done through the Redshift console, CLI, or API. It may also occur as a result of Redshift scheduled maintenance during the maintenance window. However, any cluster reboot should preferably be done when the other activities are quiesced – a reboot in the middle of a Datacoral load could lead to the load failing and a retry being required.

General guidelines for tuning WLM settings

There is no simple formula for WLM settings given that each environment may have its own trade-offs and priorities. At the highest level, the most crucial WLM tuning is defining queues and setting their priorities and degrees of concurrency. Queues are used to group tasks of similar nature or importance. For example, it could make sense to have one queue for the tasks associated with ETL loads and another queue for ad-hoc queries. In some cases it might make sense to distinguish short-running tasks from long-running batch jobs and to put them in different queues. Certain tasks may belong in special high-priority queues. The options depend on whether WLM is used in the automatic or manual mode.

Setting priorities is typically a policy decision. For instance, if there is one queue for ETL jobs and one for ad-hoc queries, the relative priority may depend on SLAs. If and ETL load has an SLA to finish within a tight load window, it could make sense to give its queue a higher priority.

The degree of concurrency is set to Auto in the automatic WLM mode and defaults to 5 per queue in manual mode with an upper limit 50 per queue and 50 in total over all user-defined queues excluding the superuser queue. As a rule of thumb, it is probably safer to limit the total degree of concurrency on the system in order to minimize the risk of performance degradations due to memory pressure that could lead to excessive spilling to disk.

Given the richness of the Redshift WLM functionality, users who want to configure their own custom settings will likely need to study the AWS documentation very carefully for details about the available options.

Current default settings for Datacoral-generated Redshift clusters

Datacoral can generate a new Redshift cluster or use an existing one. When Datacoral generates the cluster, it uses a set of default values for the WLM settings. These defaults may be subject to change in response to technological changes such as improvements in the Redshift automatic algorithms. It is recommended that the Datacoral defaults are used unless there is good evidence that they are suboptimal for the workload. The current default settings are using the manual mode WLM with three query groups.

The first group is used for queries executed for datacoral loads and other admin tasks such as table rotation and view management.

query_group: datacoral_admin
query_concurrency: 3
memory_percent_to_use: 10

The second one is used for queries again materialized views.

query_group: datacoral_queryexecutor
query_concurrency: 3
memory_percent_to_use: 10

The third one is a default queue, which would typically be used for copy statements.

query_concurrency: 3
memory_percent_to_use: 10

In addition, Datacoral creates user groups for user cluster activities, such analytical queries, that are not generated by Datacoral. Each non-Datacoral Redshift user will be assigned a user_group with the name of the user. The memory_percent_to_use number will be divided equally between the users after deducting the 30 percent for the Datacoral activities. The query_concurrency for these users is set to 5.

An example WLM config can be seen below (here, there are two user groups called datacoral and customer):

[
{
"query_group": [
"datacoral_admin"
],
"query_concurrency": 3,
"memory_percent_to_use": 10
},
{
"query_group": [
"datacoral_queryexecutor"
],
"query_concurrency": 3,
"memory_percent_to_use": 10
},
{
"query_concurrency": 3,
"memory_percent_to_use": 10
},
{
"user_group": [
"datacoral"
],
"query_concurrency": 5,
"memory_percent_to_use": 35
},
{
"user_group": [
"customer"
],
"query_concurrency": 5,
"memory_percent_to_use": 35
}
]