Fixing Your Poor Performance of Redshift with Vacuum

As also you know about Redshift is too much fast when you have to read or select some data, but of course you have to do something for like theoretical white paper performance. Even if you’ve carefully planned out your schema, sortkeys, distkeys and compression encodings, your Redshift queries may still be awfully slow if you have long-running vacuums taking place in the background.

The number one enemy for query performance is the vacuum—it can slow down your ETL jobs and analytical queries by as much as 80%. It is an I/O intensive process that sorts the table, reclaims unused disk space, and impacts all other I/O bound processes (such as queries against large tables). This guide can help you cut down the time it takes to vacuum your cluster (these steps lowered our vacuum time from 10–30 hours to less than 1 hour).

This guide assumes you’ve chosen sortkeys and distkeys for your table, and are vacuuming regularly. If you are not doing these things, use this guide and this guide to get them set up (the flow charts are quite helpful).

 Contents

  • 0- What is the Vacuum?
  • 1 – Insert Data in Sortkey Order (for Tables that are Updated Regularly)
  • 2 – Use Compression Encodings (for Large Tables)
  • 3 – Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)
  • 4 – Call ANALYZE After Vacuuming
  • 5 – VACUUM to 99% on Large Tables
  • 6 – Keep Your Tables Skinny

The vacuum is a process that carries out one or both of the following two steps: sorting tables and reclaiming unused disk blocks. Lets talk about sorting,

VACUUM SORT ONLY;

The first time you insert data into the table, it will land sorted according to its sortkey (if one exists), and this data will make up the “sorted” section of the table. Note the unsorted percentage on the newly populated table below.

COPY INTO my_table FROM s3://my-bucket/csv;
SELECT "table", unsorted FROM svv_table_info;
  table   | unsorted
----------+------------
 my_table |     0

Subsequent inserts are appended to a completely different section on disk called the “unsorted” section of the table. Calling VACUUM SORT ONLY initiates two processes,

  1. a sorting of the unsorted section,
  2. a merging of the sorted and unsorted sections;

both of these steps can be costly, but there are simple ways to cut down that cost, which we’ll discuss below.

Now onto deleting,

VACUUM DELETE ONLY;

If you called DELETE on any rows from your table since the last vacuum, they were merely marked for deletion. A vacuum operation is necessary to actually reclaim that disk space.

These two steps, sorting tables and reclaiming disk space, can be run together efficiently,

VACUUM FULL;

This command simply runs both a sort only and a delete only operation, but there are advantages to doing them concurrently. If you have deleted andinserted new data, always do a “full” vacuum. It will be faster than a manual vacuum sort only followed by a manual vacuum delete only.

Now that we have described the vacuum, lets talk about how to make it faster. I’ll describe each tip, then describe why it matters.

1 -Insert Data in Sortkey Order (for Tables that are Updated Regularly)

If you have a monotonically increasing sortkey like date, timestamp or auto-incrementing id, make that the first column of your (compound) sortkey. This will cause your inserts to conform to your sortkey configuration, and drastically reduce the merging Redshift needs to do when the vacuum is invoked. If you do one thing in this guide, do this.

Why?

If the unsorted section fully belongs at the end of the sorted sectionalready (say, because time is an arrow, and you’re sorting by timestamp), then the merge step is over almost immediately.

Meanwhile, if you have two sorted sections, and you wish to merge them, but the sort order is interleaved between the two tables (say, because you’re sorting by customer), you will likely have to rewrite the entire table. This will cost you dearly!

Furthermore, if in general if you do queries like,

SELECT 
   AGGREGATE(column)
FROM my_table
   WHERE date = '2018–01–01'
   AND action = 'message_clicked'
   AND customer = 'taco-town';

then a compound key, sorted by date first, will be both performant in terms of query speed and in terms of vacuum time. You may also consider sorting by customer or action, but these must be subsequent keys in the sortkey, not the first. It will be difficult to optimize your sortkey selection for every query pattern your cluster might see, but you can target and optimize the most likely patterns. Furthermore, by avoiding long vacuums, you are in effect improving query performance.

2- Use Compression Encodings (for Large Tables)

Compression encodings will give you 2–4x compression on disk. Almost always use Zstandard encoding. But you may use the following command to get compression encoding recommendations on a column-by-column basis,

ANALYZE COMPRESSION my_table;

This command will lock the table for the duration of the analysis, so often you need to take a small copy of your table and run the analysis on it separately.

CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (
    -- Generate a pseudo-random filter of ~100,000 rows.
    -- This works for a table with ~10e9 rows.
    SELECT * FROM my_table
    WHERE ABS(STRTOL(LEFT(MD5('seed' || id), 15), 16)) < POW(2, 59)
);
-- Recreate my_table with these recommendations.
ANALYZE COMPRESSION my_table_tmp;

Alternatively, you may apply compression encoding recommendations automatically during a COPY (but only on the first insert to an empty table).

COPY INTO my_table FROM s3://bucket COMPUPDATE ON;

If your tables are small enough to fit into memory without compression, then do not bother encoding them. If your tables are very small, and very low read latency is a requirement, get them out of Redshift altogether.

Why?

The smaller your data, the more data you can fit into memory, the faster your queries will be. So compression helps in both keeping disk space down and reducing the I/O cost of querying against tables that are much larger than memory.

For small tables, the calculus changes. We generally accept a small decompression cost over an I/O cost, but when there is no I/O cost because the table is small, then the decompression cost makes up a significant portion of the total query cost and is no longer worth it. Cutting down on disk space usage frees up the overhead to do deep copies if necessary (see point 3).

3- Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)

If for some reason your table ends up at more than 20% unsorted, you may be better off copying it than vacuuming it. Bear in mind that Redshift will require 2–3x the table size in free disk space to complete the copy.

Why?

On the first insert to an empty table, Redshift will sort the data according to the sortkey, on subsequent inserts it will not. So a deep copy is identical to a vacuum in this way (as long as the copy takes place in one step). It will likely complete much faster as well (and tie up less resources), but you may not have the 2–3x disk space overhead to complete the copy operation. That is why you should be using appropriate compression encodings (see point 2).

Your deep copy code:

BEGIN;
CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (SELECT * FROM my_table);
DROP TABLE my_table;
ALTER TABLE my_table_tmp RENAME TO my_table;
COMMIT;

4-Call ANALYZE After Vacuuming

This is basic, but it gets left out. Call ANALYZE to update the query planner after you vacuum. The vacuum may have significantly reorganized the table, and you should update the planner stats. This can create a performance increase for reads, and the analyze process itself is typically quite fast.

5- VACUUM to 99% on Large Tables

Push the vacuum to 99% if you have daily insert volume less than 5% of the existing table. The syntax for doing so is,

VACUUM FULL my_table TO 99;

You must specify a table in order to use the TO clause. Therefore, you probably have to write code like this:

for table in tables:
    cursor.execute('VACUUM FULL {} TO 99;'.format(table))

Why?

This one may seem counterintuitive. Many teams might clean up their redshift cluster by calling VACUUM FULL. This conveniently vacuums every table in the cluster. But, if a table’s unsorted percentage is less than 5%, Redshift skips the vacuum on that table. This process continues for every vacuum call until the table finally tops 5% unsorted, at which point the sorting will take place.

This is fine if the table is small, and resorting 5% of the table is a modest job. But if the table is very large, resorting and merging 5% of the table may be a significant time cost (it was for us).

Vacuuming more thoroughly on each call spreads the vacuum cost evenly across the events, instead of saving up unsorted rows, then running long vacuums to catch up.

You may wonder if this causes more total vacuum time. The answer is no, if you are following step 1, and inserting in sortkey order. The vacuum call amounts to a sorting of the unsorted section and a quick merge step. Sorting 5% of the table will take 5x the time that sorting 1% of the table does, and the merge step will always be fast if you are inserting new data in sortkey order.

Note: the Most important thing is when etl works kind of insert delete etch, Redshift stored data if you didn’t vacuum etc. So on performance and prices is going up. Keep clean your tables after every execution or with daily crons.

How to use dbt in python environment

Dbt is usefull library for dwh to create a datamart or datamarts. You can find all details in dbt official pages.

I used a few times, so i can clarify for you how you can create a dbt models and dbt configs in your own project, you can do that like below step by steps;

1 – Create a profiles.yml file for DBT Profile. Specify your db connection information etc.
2 – Create a data_model folder like project_dir
3 – Create a .yml file for main project .yml file and you will call it like project_file
4 – Create your own dbt_runner file like dbt_runner.py and set it your execution configs
5 – Create a model folder, you will put your models in that folder
6 – Create a schema or model for yourself and put into that folder a xxxx.schema.yml file
6.1 – Put some table value like below;
bietl_patch:
constraints:
unique:
– somthng_id
not_null:
– somthng_id
– xxx_id

In the end, you will have like below folder and schema;
# DBT Profile. Specify your DB connection information etc.
profiles.yml on the root directory
bietl_data_model folder
bietl_datamarts.yml file
dbt_runner.py python file

bietl_data_model
> models
> specification of your models bietl
> bietl_datamarts.schema.yml
> sql files for using.sql

I’m executing that dbt in airflow das but I didn’t mention it, maybe in next post.