Fixing Your Poor Performance of Redshift with Vacuum

As also you know about Redshift is too much fast when you have to read or select some data, but of course you have to do something for like theoretical white paper performance. Even if you’ve carefully planned out your schema, sortkeys, distkeys and compression encodings, your Redshift queries may still be awfully slow if you have long-running vacuums taking place in the background.

The number one enemy for query performance is the vacuum—it can slow down your ETL jobs and analytical queries by as much as 80%. It is an I/O intensive process that sorts the table, reclaims unused disk space, and impacts all other I/O bound processes (such as queries against large tables). This guide can help you cut down the time it takes to vacuum your cluster (these steps lowered our vacuum time from 10–30 hours to less than 1 hour).

This guide assumes you’ve chosen sortkeys and distkeys for your table, and are vacuuming regularly. If you are not doing these things, use this guide and this guide to get them set up (the flow charts are quite helpful).

 Contents

  • 0- What is the Vacuum?
  • 1 – Insert Data in Sortkey Order (for Tables that are Updated Regularly)
  • 2 – Use Compression Encodings (for Large Tables)
  • 3 – Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)
  • 4 – Call ANALYZE After Vacuuming
  • 5 – VACUUM to 99% on Large Tables
  • 6 – Keep Your Tables Skinny

The vacuum is a process that carries out one or both of the following two steps: sorting tables and reclaiming unused disk blocks. Lets talk about sorting,

VACUUM SORT ONLY;

The first time you insert data into the table, it will land sorted according to its sortkey (if one exists), and this data will make up the “sorted” section of the table. Note the unsorted percentage on the newly populated table below.

COPY INTO my_table FROM s3://my-bucket/csv;
SELECT "table", unsorted FROM svv_table_info;
  table   | unsorted
----------+------------
 my_table |     0

Subsequent inserts are appended to a completely different section on disk called the “unsorted” section of the table. Calling VACUUM SORT ONLY initiates two processes,

  1. a sorting of the unsorted section,
  2. a merging of the sorted and unsorted sections;

both of these steps can be costly, but there are simple ways to cut down that cost, which we’ll discuss below.

Now onto deleting,

VACUUM DELETE ONLY;

If you called DELETE on any rows from your table since the last vacuum, they were merely marked for deletion. A vacuum operation is necessary to actually reclaim that disk space.

These two steps, sorting tables and reclaiming disk space, can be run together efficiently,

VACUUM FULL;

This command simply runs both a sort only and a delete only operation, but there are advantages to doing them concurrently. If you have deleted andinserted new data, always do a “full” vacuum. It will be faster than a manual vacuum sort only followed by a manual vacuum delete only.

Now that we have described the vacuum, lets talk about how to make it faster. I’ll describe each tip, then describe why it matters.

1 -Insert Data in Sortkey Order (for Tables that are Updated Regularly)

If you have a monotonically increasing sortkey like date, timestamp or auto-incrementing id, make that the first column of your (compound) sortkey. This will cause your inserts to conform to your sortkey configuration, and drastically reduce the merging Redshift needs to do when the vacuum is invoked. If you do one thing in this guide, do this.

Why?

If the unsorted section fully belongs at the end of the sorted sectionalready (say, because time is an arrow, and you’re sorting by timestamp), then the merge step is over almost immediately.

Meanwhile, if you have two sorted sections, and you wish to merge them, but the sort order is interleaved between the two tables (say, because you’re sorting by customer), you will likely have to rewrite the entire table. This will cost you dearly!

Furthermore, if in general if you do queries like,

SELECT 
   AGGREGATE(column)
FROM my_table
   WHERE date = '2018–01–01'
   AND action = 'message_clicked'
   AND customer = 'taco-town';

then a compound key, sorted by date first, will be both performant in terms of query speed and in terms of vacuum time. You may also consider sorting by customer or action, but these must be subsequent keys in the sortkey, not the first. It will be difficult to optimize your sortkey selection for every query pattern your cluster might see, but you can target and optimize the most likely patterns. Furthermore, by avoiding long vacuums, you are in effect improving query performance.

2- Use Compression Encodings (for Large Tables)

Compression encodings will give you 2–4x compression on disk. Almost always use Zstandard encoding. But you may use the following command to get compression encoding recommendations on a column-by-column basis,

ANALYZE COMPRESSION my_table;

This command will lock the table for the duration of the analysis, so often you need to take a small copy of your table and run the analysis on it separately.

CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (
    -- Generate a pseudo-random filter of ~100,000 rows.
    -- This works for a table with ~10e9 rows.
    SELECT * FROM my_table
    WHERE ABS(STRTOL(LEFT(MD5('seed' || id), 15), 16)) < POW(2, 59)
);
-- Recreate my_table with these recommendations.
ANALYZE COMPRESSION my_table_tmp;

Alternatively, you may apply compression encoding recommendations automatically during a COPY (but only on the first insert to an empty table).

COPY INTO my_table FROM s3://bucket COMPUPDATE ON;

If your tables are small enough to fit into memory without compression, then do not bother encoding them. If your tables are very small, and very low read latency is a requirement, get them out of Redshift altogether.

Why?

The smaller your data, the more data you can fit into memory, the faster your queries will be. So compression helps in both keeping disk space down and reducing the I/O cost of querying against tables that are much larger than memory.

For small tables, the calculus changes. We generally accept a small decompression cost over an I/O cost, but when there is no I/O cost because the table is small, then the decompression cost makes up a significant portion of the total query cost and is no longer worth it. Cutting down on disk space usage frees up the overhead to do deep copies if necessary (see point 3).

3- Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)

If for some reason your table ends up at more than 20% unsorted, you may be better off copying it than vacuuming it. Bear in mind that Redshift will require 2–3x the table size in free disk space to complete the copy.

Why?

On the first insert to an empty table, Redshift will sort the data according to the sortkey, on subsequent inserts it will not. So a deep copy is identical to a vacuum in this way (as long as the copy takes place in one step). It will likely complete much faster as well (and tie up less resources), but you may not have the 2–3x disk space overhead to complete the copy operation. That is why you should be using appropriate compression encodings (see point 2).

Your deep copy code:

BEGIN;
CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (SELECT * FROM my_table);
DROP TABLE my_table;
ALTER TABLE my_table_tmp RENAME TO my_table;
COMMIT;

4-Call ANALYZE After Vacuuming

This is basic, but it gets left out. Call ANALYZE to update the query planner after you vacuum. The vacuum may have significantly reorganized the table, and you should update the planner stats. This can create a performance increase for reads, and the analyze process itself is typically quite fast.

5- VACUUM to 99% on Large Tables

Push the vacuum to 99% if you have daily insert volume less than 5% of the existing table. The syntax for doing so is,

VACUUM FULL my_table TO 99;

You must specify a table in order to use the TO clause. Therefore, you probably have to write code like this:

for table in tables:
    cursor.execute('VACUUM FULL {} TO 99;'.format(table))

Why?

This one may seem counterintuitive. Many teams might clean up their redshift cluster by calling VACUUM FULL. This conveniently vacuums every table in the cluster. But, if a table’s unsorted percentage is less than 5%, Redshift skips the vacuum on that table. This process continues for every vacuum call until the table finally tops 5% unsorted, at which point the sorting will take place.

This is fine if the table is small, and resorting 5% of the table is a modest job. But if the table is very large, resorting and merging 5% of the table may be a significant time cost (it was for us).

Vacuuming more thoroughly on each call spreads the vacuum cost evenly across the events, instead of saving up unsorted rows, then running long vacuums to catch up.

You may wonder if this causes more total vacuum time. The answer is no, if you are following step 1, and inserting in sortkey order. The vacuum call amounts to a sorting of the unsorted section and a quick merge step. Sorting 5% of the table will take 5x the time that sorting 1% of the table does, and the merge step will always be fast if you are inserting new data in sortkey order.

Note: the Most important thing is when etl works kind of insert delete etch, Redshift stored data if you didn’t vacuum etc. So on performance and prices is going up. Keep clean your tables after every execution or with daily crons.

How to schedule a BigQuery ETL job with Dataprep

As you know BigQuery user interface lets you do all kinds of things like run an interactive query or batch, save as Table, export to table, etc. — but there is no scheduler yet to schedule a query to run at a specific time or periodicity.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplication, transformations, etc.
  3. An export result of the transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Options include daily, weekly, etc. but also a crontab format for further flexibility that’s it.

Have a nice querying.

 

How to append your data from Cloud Storage to BigQuery with Python (ETL)

Hello Everyone,

BigQuery is a fully-managed enterprise data warehouse for analytics. It is cheap and high-scalable. In this article, I would like to share a basic tutorial for Google Cloud Storage and  BigQuery with Python.

Installation
pip install google-cloud-bigquery

Create credentials

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json

Additionally, please set the PATH to environment variables.

Read from Cloud Storage Append on Big Query

#Import libraries
from google.cloud import bigquery
from google.oauth2 import service_account

#Set Credentials “Create your own credential files on google cloud account”
credentials = service_account.Credentials.from_service_account_file(
‘C:\\Users\\talih\Desktop\\BigQuerytoTableau-6d00b31bb9ab.json’)
project_id = ‘bigquery-to-tableau’

#Set table_ref,project_id and credentials for POST request
client = bigquery.Client(credentials= credentials,project=project_id)
table_ref = client.dataset(‘BigTableau’).table(‘dataflowbasics_schemas’)

#Specify your api post requests with  a few parameter
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_EMPTY
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.max_bad_records = 1000
schema = [
bigquery.SchemaField(‘bietl_id’, ‘INTEGER’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietltools_name’, ‘STRING’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietl_usage’, ‘FLOAT’, mode=’NULLABLE’),
bigquery.SchemaField(‘bietl_salary’, ‘INTEGER’, mode=’NULLABLE’),
]
job_config.schema = schema

#Set your cloud storage bucket name
uri = ‘gs://desctinations3tostorage/bietl20180829.csv’

#Post your request to Google Api
load_job = client.load_table_from_uri(
uri,
table_ref,
job_config=job_config) # API request

assert load_job.job_type == ‘load’

load_job.result() # Waits for table load to complete.

assert load_job.state == ‘DONE’

Now you can access your own data on big query interfaces.

Etl Tools – General Information

ETL tools are designed to save time and money by eliminating the need of ‘hand-coding’ when a new data warehouse is developed. They are also used to facilitate the work of the database administrators who connect different branches of databases as well as integrate or change the existing databases.

    The main purpose of the ETL tool is:

  • extraction of the data from legacy sources (usually heterogenous)
  • data transformation (data optimized for transaction –> data optimized for analysis)
  • synchronization and cleansing of the data
  • loading the data into data warehouse.

There are several requirements that must be had by ETL tools in order to deliver an optimal value to users, supporting a full range of possible scenarios.

Those are:
– data delivery and transformation capabilities
– data and metadata modelling capabilities
– data source and target support
– data governance capability
– runtime platform capabilities
– operations and administration capabilities
– service-enablements capability.

ETL TOOLS COMPARISON CRITERIA

The bietltools.com portal is not affiliated with any of the companies listed below in the comparison.

The research inclusion and exclusion criteria are as follows:
– range and mode of connectivity/adapter support
– data transformation and delivery modes support
– metadata and data modelling support
– design, development and data governance support
– runtime platform support
– enablement of service and three additional requirements for vendors:
– $20 milion or more of software revenue from data integration tools every year or not less than 300 production customers
– support of customers in not less than two major geographic regions
– have customer implementations at crossdepartamental and multiproject level.

We miss a few etl tools, but think generally. Of course in world we have lots of etl tools, but for now we couldnt investigate which one is we miss.

ETL TOOLS COMPARISON

The information provided below lists major strengths and weaknesses of the most popular ETL vendors.

IBM (Information Server Infosphere platform)

    Advantages:

  • strongest vision on the market, flexibility
  • progress towards common metadata platform
  • high level of satisfaction from clients and a variety of initiatives
    Disadvantages:

  • difficult learning curve
  • long implementation cycles
  • became very heavy (lots of GBs) with version 8.x and requires a lot of processing power

Informatica PowerCenter

    Advantages:

  • most substantial size and resources on the market of data integration tools vendors
  • consistent track record, solid technology, straightforward learning curve, ability to address real-time data integration schemes
  • Informatica is highly specialized in ETL and Data Integration and focuses on those topics, not on BI as a whole
  • focus on B2B data exchange
    Disadvantages:

  • several partnerships diminishing the value of technologies
  • limited experience in the field.

Microsoft (SQL Server Integration Services)

    Advantages:

  • broad documentation and support, best practices to data warehouses
  • ease and speed of implementation
  • standardized data integration
  • real-time, message-based capabilities
  • relatively low cost – excellent support and distribution model
    Disadvantages:

  • problems in non-Windows environments. Takes over all Microsoft Windows limitations.
  • unclear vision and strategy

Oracle (OWB and ODI)

    Advantages:

  • based on Oracle Warehouse Builder and Oracle Data Integrator – two very powerful tools;
  • tight connection to all Oracle datawarehousing applications;
  • tendency to integrate all tools into one application and one environment.
    Disadvantages:

  • focus on ETL solutions, rather than in an open context of data management;
  • tools are used mostly for batch-oriented work, transformation rather than real-time processes or federation data delivery;
  • long-awaited bond between OWB and ODI brought only promises – customers confused in the functionality area and the future is uncertain

SAP BusinessObjects (Data Integrator / Data Services)

    Advantages:

  • integration with SAP
  • SAP Business Objects created a firm company determined to stir the market;
  • Good data modeling and data-management support;
  • SAP Business Objects provides tools for data mining and quality; profiling due to many acquisitions of other companies.
  • Quick learning curve and ease of use
    Disadvantages:

  • SAP Business Objects is seen as two different companies
  • Uncertain future. Controversy over deciding which method of delivering data integration to use (SAP BW or BODI).
  • BusinessObjects Data Integrator (Data Services) may not be seen as a stand-alone capable application to some organizations.

SAS

    Advantages:

  • experienced company, great support and most of all very powerful data integration tool with lots of multi-management features
  • can work on many operating systems and gather data through number of sources – very flexible
  • great support for the business-class companies as well for those medium and minor ones
    Disadvantages:

  • misplaced sales force, company is not well recognized
  • SAS has to extend influences to reach non-BI community
  • Costly

Sun Microsystems

    Advantages:

  • Data integration tools are a part of huge Java Composite Application Platform Suite – very flexible with ongoing development of the products
  • ‘Single-view’ services draw together data from variety of sources; small set of vendors with a strong vision
    Disadvantages:

  • relative weakness in bulk data movement
  • limited mindshare in the market
  • support and services rated below adequate

Sybase

    Advantages:

  • assembled a range of capabilities to be able to address a mulitude of data delivery styles
  • size and global presence of Sybase create opportunities in the market
  • pragmatic near-term strategy – better of current market demand
  • broad partnerships with other data quality and data integration tools vendors
    Disadvantages:

  • falls behind market leaders and large vendors
  • gaps in many aspects of data management

Syncsort

    Advantages:

  • functionality; well-known brand on the market (40 years experience); loyal customer and experience base;
  • easy implementation, strong performance, targeted functionality and lower costs
    Disadvantages:

  • struggle with gaining mind share in the market
  • lack of support for other than ETL delivery styles
  • unsatisfactory with lack of capability of professional services

Tibco Software

    Advantages:

  • message-oriented application integration; capabilities based on common SOA structures;
  • support for federated views; easy implementation, support andperformance
    Disadvantages:

  • scarce references from customers; not widely enough recognised for data integration competencies
  • lacking in data quality capabilities.

ETI

    Advantages:

  • proven and mature code-generating architecture
  • one of the earliest vendors on the data integration market; support for SOA service-oriented deployments;
  • successfully deals with large data volumes and a high degree of complexity, extension of the range of data platforms and data sources;
  • customers’ positive responses to ETI technology
    Disadvantages:

  • relatively slow growth of customer base
  • rather not attractive and inventive technology.

iWay Software

    Advantages:

  • offers physical data movement and delivery; support of wide range of adapters and access to numerous sources;
  • well integrated, standard tools;
  • reasonable ease of implementation effort
    Disadvantages:

  • gaps in specific capabilities
  • relatively costly – not competitive versus market leaders

Pervasive Software

    Advantages:

  • many customers, years of experience, solid applications and support;
  • good use of metadata
  • upgrade from older versions into newer is straightforward.
    Disadvantages:

  • inconsistency in defining the target for their applications;
  • no federation capability;
  • limitated presence due to poor marketing.

Open Text

    Advantages

  • Simplicity of use in less-structured sources
  • Easy licensing for business solutions
  • cooperates with a wide range of sources and targets
  • increasingly high functionality
    Disadvantages:

  • limited federation, replication and data quality support; rare upgrades due to its simplicity;
  • weak real-time support due to use third party solutions and other database utilities.

Pitney Bowes Software

    Advantages:

  • Data Flow concentrates on data integrity and quality;
  • supports mainly ETL patterns; can be used for other purposes too;
  • ease of use, fast implementation, specific ETL functionality.
    Disadvantages:

  • rare competition with other major companies, repeated rebranding trigger suspicions among customers.
  • narrow vision of possibilities even though Data Flow comes with variety of applications.
  • weak support, unexperienced service.

I used referenced web site : etltools.net

We have a referrer link, for original article of this post, if you want you can follow etltools.net

Special thanks for this article, and you can see the post at below link;

http://www.etltools.net/etl-tools-comparison.html