How to reduce disk usage size in Redshift?

Hello Everyone,

As also you know, redshift is using for DWH, and for some reason, the disk can be full 🙂

Create a superuser for those kinds of queries or login with superuser of Redshift.

Lets Start;
Execute below query for sorting table with the schema in redshift.

SELECT TRIM(pgn.nspname) AS SCHEMA,
       TRIM(a.name) AS TABLE,
       id AS TableId,
       decode(pgc.reldiststyle,
             0, 'EVEN',
             1,det.distkey ,
             8,'ALL'
       ) AS DistKey,
       decode(pgc.reldiststyle,
             8,NULL,
             dist_ratio.ratio::DECIMAL(20,4)
       ) AS Skew,
       det.head_sort AS "SortKey",
       det.n_sortkeys AS "#SKs",
       CASE WHEN pgc.reldiststyle = 8 THEN a.rows_all_dist ELSE a.rows END AS rows,
       b.mbytes,
       decode(det.max_enc,
             0,'N',
             'Y'
       ) AS Enc,
       det.pct_enc,
       decode(b.mbytes,
             0,0,
             ((b.mbytes/part.total::DECIMAL)*100)::DECIMAL(20,2)
       ) AS pct_of_total,
       (CASE WHEN a.rows = 0 THEN NULL ELSE
          CASE WHEN pgc.reldiststyle = 8 THEN ((a.rows_all_dist - pgc.reltuples)::DECIMAL(20,3) / a.rows_all_dist::DECIMAL(20,3)*100)::DECIMAL(20,2)
                ELSE ((a.rows - pgc.reltuples)::DECIMAL(20,3) / a.rows::DECIMAL(20,3)*100)::DECIMAL(20,2) END END
       ) AS pct_stats_off,
       CASE WHEN pgc.reldiststyle = 8
          THEN decode( det.n_sortkeys,0, NULL,DECODE( a.rows_all_dist,0,0, (a.unsorted_rows_all_dist::DECIMAL(32)/a.rows_all_dist)*100))::DECIMAL(20,2)
          ELSE decode( det.n_sortkeys,0, NULL,DECODE( a.rows,0,0, (a.unsorted_rows::DECIMAL(32)/a.rows)*100))::DECIMAL(20,2) END
        AS pct_unsorted
FROM (SELECT db_id,
             id,
             name,
             SUM(ROWS) AS ROWS,
             MAX(ROWS) AS rows_all_dist,
             SUM(ROWS) - SUM(sorted_rows) AS unsorted_rows,
             MAX(ROWS) - MAX(sorted_rows) AS unsorted_rows_all_dist
      FROM stv_tbl_perm a
      GROUP BY db_id,
               id,
               name) AS a
  JOIN pg_class AS pgc ON pgc.oid = a.id
  JOIN pg_namespace AS pgn ON pgn.oid = pgc.relnamespace
  LEFT OUTER JOIN (SELECT tbl, COUNT(*) AS mbytes FROM stv_blocklist GROUP BY tbl) b ON a.id = b.tbl
  INNER JOIN (SELECT attrelid,
                     MIN(CASE attisdistkey WHEN 't' THEN attname ELSE NULL END) AS "distkey",
                     MIN(CASE attsortkeyord WHEN 1 THEN attname ELSE NULL END) AS head_sort,
                     MAX(attsortkeyord) AS n_sortkeys,
                     MAX(attencodingtype) AS max_enc,
                     SUM(case when attencodingtype <> 0 then 1 else 0 end)::DECIMAL(20,3)/COUNT(attencodingtype)::DECIMAL(20,3)  *100.00 as pct_enc
              FROM pg_attribute
              GROUP BY 1) AS det ON det.attrelid = a.id
  INNER JOIN (SELECT tbl,
                     MAX(Mbytes)::DECIMAL(32) /MIN(Mbytes) AS ratio
              FROM (SELECT tbl,
                           TRIM(name) AS name,
                           slice,
                           COUNT(*) AS Mbytes
                    FROM svv_diskusage
                    GROUP BY tbl,
                             name,
                             slice)
              GROUP BY tbl,
                       name) AS dist_ratio ON a.id = dist_ratio.tbl
  JOIN (SELECT SUM(capacity) AS total
        FROM stv_partitions
        WHERE part_begin = 0) AS part ON 1 = 1
WHERE mbytes IS NOT NULL
AND   pgc.relowner > 1
-- and pgn.nspname = 'schema' -- schemaname
-- and a.name like 'table%' -- tablename
-- and det.max_enc = 0 -- non-compressed tables
ORDER BY mbytes DESC;

First, start you find which table is not using and remove those are.

Drop or Truncate bietltools.orders

Second, Vacuum table which is needs or you can start the size of data.

vacuum full bietltools.customer to 100 percent;

Third, It is a difficult part, encoding.

ANALYZE COMPRESSION
bietltools.bietltools_reporting
COMPROWS 10000

Comprows is important otherwise your disk usage will be 100 🙂

When you get the results, steps are like below;

1 - alter table bietltools.bietltools_reporting add column new_tooladded_time timestamp ENCODE zstd;

2 - update bietltools.bietltools_reporting  set new_tooladded_time =tooladded_time;

3 - alter table bietltools.bietltools_reporting drop column tooladded_time;

4 - alter table bietltools.bietltools_reporting rename column new_tooladded_time totooladded_time;

I write like bold zstd, redshift has an encoding type for every datatype, and you have to figure out and set the best encoding type for that.

One more useful query at below it means bonus;

Show databases size

select sum(mbytes)/1024 as db_size_in_gb, database from (
select trim(pgdb.datname) as Database,
trim(a.name) as Table, b.mbytes
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
where a.slice=0
order by db_id, name)
group by database;

 

Welcome to the black box of AWS

How to append your data from Cloud Storage to BigQuery with Python (ETL)

Hello Everyone,

BigQuery is a fully-managed enterprise data warehouse for analytics. It is cheap and high-scalable. In this article, I would like to share a basic tutorial for Google Cloud Storage and  BigQuery with Python.

Installation
pip install google-cloud-bigquery

Create credentials

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json

Additionally, please set the PATH to environment variables.

Read from Cloud Storage Append on Big Query

#Import libraries
from google.cloud import bigquery
from google.oauth2 import service_account

#Set Credentials “Create your own credential files on google cloud account”
credentials = service_account.Credentials.from_service_account_file(
‘C:\\Users\\talih\Desktop\\BigQuerytoTableau-6d00b31bb9ab.json’)
project_id = ‘bigquery-to-tableau’

#Set table_ref,project_id and credentials for POST request
client = bigquery.Client(credentials= credentials,project=project_id)
table_ref = client.dataset(‘BigTableau’).table(‘dataflowbasics_schemas’)

#Specify your api post requests with  a few parameter
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_EMPTY
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.max_bad_records = 1000
schema = [
bigquery.SchemaField(‘bietl_id’, ‘INTEGER’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietltools_name’, ‘STRING’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietl_usage’, ‘FLOAT’, mode=’NULLABLE’),
bigquery.SchemaField(‘bietl_salary’, ‘INTEGER’, mode=’NULLABLE’),
]
job_config.schema = schema

#Set your cloud storage bucket name
uri = ‘gs://desctinations3tostorage/bietl20180829.csv’

#Post your request to Google Api
load_job = client.load_table_from_uri(
uri,
table_ref,
job_config=job_config) # API request

assert load_job.job_type == ‘load’

load_job.result() # Waits for table load to complete.

assert load_job.state == ‘DONE’

Now you can access your own data on big query interfaces.

How to use dbt in python environment

Dbt is usefull library for dwh to create a datamart or datamarts. You can find all details in dbt official pages.

I used a few times, so i can clarify for you how you can create a dbt models and dbt configs in your own project, you can do that like below step by steps;

1 – Create a profiles.yml file for DBT Profile. Specify your db connection information etc.
2 – Create a data_model folder like project_dir
3 – Create a .yml file for main project .yml file and you will call it like project_file
4 – Create your own dbt_runner file like dbt_runner.py and set it your execution configs
5 – Create a model folder, you will put your models in that folder
6 – Create a schema or model for yourself and put into that folder a xxxx.schema.yml file
6.1 – Put some table value like below;
bietl_patch:
constraints:
unique:
– somthng_id
not_null:
– somthng_id
– xxx_id

In the end, you will have like below folder and schema;
# DBT Profile. Specify your DB connection information etc.
profiles.yml on the root directory
bietl_data_model folder
bietl_datamarts.yml file
dbt_runner.py python file

bietl_data_model
> models
> specification of your models bietl
> bietl_datamarts.schema.yml
> sql files for using.sql

I’m executing that dbt in airflow das but I didn’t mention it, maybe in next post.

Automate update of data marts to BigQuery

Hey Everyone,

Some reason I’m using AWS – RS and S3 bucket, so for the implementation is 2x daily means batch ingestion of data S3 to Cloud Storage.

1 – Create a dag in Airflow

2 – Execute query Unload in Redshift to S3 with no header

3 – Open Google Cloud interface and use Cloud Storage Transfer Service

Cloud Storage Transfer Service supports one-time transfers or recurring transfers. It provides advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data. It also supports the deletion of the source data after it’s been copied.

3.1 – Create a GCS bucket for storing data in Cloud Storage

3.2 – Select source: GCS you will provide s3 bucket name, access key, a secret key which user authorized s3 bucket to list object, read and delete buckets.

3.3 – Select destination: Provide GCS bucket name and synchronization options (overwrite, delete the destination, delete the source)

3.4 – Set scheduled time and set a description for the task

4 – Now time to Create a table for Querying, open Google Big Query interface

4.1 – Create a DataSet in Big Query interfaces

4.2 – Create a table in Big Query like below steps

4.2.1 – Create a table from Google Cloud Storage

4.2.2 – Select file from GCS bucket which was created by you

4.2.3 – Select file format

4.2.4 – Specify Destination table information with Table type must be external (for daily append your data from s3 to GCS)

4.2.5 – Don’t specify schema we will provide it later.

Now you have a table which providing data from s3 to GCS bucket, but you have to specify headers and column type then you can select your data in BQ.

When you select your table on BQ like below command;

SELECT * FROM `dataset.bi_1234__test.table ` LIMIT 10;

On the below screen, you can see the schema menu than you can add fields with types of your column.

Export Data from Redshift to S3 Bucket, Load Cloud Storage and Query on BQ

Hello Everyone,

My post header is going to the moon sorry for that 🙂

Last week I spent my time on GC but, if your data is in RS you have to unload your data from redshift to s3 cause of GC is not loaded data directly in redshift. So you have to use a bridge for that issue if your programming skills not enough in some scripting or oop languages you can do like below;

1 – Create a scheduled task for execution of below code;

unload (‘select * from test.bietltools where somefields is not null limit 100;’)
to ‘s3://bietltools-external/2018-08-29/bietltools_reports_dm_dt.csv’
iam_role ‘arn:aws:iam::123456789101:role/RedshiftS3Access’;

That query must be executed in Redshift, so if everything is fine on S3;

2 – Create a transfer task in Google Cloud Storage

Get your credentials and go to the cloud storage interface and create a transfer task in GCS from s3, fill text with your own credentials and bucket name etc.

You can find details on cloud storage page when you get your data from s3 to cloud storage.

3 – Create an external table in BigQuery

Create your Big Query table like an external table on BigQuery interface and create your first table with the schema or without the schema.

If you care about header in your file, no worry you can add the fields in big query table options side.

And now you are ready to query your data, enjoy querying 🙂

Installation of Tableau Server Ubuntu On Promise 2018.2

Hi Everyone,

On my last post, I mentioned to Easy DWH and Easy Dashboarding that post is related with my last post.

Lets Start;

Create a folder and store downloaded file in that path.

mkdir tableauinst

wget https://www.tableau.com/downloads/server/deb

Step 1: Install Tableau Server package and start Tableau Services Manager

Install Tableau Server with your distribution’s package manager, then run a script to initialize Tableau Services Manager (TSM). Tableau Services Manager is the management toolset used to install, configure, and manage Tableau services.

The initialize script is included with the installed package.

1 – Log on as a user with sudo access to the computer where you want to install Tableau Server.

2 – Navigate to the directory where you copied the Tableau Server installation package.

3 – Use the package manager to install the Tableau Server package.

sudo apt-get update
sudo apt-get -y install gdebi-core
sudo gdebi -n tableau-server-<version>_amd64.deb

4 – Run the following script to start TSM:

sudo ./initialize-tsm –accepteula

5 – After initialization is complete, close the terminal session:
logout

6 – Navigate to the scripts directory:

cd /opt/tableau/tableau_server/packages/scripts.<version>

7 – If your organization uses a forward proxy solution to access the internet, then configure Tableau Server to use the proxy server. Tableau Server must access the internet for map data and for default licensing functionality.

Step 2: Activate and register Tableau Server

Before you can configure Tableau Server you must activate a license and register.

Beginning by logging on to the TSM web UI. See Sign in to Tableau Services Manager Web UI.

Step 3: Configure general server settings

The most important configuration on this Setup page is the identity store option.

I did Local Auth, so you can specify others, check on Tableau website.

Step 4: Create the Tableau Server administrator account

Create the Tableau Server administrator account.

  • If you are using LDAP for authentication, then the account that you specify here must be a user in the directory.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' –username '<AD-user-name>'

  • On the other hand, if you are running Tableau Server with local authentication, the username and password that you specify here will be used to create the administrative account. Enter a strong password for this account.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' --username 'admin'

Use this account to access the Tableau Server admin web pages.

Step 5: Configure local firewall (optional)

We recommend that you run a local firewall on the computer that is running Tableau Server. This is a security best practice. By default, Linux distributions do not enable firewall during standard installations.

If you have installed or enabled a local firewall then you must open two ports for Tableau Server. These ports are the gateway port (TCP 80) and the tabadmincontroller port (TCP 8850). The following procedure shows an example of how to open these ports using Firewalld, which is the default firewall on CentOS. If you are using a different firewall then you’ll need to determine the right commands to run to open these ports.

  1. Start firewalld:

    sudo systemctl start firewalld

  2. Set default zone to public. Run the following command:

    sudo firewall-cmd --set-default-zone=public

  3. Add ports for the gateway port and the tabadmincontroller port. Run the following commands:

    sudo firewall-cmd --permanent --add-port=80/tcp

    sudo firewall-cmd --permanent --add-port=8850/tcp

  4. Reload the firewall and verify the settings. Run the following commands:

    sudo firewall-cmd --reload

    sudo firewall-cmd --list-all

    Bingo, Tableau is working now, just a sample from tableau dashboard 🙂

Analytics on GC – BQ

Hello Everyone,

Last a few days I spent the time to find some solutions for easy DWH and easy Dashboarding.

Lets Start;

 

1Create a GC account

 

Which is promoted 300 $ per 1st year.

https://cloud.google.com/gcp/

2 – Create A project and enable billing for that project.

 

Just write your credit card information for 1$ sample payment, Google will send again your bank account.

If you have some data from your current DWH or some files on somewhere;

3 – Create a Cloud Storage for BQ
set getting data from some sources

4 – Create a Sync Job for s3 or wherever you want to get data sources

I’m getting my data from s3, you have to set a name for source and one name for a destination, so I mention that like s3toGCstorage source and for destination destinations3toGCstorage.

And now you have your data on GC Storage, it scheduled and it works fine.

5 – Create a dataset on BigQuery

bq mk BigTableau

6 – Create a Table and Load Data from GCStorage
On webui or activate cloud shell

bq –location=[LOCATION] load –source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

In the end, you scheduled your task on GC Storage, that task getting your data from your sources, you created a BQ table to the struct that your data, and now we have to create a dashboard or whatever.

I will install tableau server for my issue, but you can use data studio in GC or whatever you want, in last a few years lots of Dashboard tools support to BQ for sourcing.

I will mention that in another post.

Tableau-Server Version Upgrade on Ubuntu

Tableau Server Pre-Upgrade

Create an image of Instance of the server, backup_tableau-1_24082018

Connect to tableau servers with ssh and login TSM on the command line.

tsm login -u XXXXX (which is your admin tsm user)

Perform a Full Backup and Restore of Tableau Server
There are two types of backup data that Tableau Server can generate. We recommend performing regular backups of each type in case you must restore a server in a recovery scenario:

1 – Repository data: Tableau Server data consists of the Tableau PostgreSQL database or repository, which contains workbook and user metadata, data extract files, and site configuration data. When you use TSM to create a backup, all of this data is saved in a single file with a .tsbak extension.

tsm maintenance backup -f tableau_server<version>.tsbak -d

2 – Configuration and Topology data: includes most of the server configuration information required to fully recover a server. SMTP, alerting, authentication, are all examples of configuration data that are exportable for backup. Topology data defines how your Tableau Server processes are configured in both single-server and multiple node deployments.

tsm settings export -f configuration.json

Start to Upgrade

1 – Log on as a user with sudo access to the computer you are upgrading.

Download which version you want to get on Tableau Web site:

1.a) wget https://downloads.tableau.com/esdalt/2018.2.0/tableau-server-2018-2-0_amd64.deb

2 – Use the package manager to install the Tableau Server package

2.a) sudo gdebi -n tableau-server-2018-2-0_amd64.deb

3 – After you have installed the new package on every node in your cluster

3.a) tsm stop

4 – With Tableau Server stopped, run the following command on any one node.

4.a) sudo /opt/tableau/tableau_server/packages/scripts.20182.18.0627.2230/upgrade-tsm –accepteula -u XXXXX (which is your admin tsm user)

5 – The upgraded Tableau binary directory will be added to PATH for new shells. To get the updated path, either start a new session, or for bash users run:

5.a) source /etc/profile.d/tableau_server.sh

6 – Start Tableau Server:

6.a) tsm start -u XXXXX (which is your admin tsm user)

Confirm that everything works as expected, these are some areas of testing to consider:

Server processes:
Sign in to Tableau Server as a server administrator, and then open the Server Status page to confirm that all services and processes are running as expected (including on all additional nodes if this is a distributed installation).

User access:
Confirm that Tableau Server users can sign in. Test your normal user sign in process. Have some of your users participate in the testing to make sure they are able to sign in as expected, and that they can get to the same content that they have access to in your production environment.

Publishing workbooks and data sources:
Have users publish workbooks and data sources from Tableau Desktop to make sure this goes as you expect.

Viewing published workbooks:
Have users who are familiar with the content try to view published workbooks to make sure they appear as expected. Test views embedded in web pages (for example, in SharePoint pages).

Subscriptions and extract refreshes :
Manually run some extract refreshes to confirm that they complete successfully. Run some key scheduled extract refreshes to confirm that they complete as expected.

Permissions:
Confirm that permissions are still set as expected for users and content.

Command-line utilities and APIs. If applicable, test the command line utilities (tsm and tabcmd) and programmatic access via APIs.

tsm status -v if something going wrong try to figure out tsm restart command.

If tsm is not started try to check logs with below command;
The upgrade log, app-upgrade.log, is located in /var/opt/tableau/tableau_server/logs

Why mysql is not used on big data?

Hi everyone,

I try to check database management system which is most used or which is most valueable in the world. I found a website like compare the databases for general structure.

I compared them on a link, you can check below, but i didnt understood to why mysql is not used in big data. If anyone can answer this question, you can contact me on about-contact page.

http://db-engines.com/en/system/DB2%3BMicrosoft+SQL+Server%3BMySQL%3BOracle%3BPostgreSQL

By the way, if you want to follow rank of databases you can catch on below link;

http://db-engines.com/en/ranking/relational+dbms

Cheers,

Talih Bayram

Career Roadmapping About Data

In following up on Dave Wells’ recent piece titled The Evolution (and Opportunity) of IT Careers, Jennifer takes a different look at the challenges of trying to understand why some people are happy and successful in their careers, while others just continue to struggle.


The concept of program, project, and operations as a significant career influence is one that I’ve worked with for years, and over that time it has inspired quite a bit of discussion. While Dave’s ideas about information, data, and systems are interesting, I think that they need to be slightly adjusted from a career perspective. Instead of information, data, and systems, let’s look at information, data, and technology. Today’s reality is that information and technology are diverging. So what makes sense from a career perspective is data, information, and technology roles.

jhay_jul01

Now let’s intersect that with program, project, and operations to adjust the grid from the previous article. It now looks like the image in Figure 1.

In my career guidance role, I see a lot of potential value to use this view as a framework to build career roadmaps.

A career roadmap is a navigational concept that shows not only where you’ve been in your career but more interestingly where you aspire to go as your career unfolds into the future. At a high level, the roadmap looks at progression through the nine tiles illustrated in the diagram.

jhay_jul02

Let’s look at the career roadmap for a persona that we’ll call Raoul. He started his career as a maintenance programmer, which clearly places him in the bottom right corner. Within a few years he moved from maintenance programming to software development, placing him at the intersection of project work and technology. His project work has sparked an interest in data and he now aspires to become a data architect at the intersection of data program work. Raoul’s career roadmap options are illustrated in Figure 2.

Raoul has a number of options to consider. He could stay at the project level and move horizontally into data as a data modeler (that’s path 1 in red), then move from data modeler to data architect.

jhay_jul03

He could stay in the technology space, moving from project to program level by becoming a systems architect followed by a shift to data architect (that’s path 2 in purple). He could move diagonally from technology/project to data/program, but that’s a pretty aggressive move and likely more difficult to achieve. The edge-to-edge moves illustrated by paths 1 and 2 tend to be easier because the gaps between tiles are not as great as when attempting a corner-to-corner move such as illustrated by path 3.

So which path makes best sense for Raoul? Following path 1, he gets advantage from his project achievements and relationships, and begins to develop data experience. Following path 2 he gets an advantage from his technology and programming achievements and relationships and shifts from developer to architect. Each works as a step along the path to his ultimate goal of becoming a data architect. The best path depends on a combination of his interests and the job opportunities that are available to him.

Now let’s look at Lucy, another persona. At various times throughout her career, Lucy has worked as a business data analyst (project/information), a DBA (operations/data), and a data architect. Lucy’s data architect role was especially interesting because some of the most important work that she performed was as a liaison between the architecture group and project team. She was working not in a single tile but in two adjacent tiles – program/data and project/data. Lucy’s career roadmap is illustrated in figure 3.

jhay_jul04.png

Today, combining her business, technical, and data experience she believes that she is a natural fit for a lead data steward role. But where does data steward fit in the framework? It doesn’t really fit into any of the tiles, nor is it represented by two or more adjacent tiles. The data steward role is an example of working in the “white space” that separates all of the tiles. White space jobs are often the most interesting of all, and they’re certainly important as essential roles that connect all of the pieces.

Summary

I’m in absolute agreement with Dave Wells in that the use of roadmaps in career planning will continue to grow as the field expands to include big data, analytics, and other advances. There will be interesting times ahead of us as technology demands increase and the IT field diversifies with business units assuming many roles that have traditionally existed in IT departments. Rapid evolution of both technology and skills will continue to be the norm as abundant opportunities emerge for every data, information, and systems professional.

 

Thank you so much for this amazing article to Jennifer Hay.

I’m glad to read Jennifer’s suggestions on this article and so sharing.

You can see original article on tdan.com

http://tdan.com/career-roadmapping/20012