How to get data from Twitch-API with python

Hello Everyone,

This is a small quick py script for how to call API and how to parse json to csv with pandas for the beginners.

1 – Install below libraries,

import pandas as pd
import requests
import json

2 – Set your static values, btw you can use those values with .yml file

url = “https://wind-bow.glitch.me/twitch-api/channels/”
# List of channels we want to access
channels = [“ESL_SC2”, “OgamingSC2”, “cretetion”, “freecodecamp”, “storbeck”, “habathcx”, “RobotCaleb”, “noobs2ninjas”,
“ninja”, “shroud”, “Dakotaz”, “esltv_cs”, “pokimane”, “tsm_bjergsen”, “boxbox”, “a_seagull”,
“kinggothalion”, “jahrein”, “thenadeshot”, “sivhd”, “kingrichard”]

file_name = “talih.csv”
location =”C:\\Users\\talih\\Desktop\\TwitchPy\\TwitchAPI\\”
“”” Those values can be used with .yml file “””

3 – Set your class,functions:
class apicrawler:

def __init__(self,url,channels,file_name,location):
self.url = url
self.channels = channels
self.file_name = file_name
self.location = location

def selectedchannelcrawler(url,channels,location,file_name):
channels_list = []
for channel in channels:
JSONContent = requests.get(url + channel).json()
channels_list.append([JSONContent[‘_id’], JSONContent[‘display_name’], JSONContent[‘status’],
JSONContent[‘followers’], JSONContent[‘views’]])

dataset = pd.DataFrame(channels_list)
dataset.columns = [‘Id’, ‘Name’, ‘Status’, ‘Followers’, ‘Views’]
dataset.dropna(axis = 0, how = ‘any’, inplace = True)
dataset.index = pd.RangeIndex(len(dataset.index))
dataset.to_csv(location + file_name, sep=’,’, encoding=’utf-8′)

4 – Call that class for your own values:
apicrawler.selectedchannelcrawler(url,channels,location,file_name)

Enjoy

Introduction of – Apache Pulsar

Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo! that is part of the Apache Software Foundation.

Pulsar is a multi-tenant, high-performance solution for server-to-server messaging.

Pulsar’s key features include:

Architecture Overview

At the highest level, a Pulsar instance is composed of one or more Pulsar clusters. Clusters within an instance can replicate data amongst themselves.

The diagram below provides an illustration of a Pulsar cluster:

Pulsar Comparison With Apache Kafka

The table below lists the similarities and differences between Apache Pulsar and Apache Kafka:

KAFKA PULSAR
Concepts Producer-topic-consumer group-consumer Producer-topic-subscription-consumer
Consumption More focused on streaming, exclusive messaging on partitions. No shared consumption. Unified messaging model and API.

  • Streaming via exclusive, failover subscription
  • Queuing via shared subscription
Acking Simple offset management

  • Prior to Kafka 0.8, offsets are stored in ZooKeeper
  • After Kafka 0.8, offsets are stored on offset topics
Unified messaging model and API.

  • Streaming via exclusive, failover subscription
  • Queuing via shared subscription
Retention Messages are deleted based on retention. If a consumer doesn’t read messages before the retention period, it will lose data. Messages are only deleted after all subscriptions consumed them. No data loss even the consumers of a subscription are down for a long time.Messages are allowed to keep for a configured retention period time even after all subscriptions consume them.
TTL No TTL support Supports message TTL

Conclusion

Apache Pulsar is an effort undergoing incubation at The Apache Software Foundation (ASF) sponsored by the Apache Incubator PMC. It seems that it will be a competitive alternative to Apache Kafka due to its unique features.

References

  1. Apache Pulsar homepage
  2. Yahoo! Open Source homepage
  3. Apache homepage
  4. Pulsar concepts and architecture documentation
  5. Comparing Pulsar and Kafka: Unified queueing and streaming
  6. Apache-Pulsar Distributed Pub-Sub Messaging System

Fixing Your Poor Performance of Redshift with Vacuum

As also you know about Redshift is too much fast when you have to read or select some data, but of course you have to do something for like theoretical white paper performance. Even if you’ve carefully planned out your schema, sortkeys, distkeys and compression encodings, your Redshift queries may still be awfully slow if you have long-running vacuums taking place in the background.

The number one enemy for query performance is the vacuum—it can slow down your ETL jobs and analytical queries by as much as 80%. It is an I/O intensive process that sorts the table, reclaims unused disk space, and impacts all other I/O bound processes (such as queries against large tables). This guide can help you cut down the time it takes to vacuum your cluster (these steps lowered our vacuum time from 10–30 hours to less than 1 hour).

This guide assumes you’ve chosen sortkeys and distkeys for your table, and are vacuuming regularly. If you are not doing these things, use this guide and this guide to get them set up (the flow charts are quite helpful).

 Contents

  • 0- What is the Vacuum?
  • 1 – Insert Data in Sortkey Order (for Tables that are Updated Regularly)
  • 2 – Use Compression Encodings (for Large Tables)
  • 3 – Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)
  • 4 – Call ANALYZE After Vacuuming
  • 5 – VACUUM to 99% on Large Tables
  • 6 – Keep Your Tables Skinny

The vacuum is a process that carries out one or both of the following two steps: sorting tables and reclaiming unused disk blocks. Lets talk about sorting,

VACUUM SORT ONLY;

The first time you insert data into the table, it will land sorted according to its sortkey (if one exists), and this data will make up the “sorted” section of the table. Note the unsorted percentage on the newly populated table below.

COPY INTO my_table FROM s3://my-bucket/csv;
SELECT "table", unsorted FROM svv_table_info;
  table   | unsorted
----------+------------
 my_table |     0

Subsequent inserts are appended to a completely different section on disk called the “unsorted” section of the table. Calling VACUUM SORT ONLY initiates two processes,

  1. a sorting of the unsorted section,
  2. a merging of the sorted and unsorted sections;

both of these steps can be costly, but there are simple ways to cut down that cost, which we’ll discuss below.

Now onto deleting,

VACUUM DELETE ONLY;

If you called DELETE on any rows from your table since the last vacuum, they were merely marked for deletion. A vacuum operation is necessary to actually reclaim that disk space.

These two steps, sorting tables and reclaiming disk space, can be run together efficiently,

VACUUM FULL;

This command simply runs both a sort only and a delete only operation, but there are advantages to doing them concurrently. If you have deleted andinserted new data, always do a “full” vacuum. It will be faster than a manual vacuum sort only followed by a manual vacuum delete only.

Now that we have described the vacuum, lets talk about how to make it faster. I’ll describe each tip, then describe why it matters.

1 -Insert Data in Sortkey Order (for Tables that are Updated Regularly)

If you have a monotonically increasing sortkey like date, timestamp or auto-incrementing id, make that the first column of your (compound) sortkey. This will cause your inserts to conform to your sortkey configuration, and drastically reduce the merging Redshift needs to do when the vacuum is invoked. If you do one thing in this guide, do this.

Why?

If the unsorted section fully belongs at the end of the sorted sectionalready (say, because time is an arrow, and you’re sorting by timestamp), then the merge step is over almost immediately.

Meanwhile, if you have two sorted sections, and you wish to merge them, but the sort order is interleaved between the two tables (say, because you’re sorting by customer), you will likely have to rewrite the entire table. This will cost you dearly!

Furthermore, if in general if you do queries like,

SELECT 
   AGGREGATE(column)
FROM my_table
   WHERE date = '2018–01–01'
   AND action = 'message_clicked'
   AND customer = 'taco-town';

then a compound key, sorted by date first, will be both performant in terms of query speed and in terms of vacuum time. You may also consider sorting by customer or action, but these must be subsequent keys in the sortkey, not the first. It will be difficult to optimize your sortkey selection for every query pattern your cluster might see, but you can target and optimize the most likely patterns. Furthermore, by avoiding long vacuums, you are in effect improving query performance.

2- Use Compression Encodings (for Large Tables)

Compression encodings will give you 2–4x compression on disk. Almost always use Zstandard encoding. But you may use the following command to get compression encoding recommendations on a column-by-column basis,

ANALYZE COMPRESSION my_table;

This command will lock the table for the duration of the analysis, so often you need to take a small copy of your table and run the analysis on it separately.

CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (
    -- Generate a pseudo-random filter of ~100,000 rows.
    -- This works for a table with ~10e9 rows.
    SELECT * FROM my_table
    WHERE ABS(STRTOL(LEFT(MD5('seed' || id), 15), 16)) < POW(2, 59)
);
-- Recreate my_table with these recommendations.
ANALYZE COMPRESSION my_table_tmp;

Alternatively, you may apply compression encoding recommendations automatically during a COPY (but only on the first insert to an empty table).

COPY INTO my_table FROM s3://bucket COMPUPDATE ON;

If your tables are small enough to fit into memory without compression, then do not bother encoding them. If your tables are very small, and very low read latency is a requirement, get them out of Redshift altogether.

Why?

The smaller your data, the more data you can fit into memory, the faster your queries will be. So compression helps in both keeping disk space down and reducing the I/O cost of querying against tables that are much larger than memory.

For small tables, the calculus changes. We generally accept a small decompression cost over an I/O cost, but when there is no I/O cost because the table is small, then the decompression cost makes up a significant portion of the total query cost and is no longer worth it. Cutting down on disk space usage frees up the overhead to do deep copies if necessary (see point 3).

3- Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)

If for some reason your table ends up at more than 20% unsorted, you may be better off copying it than vacuuming it. Bear in mind that Redshift will require 2–3x the table size in free disk space to complete the copy.

Why?

On the first insert to an empty table, Redshift will sort the data according to the sortkey, on subsequent inserts it will not. So a deep copy is identical to a vacuum in this way (as long as the copy takes place in one step). It will likely complete much faster as well (and tie up less resources), but you may not have the 2–3x disk space overhead to complete the copy operation. That is why you should be using appropriate compression encodings (see point 2).

Your deep copy code:

BEGIN;
CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (SELECT * FROM my_table);
DROP TABLE my_table;
ALTER TABLE my_table_tmp RENAME TO my_table;
COMMIT;

4-Call ANALYZE After Vacuuming

This is basic, but it gets left out. Call ANALYZE to update the query planner after you vacuum. The vacuum may have significantly reorganized the table, and you should update the planner stats. This can create a performance increase for reads, and the analyze process itself is typically quite fast.

5- VACUUM to 99% on Large Tables

Push the vacuum to 99% if you have daily insert volume less than 5% of the existing table. The syntax for doing so is,

VACUUM FULL my_table TO 99;

You must specify a table in order to use the TO clause. Therefore, you probably have to write code like this:

for table in tables:
    cursor.execute('VACUUM FULL {} TO 99;'.format(table))

Why?

This one may seem counterintuitive. Many teams might clean up their redshift cluster by calling VACUUM FULL. This conveniently vacuums every table in the cluster. But, if a table’s unsorted percentage is less than 5%, Redshift skips the vacuum on that table. This process continues for every vacuum call until the table finally tops 5% unsorted, at which point the sorting will take place.

This is fine if the table is small, and resorting 5% of the table is a modest job. But if the table is very large, resorting and merging 5% of the table may be a significant time cost (it was for us).

Vacuuming more thoroughly on each call spreads the vacuum cost evenly across the events, instead of saving up unsorted rows, then running long vacuums to catch up.

You may wonder if this causes more total vacuum time. The answer is no, if you are following step 1, and inserting in sortkey order. The vacuum call amounts to a sorting of the unsorted section and a quick merge step. Sorting 5% of the table will take 5x the time that sorting 1% of the table does, and the merge step will always be fast if you are inserting new data in sortkey order.

Note: the Most important thing is when etl works kind of insert delete etch, Redshift stored data if you didn’t vacuum etc. So on performance and prices is going up. Keep clean your tables after every execution or with daily crons.

How to schedule a BigQuery ETL job with Dataprep

As you know BigQuery user interface lets you do all kinds of things like run an interactive query or batch, save as Table, export to table, etc. — but there is no scheduler yet to schedule a query to run at a specific time or periodicity.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplication, transformations, etc.
  3. An export result of the transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Options include daily, weekly, etc. but also a crontab format for further flexibility that’s it.

Have a nice querying.

 

How to reduce disk usage size in Redshift?

Hello Everyone,

As also you know, redshift is using for DWH, and for some reason, the disk can be full 🙂

Create a superuser for those kinds of queries or login with superuser of Redshift.

Lets Start;
Execute below query for sorting table with the schema in redshift.

SELECT TRIM(pgn.nspname) AS SCHEMA,
       TRIM(a.name) AS TABLE,
       id AS TableId,
       decode(pgc.reldiststyle,
             0, 'EVEN',
             1,det.distkey ,
             8,'ALL'
       ) AS DistKey,
       decode(pgc.reldiststyle,
             8,NULL,
             dist_ratio.ratio::DECIMAL(20,4)
       ) AS Skew,
       det.head_sort AS "SortKey",
       det.n_sortkeys AS "#SKs",
       CASE WHEN pgc.reldiststyle = 8 THEN a.rows_all_dist ELSE a.rows END AS rows,
       b.mbytes,
       decode(det.max_enc,
             0,'N',
             'Y'
       ) AS Enc,
       det.pct_enc,
       decode(b.mbytes,
             0,0,
             ((b.mbytes/part.total::DECIMAL)*100)::DECIMAL(20,2)
       ) AS pct_of_total,
       (CASE WHEN a.rows = 0 THEN NULL ELSE
          CASE WHEN pgc.reldiststyle = 8 THEN ((a.rows_all_dist - pgc.reltuples)::DECIMAL(20,3) / a.rows_all_dist::DECIMAL(20,3)*100)::DECIMAL(20,2)
                ELSE ((a.rows - pgc.reltuples)::DECIMAL(20,3) / a.rows::DECIMAL(20,3)*100)::DECIMAL(20,2) END END
       ) AS pct_stats_off,
       CASE WHEN pgc.reldiststyle = 8
          THEN decode( det.n_sortkeys,0, NULL,DECODE( a.rows_all_dist,0,0, (a.unsorted_rows_all_dist::DECIMAL(32)/a.rows_all_dist)*100))::DECIMAL(20,2)
          ELSE decode( det.n_sortkeys,0, NULL,DECODE( a.rows,0,0, (a.unsorted_rows::DECIMAL(32)/a.rows)*100))::DECIMAL(20,2) END
        AS pct_unsorted
FROM (SELECT db_id,
             id,
             name,
             SUM(ROWS) AS ROWS,
             MAX(ROWS) AS rows_all_dist,
             SUM(ROWS) - SUM(sorted_rows) AS unsorted_rows,
             MAX(ROWS) - MAX(sorted_rows) AS unsorted_rows_all_dist
      FROM stv_tbl_perm a
      GROUP BY db_id,
               id,
               name) AS a
  JOIN pg_class AS pgc ON pgc.oid = a.id
  JOIN pg_namespace AS pgn ON pgn.oid = pgc.relnamespace
  LEFT OUTER JOIN (SELECT tbl, COUNT(*) AS mbytes FROM stv_blocklist GROUP BY tbl) b ON a.id = b.tbl
  INNER JOIN (SELECT attrelid,
                     MIN(CASE attisdistkey WHEN 't' THEN attname ELSE NULL END) AS "distkey",
                     MIN(CASE attsortkeyord WHEN 1 THEN attname ELSE NULL END) AS head_sort,
                     MAX(attsortkeyord) AS n_sortkeys,
                     MAX(attencodingtype) AS max_enc,
                     SUM(case when attencodingtype <> 0 then 1 else 0 end)::DECIMAL(20,3)/COUNT(attencodingtype)::DECIMAL(20,3)  *100.00 as pct_enc
              FROM pg_attribute
              GROUP BY 1) AS det ON det.attrelid = a.id
  INNER JOIN (SELECT tbl,
                     MAX(Mbytes)::DECIMAL(32) /MIN(Mbytes) AS ratio
              FROM (SELECT tbl,
                           TRIM(name) AS name,
                           slice,
                           COUNT(*) AS Mbytes
                    FROM svv_diskusage
                    GROUP BY tbl,
                             name,
                             slice)
              GROUP BY tbl,
                       name) AS dist_ratio ON a.id = dist_ratio.tbl
  JOIN (SELECT SUM(capacity) AS total
        FROM stv_partitions
        WHERE part_begin = 0) AS part ON 1 = 1
WHERE mbytes IS NOT NULL
AND   pgc.relowner > 1
-- and pgn.nspname = 'schema' -- schemaname
-- and a.name like 'table%' -- tablename
-- and det.max_enc = 0 -- non-compressed tables
ORDER BY mbytes DESC;

First, start you find which table is not using and remove those are.

Drop or Truncate bietltools.orders

Second, Vacuum table which is needs or you can start the size of data.

vacuum full bietltools.customer to 100 percent;

Third, It is a difficult part, encoding.

ANALYZE COMPRESSION
bietltools.bietltools_reporting
COMPROWS 10000

Comprows is important otherwise your disk usage will be 100 🙂

When you get the results, steps are like below;

1 - alter table bietltools.bietltools_reporting add column new_tooladded_time timestamp ENCODE zstd;

2 - update bietltools.bietltools_reporting  set new_tooladded_time =tooladded_time;

3 - alter table bietltools.bietltools_reporting drop column tooladded_time;

4 - alter table bietltools.bietltools_reporting rename column new_tooladded_time totooladded_time;

I write like bold zstd, redshift has an encoding type for every datatype, and you have to figure out and set the best encoding type for that.

One more useful query at below it means bonus;

Show databases size

select sum(mbytes)/1024 as db_size_in_gb, database from (
select trim(pgdb.datname) as Database,
trim(a.name) as Table, b.mbytes
from stv_tbl_perm a
join pg_database as pgdb on pgdb.oid = a.db_id
join (select tbl, count(*) as mbytes
from stv_blocklist group by tbl) b on a.id=b.tbl
where a.slice=0
order by db_id, name)
group by database;

 

Welcome to the black box of AWS

How to append your data from Cloud Storage to BigQuery with Python (ETL)

Hello Everyone,

BigQuery is a fully-managed enterprise data warehouse for analytics. It is cheap and high-scalable. In this article, I would like to share a basic tutorial for Google Cloud Storage and  BigQuery with Python.

Installation
pip install google-cloud-bigquery

Create credentials

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json

Additionally, please set the PATH to environment variables.

Read from Cloud Storage Append on Big Query

#Import libraries
from google.cloud import bigquery
from google.oauth2 import service_account

#Set Credentials “Create your own credential files on google cloud account”
credentials = service_account.Credentials.from_service_account_file(
‘C:\\Users\\talih\Desktop\\BigQuerytoTableau-6d00b31bb9ab.json’)
project_id = ‘bigquery-to-tableau’

#Set table_ref,project_id and credentials for POST request
client = bigquery.Client(credentials= credentials,project=project_id)
table_ref = client.dataset(‘BigTableau’).table(‘dataflowbasics_schemas’)

#Specify your api post requests with  a few parameter
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_EMPTY
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.max_bad_records = 1000
schema = [
bigquery.SchemaField(‘bietl_id’, ‘INTEGER’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietltools_name’, ‘STRING’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietl_usage’, ‘FLOAT’, mode=’NULLABLE’),
bigquery.SchemaField(‘bietl_salary’, ‘INTEGER’, mode=’NULLABLE’),
]
job_config.schema = schema

#Set your cloud storage bucket name
uri = ‘gs://desctinations3tostorage/bietl20180829.csv’

#Post your request to Google Api
load_job = client.load_table_from_uri(
uri,
table_ref,
job_config=job_config) # API request

assert load_job.job_type == ‘load’

load_job.result() # Waits for table load to complete.

assert load_job.state == ‘DONE’

Now you can access your own data on big query interfaces.

How to use dbt in python environment

Dbt is usefull library for dwh to create a datamart or datamarts. You can find all details in dbt official pages.

I used a few times, so i can clarify for you how you can create a dbt models and dbt configs in your own project, you can do that like below step by steps;

1 – Create a profiles.yml file for DBT Profile. Specify your db connection information etc.
2 – Create a data_model folder like project_dir
3 – Create a .yml file for main project .yml file and you will call it like project_file
4 – Create your own dbt_runner file like dbt_runner.py and set it your execution configs
5 – Create a model folder, you will put your models in that folder
6 – Create a schema or model for yourself and put into that folder a xxxx.schema.yml file
6.1 – Put some table value like below;
bietl_patch:
constraints:
unique:
– somthng_id
not_null:
– somthng_id
– xxx_id

In the end, you will have like below folder and schema;
# DBT Profile. Specify your DB connection information etc.
profiles.yml on the root directory
bietl_data_model folder
bietl_datamarts.yml file
dbt_runner.py python file

bietl_data_model
> models
> specification of your models bietl
> bietl_datamarts.schema.yml
> sql files for using.sql

I’m executing that dbt in airflow das but I didn’t mention it, maybe in next post.

Automate update of data marts to BigQuery

Hey Everyone,

Some reason I’m using AWS – RS and S3 bucket, so for the implementation is 2x daily means batch ingestion of data S3 to Cloud Storage.

1 – Create a dag in Airflow

2 – Execute query Unload in Redshift to S3 with no header

3 – Open Google Cloud interface and use Cloud Storage Transfer Service

Cloud Storage Transfer Service supports one-time transfers or recurring transfers. It provides advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data. It also supports the deletion of the source data after it’s been copied.

3.1 – Create a GCS bucket for storing data in Cloud Storage

3.2 – Select source: GCS you will provide s3 bucket name, access key, a secret key which user authorized s3 bucket to list object, read and delete buckets.

3.3 – Select destination: Provide GCS bucket name and synchronization options (overwrite, delete the destination, delete the source)

3.4 – Set scheduled time and set a description for the task

4 – Now time to Create a table for Querying, open Google Big Query interface

4.1 – Create a DataSet in Big Query interfaces

4.2 – Create a table in Big Query like below steps

4.2.1 – Create a table from Google Cloud Storage

4.2.2 – Select file from GCS bucket which was created by you

4.2.3 – Select file format

4.2.4 – Specify Destination table information with Table type must be external (for daily append your data from s3 to GCS)

4.2.5 – Don’t specify schema we will provide it later.

Now you have a table which providing data from s3 to GCS bucket, but you have to specify headers and column type then you can select your data in BQ.

When you select your table on BQ like below command;

SELECT * FROM `dataset.bi_1234__test.table ` LIMIT 10;

On the below screen, you can see the schema menu than you can add fields with types of your column.

Export Data from Redshift to S3 Bucket, Load Cloud Storage and Query on BQ

Hello Everyone,

My post header is going to the moon sorry for that 🙂

Last week I spent my time on GC but, if your data is in RS you have to unload your data from redshift to s3 cause of GC is not loaded data directly in redshift. So you have to use a bridge for that issue if your programming skills not enough in some scripting or oop languages you can do like below;

1 – Create a scheduled task for execution of below code;

unload (‘select * from test.bietltools where somefields is not null limit 100;’)
to ‘s3://bietltools-external/2018-08-29/bietltools_reports_dm_dt.csv’
iam_role ‘arn:aws:iam::123456789101:role/RedshiftS3Access’;

That query must be executed in Redshift, so if everything is fine on S3;

2 – Create a transfer task in Google Cloud Storage

Get your credentials and go to the cloud storage interface and create a transfer task in GCS from s3, fill text with your own credentials and bucket name etc.

You can find details on cloud storage page when you get your data from s3 to cloud storage.

3 – Create an external table in BigQuery

Create your Big Query table like an external table on BigQuery interface and create your first table with the schema or without the schema.

If you care about header in your file, no worry you can add the fields in big query table options side.

And now you are ready to query your data, enjoy querying 🙂

Installation of Tableau Server Ubuntu On Promise 2018.2

Hi Everyone,

On my last post, I mentioned to Easy DWH and Easy Dashboarding that post is related with my last post.

Lets Start;

Create a folder and store downloaded file in that path.

mkdir tableauinst

wget https://www.tableau.com/downloads/server/deb

Step 1: Install Tableau Server package and start Tableau Services Manager

Install Tableau Server with your distribution’s package manager, then run a script to initialize Tableau Services Manager (TSM). Tableau Services Manager is the management toolset used to install, configure, and manage Tableau services.

The initialize script is included with the installed package.

1 – Log on as a user with sudo access to the computer where you want to install Tableau Server.

2 – Navigate to the directory where you copied the Tableau Server installation package.

3 – Use the package manager to install the Tableau Server package.

sudo apt-get update
sudo apt-get -y install gdebi-core
sudo gdebi -n tableau-server-<version>_amd64.deb

4 – Run the following script to start TSM:

sudo ./initialize-tsm –accepteula

5 – After initialization is complete, close the terminal session:
logout

6 – Navigate to the scripts directory:

cd /opt/tableau/tableau_server/packages/scripts.<version>

7 – If your organization uses a forward proxy solution to access the internet, then configure Tableau Server to use the proxy server. Tableau Server must access the internet for map data and for default licensing functionality.

Step 2: Activate and register Tableau Server

Before you can configure Tableau Server you must activate a license and register.

Beginning by logging on to the TSM web UI. See Sign in to Tableau Services Manager Web UI.

Step 3: Configure general server settings

The most important configuration on this Setup page is the identity store option.

I did Local Auth, so you can specify others, check on Tableau website.

Step 4: Create the Tableau Server administrator account

Create the Tableau Server administrator account.

  • If you are using LDAP for authentication, then the account that you specify here must be a user in the directory.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' –username '<AD-user-name>'

  • On the other hand, if you are running Tableau Server with local authentication, the username and password that you specify here will be used to create the administrative account. Enter a strong password for this account.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' --username 'admin'

Use this account to access the Tableau Server admin web pages.

Step 5: Configure local firewall (optional)

We recommend that you run a local firewall on the computer that is running Tableau Server. This is a security best practice. By default, Linux distributions do not enable firewall during standard installations.

If you have installed or enabled a local firewall then you must open two ports for Tableau Server. These ports are the gateway port (TCP 80) and the tabadmincontroller port (TCP 8850). The following procedure shows an example of how to open these ports using Firewalld, which is the default firewall on CentOS. If you are using a different firewall then you’ll need to determine the right commands to run to open these ports.

  1. Start firewalld:

    sudo systemctl start firewalld

  2. Set default zone to public. Run the following command:

    sudo firewall-cmd --set-default-zone=public

  3. Add ports for the gateway port and the tabadmincontroller port. Run the following commands:

    sudo firewall-cmd --permanent --add-port=80/tcp

    sudo firewall-cmd --permanent --add-port=8850/tcp

  4. Reload the firewall and verify the settings. Run the following commands:

    sudo firewall-cmd --reload

    sudo firewall-cmd --list-all

    Bingo, Tableau is working now, just a sample from tableau dashboard 🙂