How to append your data from Cloud Storage to BigQuery with Python (ETL)

Hello Everyone,

BigQuery is a fully-managed enterprise data warehouse for analytics. It is cheap and high-scalable. In this article, I would like to share a basic tutorial for Google Cloud Storage and  BigQuery with Python.

Installation
pip install google-cloud-bigquery

Create credentials

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json

Additionally, please set the PATH to environment variables.

Read from Cloud Storage Append on Big Query

#Import libraries
from google.cloud import bigquery
from google.oauth2 import service_account

#Set Credentials “Create your own credential files on google cloud account”
credentials = service_account.Credentials.from_service_account_file(
‘C:\\Users\\talih\Desktop\\BigQuerytoTableau-6d00b31bb9ab.json’)
project_id = ‘bigquery-to-tableau’

#Set table_ref,project_id and credentials for POST request
client = bigquery.Client(credentials= credentials,project=project_id)
table_ref = client.dataset(‘BigTableau’).table(‘dataflowbasics_schemas’)

#Specify your api post requests with  a few parameter
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_EMPTY
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.max_bad_records = 1000
schema = [
bigquery.SchemaField(‘bietl_id’, ‘INTEGER’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietltools_name’, ‘STRING’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietl_usage’, ‘FLOAT’, mode=’NULLABLE’),
bigquery.SchemaField(‘bietl_salary’, ‘INTEGER’, mode=’NULLABLE’),
]
job_config.schema = schema

#Set your cloud storage bucket name
uri = ‘gs://desctinations3tostorage/bietl20180829.csv’

#Post your request to Google Api
load_job = client.load_table_from_uri(
uri,
table_ref,
job_config=job_config) # API request

assert load_job.job_type == ‘load’

load_job.result() # Waits for table load to complete.

assert load_job.state == ‘DONE’

Now you can access your own data on big query interfaces.

How to use dbt in python environment

Dbt is usefull library for dwh to create a datamart or datamarts. You can find all details in dbt official pages.

I used a few times, so i can clarify for you how you can create a dbt models and dbt configs in your own project, you can do that like below step by steps;

1 – Create a profiles.yml file for DBT Profile. Specify your db connection information etc.
2 – Create a data_model folder like project_dir
3 – Create a .yml file for main project .yml file and you will call it like project_file
4 – Create your own dbt_runner file like dbt_runner.py and set it your execution configs
5 – Create a model folder, you will put your models in that folder
6 – Create a schema or model for yourself and put into that folder a xxxx.schema.yml file
6.1 – Put some table value like below;
bietl_patch:
constraints:
unique:
– somthng_id
not_null:
– somthng_id
– xxx_id

In the end, you will have like below folder and schema;
# DBT Profile. Specify your DB connection information etc.
profiles.yml on the root directory
bietl_data_model folder
bietl_datamarts.yml file
dbt_runner.py python file

bietl_data_model
> models
> specification of your models bietl
> bietl_datamarts.schema.yml
> sql files for using.sql

I’m executing that dbt in airflow das but I didn’t mention it, maybe in next post.

Automate update of data marts to BigQuery

Hey Everyone,

Some reason I’m using AWS – RS and S3 bucket, so for the implementation is 2x daily means batch ingestion of data S3 to Cloud Storage.

1 – Create a dag in Airflow

2 – Execute query Unload in Redshift to S3 with no header

3 – Open Google Cloud interface and use Cloud Storage Transfer Service

Cloud Storage Transfer Service supports one-time transfers or recurring transfers. It provides advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data. It also supports the deletion of the source data after it’s been copied.

3.1 – Create a GCS bucket for storing data in Cloud Storage

3.2 – Select source: GCS you will provide s3 bucket name, access key, a secret key which user authorized s3 bucket to list object, read and delete buckets.

3.3 – Select destination: Provide GCS bucket name and synchronization options (overwrite, delete the destination, delete the source)

3.4 – Set scheduled time and set a description for the task

4 – Now time to Create a table for Querying, open Google Big Query interface

4.1 – Create a DataSet in Big Query interfaces

4.2 – Create a table in Big Query like below steps

4.2.1 – Create a table from Google Cloud Storage

4.2.2 – Select file from GCS bucket which was created by you

4.2.3 – Select file format

4.2.4 – Specify Destination table information with Table type must be external (for daily append your data from s3 to GCS)

4.2.5 – Don’t specify schema we will provide it later.

Now you have a table which providing data from s3 to GCS bucket, but you have to specify headers and column type then you can select your data in BQ.

When you select your table on BQ like below command;

SELECT * FROM `dataset.bi_1234__test.table ` LIMIT 10;

On the below screen, you can see the schema menu than you can add fields with types of your column.

Export Data from Redshift to S3 Bucket, Load Cloud Storage and Query on BQ

Hello Everyone,

My post header is going to the moon sorry for that 🙂

Last week I spent my time on GC but, if your data is in RS you have to unload your data from redshift to s3 cause of GC is not loaded data directly in redshift. So you have to use a bridge for that issue if your programming skills not enough in some scripting or oop languages you can do like below;

1 – Create a scheduled task for execution of below code;

unload (‘select * from test.bietltools where somefields is not null limit 100;’)
to ‘s3://bietltools-external/2018-08-29/bietltools_reports_dm_dt.csv’
iam_role ‘arn:aws:iam::123456789101:role/RedshiftS3Access’;

That query must be executed in Redshift, so if everything is fine on S3;

2 – Create a transfer task in Google Cloud Storage

Get your credentials and go to the cloud storage interface and create a transfer task in GCS from s3, fill text with your own credentials and bucket name etc.

You can find details on cloud storage page when you get your data from s3 to cloud storage.

3 – Create an external table in BigQuery

Create your Big Query table like an external table on BigQuery interface and create your first table with the schema or without the schema.

If you care about header in your file, no worry you can add the fields in big query table options side.

And now you are ready to query your data, enjoy querying 🙂

Installation of Tableau Server Ubuntu On Promise 2018.2

Hi Everyone,

On my last post, I mentioned to Easy DWH and Easy Dashboarding that post is related with my last post.

Lets Start;

Create a folder and store downloaded file in that path.

mkdir tableauinst

wget https://www.tableau.com/downloads/server/deb

Step 1: Install Tableau Server package and start Tableau Services Manager

Install Tableau Server with your distribution’s package manager, then run a script to initialize Tableau Services Manager (TSM). Tableau Services Manager is the management toolset used to install, configure, and manage Tableau services.

The initialize script is included with the installed package.

1 – Log on as a user with sudo access to the computer where you want to install Tableau Server.

2 – Navigate to the directory where you copied the Tableau Server installation package.

3 – Use the package manager to install the Tableau Server package.

sudo apt-get update
sudo apt-get -y install gdebi-core
sudo gdebi -n tableau-server-<version>_amd64.deb

4 – Run the following script to start TSM:

sudo ./initialize-tsm –accepteula

5 – After initialization is complete, close the terminal session:
logout

6 – Navigate to the scripts directory:

cd /opt/tableau/tableau_server/packages/scripts.<version>

7 – If your organization uses a forward proxy solution to access the internet, then configure Tableau Server to use the proxy server. Tableau Server must access the internet for map data and for default licensing functionality.

Step 2: Activate and register Tableau Server

Before you can configure Tableau Server you must activate a license and register.

Beginning by logging on to the TSM web UI. See Sign in to Tableau Services Manager Web UI.

Step 3: Configure general server settings

The most important configuration on this Setup page is the identity store option.

I did Local Auth, so you can specify others, check on Tableau website.

Step 4: Create the Tableau Server administrator account

Create the Tableau Server administrator account.

  • If you are using LDAP for authentication, then the account that you specify here must be a user in the directory.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' –username '<AD-user-name>'

  • On the other hand, if you are running Tableau Server with local authentication, the username and password that you specify here will be used to create the administrative account. Enter a strong password for this account.

    Run the following command:

    tabcmd initialuser --server 'localhost:80' --username 'admin'

Use this account to access the Tableau Server admin web pages.

Step 5: Configure local firewall (optional)

We recommend that you run a local firewall on the computer that is running Tableau Server. This is a security best practice. By default, Linux distributions do not enable firewall during standard installations.

If you have installed or enabled a local firewall then you must open two ports for Tableau Server. These ports are the gateway port (TCP 80) and the tabadmincontroller port (TCP 8850). The following procedure shows an example of how to open these ports using Firewalld, which is the default firewall on CentOS. If you are using a different firewall then you’ll need to determine the right commands to run to open these ports.

  1. Start firewalld:

    sudo systemctl start firewalld

  2. Set default zone to public. Run the following command:

    sudo firewall-cmd --set-default-zone=public

  3. Add ports for the gateway port and the tabadmincontroller port. Run the following commands:

    sudo firewall-cmd --permanent --add-port=80/tcp

    sudo firewall-cmd --permanent --add-port=8850/tcp

  4. Reload the firewall and verify the settings. Run the following commands:

    sudo firewall-cmd --reload

    sudo firewall-cmd --list-all

    Bingo, Tableau is working now, just a sample from tableau dashboard 🙂

Analytics on GC – BQ

Hello Everyone,

Last a few days I spent the time to find some solutions for easy DWH and easy Dashboarding.

Lets Start;

 

1Create a GC account

 

Which is promoted 300 $ per 1st year.

https://cloud.google.com/gcp/

2 – Create A project and enable billing for that project.

 

Just write your credit card information for 1$ sample payment, Google will send again your bank account.

If you have some data from your current DWH or some files on somewhere;

3Create a Cloud Storage for BQ
set getting data from some sources

4Create a Sync Job for s3 or wherever you want to get data sources

I’m getting my data from s3, you have to set a name for source and one name for a destination, so I mention that like s3toGCstorage source and for destination destinations3toGCstorage.

And now you have your data on GC Storage, it scheduled and it works fine.

5Create a dataset on BigQuery

bq mk BigTableau

6Create a Table and Load Data from GCStorage
On webui or activate cloud shell

bq –location=[LOCATION] load –source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

In the end, you scheduled your task on GC Storage, that task getting your data from your sources, you created a BQ table to the struct that your data, and now we have to create a dashboard or whatever.

I will install tableau server for my issue, but you can use data studio in GC or whatever you want, in last a few years lots of Dashboard tools support to BQ for sourcing.

I will mention that in another post.

Tableau-Server Version Upgrade on Ubuntu

Tableau Server Pre-Upgrade

Create an image of Instance of the server, backup_tableau-1_24082018

Connect to tableau servers with ssh and login TSM on the command line.

tsm login -u XXXXX (which is your admin tsm user)

Perform a Full Backup and Restore of Tableau Server
There are two types of backup data that Tableau Server can generate. We recommend performing regular backups of each type in case you must restore a server in a recovery scenario:

1 – Repository data: Tableau Server data consists of the Tableau PostgreSQL database or repository, which contains workbook and user metadata, data extract files, and site configuration data. When you use TSM to create a backup, all of this data is saved in a single file with a .tsbak extension.

tsm maintenance backup -f tableau_server<version>.tsbak -d

2 – Configuration and Topology data: includes most of the server configuration information required to fully recover a server. SMTP, alerting, authentication, are all examples of configuration data that are exportable for backup. Topology data defines how your Tableau Server processes are configured in both single-server and multiple node deployments.

tsm settings export -f configuration.json

Start to Upgrade

1 – Log on as a user with sudo access to the computer you are upgrading.

Download which version you want to get on Tableau Web site:

1.a) wget https://downloads.tableau.com/esdalt/2018.2.0/tableau-server-2018-2-0_amd64.deb

2 – Use the package manager to install the Tableau Server package

2.a) sudo gdebi -n tableau-server-2018-2-0_amd64.deb

3 – After you have installed the new package on every node in your cluster

3.a) tsm stop

4 – With Tableau Server stopped, run the following command on any one node.

4.a) sudo /opt/tableau/tableau_server/packages/scripts.20182.18.0627.2230/upgrade-tsm –accepteula -u XXXXX (which is your admin tsm user)

5 – The upgraded Tableau binary directory will be added to PATH for new shells. To get the updated path, either start a new session, or for bash users run:

5.a) source /etc/profile.d/tableau_server.sh

6 – Start Tableau Server:

6.a) tsm start -u XXXXX (which is your admin tsm user)

Confirm that everything works as expected, these are some areas of testing to consider:

Server processes:
Sign in to Tableau Server as a server administrator, and then open the Server Status page to confirm that all services and processes are running as expected (including on all additional nodes if this is a distributed installation).

User access:
Confirm that Tableau Server users can sign in. Test your normal user sign in process. Have some of your users participate in the testing to make sure they are able to sign in as expected, and that they can get to the same content that they have access to in your production environment.

Publishing workbooks and data sources:
Have users publish workbooks and data sources from Tableau Desktop to make sure this goes as you expect.

Viewing published workbooks:
Have users who are familiar with the content try to view published workbooks to make sure they appear as expected. Test views embedded in web pages (for example, in SharePoint pages).

Subscriptions and extract refreshes :
Manually run some extract refreshes to confirm that they complete successfully. Run some key scheduled extract refreshes to confirm that they complete as expected.

Permissions:
Confirm that permissions are still set as expected for users and content.

Command-line utilities and APIs. If applicable, test the command line utilities (tsm and tabcmd) and programmatic access via APIs.

tsm status -v if something going wrong try to figure out tsm restart command.

If tsm is not started try to check logs with below command;
The upgrade log, app-upgrade.log, is located in /var/opt/tableau/tableau_server/logs

Why mysql is not used on big data?

Hi everyone,

I try to check database management system which is most used or which is most valueable in the world. I found a website like compare the databases for general structure.

I compared them on a link, you can check below, but i didnt understood to why mysql is not used in big data. If anyone can answer this question, you can contact me on about-contact page.

http://db-engines.com/en/system/DB2%3BMicrosoft+SQL+Server%3BMySQL%3BOracle%3BPostgreSQL

By the way, if you want to follow rank of databases you can catch on below link;

http://db-engines.com/en/ranking/relational+dbms

Cheers,

Talih Bayram

Career Roadmapping About Data

In following up on Dave Wells’ recent piece titled The Evolution (and Opportunity) of IT Careers, Jennifer takes a different look at the challenges of trying to understand why some people are happy and successful in their careers, while others just continue to struggle.


The concept of program, project, and operations as a significant career influence is one that I’ve worked with for years, and over that time it has inspired quite a bit of discussion. While Dave’s ideas about information, data, and systems are interesting, I think that they need to be slightly adjusted from a career perspective. Instead of information, data, and systems, let’s look at information, data, and technology. Today’s reality is that information and technology are diverging. So what makes sense from a career perspective is data, information, and technology roles.

jhay_jul01

Now let’s intersect that with program, project, and operations to adjust the grid from the previous article. It now looks like the image in Figure 1.

In my career guidance role, I see a lot of potential value to use this view as a framework to build career roadmaps.

A career roadmap is a navigational concept that shows not only where you’ve been in your career but more interestingly where you aspire to go as your career unfolds into the future. At a high level, the roadmap looks at progression through the nine tiles illustrated in the diagram.

jhay_jul02

Let’s look at the career roadmap for a persona that we’ll call Raoul. He started his career as a maintenance programmer, which clearly places him in the bottom right corner. Within a few years he moved from maintenance programming to software development, placing him at the intersection of project work and technology. His project work has sparked an interest in data and he now aspires to become a data architect at the intersection of data program work. Raoul’s career roadmap options are illustrated in Figure 2.

Raoul has a number of options to consider. He could stay at the project level and move horizontally into data as a data modeler (that’s path 1 in red), then move from data modeler to data architect.

jhay_jul03

He could stay in the technology space, moving from project to program level by becoming a systems architect followed by a shift to data architect (that’s path 2 in purple). He could move diagonally from technology/project to data/program, but that’s a pretty aggressive move and likely more difficult to achieve. The edge-to-edge moves illustrated by paths 1 and 2 tend to be easier because the gaps between tiles are not as great as when attempting a corner-to-corner move such as illustrated by path 3.

So which path makes best sense for Raoul? Following path 1, he gets advantage from his project achievements and relationships, and begins to develop data experience. Following path 2 he gets an advantage from his technology and programming achievements and relationships and shifts from developer to architect. Each works as a step along the path to his ultimate goal of becoming a data architect. The best path depends on a combination of his interests and the job opportunities that are available to him.

Now let’s look at Lucy, another persona. At various times throughout her career, Lucy has worked as a business data analyst (project/information), a DBA (operations/data), and a data architect. Lucy’s data architect role was especially interesting because some of the most important work that she performed was as a liaison between the architecture group and project team. She was working not in a single tile but in two adjacent tiles – program/data and project/data. Lucy’s career roadmap is illustrated in figure 3.

jhay_jul04.png

Today, combining her business, technical, and data experience she believes that she is a natural fit for a lead data steward role. But where does data steward fit in the framework? It doesn’t really fit into any of the tiles, nor is it represented by two or more adjacent tiles. The data steward role is an example of working in the “white space” that separates all of the tiles. White space jobs are often the most interesting of all, and they’re certainly important as essential roles that connect all of the pieces.

Summary

I’m in absolute agreement with Dave Wells in that the use of roadmaps in career planning will continue to grow as the field expands to include big data, analytics, and other advances. There will be interesting times ahead of us as technology demands increase and the IT field diversifies with business units assuming many roles that have traditionally existed in IT departments. Rapid evolution of both technology and skills will continue to be the norm as abundant opportunities emerge for every data, information, and systems professional.

 

Thank you so much for this amazing article to Jennifer Hay.

I’m glad to read Jennifer’s suggestions on this article and so sharing.

You can see original article on tdan.com

http://tdan.com/career-roadmapping/20012

It’s all in the Data: Everybody is a Data Steward

A Data Steward is someone that has formal accountability for data in the organization. I say that everybody in the organization is a Data Steward. You may disagree with me or think that this idea is preposterous; however, I hope to change your mind by the end of this short column. Please give me five minutes.

Data-Column

My premise is based on the fact that everybody that comes in contact with data should have formal accountability for that contact. In other words, people that define, produce, and use data must be held accountable for how they define, produce, and use the data. This may be common sense, but the truth is that this is not taking place. Formalizing accountability to execute and enforce authority over data is the essence of using stewardship to govern data.

Most people agree that everybody that uses sensitive data must protect that data. The sensitive data may contain PII data (personally identifiable information) or PHI data (personal health information) or even IP data (intellectual property) that has a clear set of rules associated with how that data can be shared and who can have access to that data. The rules may be external as in the case of PII and PHI data, or the rules can be internal as in the case of IP data. But one thing is for certain: there are rules associated with at least some of your data.

The truth is that the rules for protecting sensitive data must 1) apply equally to everybody that comes in contact with sensitive data, 2) everybody must know and live the rules, 3) the rules must be formally enforced, and 4) the ability to demonstrate that people are following the rules must be auditable. This, my friends, is what I am proving in this column. Everybody that uses sensitive data must be held formally accountable for how they use the data. Therefore, they are, by my definition, a Data Steward. A Non-Invasive Data Governance™ program focuses on formalizing that level of data usage accountability.

Data Usage is only one facet of the Everybody is a Data Steward notion. What about people that define or produce data? Shouldn’t they also have formal accountability for their actions? The answer to that question is ‘Yes.’

People that define data – either by entering the data or finding new data sources, creating new systems, creating new databases, or propagating new spread-marts that will be used for decision making – should be held formally accountable for checking to see what already exists before producing, as an example, another version of the customer. People that define the ‘golden record’ or system-of-record or master data resources for your organization should be held formally accountable for the quality and value of the definition of that data.

Non-Invasive Data Governance™ recognizes the data producers as stewards of the data as well. If you produce data one of the ways mentioned previously, it is important that you understand the impact you have on the value of that data to the organization. Accepting default values may or may not be a good thing. Entering dummy data where real data is required is never a good thing. Allowing data that is not up to standards to enter your data resources may wreak havoc on decision-making. Calculating profitability may be inconsistent from product to product. People that produce data – through their functions and processes – should be held accountable for how they produce that data including the quality, accuracy, and value of the data they produce.

It all boils down to whether or not you believe that everybody with a relationship to the data should be held formally accountable for that relationship. Basically, every person in your organization has a relationship to the data. Therefore,Everybody is a Data Steward.

The idea that Everybody is a Data Steward may scare you a smidge. Most data governance programs do not follow the thinking that everybody in the organization is a data steward. In fact, most programs assign or hire people to be data stewards. The Non-Invasive Data Governance™ approach allows for certain people to be stewards at a more tactical level (subject matter experts), but the approach calls for identifying or recognizing these people based on their existing levels of authority associated with their data domains.

Are you convinced yet that Everybody is a Data Steward? Does this concept mean that your data governance program will become, in some way, more complex?  From my experience the answer is ‘not necessarily’. It depends on how you communicate and address this main tenet of data stewardship. The Everybody is a Data Steward notion guarantees that accountability for data is consistent across the organization for everybody.

Thanks for tdan community, for this article.

You can see original article in this link;

http://tdan.com/its-all-in-the-data-everybody-is-a-data-steward-get-over-it/20003