How to schedule a BigQuery ETL job with Dataprep

As you know BigQuery user interface lets you do all kinds of things like run an interactive query or batch, save as Table, export to table, etc. — but there is no scheduler yet to schedule a query to run at a specific time or periodicity.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplication, transformations, etc.
  3. An export result of the transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Options include daily, weekly, etc. but also a crontab format for further flexibility that’s it.

Have a nice querying.

 

How to append your data from Cloud Storage to BigQuery with Python (ETL)

Hello Everyone,

BigQuery is a fully-managed enterprise data warehouse for analytics. It is cheap and high-scalable. In this article, I would like to share a basic tutorial for Google Cloud Storage and  BigQuery with Python.

Installation
pip install google-cloud-bigquery

Create credentials

export GOOGLE_APPLICATION_CREDENTIALS="/home/user/Downloads/[FILE_NAME].json

Additionally, please set the PATH to environment variables.

Read from Cloud Storage Append on Big Query

#Import libraries
from google.cloud import bigquery
from google.oauth2 import service_account

#Set Credentials “Create your own credential files on google cloud account”
credentials = service_account.Credentials.from_service_account_file(
‘C:\\Users\\talih\Desktop\\BigQuerytoTableau-6d00b31bb9ab.json’)
project_id = ‘bigquery-to-tableau’

#Set table_ref,project_id and credentials for POST request
client = bigquery.Client(credentials= credentials,project=project_id)
table_ref = client.dataset(‘BigTableau’).table(‘dataflowbasics_schemas’)

#Specify your api post requests with  a few parameter
job_config = bigquery.LoadJobConfig()
job_config.write_disposition = bigquery.WriteDisposition.WRITE_EMPTY
job_config.skip_leading_rows = 1
job_config.autodetect = True
job_config.allow_jagged_rows = True
job_config.ignore_unknown_values = True
job_config.max_bad_records = 1000
schema = [
bigquery.SchemaField(‘bietl_id’, ‘INTEGER’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietltools_name’, ‘STRING’, mode=’REQUIRED’),
bigquery.SchemaField(‘bietl_usage’, ‘FLOAT’, mode=’NULLABLE’),
bigquery.SchemaField(‘bietl_salary’, ‘INTEGER’, mode=’NULLABLE’),
]
job_config.schema = schema

#Set your cloud storage bucket name
uri = ‘gs://desctinations3tostorage/bietl20180829.csv’

#Post your request to Google Api
load_job = client.load_table_from_uri(
uri,
table_ref,
job_config=job_config) # API request

assert load_job.job_type == ‘load’

load_job.result() # Waits for table load to complete.

assert load_job.state == ‘DONE’

Now you can access your own data on big query interfaces.

Automate update of data marts to BigQuery

Hey Everyone,

Some reason I’m using AWS – RS and S3 bucket, so for the implementation is 2x daily means batch ingestion of data S3 to Cloud Storage.

1 – Create a dag in Airflow

2 – Execute query Unload in Redshift to S3 with no header

3 – Open Google Cloud interface and use Cloud Storage Transfer Service

Cloud Storage Transfer Service supports one-time transfers or recurring transfers. It provides advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data. It also supports the deletion of the source data after it’s been copied.

3.1 – Create a GCS bucket for storing data in Cloud Storage

3.2 – Select source: GCS you will provide s3 bucket name, access key, a secret key which user authorized s3 bucket to list object, read and delete buckets.

3.3 – Select destination: Provide GCS bucket name and synchronization options (overwrite, delete the destination, delete the source)

3.4 – Set scheduled time and set a description for the task

4 – Now time to Create a table for Querying, open Google Big Query interface

4.1 – Create a DataSet in Big Query interfaces

4.2 – Create a table in Big Query like below steps

4.2.1 – Create a table from Google Cloud Storage

4.2.2 – Select file from GCS bucket which was created by you

4.2.3 – Select file format

4.2.4 – Specify Destination table information with Table type must be external (for daily append your data from s3 to GCS)

4.2.5 – Don’t specify schema we will provide it later.

Now you have a table which providing data from s3 to GCS bucket, but you have to specify headers and column type then you can select your data in BQ.

When you select your table on BQ like below command;

SELECT * FROM `dataset.bi_1234__test.table ` LIMIT 10;

On the below screen, you can see the schema menu than you can add fields with types of your column.

Analytics on GC – BQ

Hello Everyone,

Last a few days I spent the time to find some solutions for easy DWH and easy Dashboarding.

Lets Start;

 

1Create a GC account

 

Which is promoted 300 $ per 1st year.

https://cloud.google.com/gcp/

2 – Create A project and enable billing for that project.

 

Just write your credit card information for 1$ sample payment, Google will send again your bank account.

If you have some data from your current DWH or some files on somewhere;

3Create a Cloud Storage for BQ
set getting data from some sources

4Create a Sync Job for s3 or wherever you want to get data sources

I’m getting my data from s3, you have to set a name for source and one name for a destination, so I mention that like s3toGCstorage source and for destination destinations3toGCstorage.

And now you have your data on GC Storage, it scheduled and it works fine.

5Create a dataset on BigQuery

bq mk BigTableau

6Create a Table and Load Data from GCStorage
On webui or activate cloud shell

bq –location=[LOCATION] load –source_format=[FORMAT] [DATASET].[TABLE] [PATH_TO_SOURCE] [SCHEMA]

In the end, you scheduled your task on GC Storage, that task getting your data from your sources, you created a BQ table to the struct that your data, and now we have to create a dashboard or whatever.

I will install tableau server for my issue, but you can use data studio in GC or whatever you want, in last a few years lots of Dashboard tools support to BQ for sourcing.

I will mention that in another post.