Automate update of data marts to BigQuery

Hey Everyone,

Some reason I’m using AWS – RS and S3 bucket, so for the implementation is 2x daily means batch ingestion of data S3 to Cloud Storage.

1 – Create a dag in Airflow

2 – Execute query Unload in Redshift to S3 with no header

3 – Open Google Cloud interface and use Cloud Storage Transfer Service

Cloud Storage Transfer Service supports one-time transfers or recurring transfers. It provides advanced filters based on file creation dates, filename filters, and the times of day you prefer to import data. It also supports the deletion of the source data after it’s been copied.

3.1 – Create a GCS bucket for storing data in Cloud Storage

3.2 – Select source: GCS you will provide s3 bucket name, access key, a secret key which user authorized s3 bucket to list object, read and delete buckets.

3.3 – Select destination: Provide GCS bucket name and synchronization options (overwrite, delete the destination, delete the source)

3.4 – Set scheduled time and set a description for the task

4 – Now time to Create a table for Querying, open Google Big Query interface

4.1 – Create a DataSet in Big Query interfaces

4.2 – Create a table in Big Query like below steps

4.2.1 – Create a table from Google Cloud Storage

4.2.2 – Select file from GCS bucket which was created by you

4.2.3 – Select file format

4.2.4 – Specify Destination table information with Table type must be external (for daily append your data from s3 to GCS)

4.2.5 – Don’t specify schema we will provide it later.

Now you have a table which providing data from s3 to GCS bucket, but you have to specify headers and column type then you can select your data in BQ.

When you select your table on BQ like below command;

SELECT * FROM `dataset.bi_1234__test.table ` LIMIT 10;

On the below screen, you can see the schema menu than you can add fields with types of your column.

Export Data from Redshift to S3 Bucket, Load Cloud Storage and Query on BQ

Hello Everyone,

My post header is going to the moon sorry for that 🙂

Last week I spent my time on GC but, if your data is in RS you have to unload your data from redshift to s3 cause of GC is not loaded data directly in redshift. So you have to use a bridge for that issue if your programming skills not enough in some scripting or oop languages you can do like below;

1 – Create a scheduled task for execution of below code;

unload (‘select * from test.bietltools where somefields is not null limit 100;’)
to ‘s3://bietltools-external/2018-08-29/bietltools_reports_dm_dt.csv’
iam_role ‘arn:aws:iam::123456789101:role/RedshiftS3Access’;

That query must be executed in Redshift, so if everything is fine on S3;

2 – Create a transfer task in Google Cloud Storage

Get your credentials and go to the cloud storage interface and create a transfer task in GCS from s3, fill text with your own credentials and bucket name etc.

You can find details on cloud storage page when you get your data from s3 to cloud storage.

3 – Create an external table in BigQuery

Create your Big Query table like an external table on BigQuery interface and create your first table with the schema or without the schema.

If you care about header in your file, no worry you can add the fields in big query table options side.

And now you are ready to query your data, enjoy querying 🙂