Which is the fastest web framework?

For a project I was reading some benchmarks at GitHub and google, I found an article then I was surprised, its too much interesting result for me, you can check the result below.

Results

LanguageFrameworkAverage50th percentile90th percentileStandard deviationRequests / sThroughput
rust (1.38)nickel (0.11)0.24 ms0.20 ms0.39 ms199.3337636.674.97 Mb
ruby (2.6)syro (3.1)2.70 ms0.63 ms7.94 ms4143.6746844.331.80 Mb
ruby (2.6)roda (3.25)2.79 ms0.67 ms8.10 ms4188.6745494.672.88 Mb
rust (1.38)iron (0.6)3.04 ms2.93 ms4.49 ms1368.6721289.671.76 Mb
ruby (2.6)cuba (3.9)3.06 ms0.57 ms9.41 ms5064.3341756.673.27 Mb
ruby (2.6)rack-routing (0.0)3.84 ms0.73 ms11.38 ms5596.0033197.001.27 Mb
c (11)agoo-c (0.7)4.59 ms4.28 ms8.86 ms3377.00209196.008.02 Mb
ruby (2.6)camping (2.1)4.67 ms0.49 ms16.08 ms8578.6727517.671.74 Mb
node (12.11)sifrr (0.0)4.79 ms4.32 ms9.82 ms4054.33203003.6711.84 Mb
nim (1.0)httpbeast (0.2)5.02 ms4.54 ms9.61 ms3651.00192873.3318.20 Mb
python (3.7)japronto (0.1)5.13 ms4.63 ms9.96 ms3909.00189149.6715.00 Mb
cpp (11)drogon (1.0)5.42 ms4.90 ms9.91 ms3546.00177889.3311.44 Mb
ruby (2.6)flame (4.18)5.47 ms0.49 ms19.16 ms10784.3323469.670.90 Mb
swift (5.1)swifter (1.4)5.56 ms0.85 ms14.54 ms86445.0011871.001.01 Mb
cpp (11)evhtp (1.2)5.90 ms5.24 ms9.56 ms2929.33160195.6710.30 Mb
go (1.13)gorouter-fasthttp (4.2)6.04 ms5.38 ms9.31 ms3625.67156380.0016.64 Mb
go (1.13)fasthttprouter (0.1)6.18 ms5.11 ms9.14 ms9601.00161432.0017.23 Mb
go (1.13)atreugo (8.2)6.26 ms5.19 ms9.27 ms8758.00159818.3321.31 Mb
go (1.13)fasthttp (1.5)6.33 ms4.87 ms8.99 ms12920.33168733.3318.05 Mb
crystal (0.31)router.cr (0.2)6.46 ms5.66 ms10.55 ms3220.00149292.339.31 Mb
crystal (0.31)toro (0.4)6.47 ms5.67 ms10.55 ms3222.00148845.339.28 Mb
ruby (2.6)hanami (1.3)6.66 ms0.60 ms22.99 ms11912.3319246.009.67 Mb
crystal (0.31)raze (0.3)6.75 ms5.91 ms10.91 ms3312.00143044.338.91 Mb
java (8)rapidoid (5.5)6.80 ms5.09 ms11.01 ms13146.00163718.0019.53 Mb
crystal (0.31)kemal (0.28)7.09 ms6.44 ms10.91 ms3221.67135783.6714.71 Mb
nim (1.0)jester (0.4)7.29 ms6.65 ms11.72 ms3980.00145089.0019.33 Mb
c (11)kore (3.3)7.41 ms5.96 ms13.15 ms9455.67161905.6729.17 Mb
crystal (0.31)amber (0.3)7.67 ms7.04 ms12.12 ms3559.67126369.6715.33 Mb
ruby (2.6)sinatra (2.0)7.97 ms0.68 ms26.40 ms13397.6716038.672.76 Mb
crystal (0.31)orion (1.7)8.35 ms7.80 ms13.19 ms3888.33116803.0012.65 Mb
ruby (2.6)grape (1.2)9.51 ms0.80 ms31.25 ms15583.6713576.000.51 Mb
java (8)act (1.8)9.73 ms7.65 ms13.12 ms15954.67121682.3313.92 Mb
go (1.13)gorouter (4.2)9.88 ms8.05 ms16.18 ms9871.67105554.679.31 Mb
go (1.13)rte (0.0)9.89 ms7.84 ms15.73 ms13656.33107851.679.58 Mb
rust (1.38)actix-web (1.0)10.23 ms9.78 ms13.79 ms3258.67105115.3310.09 Mb
go (1.13)echo (4.1)10.69 ms8.56 ms19.02 ms8175.6796770.6711.26 Mb
go (1.13)violetear (7.0)10.72 ms8.90 ms16.17 ms12167.0097186.008.55 Mb
go (1.13)gin (1.4)10.99 ms8.60 ms19.20 ms10553.0096706.3311.25 Mb
go (1.13)goroute (0.0)11.04 ms8.56 ms19.05 ms13101.0096836.0011.27 Mb
go (1.13)chi (4.0)11.16 ms8.27 ms18.36 ms17216.00101013.678.96 Mb
go (1.13)beego (1.12)11.22 ms8.77 ms19.63 ms10964.0095757.678.54 Mb
go (1.13)kami (2.2)11.31 ms8.63 ms17.37 ms18533.0098168.678.66 Mb
go (1.13)webgo (3.0)11.62 ms9.32 ms18.84 ms12303.3391533.008.08 Mb
python (3.7)falcon (2.0)12.45 ms10.15 ms20.53 ms7317.6780978.0012.58 Mb
go (1.13)air (0.13)12.70 ms9.58 ms23.38 ms12702.0084610.0011.70 Mb
swift (5.1)perfect (3.1)12.99 ms13.07 ms15.50 ms4827.3374517.004.64 Mb
go (1.13)gorilla-mux (1.7)13.51 ms8.56 ms19.96 ms33361.3395175.678.43 Mb
csharp (7.3)aspnetcore (2.2)13.52 ms9.69 ms16.21 ms33059.3387579.339.45 Mb
node (12.11)polkadot (1.0)13.54 ms9.24 ms17.63 ms33076.3393298.679.27 Mb
go (1.13)gf (1.9)13.58 ms10.90 ms23.38 ms11764.6777603.008.73 Mb
php (7.3)one (1.8)13.74 ms12.47 ms23.63 ms8067.6773012.0011.12 Mb
ruby (2.6)agoo (2.11)13.96 ms13.56 ms18.36 ms3473.6770218.332.69 Mb
node (12.11)0http (1.2)15.87 ms9.40 ms17.70 ms45955.3391692.009.11 Mb
php (7.3)hyperf (1.0)16.97 ms14.55 ms32.28 ms11795.3361592.338.70 Mb
rust (1.38)gotham (0.4)17.32 ms16.37 ms24.25 ms18398.0059471.008.00 Mb
node (12.11)rayo (1.3)17.50 ms10.49 ms20.18 ms48826.3380132.337.96 Mb
node (12.11)polka (0.5)17.51 ms10.14 ms19.82 ms50286.6782259.338.17 Mb
ruby (2.6)plezi (0.16)17.71 ms16.45 ms23.31 ms8987.6755487.337.84 Mb
python (3.7)bottle (0.12)18.49 ms15.86 ms30.21 ms10116.6754975.338.98 Mb
php (7.3)sw-fw-less (preview)19.53 ms17.97 ms31.21 ms9924.0050967.007.76 Mb
python (3.7)blacksheep (0.2)19.53 ms17.34 ms32.78 ms10283.3351572.676.88 Mb
kotlin (1.3)ktor (1.2)19.86 ms12.19 ms29.10 ms52312.6772213.007.46 Mb
python (3.7)asgineer (0.7)20.12 ms17.88 ms33.06 ms10311.6749954.005.91 Mb
node (12.11)restana (3.3)20.18 ms9.67 ms19.11 ms67392.0088524.008.79 Mb
node (12.11)muneem (2.4)21.71 ms11.72 ms22.32 ms65248.0072124.007.16 Mb
python (3.7)hug (2.6)21.91 ms17.88 ms35.48 ms13931.6746597.007.66 Mb
python (3.7)starlette (0.12)22.08 ms18.90 ms37.56 ms11508.3345531.336.50 Mb
node (12.11)foxify (0.1)22.35 ms12.59 ms23.78 ms62906.6768189.009.50 Mb
clojure (1.10)coast (1.0)22.53 ms19.66 ms21.77 ms36208.0048332.005.76 Mb
php (7.3)swoft (2.0)23.21 ms22.53 ms30.76 ms6650.3341957.677.31 Mb
node (12.11)iotjs-express (0.0)24.19 ms14.39 ms26.72 ms64464.6759753.6716.09 Mb
swift (5.1)kitura-nio (2.8)25.57 ms20.16 ms23.38 ms61899.0047177.005.82 Mb
php (7.3)imi (1.0)26.63 ms25.65 ms33.72 ms6709.6736652.005.58 Mb
swift (5.1)kitura (2.8)27.10 ms20.72 ms23.75 ms66177.6746502.675.73 Mb
node (12.11)restify (8.4)28.43 ms19.08 ms31.25 ms59532.0045832.335.33 Mb
node (12.11)koa (2.8)28.76 ms13.84 ms26.71 ms85051.0060916.338.55 Mb
node (12.11)express (4.17)29.69 ms15.76 ms29.73 ms82691.3353886.008.75 Mb
java (8)spring-boot (2.1)29.70 ms16.10 ms36.25 ms86239.6747422.332.52 Mb
node (12.11)fastify (2.8)32.84 ms15.00 ms27.96 ms105969.0060039.3310.56 Mb
ruby (2.6)rails (6.0)33.29 ms2.49 ms110.27 ms63071.673850.331.61 Mb
python (3.7)fastapi (0.42)36.39 ms32.15 ms60.51 ms19152.0027792.003.98 Mb
python (3.7)responder (2.0)36.72 ms34.41 ms58.42 ms16339.3327054.333.91 Mb
crystal (0.31)spider-gazelle (1.6)37.63 ms35.73 ms45.36 ms15535.3326014.671.84 Mb
python (3.7)clastic (19.9)40.15 ms33.18 ms65.65 ms19619.0024875.674.09 Mb
python (3.7)molten (0.27)40.30 ms33.85 ms66.82 ms19592.0025365.673.13 Mb
fsharp (7.3)suave (2.5)40.58 ms24.44 ms100.56 ms50407.6724596.333.30 Mb
python (3.7)flask (1.1)41.74 ms36.29 ms63.43 ms16853.6723598.003.85 Mb
crystal (0.31)lucky (0.18)42.86 ms40.21 ms52.32 ms14370.3322844.331.87 Mb
node (12.11)turbo_polka (2.0)44.09 ms42.05 ms48.93 ms22248.3322322.001.39 Mb
python (3.7)aiohttp (3.6)44.17 ms42.06 ms69.09 ms19059.3322607.003.40 Mb
python (3.7)bocadillo (0.18)52.16 ms45.99 ms87.88 ms30216.0019473.672.50 Mb
java (8)micronaut (1.2)52.78 ms23.20 ms97.08 ms128239.6724704.673.43 Mb
swift (5.1)vapor (3.3)53.36 ms17.34 ms32.92 ms219012.3348897.335.54 Mb
php (7.3)lumen (6.2)53.83 ms18.27 ms114.41 ms112798.6743774.3314.42 Mb
php (7.3)slim (4.3)54.73 ms18.51 ms112.98 ms117455.3343835.0014.43 Mb
php (7.3)zend-expressive (3.2)55.56 ms18.58 ms123.67 ms115767.3343418.3314.30 Mb
python (3.7)sanic (19.9)57.11 ms51.21 ms96.65 ms38019.0018091.002.14 Mb
php (7.3)basicphp (0.9)58.93 ms19.67 ms119.40 ms123055.3340573.0013.40 Mb
php (7.3)spiral (2.2)59.60 ms59.64 ms66.25 ms8035.6716327.331.88 Mb
php (7.3)symfony (4.3)59.85 ms19.16 ms119.08 ms133181.3340957.3313.49 Mb
php (7.3)zend-framework (3.1)59.94 ms19.09 ms127.72 ms129446.6741951.3313.82 Mb
scala (2.12)http4s (0.18)65.22 ms19.31 ms45.01 ms257133.6745286.005.27 Mb
node (12.11)hapi (18.4)66.34 ms24.26 ms46.49 ms204446.6735301.336.07 Mb
crystal (0.31)athena (0.7)67.28 ms48.80 ms180.47 ms84270.6724144.672.01 Mb
php (7.3)laravel (6.4)81.27 ms22.88 ms171.46 ms188420.3335942.3311.89 Mb
node (12.11)moleculer (0.13)85.99 ms27.18 ms60.05 ms254224.6730295.333.46 Mb
python (3.7)quart (0.10)88.90 ms75.34 ms156.00 ms46756.3311209.001.48 Mb
python (3.7)cherrypy (18.3)89.78 ms73.66 ms79.81 ms233050.001373.670.21 Mb
go (1.13)gramework (1.6)96.01 ms97.62 ms102.01 ms18890.3310148.001.72 Mb
python (3.7)tornado (5.1)101.51 ms100.19 ms126.70 ms34740.679525.331.87 Mb
python (3.7)django (2.2)105.60 ms93.93 ms163.05 ms36832.339189.331.77 Mb
java (8)javalin (3.5)125.41 ms11.66 ms290.35 ms362928.3356370.676.67 Mb
python (3.7)masonite (2.2)138.54 ms129.50 ms179.52 ms53997.676988.671.14 Mb
perl (5.3)dancer2 (2.0)162.32 ms58.98 ms364.69 ms338036.671492.000.22 Mb
crystal (0.31)onyx (0.5)193.90 ms193.25 ms226.42 ms28283.675066.000.87 Mb
scala (2.12)akkahttp (10.1)220.51 ms7.35 ms96.23 ms873321.6765406.009.38 Mb
python (3.7)cyclone (1.3)399.23 ms351.41 ms445.11 ms460804.672202.670.37 Mb
python (3.7)nameko (2.12)655.34 ms551.15 ms613.68 ms770037.671278.000.18 Mb

You can find original page https://github.com/the-benchmarker/web-frameworks#results

How to install Docker for windows

Docker is a full development platform for creating containerized applications. Docker Desktop is the best way to get started with Docker on Windows.

1 – Download and Install from below url

https://docs.docker.com/docker-for-windows/install/

2 – Open a terminal window (Command Prompt or PowerShell, but not PowerShell ISE).

3 – Run docker --version to ensure that you have a supported version of Docker:

4 – Pull the hello-world image from Docker Hub and run a container:

> docker run hello-world

docker : Unable to find image 'hello-world:latest' locally
...

latest:
Pulling from library/hello-world
ca4f61b1923c:
Pulling fs layer
ca4f61b1923c:
Download complete
ca4f61b1923c:
Pull complete
Digest: sha256:97ce6fa4b6cdc0790cda65fe7290b74cfebd9fa0c9b8c38e979330d547d22ce1
Status: Downloaded newer image for hello-world:latest

Hello from Docker!
This message shows that your installation appears to be working correctly.
...

5 – List the hello-world image that was downloaded from Docker Hub:

> docker image ls

6 – Explore the Docker help pages by running some help commands:

> docker --help
> docker container --help
> docker container ls --help
> docker run --help

Sample Complex Applications

In this section, demonstrated sample the ease and power of Dockerized applications by running something more complex, such as an OS and a webserver.

  1. Pull an image of the Ubuntu OS and run an interactive terminal inside the spawned container: > docker run --interactive --tty ubuntu bash
docker run --interactive --tty ubuntu bash

 docker : Unable to find image 'ubuntu:latest' locally
 ...

 latest:
 Pulling from library/ubuntu
 22dc81ace0ea:
 Pulling fs layer
 1a8b3c87dba3:
 Pulling fs layer
 91390a1c435a:
 Pulling fs layer
 ...
 Digest: sha256:e348fbbea0e0a0e73ab0370de151e7800684445c509d46195aef73e090a49bd6
 Status: Downloaded newer image for ubuntu:latest

You are in the container. At the root # prompt, check the hostname of the container:

 root@8aea0acb7423:/# hostname
 8aea0acb7423

Notice that the hostname is assigned as the container ID (and is also used in the prompt).

Exit the shell with the exit command (which also stops the container):

List containers with the --all option (because no containers are running).

The hello-world container (randomly named, relaxed_sammet) stopped after displaying its message. The ubuntu container (randomly named, laughing_kowalevski) stopped when you exited the container.

docker container ls --all

 CONTAINER ID    IMAGE          COMMAND     CREATED          STATUS                      PORTS    NAMES
 8aea0acb7423    ubuntu         "bash"      2 minutes ago    Exited (0) 2 minutes ago             laughing_kowalevski
 45f77eb48e78    hello-world    "/hello"    3 minutes ago    Exited (0) 3 minutes ago             relaxed_sammet

Pull and run a Dockerized nginx web server that we name, webserver:

docker run --detach --publish 80:80 --name webserver nginx

 Unable to find image 'nginx:latest' locally
 latest: Pulling from library/nginx

 fdd5d7827f33: Pull complete
 a3ed95caeb02: Pull complete
 716f7a5f3082: Pull complete
 7b10f03a0309: Pull complete
 Digest: sha256:f6a001272d5d324c4c9f3f183e1b69e9e0ff12debeb7a092730d638c33e0de3e
 Status: Downloaded newer image for nginx:latest
 dfe13c68b3b86f01951af617df02be4897184cbf7a8b4d5caf1c3c5bd3fc267f

Point your web browser at http://localhost to display the nginx start page. (You don’t need to append :80 because you specified the default HTTP port in the docker command.)

List only your running containers:

docker container ls

 CONTAINER ID    IMAGE    COMMAND                   CREATED          STATUS          PORTS                 NAMES
 0e788d8e4dfd    nginx    "nginx -g 'daemon of…"    2 minutes ago    Up 2 minutes    0.0.0.0:80->80/tcp    webserver

Stop the running nginx container by the name we assigned it, webserver:

 >  docker container stop webserver

Remove all three containers by their names — the latter two names will differ for you:

> docker container rm webserver laughing_kowalevski relaxed_sammet

How to get data from SQL Server to Elasticsearch using LogStash

As a developer working with SQL Server there was a need to import data from the database to Elasticsearch and analyze data in Kibana.

As Elasticsearch is an open-source project built with Java and handles most other open-source projects, documentation on importing data from SQL Server to ES using LogStash.

I’d like to share how to import SQL Server data to Elasticsearch (version 6.2) using LS and verify the result on Kibana.

Assumption

I will skip on installing ELK (ElasticSearch, LogStash, and Kibana) stack as it’s outside the scope of this article.
Please refer to installation steps on Elastic download pages.

Overview

Here are the steps required to import SQL Server data to Elasticsearch.

  1. Install Java Development Kit (JDK)
  2. Install JDBC Driver for SQL Server
  3. Set CLASSPATH for the driver
  4. Create an Elasticsearch Index to Import Data to
  5. Configure LogStash configuration file
  6. Run LogStash
  7. Verify in Kibana

Step 1 – Install Java SE Development Kit 8

One of the gotchas is that you might install the latest version of JDK, which is version 9 but Elasticsearch documentation requires you to install JDK 8.

At the time of writing, the latest JDK 8 version is 8u162, which can be downloaded here.

Download “JDK8 8u162” and install it on your machine and make sure that “java” is in the PATH variable so that it can be called in any directory within a command line.

Step 2 – Install JDBC Driver for SQL Server

You need to download and install Microsoft JDBC Driver 4.2 for SQL Server, not the latest version.

As Elasticsearch is built with JDK 8, you can’t use the latest version of JDBC Driver (version 6.2) for SQL Server as it does not support JDK 8.

Step 3 – Set CLASSPATH for the JDBC Driver

We need to set the path so that Java can find the JDBC driver.

📝 Note: I am working on Windows 10 machine.

1. Go to the directory under which you have installed SQL Server JDBC.

2. Now you need to navigate to find a JAR file named sqljdbc42.jar, which is found under<<JDBC installation folder>>\sqljdbc_4.2\enu\jre8

3. And then copy the full path to the JAR file.

A cool trick on Windows 7/8/10 is that, when shift+right click on a file, it gives you a “Copy as Path” option.

4. Go to Windows Start button and type “environment” and click on “Edit the system environment variables”.

5. Add a CLASSPATH environment variable with following values (if you don’t already have one).

  1. “.” – for the current directory to search.
  2. And the JAR file path copied in previously (e.g. “C:\talih\Java\MicrosoftJDBCDriversSQLServer\sqljdbc_4.2\enu\jre8\sqljdbc42.jar”).

Gotcha: If you have a space in the path for JDBC JAR file, make sure to put double quotes around it.

Not doing so will result in either of following error messages when you start LogStash service in later step. 

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class JDBC

 - Or -

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class File\Microsoft

Let’s now move onto to create an Elasticsearch index to import data to.

Step 4 – Create an Elasticsearch Index to Import Data to

You can use cURL or Postman to create an Index but I will use Kibana console to create an index named “cs_users”, which is equivalent to a database in relational database terminology.

Before we start the Kibana service, we need to start Elasticsearch so that Kibana would not whine about Elasticsearch not being present.

Kibana warnings on lines 12~21 due to Elasticsearch being unavailable

Go to the Elasticsearch installation and start the service.

talih@CC c:\talih\elasticco\elasticsearch-6.2.2
> bin\elasticsearch.bat

And then go to the Kibana installation directory to start Kibana service.

talih@CC c:\talih\elasticco\kibana-6.2.2-windows-x86_64 
> bin\kibana.bat

If Kibana started without an issue, you will see an output similar to the following.

Kibana started successfully

On line 9, Kibana reports that it is running on http://localhost:5601.
Open the URL in a browser of your choice.

Now go to “Dev Tools” link on the bottom left of the page.

Click on Kibana Dev Tools Link

Once you see the Console, create a new index with the following command.

PUT cs_users
{
        "settings" : {
              "index" : {
                      "number_of_shards" : 3,
                      "number_of_replicas" : 0
              }
        }
}

on the left panel of the Kibana Dev Tools Console.

Create a new Elasticsearch index named “cs_users”

I won’t go into details on “shards” and “replicas” since it’s outside the scope of this article. For more information on the syntax, refer to the official Elasticsearch documentation.

And you will see the response from Elasticsearch with index creation confirmation on the panel right.

A new index “cs_users” is created on Elasticsearch successfully

OK, now we are finally ready to move onto creating a configuration file for LogStash to actually import data.

Step 5 – Configure LogStash configuration file

Go to the LogStash installation folder and create a file named “sql.conf” (name doesn’t really matter).
Here is the LogStash configuration I will be using.

input {
  jdbc {
    jdbc_connection_string => "jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_user => "xxx"

    statement => "SELECT * FROM Users"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

Let me break down “input” and “output” configurations.

Input

There are three required fields you need to specify for “jdbc” input plugin.

jdbc_connection_string – This field instructs LogStash information on SQL Server.

"jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"

Elasticsearch will connect to the server named “cc” running on port 1433 to connect to a database named “StackExchangeCS” with integrated security authentication method.

jdbc_driver_class – This is the driver class contained within the JDBC JAR file.
The JDBC JAR file contains a driver of type “com.microsoft.sqlserver.jdbc.SQLServerDriver” according to the documentation.

If you have an inquisitive mind, you can confirm it by opening the JAR file with your choice of ZIP program as JAR is a simple ZIP file.

Unzip JAR to verify JDBC driver name

jdbc_user – If you are using “Integrated Security” as an authentication option, this can be any string (I just entered “xxx” since that’s the easiest thing I can type 😉).

Output

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

SQL Server data (all cs.stackexchange.com users) will be sent to Elasticsearch running on the local machine port 9200 and will be indexed under “cs_users” index created in “Step 4 – Create an Elasticsearch Index to Import Data to”.
There are quite a bit of Elasticsearch configuration options so please refer to the official LogStash documentation for more “elasticsearch” output plugin options.

Step 6 – Import Data with LogStash

With prerequisites out of the way, we are now ready to import data to Elasticsearch from SQL Server.
Go to the LogStash installation location under which you should have created “sql.conf” and run LogStash service.

bin\logstash -f sql.conf

-f flag specifies the configuration file to use.
In our case, “sql.conf” we created in the previous step.

The result of successful LogStash run will look similar to the following output.

Step 7 – Verify in Kibana

Wow, we have finally imported data. Now let’s do a quick check whether the number of records in the database matches the records in Elasticsearch.

Verifying result of data import

“User” table in the SQL Server has 59394 records and Elasticsearch returns the same number as well.
📝 Note: You can use following command to get the list of all records in “cs_users” index.

GET cs_users/_count

For more information on how “_count” works, refer to Count API documentation.

Conclusion

Congratulations for getting this far 👏👏👏.

How To Install and Configure Elasticsearch on Ubuntu 16.04 + Bonus (Nifi ^^)

Step 1 — Downloading and Installing Elasticsearch

Elasticsearch can be downloaded directly from elastic.co in ziptar.gzdeb, or rpm packages. For Ubuntu, it’s best to use the deb (Debian) package which will install everything you need to run Elasticsearch.

First, update your package index.

sudo apt-get update

Download the latest Elasticsearch version, which is 2.3.1 at the time of writing.

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.1/elasticsearch-2.3.1.deb

Then install it in the usual Ubuntu way with dpkg.

sudo dpkg -i elasticsearch-2.3.1.deb

This results in Elasticsearch being installed in /usr/share/elasticsearch/ with its configuration files placed in /etc/elasticsearch and its init script added in /etc/init.d/elasticsearch.

To make sure Elasticsearch starts and stops automatically with the server, add its init script to the default runlevels.

sudo systemctl enable elasticsearch.service

Before starting Elasticsearch for the first time, please check the next section about the recommended minimum configuration.

Step 2 — Configuring Elasticsearch

Now that Elasticsearch and its Java dependencies have been installed, it is time to configure Elasticsearch. The Elasticsearch configuration files are in the /etc/elasticsearch directory. There are two files:

  • elasticsearch.yml configures the Elasticsearch server settings. This is where all options, except those for logging, are stored, which is why we are mostly interested in this file.
  • logging.yml provides configuration for logging. In the beginning, you don’t have to edit this file. You can leave all default logging options. You can find the resulting logs in /var/log/elasticsearch by default.

The first variables to customize on any Elasticsearch server are node.name and cluster.name in elasticsearch.yml. As their names suggest, node.name specifies the name of the server (node) and the cluster to which the latter is associated.

If you don’t customize these variable, a node.name will be assigned automatically in respect to the Droplet hostname. The cluster.name will be automatically set to the name of the default cluster.

The cluster.name value is used by the auto-discovery feature of Elasticsearch to automatically discover and associate Elasticsearch nodes to a cluster. Thus, if you don’t change the default value, you might have unwanted nodes, found on the same network, in your cluster.

To start editing the main elasticsearch.yml configuration file with nano or your favorite text editor.

sudo nano /etc/elasticsearch/elasticsearch.yml

Remove the # character at the beginning of the lines for cluster.name and node.name to uncomment them, and then update their values. Your first configuration changes in the /etc/elasticsearch/elasticsearch.yml file should look like this:

/etc/elasticsearch/elasticsearch.yml

. . .
cluster.name: mycluster1
node.name: "My First Node"
. . .

These the minimum settings you can start with using Elasticsearch. However, it’s recommended to continue reading the configuration part for more thorough understanding and fine-tuning of Elasticsearch.

One especially important setting of Elasticsearch is the role of the server, which is either master or slave. Master servers are responsible for the cluster health and stability. In large deployments with a lot of cluster nodes, it’s recommended to have more than one dedicated master. Typically, a dedicated master will not store data or create indexes. Thus, there should be no chance of being overloaded, by which the cluster health could be endangered.

Slave servers are used as workhorses which can be loaded with data tasks. Even if a slave node is overloaded, the cluster health shouldn’t be affected seriously, provided there are other nodes to take additional load.

The setting which determines the role of the server is called node.master. By default, a node is a master. If you have only one Elasticsearch node, you should leave this option to the default true value because at least one master is always needed. Alternatively, if you wish to configure the node as a slave, assign a false value to the variable node.master like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.master: false
. . .

Another important configuration option is node.data, which determines whether a node will store data or not. In most cases this option should be left to its default value (true), but there are two cases in which you might wish not to store data on a node. One is when the node is a dedicated master” as previously mentioned. The other is when a node is used only for fetching data from nodes and aggregating results. In the latter case the node will act up as a search load balancer.

Again, if you have only one Elasticsearch node, you should not change this value. Otherwise, to disable storing data locally, specify node.data as false like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.data: false
. . .

In larger Elasticsearch deployments with many nodes, two other important options are index.number_of_shards and index.number_of_replicas. The first determines how many pieces, or shards, the index will be split into. The second defines the number of replicas which will be distributed across the cluster. Having more shards improves the indexing performance, while having more replicas makes searching faster.

By default, the number of shards is 5 and the number of replicas is 1. Assuming that you are still exploring and testing Elasticsearch on a single node, you can start with only one shard and no replicas. Thus, their values should be set like this:/etc/elasticsearch/elasticsearch.yml

. . .
index.number_of_shards: 1
index.number_of_replicas: 0
. . .

One final setting which you might be interested in changing is path.data, which determines the path where data is stored. The default path is /var/lib/elasticsearch. In a production environment, it’s recommended that you use a dedicated partition and mount point for storing Elasticsearch data. In the best case, this dedicated partition will be a separate storage media which will provide better performance and data isolation. You can specify a different path.data path by specifying it like this:/etc/elasticsearch/elasticsearch.yml

. . .
path.data: /media/different_media
. . .

Once you make all the changes, save and exit the file. Now you can start Elasticsearch for the first time.

sudo systemctl start elasticsearch

Give Elasticsearch a few to fully start before you try to use it. Otherwise, you may get errors about not being able to connect.

Step 3 — Securing Elasticsearch

By default, Elasticsearch has no built-in security and can be controlled by anyone who can access the HTTP API. This is not always a security risk because Elasticsearch listens only on the loopback interface (i.e., 127.0.0.1) which can be accessed only locally. Thus, no public access is possible and your Elasticsearch is secure enough as long as all server users are trusted or this is a dedicated Elasticsearch server.

Still, if you wish to harden the security, the first thing to do is to enable authentication. Authentication is provided by the commercial Shield plugin. Unfortunately, this plugin is not free but there is a free 30 day trial you can use to test it. Its official page has excellent installation and configuration instructions. The only thing you may need to know in addition is that the path to the Elasticsearch plugin installation manager is /usr/share/elasticsearch/bin/plugin.

If you don’t want to use the commercial plugin but you still have to allow remote access to the HTTP API, you can at least limit the network exposure with Ubuntu’s default firewall, UFW (Uncomplicated Firewall). By default, UFW is installed but not enabled. If you decide to use it, follow these steps:

First, create a rule to allow any needed services. You will need at least SSH allowed so that you can log in the server. To allow world-wide access to SSH, whitelist port 22.

sudo ufw allow 22

Then allow access to the default Elasticsearch HTTP API port (TCP 9200) for the trusted remote host, e.g.TRUSTED_IP, like this:

sudo ufw allow from TRUSTED_IP to any port 9200

Only after that enable UFW with the command:

sudo ufw enable

Finally, check the status of UFW with the following command:

sudo ufw status

If you have specified the rules correctly, the output should look like this:

Output of java -versionStatus: active

To                         Action      From
--                         ------      ----
9200                       ALLOW       TRUSTED_IP
22                         ALLOW       Anywhere
22 (v6)                    ALLOW       Anywhere (v6)

Once you have confirmed UFW is enabled and protecting Elasticsearch port 9200, then you can allow Elasticsearch to listen for external connections. To do this, open the elasticsearch.yml configuration file again.

sudo nano /etc/elasticsearch/elasticsearch.yml

Find the line that contains network.bind_host, uncomment it by removing the # character at the beginning of the line, and change the value to 0.0.0.0 so it looks like this:/etc/elasticsearch/elasticsearch.yml

. . .
network.host: 0.0.0.0
. . .

We have specified 0.0.0.0 so that Elasticsearch listens on all interfaces and bound IPs. If you want it to listen only on a specific interface, you can specify its IP in place of 0.0.0.0.

To make the above setting take effect, restart Elasticsearch with the command:

sudo systemctl restart elasticsearch

After that try to connect from the trusted host to Elasticsearch. If you cannot connect, make sure that the UFW is working and the network.host variable has been correctly specified.

Step 4 — Testing Elasticsearch

By now, Elasticsearch should be running on port 9200. You can test it with curl, the command line client-side URL transfers tool and a simple GET request.

curl -X GET 'http://localhost:9200'

You should see the following response:

Output of curl{
  "name" : "My First Node",
  "cluster_name" : "mycluster1",
  "version" : {
    "number" : "2.3.1",
    "build_hash" : "bd980929010aef404e7cb0843e61d0665269fc39",
    "build_timestamp" : "2016-04-04T12:25:05Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

If you see a response similar to the one above, Elasticsearch is working properly. If not, make sure that you have followed correctly the installation instructions and you have allowed some time for Elasticsearch to fully start.

To perform a more thorough check of Elasticsearch execute the following command:

curl -XGET 'http://localhost:9200/_nodes?pretty'

In the output from the above command you can see and verify all the current settings for the node, cluster, application paths, modules, etc.

Step 5 — Using Elasticsearch

To start using Elasticsearch, let’s add some data first. As already mentioned, Elasticsearch uses a RESTful API, which responds to the usual CRUD commands: create, read, update, and delete. For working with it, we’ll use again curl.

You can add your first entry with the command:

curl -X POST 'http://localhost:9200/tutorial/helloworld/1' -d '{ "message": "Hello World!" }'

You should see the following response:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

With cuel, we have sent an HTTP POST request to the Elasticsearch server. The URI of the request was /tutorial/helloworld/1 with several parameters:

  • tutorial is the index of the data in Elasticsearch.
  • helloworld is the type.
  • 1 is the id of our entry under the above index and type.

You can retrieve this first entry with an HTTP GET request.

curl -X GET 'http://localhost:9200/tutorial/helloworld/1'

The result should look like:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"found":true,"_source":{ "message": "Hello World!" }}

To modify an existing entry, you can use an HTTP PUT request.

curl -X PUT 'localhost:9200/tutorial/helloworld/1?pretty' -d '
{
  "message": "Hello People!"
}'

Elasticsearch should acknowledge successful modification like this:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

In the above example we have modified the message of the first entry to “Hello People!”. With that, the version number has been automatically increased to 2.

You may have noticed the extra argument pretty in the above request. It enables human readable format so that you can write each data field on a new row. You can also “prettify” your results when retrieving data and get much nicer output like this:

curl -X GET 'http://localhost:9200/tutorial/helloworld/1?pretty'

Now the response will be in a much better format:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "message" : "Hello People!"
  }
}

So far we have added to and queried data in Elasticsearch. To learn about the other operations please check the API documentation.

Last Step – Get Data from SQL to Elastich With Nifi

Conclusion

That’s how easy it is to install, configure, and begin using Elasticsearch. Once you have played enough with manual queries, your next task will be to start using it from your applications.

Install MongoDB 4.0.5 on Ubuntu 16.04

Add the key: (without the key, the repository will not load)

1 – sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv E52529D4

Now, create a new MongoDB repository list file:

2 – sudo bash -c ‘echo “deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse” > /etc/apt/sources.list.d/mongodb-org-4.0.list’

Complete the installation with update of repositories then install:

3 – sudo apt update
sudo apt install mongodb-org

Enable the mongod service and start it up:

4 – systemctl enable mongod.service
systemctl start mongod.service

Check your mongodb version:

5 – ~$ mongo –version
MongoDB shell version v4.0.5
git version: 3739429dd92b92d1b0ab120911a23d50bf03c412
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1604
distarch: x86_64
target_arch: x86_64

Check if the service is running:

6 – systemctl status mongod.service
mongod.service – MongoDB Database Server
Loaded: loaded (/lib/systemd/system/mongod.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-01-21 00:33:51 MST; 7s ago
Docs: https://docs.mongodb.org/manual
Main PID: 2906 (mongod)
CGroup: /system.slice/mongod.service
└─2906 /usr/bin/mongod –config /etc/mongod.conf

How to install scala and create a class on Win & Linux

1 – Verify the JDK installation on your machine. Open the shell/terminal and type java -version and javac -version.

2 – Download Scala Binaries from http://www.scala-lang.org/download/. As of writing this post Scala version is 2.11.6, so you should be getting downloaded file as scala-2.11.6.tgz. Unzip the scala-2.11.6.tgz file using the following command as shown below.

3 – tar -xvzf scala-2.11.6.tgz

4 – After unzipping, change the path to point to the directory using cd command as shown below.

5 – For instance my directory is Downloads in which Scala binaries are unzipped.

6 – Now we are in the downloads directory where Scala binaries are present. Just go to the bin directory.

7 – cd scala-2.11.6 / cd bin

8 – This is the Scala REPL shell in which we can type programs and see the outcome right in the shell.

Scala Hello World Example

class Student() {
var id:Int = 0
var age:Int = 0
def studentDetails(i:Int,a:Int) {
id = i
age = a
println(“Student Id is :”+id);
println(“Student Age is :”+age);
}
}

Output: defined class Student

Here we create a Student class and print the student details in the studentDetails method by passing student id and age as parameter. If there are no errors in the code then a message “defined class Student” is displayed.

Create the student object and invoke the studdetails method by passing the student id and age.

object Stud {
def main(args:Array[String]) {
val stu = new Student();
stu.studentDetails(10,8);
}
}

Returns: defined object Stud

How to get data from Twitch-API with python

Hello Everyone,

This is a small quick py script for how to call API and how to parse json to csv with pandas for the beginners.

1 – Install below libraries,

import pandas as pd
import requests
import json

2 – Set your static values, btw you can use those values with .yml file

url = “https://wind-bow.glitch.me/twitch-api/channels/”
# List of channels we want to access
channels = [“ESL_SC2”, “OgamingSC2”, “cretetion”, “freecodecamp”, “storbeck”, “habathcx”, “RobotCaleb”, “noobs2ninjas”,
“ninja”, “shroud”, “Dakotaz”, “esltv_cs”, “pokimane”, “tsm_bjergsen”, “boxbox”, “a_seagull”,
“kinggothalion”, “jahrein”, “thenadeshot”, “sivhd”, “kingrichard”]

file_name = “talih.csv”
location =”C:\\Users\\talih\\Desktop\\TwitchPy\\TwitchAPI\\”
“”” Those values can be used with .yml file “””

3 – Set your class,functions:
class apicrawler:

def __init__(self,url,channels,file_name,location):
self.url = url
self.channels = channels
self.file_name = file_name
self.location = location

def selectedchannelcrawler(url,channels,location,file_name):
channels_list = []
for channel in channels:
JSONContent = requests.get(url + channel).json()
channels_list.append([JSONContent[‘_id’], JSONContent[‘display_name’], JSONContent[‘status’],
JSONContent[‘followers’], JSONContent[‘views’]])

dataset = pd.DataFrame(channels_list)
dataset.columns = [‘Id’, ‘Name’, ‘Status’, ‘Followers’, ‘Views’]
dataset.dropna(axis = 0, how = ‘any’, inplace = True)
dataset.index = pd.RangeIndex(len(dataset.index))
dataset.to_csv(location + file_name, sep=’,’, encoding=’utf-8′)

4 – Call that class for your own values:
apicrawler.selectedchannelcrawler(url,channels,location,file_name)

Enjoy

Introduction of – Apache Pulsar

Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo! that is part of the Apache Software Foundation.

Pulsar is a multi-tenant, high-performance solution for server-to-server messaging.

Pulsar’s key features include:

Architecture Overview

At the highest level, a Pulsar instance is composed of one or more Pulsar clusters. Clusters within an instance can replicate data amongst themselves.

The diagram below provides an illustration of a Pulsar cluster:

Pulsar Comparison With Apache Kafka

The table below lists the similarities and differences between Apache Pulsar and Apache Kafka:

KAFKA PULSAR
Concepts Producer-topic-consumer group-consumer Producer-topic-subscription-consumer
Consumption More focused on streaming, exclusive messaging on partitions. No shared consumption. Unified messaging model and API.

  • Streaming via exclusive, failover subscription
  • Queuing via shared subscription
Acking Simple offset management

  • Prior to Kafka 0.8, offsets are stored in ZooKeeper
  • After Kafka 0.8, offsets are stored on offset topics
Unified messaging model and API.

  • Streaming via exclusive, failover subscription
  • Queuing via shared subscription
Retention Messages are deleted based on retention. If a consumer doesn’t read messages before the retention period, it will lose data. Messages are only deleted after all subscriptions consumed them. No data loss even the consumers of a subscription are down for a long time.Messages are allowed to keep for a configured retention period time even after all subscriptions consume them.
TTL No TTL support Supports message TTL

Conclusion

Apache Pulsar is an effort undergoing incubation at The Apache Software Foundation (ASF) sponsored by the Apache Incubator PMC. It seems that it will be a competitive alternative to Apache Kafka due to its unique features.

References

  1. Apache Pulsar homepage
  2. Yahoo! Open Source homepage
  3. Apache homepage
  4. Pulsar concepts and architecture documentation
  5. Comparing Pulsar and Kafka: Unified queueing and streaming
  6. Apache-Pulsar Distributed Pub-Sub Messaging System

Fixing Your Poor Performance of Redshift with Vacuum

As also you know about Redshift is too much fast when you have to read or select some data, but of course you have to do something for like theoretical white paper performance. Even if you’ve carefully planned out your schema, sortkeys, distkeys and compression encodings, your Redshift queries may still be awfully slow if you have long-running vacuums taking place in the background.

The number one enemy for query performance is the vacuum—it can slow down your ETL jobs and analytical queries by as much as 80%. It is an I/O intensive process that sorts the table, reclaims unused disk space, and impacts all other I/O bound processes (such as queries against large tables). This guide can help you cut down the time it takes to vacuum your cluster (these steps lowered our vacuum time from 10–30 hours to less than 1 hour).

This guide assumes you’ve chosen sortkeys and distkeys for your table, and are vacuuming regularly. If you are not doing these things, use this guide and this guide to get them set up (the flow charts are quite helpful).

 Contents

  • 0- What is the Vacuum?
  • 1 – Insert Data in Sortkey Order (for Tables that are Updated Regularly)
  • 2 – Use Compression Encodings (for Large Tables)
  • 3 – Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)
  • 4 – Call ANALYZE After Vacuuming
  • 5 – VACUUM to 99% on Large Tables
  • 6 – Keep Your Tables Skinny

The vacuum is a process that carries out one or both of the following two steps: sorting tables and reclaiming unused disk blocks. Lets talk about sorting,

VACUUM SORT ONLY;

The first time you insert data into the table, it will land sorted according to its sortkey (if one exists), and this data will make up the “sorted” section of the table. Note the unsorted percentage on the newly populated table below.

COPY INTO my_table FROM s3://my-bucket/csv;
SELECT "table", unsorted FROM svv_table_info;
  table   | unsorted
----------+------------
 my_table |     0

Subsequent inserts are appended to a completely different section on disk called the “unsorted” section of the table. Calling VACUUM SORT ONLY initiates two processes,

  1. a sorting of the unsorted section,
  2. a merging of the sorted and unsorted sections;

both of these steps can be costly, but there are simple ways to cut down that cost, which we’ll discuss below.

Now onto deleting,

VACUUM DELETE ONLY;

If you called DELETE on any rows from your table since the last vacuum, they were merely marked for deletion. A vacuum operation is necessary to actually reclaim that disk space.

These two steps, sorting tables and reclaiming disk space, can be run together efficiently,

VACUUM FULL;

This command simply runs both a sort only and a delete only operation, but there are advantages to doing them concurrently. If you have deleted andinserted new data, always do a “full” vacuum. It will be faster than a manual vacuum sort only followed by a manual vacuum delete only.

Now that we have described the vacuum, lets talk about how to make it faster. I’ll describe each tip, then describe why it matters.

1 -Insert Data in Sortkey Order (for Tables that are Updated Regularly)

If you have a monotonically increasing sortkey like date, timestamp or auto-incrementing id, make that the first column of your (compound) sortkey. This will cause your inserts to conform to your sortkey configuration, and drastically reduce the merging Redshift needs to do when the vacuum is invoked. If you do one thing in this guide, do this.

Why?

If the unsorted section fully belongs at the end of the sorted sectionalready (say, because time is an arrow, and you’re sorting by timestamp), then the merge step is over almost immediately.

Meanwhile, if you have two sorted sections, and you wish to merge them, but the sort order is interleaved between the two tables (say, because you’re sorting by customer), you will likely have to rewrite the entire table. This will cost you dearly!

Furthermore, if in general if you do queries like,

SELECT 
   AGGREGATE(column)
FROM my_table
   WHERE date = '2018–01–01'
   AND action = 'message_clicked'
   AND customer = 'taco-town';

then a compound key, sorted by date first, will be both performant in terms of query speed and in terms of vacuum time. You may also consider sorting by customer or action, but these must be subsequent keys in the sortkey, not the first. It will be difficult to optimize your sortkey selection for every query pattern your cluster might see, but you can target and optimize the most likely patterns. Furthermore, by avoiding long vacuums, you are in effect improving query performance.

2- Use Compression Encodings (for Large Tables)

Compression encodings will give you 2–4x compression on disk. Almost always use Zstandard encoding. But you may use the following command to get compression encoding recommendations on a column-by-column basis,

ANALYZE COMPRESSION my_table;

This command will lock the table for the duration of the analysis, so often you need to take a small copy of your table and run the analysis on it separately.

CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (
    -- Generate a pseudo-random filter of ~100,000 rows.
    -- This works for a table with ~10e9 rows.
    SELECT * FROM my_table
    WHERE ABS(STRTOL(LEFT(MD5('seed' || id), 15), 16)) < POW(2, 59)
);
-- Recreate my_table with these recommendations.
ANALYZE COMPRESSION my_table_tmp;

Alternatively, you may apply compression encoding recommendations automatically during a COPY (but only on the first insert to an empty table).

COPY INTO my_table FROM s3://bucket COMPUPDATE ON;

If your tables are small enough to fit into memory without compression, then do not bother encoding them. If your tables are very small, and very low read latency is a requirement, get them out of Redshift altogether.

Why?

The smaller your data, the more data you can fit into memory, the faster your queries will be. So compression helps in both keeping disk space down and reducing the I/O cost of querying against tables that are much larger than memory.

For small tables, the calculus changes. We generally accept a small decompression cost over an I/O cost, but when there is no I/O cost because the table is small, then the decompression cost makes up a significant portion of the total query cost and is no longer worth it. Cutting down on disk space usage frees up the overhead to do deep copies if necessary (see point 3).

3- Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)

If for some reason your table ends up at more than 20% unsorted, you may be better off copying it than vacuuming it. Bear in mind that Redshift will require 2–3x the table size in free disk space to complete the copy.

Why?

On the first insert to an empty table, Redshift will sort the data according to the sortkey, on subsequent inserts it will not. So a deep copy is identical to a vacuum in this way (as long as the copy takes place in one step). It will likely complete much faster as well (and tie up less resources), but you may not have the 2–3x disk space overhead to complete the copy operation. That is why you should be using appropriate compression encodings (see point 2).

Your deep copy code:

BEGIN;
CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (SELECT * FROM my_table);
DROP TABLE my_table;
ALTER TABLE my_table_tmp RENAME TO my_table;
COMMIT;

4-Call ANALYZE After Vacuuming

This is basic, but it gets left out. Call ANALYZE to update the query planner after you vacuum. The vacuum may have significantly reorganized the table, and you should update the planner stats. This can create a performance increase for reads, and the analyze process itself is typically quite fast.

5- VACUUM to 99% on Large Tables

Push the vacuum to 99% if you have daily insert volume less than 5% of the existing table. The syntax for doing so is,

VACUUM FULL my_table TO 99;

You must specify a table in order to use the TO clause. Therefore, you probably have to write code like this:

for table in tables:
    cursor.execute('VACUUM FULL {} TO 99;'.format(table))

Why?

This one may seem counterintuitive. Many teams might clean up their redshift cluster by calling VACUUM FULL. This conveniently vacuums every table in the cluster. But, if a table’s unsorted percentage is less than 5%, Redshift skips the vacuum on that table. This process continues for every vacuum call until the table finally tops 5% unsorted, at which point the sorting will take place.

This is fine if the table is small, and resorting 5% of the table is a modest job. But if the table is very large, resorting and merging 5% of the table may be a significant time cost (it was for us).

Vacuuming more thoroughly on each call spreads the vacuum cost evenly across the events, instead of saving up unsorted rows, then running long vacuums to catch up.

You may wonder if this causes more total vacuum time. The answer is no, if you are following step 1, and inserting in sortkey order. The vacuum call amounts to a sorting of the unsorted section and a quick merge step. Sorting 5% of the table will take 5x the time that sorting 1% of the table does, and the merge step will always be fast if you are inserting new data in sortkey order.

Note: the Most important thing is when etl works kind of insert delete etch, Redshift stored data if you didn’t vacuum etc. So on performance and prices is going up. Keep clean your tables after every execution or with daily crons.

How to schedule a BigQuery ETL job with Dataprep

As you know BigQuery user interface lets you do all kinds of things like run an interactive query or batch, save as Table, export to table, etc. — but there is no scheduler yet to schedule a query to run at a specific time or periodicity.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplication, transformations, etc.
  3. An export result of the transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Options include daily, weekly, etc. but also a crontab format for further flexibility that’s it.

Have a nice querying.