How to get data from SQL Server to Elasticsearch using LogStash

As a developer working with SQL Server there was a need to import data from the database to Elasticsearch and analyze data in Kibana.

As Elasticsearch is an open-source project built with Java and handles most other open-source projects, documentation on importing data from SQL Server to ES using LogStash.

I’d like to share how to import SQL Server data to Elasticsearch (version 6.2) using LS and verify the result on Kibana.

Assumption

I will skip on installing ELK (ElasticSearch, LogStash, and Kibana) stack as it’s outside the scope of this article.
Please refer to installation steps on Elastic download pages.

Overview

Here are the steps required to import SQL Server data to Elasticsearch.

  1. Install Java Development Kit (JDK)
  2. Install JDBC Driver for SQL Server
  3. Set CLASSPATH for the driver
  4. Create an Elasticsearch Index to Import Data to
  5. Configure LogStash configuration file
  6. Run LogStash
  7. Verify in Kibana

Step 1 – Install Java SE Development Kit 8

One of the gotchas is that you might install the latest version of JDK, which is version 9 but Elasticsearch documentation requires you to install JDK 8.

At the time of writing, the latest JDK 8 version is 8u162, which can be downloaded here.

Download “JDK8 8u162” and install it on your machine and make sure that “java” is in the PATH variable so that it can be called in any directory within a command line.

Step 2 – Install JDBC Driver for SQL Server

You need to download and install Microsoft JDBC Driver 4.2 for SQL Server, not the latest version.

As Elasticsearch is built with JDK 8, you can’t use the latest version of JDBC Driver (version 6.2) for SQL Server as it does not support JDK 8.

Step 3 – Set CLASSPATH for the JDBC Driver

We need to set the path so that Java can find the JDBC driver.

📝 Note: I am working on Windows 10 machine.

1. Go to the directory under which you have installed SQL Server JDBC.

2. Now you need to navigate to find a JAR file named sqljdbc42.jar, which is found under<<JDBC installation folder>>\sqljdbc_4.2\enu\jre8

3. And then copy the full path to the JAR file.

A cool trick on Windows 7/8/10 is that, when shift+right click on a file, it gives you a “Copy as Path” option.

4. Go to Windows Start button and type “environment” and click on “Edit the system environment variables”.

5. Add a CLASSPATH environment variable with following values (if you don’t already have one).

  1. “.” – for the current directory to search.
  2. And the JAR file path copied in previously (e.g. “C:\talih\Java\MicrosoftJDBCDriversSQLServer\sqljdbc_4.2\enu\jre8\sqljdbc42.jar”).

Gotcha: If you have a space in the path for JDBC JAR file, make sure to put double quotes around it.

Not doing so will result in either of following error messages when you start LogStash service in later step. 

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class JDBC

 - Or -

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class File\Microsoft

Let’s now move onto to create an Elasticsearch index to import data to.

Step 4 – Create an Elasticsearch Index to Import Data to

You can use cURL or Postman to create an Index but I will use Kibana console to create an index named “cs_users”, which is equivalent to a database in relational database terminology.

Before we start the Kibana service, we need to start Elasticsearch so that Kibana would not whine about Elasticsearch not being present.

Kibana warnings on lines 12~21 due to Elasticsearch being unavailable

Go to the Elasticsearch installation and start the service.

talih@CC c:\talih\elasticco\elasticsearch-6.2.2
> bin\elasticsearch.bat

And then go to the Kibana installation directory to start Kibana service.

talih@CC c:\talih\elasticco\kibana-6.2.2-windows-x86_64 
> bin\kibana.bat

If Kibana started without an issue, you will see an output similar to the following.

Kibana started successfully

On line 9, Kibana reports that it is running on http://localhost:5601.
Open the URL in a browser of your choice.

Now go to “Dev Tools” link on the bottom left of the page.

Click on Kibana Dev Tools Link

Once you see the Console, create a new index with the following command.

PUT cs_users
{
        "settings" : {
              "index" : {
                      "number_of_shards" : 3,
                      "number_of_replicas" : 0
              }
        }
}

on the left panel of the Kibana Dev Tools Console.

Create a new Elasticsearch index named “cs_users”

I won’t go into details on “shards” and “replicas” since it’s outside the scope of this article. For more information on the syntax, refer to the official Elasticsearch documentation.

And you will see the response from Elasticsearch with index creation confirmation on the panel right.

A new index “cs_users” is created on Elasticsearch successfully

OK, now we are finally ready to move onto creating a configuration file for LogStash to actually import data.

Step 5 – Configure LogStash configuration file

Go to the LogStash installation folder and create a file named “sql.conf” (name doesn’t really matter).
Here is the LogStash configuration I will be using.

input {
  jdbc {
    jdbc_connection_string => "jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_user => "xxx"

    statement => "SELECT * FROM Users"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

Let me break down “input” and “output” configurations.

Input

There are three required fields you need to specify for “jdbc” input plugin.

jdbc_connection_string – This field instructs LogStash information on SQL Server.

"jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"

Elasticsearch will connect to the server named “cc” running on port 1433 to connect to a database named “StackExchangeCS” with integrated security authentication method.

jdbc_driver_class – This is the driver class contained within the JDBC JAR file.
The JDBC JAR file contains a driver of type “com.microsoft.sqlserver.jdbc.SQLServerDriver” according to the documentation.

If you have an inquisitive mind, you can confirm it by opening the JAR file with your choice of ZIP program as JAR is a simple ZIP file.

Unzip JAR to verify JDBC driver name

jdbc_user – If you are using “Integrated Security” as an authentication option, this can be any string (I just entered “xxx” since that’s the easiest thing I can type 😉).

Output

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

SQL Server data (all cs.stackexchange.com users) will be sent to Elasticsearch running on the local machine port 9200 and will be indexed under “cs_users” index created in “Step 4 – Create an Elasticsearch Index to Import Data to”.
There are quite a bit of Elasticsearch configuration options so please refer to the official LogStash documentation for more “elasticsearch” output plugin options.

Step 6 – Import Data with LogStash

With prerequisites out of the way, we are now ready to import data to Elasticsearch from SQL Server.
Go to the LogStash installation location under which you should have created “sql.conf” and run LogStash service.

bin\logstash -f sql.conf

-f flag specifies the configuration file to use.
In our case, “sql.conf” we created in the previous step.

The result of successful LogStash run will look similar to the following output.

Step 7 – Verify in Kibana

Wow, we have finally imported data. Now let’s do a quick check whether the number of records in the database matches the records in Elasticsearch.

Verifying result of data import

“User” table in the SQL Server has 59394 records and Elasticsearch returns the same number as well.
📝 Note: You can use following command to get the list of all records in “cs_users” index.

GET cs_users/_count

For more information on how “_count” works, refer to Count API documentation.

Conclusion

Congratulations for getting this far 👏👏👏.

How To Install and Configure Elasticsearch on Ubuntu 16.04 + Bonus (Nifi ^^)

Step 1 — Downloading and Installing Elasticsearch

Elasticsearch can be downloaded directly from elastic.co in ziptar.gzdeb, or rpm packages. For Ubuntu, it’s best to use the deb (Debian) package which will install everything you need to run Elasticsearch.

First, update your package index.

sudo apt-get update

Download the latest Elasticsearch version, which is 2.3.1 at the time of writing.

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.1/elasticsearch-2.3.1.deb

Then install it in the usual Ubuntu way with dpkg.

sudo dpkg -i elasticsearch-2.3.1.deb

This results in Elasticsearch being installed in /usr/share/elasticsearch/ with its configuration files placed in /etc/elasticsearch and its init script added in /etc/init.d/elasticsearch.

To make sure Elasticsearch starts and stops automatically with the server, add its init script to the default runlevels.

sudo systemctl enable elasticsearch.service

Before starting Elasticsearch for the first time, please check the next section about the recommended minimum configuration.

Step 2 — Configuring Elasticsearch

Now that Elasticsearch and its Java dependencies have been installed, it is time to configure Elasticsearch. The Elasticsearch configuration files are in the /etc/elasticsearch directory. There are two files:

  • elasticsearch.yml configures the Elasticsearch server settings. This is where all options, except those for logging, are stored, which is why we are mostly interested in this file.
  • logging.yml provides configuration for logging. In the beginning, you don’t have to edit this file. You can leave all default logging options. You can find the resulting logs in /var/log/elasticsearch by default.

The first variables to customize on any Elasticsearch server are node.name and cluster.name in elasticsearch.yml. As their names suggest, node.name specifies the name of the server (node) and the cluster to which the latter is associated.

If you don’t customize these variable, a node.name will be assigned automatically in respect to the Droplet hostname. The cluster.name will be automatically set to the name of the default cluster.

The cluster.name value is used by the auto-discovery feature of Elasticsearch to automatically discover and associate Elasticsearch nodes to a cluster. Thus, if you don’t change the default value, you might have unwanted nodes, found on the same network, in your cluster.

To start editing the main elasticsearch.yml configuration file with nano or your favorite text editor.

sudo nano /etc/elasticsearch/elasticsearch.yml

Remove the # character at the beginning of the lines for cluster.name and node.name to uncomment them, and then update their values. Your first configuration changes in the /etc/elasticsearch/elasticsearch.yml file should look like this:

/etc/elasticsearch/elasticsearch.yml

. . .
cluster.name: mycluster1
node.name: "My First Node"
. . .

These the minimum settings you can start with using Elasticsearch. However, it’s recommended to continue reading the configuration part for more thorough understanding and fine-tuning of Elasticsearch.

One especially important setting of Elasticsearch is the role of the server, which is either master or slave. Master servers are responsible for the cluster health and stability. In large deployments with a lot of cluster nodes, it’s recommended to have more than one dedicated master. Typically, a dedicated master will not store data or create indexes. Thus, there should be no chance of being overloaded, by which the cluster health could be endangered.

Slave servers are used as workhorses which can be loaded with data tasks. Even if a slave node is overloaded, the cluster health shouldn’t be affected seriously, provided there are other nodes to take additional load.

The setting which determines the role of the server is called node.master. By default, a node is a master. If you have only one Elasticsearch node, you should leave this option to the default true value because at least one master is always needed. Alternatively, if you wish to configure the node as a slave, assign a false value to the variable node.master like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.master: false
. . .

Another important configuration option is node.data, which determines whether a node will store data or not. In most cases this option should be left to its default value (true), but there are two cases in which you might wish not to store data on a node. One is when the node is a dedicated master” as previously mentioned. The other is when a node is used only for fetching data from nodes and aggregating results. In the latter case the node will act up as a search load balancer.

Again, if you have only one Elasticsearch node, you should not change this value. Otherwise, to disable storing data locally, specify node.data as false like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.data: false
. . .

In larger Elasticsearch deployments with many nodes, two other important options are index.number_of_shards and index.number_of_replicas. The first determines how many pieces, or shards, the index will be split into. The second defines the number of replicas which will be distributed across the cluster. Having more shards improves the indexing performance, while having more replicas makes searching faster.

By default, the number of shards is 5 and the number of replicas is 1. Assuming that you are still exploring and testing Elasticsearch on a single node, you can start with only one shard and no replicas. Thus, their values should be set like this:/etc/elasticsearch/elasticsearch.yml

. . .
index.number_of_shards: 1
index.number_of_replicas: 0
. . .

One final setting which you might be interested in changing is path.data, which determines the path where data is stored. The default path is /var/lib/elasticsearch. In a production environment, it’s recommended that you use a dedicated partition and mount point for storing Elasticsearch data. In the best case, this dedicated partition will be a separate storage media which will provide better performance and data isolation. You can specify a different path.data path by specifying it like this:/etc/elasticsearch/elasticsearch.yml

. . .
path.data: /media/different_media
. . .

Once you make all the changes, save and exit the file. Now you can start Elasticsearch for the first time.

sudo systemctl start elasticsearch

Give Elasticsearch a few to fully start before you try to use it. Otherwise, you may get errors about not being able to connect.

Step 3 — Securing Elasticsearch

By default, Elasticsearch has no built-in security and can be controlled by anyone who can access the HTTP API. This is not always a security risk because Elasticsearch listens only on the loopback interface (i.e., 127.0.0.1) which can be accessed only locally. Thus, no public access is possible and your Elasticsearch is secure enough as long as all server users are trusted or this is a dedicated Elasticsearch server.

Still, if you wish to harden the security, the first thing to do is to enable authentication. Authentication is provided by the commercial Shield plugin. Unfortunately, this plugin is not free but there is a free 30 day trial you can use to test it. Its official page has excellent installation and configuration instructions. The only thing you may need to know in addition is that the path to the Elasticsearch plugin installation manager is /usr/share/elasticsearch/bin/plugin.

If you don’t want to use the commercial plugin but you still have to allow remote access to the HTTP API, you can at least limit the network exposure with Ubuntu’s default firewall, UFW (Uncomplicated Firewall). By default, UFW is installed but not enabled. If you decide to use it, follow these steps:

First, create a rule to allow any needed services. You will need at least SSH allowed so that you can log in the server. To allow world-wide access to SSH, whitelist port 22.

sudo ufw allow 22

Then allow access to the default Elasticsearch HTTP API port (TCP 9200) for the trusted remote host, e.g.TRUSTED_IP, like this:

sudo ufw allow from TRUSTED_IP to any port 9200

Only after that enable UFW with the command:

sudo ufw enable

Finally, check the status of UFW with the following command:

sudo ufw status

If you have specified the rules correctly, the output should look like this:

Output of java -versionStatus: active

To                         Action      From
--                         ------      ----
9200                       ALLOW       TRUSTED_IP
22                         ALLOW       Anywhere
22 (v6)                    ALLOW       Anywhere (v6)

Once you have confirmed UFW is enabled and protecting Elasticsearch port 9200, then you can allow Elasticsearch to listen for external connections. To do this, open the elasticsearch.yml configuration file again.

sudo nano /etc/elasticsearch/elasticsearch.yml

Find the line that contains network.bind_host, uncomment it by removing the # character at the beginning of the line, and change the value to 0.0.0.0 so it looks like this:/etc/elasticsearch/elasticsearch.yml

. . .
network.host: 0.0.0.0
. . .

We have specified 0.0.0.0 so that Elasticsearch listens on all interfaces and bound IPs. If you want it to listen only on a specific interface, you can specify its IP in place of 0.0.0.0.

To make the above setting take effect, restart Elasticsearch with the command:

sudo systemctl restart elasticsearch

After that try to connect from the trusted host to Elasticsearch. If you cannot connect, make sure that the UFW is working and the network.host variable has been correctly specified.

Step 4 — Testing Elasticsearch

By now, Elasticsearch should be running on port 9200. You can test it with curl, the command line client-side URL transfers tool and a simple GET request.

curl -X GET 'http://localhost:9200'

You should see the following response:

Output of curl{
  "name" : "My First Node",
  "cluster_name" : "mycluster1",
  "version" : {
    "number" : "2.3.1",
    "build_hash" : "bd980929010aef404e7cb0843e61d0665269fc39",
    "build_timestamp" : "2016-04-04T12:25:05Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

If you see a response similar to the one above, Elasticsearch is working properly. If not, make sure that you have followed correctly the installation instructions and you have allowed some time for Elasticsearch to fully start.

To perform a more thorough check of Elasticsearch execute the following command:

curl -XGET 'http://localhost:9200/_nodes?pretty'

In the output from the above command you can see and verify all the current settings for the node, cluster, application paths, modules, etc.

Step 5 — Using Elasticsearch

To start using Elasticsearch, let’s add some data first. As already mentioned, Elasticsearch uses a RESTful API, which responds to the usual CRUD commands: create, read, update, and delete. For working with it, we’ll use again curl.

You can add your first entry with the command:

curl -X POST 'http://localhost:9200/tutorial/helloworld/1' -d '{ "message": "Hello World!" }'

You should see the following response:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

With cuel, we have sent an HTTP POST request to the Elasticsearch server. The URI of the request was /tutorial/helloworld/1 with several parameters:

  • tutorial is the index of the data in Elasticsearch.
  • helloworld is the type.
  • 1 is the id of our entry under the above index and type.

You can retrieve this first entry with an HTTP GET request.

curl -X GET 'http://localhost:9200/tutorial/helloworld/1'

The result should look like:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"found":true,"_source":{ "message": "Hello World!" }}

To modify an existing entry, you can use an HTTP PUT request.

curl -X PUT 'localhost:9200/tutorial/helloworld/1?pretty' -d '
{
  "message": "Hello People!"
}'

Elasticsearch should acknowledge successful modification like this:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

In the above example we have modified the message of the first entry to “Hello People!”. With that, the version number has been automatically increased to 2.

You may have noticed the extra argument pretty in the above request. It enables human readable format so that you can write each data field on a new row. You can also “prettify” your results when retrieving data and get much nicer output like this:

curl -X GET 'http://localhost:9200/tutorial/helloworld/1?pretty'

Now the response will be in a much better format:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "message" : "Hello People!"
  }
}

So far we have added to and queried data in Elasticsearch. To learn about the other operations please check the API documentation.

Last Step – Get Data from SQL to Elastich With Nifi

Conclusion

That’s how easy it is to install, configure, and begin using Elasticsearch. Once you have played enough with manual queries, your next task will be to start using it from your applications.

Fixing Your Poor Performance of Redshift with Vacuum

As also you know about Redshift is too much fast when you have to read or select some data, but of course you have to do something for like theoretical white paper performance. Even if you’ve carefully planned out your schema, sortkeys, distkeys and compression encodings, your Redshift queries may still be awfully slow if you have long-running vacuums taking place in the background.

The number one enemy for query performance is the vacuum—it can slow down your ETL jobs and analytical queries by as much as 80%. It is an I/O intensive process that sorts the table, reclaims unused disk space, and impacts all other I/O bound processes (such as queries against large tables). This guide can help you cut down the time it takes to vacuum your cluster (these steps lowered our vacuum time from 10–30 hours to less than 1 hour).

This guide assumes you’ve chosen sortkeys and distkeys for your table, and are vacuuming regularly. If you are not doing these things, use this guide and this guide to get them set up (the flow charts are quite helpful).

 Contents

  • 0- What is the Vacuum?
  • 1 – Insert Data in Sortkey Order (for Tables that are Updated Regularly)
  • 2 – Use Compression Encodings (for Large Tables)
  • 3 – Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)
  • 4 – Call ANALYZE After Vacuuming
  • 5 – VACUUM to 99% on Large Tables
  • 6 – Keep Your Tables Skinny

The vacuum is a process that carries out one or both of the following two steps: sorting tables and reclaiming unused disk blocks. Lets talk about sorting,

VACUUM SORT ONLY;

The first time you insert data into the table, it will land sorted according to its sortkey (if one exists), and this data will make up the “sorted” section of the table. Note the unsorted percentage on the newly populated table below.

COPY INTO my_table FROM s3://my-bucket/csv;
SELECT "table", unsorted FROM svv_table_info;
  table   | unsorted
----------+------------
 my_table |     0

Subsequent inserts are appended to a completely different section on disk called the “unsorted” section of the table. Calling VACUUM SORT ONLY initiates two processes,

  1. a sorting of the unsorted section,
  2. a merging of the sorted and unsorted sections;

both of these steps can be costly, but there are simple ways to cut down that cost, which we’ll discuss below.

Now onto deleting,

VACUUM DELETE ONLY;

If you called DELETE on any rows from your table since the last vacuum, they were merely marked for deletion. A vacuum operation is necessary to actually reclaim that disk space.

These two steps, sorting tables and reclaiming disk space, can be run together efficiently,

VACUUM FULL;

This command simply runs both a sort only and a delete only operation, but there are advantages to doing them concurrently. If you have deleted andinserted new data, always do a “full” vacuum. It will be faster than a manual vacuum sort only followed by a manual vacuum delete only.

Now that we have described the vacuum, lets talk about how to make it faster. I’ll describe each tip, then describe why it matters.

1 -Insert Data in Sortkey Order (for Tables that are Updated Regularly)

If you have a monotonically increasing sortkey like date, timestamp or auto-incrementing id, make that the first column of your (compound) sortkey. This will cause your inserts to conform to your sortkey configuration, and drastically reduce the merging Redshift needs to do when the vacuum is invoked. If you do one thing in this guide, do this.

Why?

If the unsorted section fully belongs at the end of the sorted sectionalready (say, because time is an arrow, and you’re sorting by timestamp), then the merge step is over almost immediately.

Meanwhile, if you have two sorted sections, and you wish to merge them, but the sort order is interleaved between the two tables (say, because you’re sorting by customer), you will likely have to rewrite the entire table. This will cost you dearly!

Furthermore, if in general if you do queries like,

SELECT 
   AGGREGATE(column)
FROM my_table
   WHERE date = '2018–01–01'
   AND action = 'message_clicked'
   AND customer = 'taco-town';

then a compound key, sorted by date first, will be both performant in terms of query speed and in terms of vacuum time. You may also consider sorting by customer or action, but these must be subsequent keys in the sortkey, not the first. It will be difficult to optimize your sortkey selection for every query pattern your cluster might see, but you can target and optimize the most likely patterns. Furthermore, by avoiding long vacuums, you are in effect improving query performance.

2- Use Compression Encodings (for Large Tables)

Compression encodings will give you 2–4x compression on disk. Almost always use Zstandard encoding. But you may use the following command to get compression encoding recommendations on a column-by-column basis,

ANALYZE COMPRESSION my_table;

This command will lock the table for the duration of the analysis, so often you need to take a small copy of your table and run the analysis on it separately.

CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (
    -- Generate a pseudo-random filter of ~100,000 rows.
    -- This works for a table with ~10e9 rows.
    SELECT * FROM my_table
    WHERE ABS(STRTOL(LEFT(MD5('seed' || id), 15), 16)) < POW(2, 59)
);
-- Recreate my_table with these recommendations.
ANALYZE COMPRESSION my_table_tmp;

Alternatively, you may apply compression encoding recommendations automatically during a COPY (but only on the first insert to an empty table).

COPY INTO my_table FROM s3://bucket COMPUPDATE ON;

If your tables are small enough to fit into memory without compression, then do not bother encoding them. If your tables are very small, and very low read latency is a requirement, get them out of Redshift altogether.

Why?

The smaller your data, the more data you can fit into memory, the faster your queries will be. So compression helps in both keeping disk space down and reducing the I/O cost of querying against tables that are much larger than memory.

For small tables, the calculus changes. We generally accept a small decompression cost over an I/O cost, but when there is no I/O cost because the table is small, then the decompression cost makes up a significant portion of the total query cost and is no longer worth it. Cutting down on disk space usage frees up the overhead to do deep copies if necessary (see point 3).

3- Deep Copy Instead Of Vacuuming (When the Unsorted Section is Large)

If for some reason your table ends up at more than 20% unsorted, you may be better off copying it than vacuuming it. Bear in mind that Redshift will require 2–3x the table size in free disk space to complete the copy.

Why?

On the first insert to an empty table, Redshift will sort the data according to the sortkey, on subsequent inserts it will not. So a deep copy is identical to a vacuum in this way (as long as the copy takes place in one step). It will likely complete much faster as well (and tie up less resources), but you may not have the 2–3x disk space overhead to complete the copy operation. That is why you should be using appropriate compression encodings (see point 2).

Your deep copy code:

BEGIN;
CREATE TABLE my_table_tmp (LIKE my_table);
INSERT INTO my_table_tmp (SELECT * FROM my_table);
DROP TABLE my_table;
ALTER TABLE my_table_tmp RENAME TO my_table;
COMMIT;

4-Call ANALYZE After Vacuuming

This is basic, but it gets left out. Call ANALYZE to update the query planner after you vacuum. The vacuum may have significantly reorganized the table, and you should update the planner stats. This can create a performance increase for reads, and the analyze process itself is typically quite fast.

5- VACUUM to 99% on Large Tables

Push the vacuum to 99% if you have daily insert volume less than 5% of the existing table. The syntax for doing so is,

VACUUM FULL my_table TO 99;

You must specify a table in order to use the TO clause. Therefore, you probably have to write code like this:

for table in tables:
    cursor.execute('VACUUM FULL {} TO 99;'.format(table))

Why?

This one may seem counterintuitive. Many teams might clean up their redshift cluster by calling VACUUM FULL. This conveniently vacuums every table in the cluster. But, if a table’s unsorted percentage is less than 5%, Redshift skips the vacuum on that table. This process continues for every vacuum call until the table finally tops 5% unsorted, at which point the sorting will take place.

This is fine if the table is small, and resorting 5% of the table is a modest job. But if the table is very large, resorting and merging 5% of the table may be a significant time cost (it was for us).

Vacuuming more thoroughly on each call spreads the vacuum cost evenly across the events, instead of saving up unsorted rows, then running long vacuums to catch up.

You may wonder if this causes more total vacuum time. The answer is no, if you are following step 1, and inserting in sortkey order. The vacuum call amounts to a sorting of the unsorted section and a quick merge step. Sorting 5% of the table will take 5x the time that sorting 1% of the table does, and the merge step will always be fast if you are inserting new data in sortkey order.

Note: the Most important thing is when etl works kind of insert delete etch, Redshift stored data if you didn’t vacuum etc. So on performance and prices is going up. Keep clean your tables after every execution or with daily crons.

How to schedule a BigQuery ETL job with Dataprep

As you know BigQuery user interface lets you do all kinds of things like run an interactive query or batch, save as Table, export to table, etc. — but there is no scheduler yet to schedule a query to run at a specific time or periodicity.

To be clear: once BigQuery has scheduled queries, you want to use that, so that you can keep your data in BigQuery and take advantage of power. However, if you are doing transformations (the T in ETL), then consider this approach:

  1. In the BigQuery UI, save the desired query as a View.
  2. In Cloud Dataprep, write a new recipe, with a BigQuery source. Optionally, add some transforms to your recipe. For example, you might want to add some formulas, de-deduplication, transformations, etc.
  3. An export result of the transformation to a BigQuery table or CSV file on Cloud Storage
  4. Schedule the Dataprep flow to run periodically

Go to the “Flows” section of the Dataprep UI and click on the three buttons next to your new Flow. You’ll see an option to add a schedule:

If the UI is different when you try to replicate the steps, just hunt around a bit. The functionality is likely to be there, just in a different place.

Options include daily, weekly, etc. but also a crontab format for further flexibility that’s it.

Have a nice querying.

 

What is Hadoop?

Apache™ Hadoop® is a highly scalable open-source storage platform designed for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks i.e. it can process very large data sets across hundreds and thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.

Hadoop is an ecosystem of open source components that fundamentally changed the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.

What are those terms mean?

Open-source software: Open-source software is created and maintained by a network of developers from around the globe. It’s free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available.

Framework: In this case, it means that everything you need to develop and run software applications is provided – programs, connections, etc.

Massive storage: The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware.

Processing power: Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.

History

It all started with the World Wide Web. As the web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results really were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).

One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.

desc
Mike Cafarella and Doug Cutting

In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.

Components of Hadoop

Currently, four core modules are included in the basic framework from the Apache Foundation:

1. Hadoop Common

The libraries and utilities used by other Hadoop modules.

2. Hadoop Distributed File System (HDFS)

The Java-based scalable system that stores data across multiple machines without prior organization.

Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.

hadoopdistrubuted
3. Processing with Hadoop MapReduce and YARN

MapReduce – a software programming model for processing large sets of data in parallel.

process

YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator)

yarn

Hadoop MapReduce, an execution engine in Hadoop, processes workloads using the MapReduce framework which breaks down jobs into smaller pieces of work that can be distributed across nodes in your Amazon EMR cluster. The Hadoop MapReduce engine is built with the expectation that any given machine in your cluster could fail at any time and is designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.

You can write MapReduce programs in Java, or use Hadoop Streaming to execute custom scripts in a parallel fashion, Hive and Pig (if you choose to install these applications on your Amazon EMR cluster) for higher level abstractions over MapReduce, or other tools to interact with Hadoop.

Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce workloads as well as other distributed frameworks such as Apache Spark, Apache Tez, and more.

Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:

masasdasd

Pig

A platform for manipulating data stored in HDFS that includes a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.

Hive

A data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)

HBase

A nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.

HCatalog

A table and storage management layer that helps users share and access data.

Ambari

A web interface for managing, configuring and testing Hadoop services and components.

Cassandra

A distributed database system.

Chukwa

A data collection system for monitoring large distributed systems.

Flume

Software that collects, aggregates and moves large amounts of streaming data into HDFS.

Oozie

A Hadoop job scheduler.

Sqoop

A connection and transfer mechanism that moves data between Hadoop and relational databases.

Spark

An open-source cluster computing framework with in-memory analytics.

Solr

An scalable search tool that includes indexing, reliability, central configuration, failover and recovery.

Zookeeper

An application that coordinates distributed processes.

In addition, there are commercial distributions of Hadoop, including Cloudera, Hortonworks and MapR. With distributions from software vendors, you pay for their version of the framework and receive additional software components, tools, training, documentation and other services.

Uses of Hadoop

1. Store

Hadoop’s infinitely scalable flexible architecture (based on the HDFS filesystem) allows organizations to store and analyze unlimited amounts and types of data—all in a single, open source platform on industry-standard hardware.

The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later.

2. Data lake

Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a low-cost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.

3. Process

Quickly integrate with existing systems or applications to move data into and out of Hadoop through bulk load processing (Apache Sqoop) or streaming (Apache Flume, Apache Kafka).?

Transform complex data, at scale, using multiple data access options (Apache Hive, Apache Pig) for batch (MR2) or fast in-memory (Apache Spark) processing. Process streaming data as it arrives in your cluster via Spark Streaming.

4. Discover

Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.

Analysts interact with full-fidelity data on the fly with Apache Impala (incubating), the analytic database for Hadoop. With Impala, analysts experience BI-quality SQL performance and functionality plus compatibility with all the leading BI tools.

Using Cloudera Search, an integration of Hadoop and Apache Solr, analysts can accelerate the process of discovering patterns in data in all amounts and formats, especially when combined with Impala.

5. Recommendation systems

One of the most popular analytical uses by some of Hadoop’s largest adopters is for web-based recommendation systems. Facebook – people you may know. LinkedIn – jobs you may be interested in. Netflix, eBay, Hulu – items you may be interested in. These systems analyze huge amounts of data in real time to quickly predict preferences before customers leave the web page.

6. Model

With Hadoop, analysts and data scientists have the flexibility to develop and iterate on advanced statistical models using a mix of partner technologies as well as open source frameworks like Apache Spark.

7. Serve

The distributed data store for Hadoop, Apache HBase, supports the fast, random reads/writes (“fast data”) required for online applications.

Difficulties in Hadoop Adoption

The scale-out potential of Apache Hadoop is impressive. However, while Hadoop offers the advantage of using low-cost commodity servers, extending this scale-out potential to thousands of nodes can translate to a true expense. As the demand for compute and analytic capacity grows, so can the machine costs. This has an equal effect on storage since Hadoop spreads out data, and companies must have equal space for increased data storage repositories, including all the indices, and for all the acquired raw data.

Integrating and processing all of this diverse data can be costly in terms of both infrastructure and personnel. While traditional BI relies on evaluating transactional and historical data, today’s analytics require more skill in iterative analysis and the ability to recognize patterns. When dealing with big data, an advanced skillset that goes beyond RDBMS capabilities-both in terms of analysis and programming-is essential. Not only is there need for advanced systems administration and analyst capabilities when working with Hadoop, but learning the MapReduce programming unique to this framework represents a significant hurdle.

In terms of relational databases, moving and modifying large volumes of unstructured data into the necessary form for Extraction, Transformation, Loading (ETL) can be both costly and time-consuming. That’s a key reason why Hadoop seems so attractive. One could argue that the ongoing growth in data volume, velocity, and variety has made the traditional data warehousing architecture less and less viable. However, it is still easier to find experienced RDBMS programmers and developers than those with MapReduce programming capabilities. Part of the difficulty lies in just learning the language beyond having the skills to install and maintain the Hadoop platform.

MapReduce uses a computational approach that employs a Map pre-processing function and a Reduce data aggregation/distillation step. However, when it comes to real-time transactional data analysis, the low latency reads and writes characteristic of RDBMS structured data processing are simply not possible with HDFS and MapReduce.

Of course, as the platform matures, more features will continue to be added to it. While add-on products make Hadoop easier to use, they also present a learning challenge that requires constantly expanding one’s expertise. For example:

Hive is the data warehousing component of Hadoop, and it functions well with structured data, enabling ad-hoc queries against large transactional datasets. On the other hand, though workarounds do exist, the absence of any ETL-style tool makes HiveQL, the SQL-like programming dialect, problematic when working with unprocessed, unstructured data.

HBase, the column-based storage system, enables users to employ Hadoop datasets as though they’re indices in any conventional RDBMS. It typically allows easy column creation and lets the user store virtually any structure within a data element.

PIG represents the high-level dataflow language, Pig Latin, and requires quite advanced training. It provides easier access to data held in Hadoop clusters and offers a means for analyzing large datasets. In part, PIG enables the implementation of simple or complex workflows and the designation of multiple data inputs where data can then be processed by multiple operators.

As IT organizations consider wholesale adoption of the Hadoop platform for analytics, they must carefully strategize their approach. The platform’s specialized methodology, scale-out storage, and powerful processing capacity make it optimal for analytical data loads. However, the dedication in training competent personnel and machine costs, as well as the framework’s inability to function as an RDBMS replacement, should prompt careful consideration.

Hardware and Software for Hadoop

Hardware Requirements

Hadoop runs on commodity hardware. That doesn’t mean it runs on cheapo hardware. Hadoop runs on decent server class machines.

Here are some possibilities of hardware for Hadoop nodes.

hedm

So the high end machines have more memory. Plus, newer machines are packed with a lot more disks (e.g. 36 TB) — high storage capacity.

Examples of Hadoop servers

  1. HP ProLiant DL380
  2. Dell C2100 series
  3. Supermicro Hadoop series

So how does a large hadoop cluster looks like? Here is a picture of Yahoo’s Hadoop cluster.

dc

Software Requirements

Operating System

Hadoop runs well on Linux. The operating systems of choice are:

RedHat Enterprise Linux (RHEL)

This is a well tested Linux distro that is geared for Enterprise. Comes with RedHat support

CentOS

Source compatible distro with RHEL. Free. Very popular for running Hadoop. Use a later version (version 6.x).

Ubuntu

The Server edition of Ubuntu is a good fit — not the Desktop edition. Long Term Support (LTS) releases are recommended, because they continue to be updated for at least 2 years.

Java

Hadoop is written in Java. The recommended Java version is Oracle JDK 1.6 release and the recommended minimum revision is 31 (v 1.6.31).

So what about OpenJDK? At this point the Sun JDK is the ‘official’ supported JDK. You can still run Hadoop on OpenJDK (it runs reasonably well) but you are on your own for support 🙂

Business Intelligence Tools For Hadoop and Big Data

The case for BI Tools

Analytics for Hadoop can be done by the following:

  • Writing custom Map Reduce code using Java, Python, R ..etc
  • Using high level Pig scripts
  • Using SQL using Hive

How ever doing analytics like this can feel a little pedantic and time consuming. Business INtelligence tools (BI tools for short) can address this problem.

BI tools have been around since before Hadoop. Some of them are generic, some are very specific towards a certain domain (e.g. Telecom, Health Care ..etc). BI tools provide rich, user friendly environment to slice and dice data. Most of them have nice GUI environments as well.

BI Tools Feature Matrix Comparison

Since Hadoop is gaining popularity as a data silo, a lot of BI tools have added support to Hadoop. In this chapter we will look into some BI tools that work with Hadoop.

We are trying to present capabilities of BI tools in an easy to compare feature matrix format. This is a ‘living’ document. We will keep it updated as new versions and new features surface.

This matrix is under construction

How to read the matrix?

  • Y – feature is supported
  • N – feature is NOT supported
  • ? or empty – unknown

Table: BI Tools Comparison : Data Access and Management

b

Table: BI Tools Comparison : Analytics

a

Table: BI Tools Comparison : Visualizing

10fb15c77258a991b0028080a64fb42d (19)

Table: BI Tools Comparison : Connectivity

8266e4bfeda1bd42d8f9794eb4ea0a13 (16)

Glossary of terms

Data Validataion

Can validate data confirms to certain limits, can do cleansing and de-duping.

Share with others

Can share the results with others within or outside organization easily. (Think like sharing a document on DropBox or Google Drive)

Local Rendering

You can slice and dice data on locally on a computer or tablet. This uses the CPU power of the device and doesn’t need a round-trip to a ‘server’ to process results. This can speed up ad-hoc data exploration

Analytics ‘app store’

The platform allows customers to buy third party analytics app. Think like APple App Store.

Hadoop For Executives

This section is a quick ‘fact sheet’ in a Q&A format.

What is Hadoop?

Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.

What is the license of Hadoop?

Hadoop is open source software. It is an Apache project released under Apache Open Source License v2.0. This license is very commercial friendly.

Who contributes to Hadoop?

Originally Hadoop was developed and open sourced by Yahoo. Now Hadoop is developed as an Apache Software Foundation project and has numerous contributors from Cloudera, Horton Works, Facebook, etc.

Isn’t Hadoop used by foo-foo social media companies and not by enterprises

Hadoop had its start in a Web company. It was adopted pretty early by social media companies because the companies had Big Data problems and Hadoop offered a solution.

However, Hadoop is now making inroads into Enterprises.

I am not sure my company has a big data problem

Hadoop is designed to deal with Big Data. So if you don’t have a ‘Big Data Problem’, then Hadoop probably isn’t the best fit for your company. But before you stop reading right here, please read on 🙂

How much data is considered Big Data, differs from company to company. For some companies, 10 TB of data would be considered Big Data; for others 1 PB would be ‘Big Data’. So only you can determine how much is Big Data.

Also, if you don’t have a ‘Big Data problem’ now, is that because you are not capturing some data? In some scenarios, companies chose to forgo capturing data, because there wasn’t a feasible way to store and process it. Now that Hadoop can help with Big Data, it may be possible to start capturing data that wasn’t captured before.

How much does it cost to adopt Hadoop?

Hadoop is open source. The software is free. However running Hadoop does have other cost components.

Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant.

The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in.

Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy.

Hadoop for Developers

This section is a quick ‘fact sheet’ in a Q&A format.

What is Hadoop?

Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.

Is Hadoop a fad or here to stay?

Sure, Hadoop and Big Data are all the rage now. But Hadoop does solve a real problem and it is a safe bet that it is here to stay.

Below is a graph of Hadoop job trends from Indeed.com. As you can see, demand for Hadoop skills has been up and up since 2009. So Hadoop is a good skill to have!

 Hadoop Job Trends

23d2cdfe8faaf0b4d47d114070470283

What skills do I need to learn Hadoop?

A hands-on developer or admin can learn Hadoop. The following list is a start – in no particular order

  • Hadoop is written in Java. So knowing Java helps
  • Hadoop runs on Linux, so you should know basic Linux command line navigation skills
  • Some Linux scripting skills will go a long way

What kind of technical roles are available in Hadoop?

The following should give you an idea of the kind of technical roles in Hadoop.

1c82a3a7-21ed-431b-aef2-8cc72f90a472

I am not a programmer, can I still use Hadoop?

Yes, you don’t need to write Java Map Reduce code to extract data out of Hadoop. You can use Pig and Hive. Both Pig and Hive offer ‘high level’ Map Reduce. For example you can query Hadoop using SQL in Hive.

What kind of development tools are available for Hadoop?

Hadoop development tools are still evolving. Here are a few:

  • Karmasphere IDE : tuned for developing for Hadoop
  • Eclipse and other Java IDEs : When writing Java code
  • Command line editor like VIM : No matter what editor you use, you will be editing a lot of files / scripts. So familiarity with CLI editors is essential.

Where can I learn more?

Books

Tom White’s Hadoop Book : This is the ‘Hadoop Bible’

Hadoop

 

The Commercial Platform Approach to Apache Hadoop

As mentioned above, businesses dealing with increasing masses of data are looking for a distributed computing analytic solution that provides comprehensive administration and management, easy deployment, and support for effective business continuity and high availability.

Today, commercial open-source models that incorporate MapReduce along with a built-in framework and infrastructure offer another means for avoiding the learning curve and burdens associated with Apache Hadoop deployment. These commercial players ease skillset acquisition by providing key management tools that interface with Hadoop processes. The value of technical support, services, and training cannot be overstated when it comes to Hadoop implementation.

Commercial vendors offer a means by which these high-level analysis tools can be accessed and used by a wide variety of users, not just those with engineering or BI capabilities. They provide the support that ensures Hadoop users can undertake complex data analysis projects.

As open-source tools proliferate and their increasing importance to big data analytics continues to grow, a need for streamlined administration and support will expand as well. While commercial Hadoop providers offer the necessary support, there is no alternative to learning its platform-specific language. Adequate knowledge of MapReduce represents an intrinsic component to working with Hadoop. Moreover, in order for users to install, configure, and use the code, thorough training is fundamental.

Hadoop integration with current BI analytics remains a key goal along with the development of analytic tools that can be employed by a wide range of users. Commercial vendors, such as Cloudera, Hortonworks, and MapR, may eventually provide the necessary connectivity between common BI analysis methodology and NoSQL. Since Apache Hadoop, as a stand-alone, open-source deployment, doesn’t contain internal manageability controls or high-level performance monitors, Cloudera offers a number of management tools that make analysis easier to implement for a range of users.

Cloudera’s Apache-licensed open source software, Cloudera’s Distribution Including Apache Hadoop (CDH), is in its fourth generation (CDH4). The offering includes a hot failover for the metadata function, NameNode. This is a critical contribution since NameNode is considered a single point of failure, essentially an Achille’s heel for Hadoop. The latest version of Cloudera’s Enterprise subscription offers a comprehensive package: high availability, improved security, Cloudera Manager for end-to-end Hadoop administration as well as long-term support.

Since part of the promise of big data requires getting past the hype and understanding appropriate applications of Hadoop, Hortonworks has created the Hortonwork Data Platform, version 1.0, which combines HA and failover requirements using VMware virtualization tools. The software’s approach relies on running NameNode and Hadoop’s JobTracker on virtual machines (VMs). This aspect helps to double up Hadoop’s fault tolerance through the automation of VM replacement for failed servers. The software also includes a GUI for dataset integration and for composing workflows as well as HCatalog that enables connectivity with RDBMS products.

MapR has chosen to solve the data volume issue via its replacement of Hadoop’s HDFS with a derivative of the UNIX-based file system, NFS. This helps to do away with the NameNode function altogether as a single point of failure. By swapping out HDFS, the company’s proprietary components claim to offer improved HA as well as higher scalability and performance.

Commercial Hadoop providers play a critical role in enabling wider platform adoption, and their support services allow the technology to be accessed by those organizations that might otherwise have difficulties around implementation. While these companies represent key players in the ongoing commercialization of Hadoop, they also offer an important function through their training and certification courses-a value that cannot be understated.

References:

SAS, IBM, Cloudera, AWS, Hadoop Illuminated

I used referenced web site for this post : gladwinanalytics.com

We have a referrer link, for original article of this post, if you want you can follow gladwinanalytics.com

Special thanks for Anandh Shanmugaraj and gladwinanalytics team, for this article, and you can see the post at below link;

http://www.gladwinanalytics.com/what-is/hadoop

Etl Tools – General Information

ETL tools are designed to save time and money by eliminating the need of ‘hand-coding’ when a new data warehouse is developed. They are also used to facilitate the work of the database administrators who connect different branches of databases as well as integrate or change the existing databases.

    The main purpose of the ETL tool is:

  • extraction of the data from legacy sources (usually heterogenous)
  • data transformation (data optimized for transaction –> data optimized for analysis)
  • synchronization and cleansing of the data
  • loading the data into data warehouse.

There are several requirements that must be had by ETL tools in order to deliver an optimal value to users, supporting a full range of possible scenarios.

Those are:
– data delivery and transformation capabilities
– data and metadata modelling capabilities
– data source and target support
– data governance capability
– runtime platform capabilities
– operations and administration capabilities
– service-enablements capability.

ETL TOOLS COMPARISON CRITERIA

The bietltools.com portal is not affiliated with any of the companies listed below in the comparison.

The research inclusion and exclusion criteria are as follows:
– range and mode of connectivity/adapter support
– data transformation and delivery modes support
– metadata and data modelling support
– design, development and data governance support
– runtime platform support
– enablement of service and three additional requirements for vendors:
– $20 milion or more of software revenue from data integration tools every year or not less than 300 production customers
– support of customers in not less than two major geographic regions
– have customer implementations at crossdepartamental and multiproject level.

We miss a few etl tools, but think generally. Of course in world we have lots of etl tools, but for now we couldnt investigate which one is we miss.

ETL TOOLS COMPARISON

The information provided below lists major strengths and weaknesses of the most popular ETL vendors.

IBM (Information Server Infosphere platform)

    Advantages:

  • strongest vision on the market, flexibility
  • progress towards common metadata platform
  • high level of satisfaction from clients and a variety of initiatives
    Disadvantages:

  • difficult learning curve
  • long implementation cycles
  • became very heavy (lots of GBs) with version 8.x and requires a lot of processing power

Informatica PowerCenter

    Advantages:

  • most substantial size and resources on the market of data integration tools vendors
  • consistent track record, solid technology, straightforward learning curve, ability to address real-time data integration schemes
  • Informatica is highly specialized in ETL and Data Integration and focuses on those topics, not on BI as a whole
  • focus on B2B data exchange
    Disadvantages:

  • several partnerships diminishing the value of technologies
  • limited experience in the field.

Microsoft (SQL Server Integration Services)

    Advantages:

  • broad documentation and support, best practices to data warehouses
  • ease and speed of implementation
  • standardized data integration
  • real-time, message-based capabilities
  • relatively low cost – excellent support and distribution model
    Disadvantages:

  • problems in non-Windows environments. Takes over all Microsoft Windows limitations.
  • unclear vision and strategy

Oracle (OWB and ODI)

    Advantages:

  • based on Oracle Warehouse Builder and Oracle Data Integrator – two very powerful tools;
  • tight connection to all Oracle datawarehousing applications;
  • tendency to integrate all tools into one application and one environment.
    Disadvantages:

  • focus on ETL solutions, rather than in an open context of data management;
  • tools are used mostly for batch-oriented work, transformation rather than real-time processes or federation data delivery;
  • long-awaited bond between OWB and ODI brought only promises – customers confused in the functionality area and the future is uncertain

SAP BusinessObjects (Data Integrator / Data Services)

    Advantages:

  • integration with SAP
  • SAP Business Objects created a firm company determined to stir the market;
  • Good data modeling and data-management support;
  • SAP Business Objects provides tools for data mining and quality; profiling due to many acquisitions of other companies.
  • Quick learning curve and ease of use
    Disadvantages:

  • SAP Business Objects is seen as two different companies
  • Uncertain future. Controversy over deciding which method of delivering data integration to use (SAP BW or BODI).
  • BusinessObjects Data Integrator (Data Services) may not be seen as a stand-alone capable application to some organizations.

SAS

    Advantages:

  • experienced company, great support and most of all very powerful data integration tool with lots of multi-management features
  • can work on many operating systems and gather data through number of sources – very flexible
  • great support for the business-class companies as well for those medium and minor ones
    Disadvantages:

  • misplaced sales force, company is not well recognized
  • SAS has to extend influences to reach non-BI community
  • Costly

Sun Microsystems

    Advantages:

  • Data integration tools are a part of huge Java Composite Application Platform Suite – very flexible with ongoing development of the products
  • ‘Single-view’ services draw together data from variety of sources; small set of vendors with a strong vision
    Disadvantages:

  • relative weakness in bulk data movement
  • limited mindshare in the market
  • support and services rated below adequate

Sybase

    Advantages:

  • assembled a range of capabilities to be able to address a mulitude of data delivery styles
  • size and global presence of Sybase create opportunities in the market
  • pragmatic near-term strategy – better of current market demand
  • broad partnerships with other data quality and data integration tools vendors
    Disadvantages:

  • falls behind market leaders and large vendors
  • gaps in many aspects of data management

Syncsort

    Advantages:

  • functionality; well-known brand on the market (40 years experience); loyal customer and experience base;
  • easy implementation, strong performance, targeted functionality and lower costs
    Disadvantages:

  • struggle with gaining mind share in the market
  • lack of support for other than ETL delivery styles
  • unsatisfactory with lack of capability of professional services

Tibco Software

    Advantages:

  • message-oriented application integration; capabilities based on common SOA structures;
  • support for federated views; easy implementation, support andperformance
    Disadvantages:

  • scarce references from customers; not widely enough recognised for data integration competencies
  • lacking in data quality capabilities.

ETI

    Advantages:

  • proven and mature code-generating architecture
  • one of the earliest vendors on the data integration market; support for SOA service-oriented deployments;
  • successfully deals with large data volumes and a high degree of complexity, extension of the range of data platforms and data sources;
  • customers’ positive responses to ETI technology
    Disadvantages:

  • relatively slow growth of customer base
  • rather not attractive and inventive technology.

iWay Software

    Advantages:

  • offers physical data movement and delivery; support of wide range of adapters and access to numerous sources;
  • well integrated, standard tools;
  • reasonable ease of implementation effort
    Disadvantages:

  • gaps in specific capabilities
  • relatively costly – not competitive versus market leaders

Pervasive Software

    Advantages:

  • many customers, years of experience, solid applications and support;
  • good use of metadata
  • upgrade from older versions into newer is straightforward.
    Disadvantages:

  • inconsistency in defining the target for their applications;
  • no federation capability;
  • limitated presence due to poor marketing.

Open Text

    Advantages

  • Simplicity of use in less-structured sources
  • Easy licensing for business solutions
  • cooperates with a wide range of sources and targets
  • increasingly high functionality
    Disadvantages:

  • limited federation, replication and data quality support; rare upgrades due to its simplicity;
  • weak real-time support due to use third party solutions and other database utilities.

Pitney Bowes Software

    Advantages:

  • Data Flow concentrates on data integrity and quality;
  • supports mainly ETL patterns; can be used for other purposes too;
  • ease of use, fast implementation, specific ETL functionality.
    Disadvantages:

  • rare competition with other major companies, repeated rebranding trigger suspicions among customers.
  • narrow vision of possibilities even though Data Flow comes with variety of applications.
  • weak support, unexperienced service.

I used referenced web site : etltools.net

We have a referrer link, for original article of this post, if you want you can follow etltools.net

Special thanks for this article, and you can see the post at below link;

http://www.etltools.net/etl-tools-comparison.html

Why Data Integration ? – The Importance of Data Integration

Almost every Chief Information Officer (CIO) has the goal of integrating their organization’s data. In fact the issue of data integration has risen all the way to the Chief Financial Officer
(CFO) and Chief Executive Officer (CEO) level of a corporation. A key question is why is data integration becoming so important to so many C-level executives? There are several key reasons driving
this desire:

  • Provide IT Portfolio Management
  • Reduce IT Redundancy
  • Prevent IT Applications Failure

 
Provide IT Portfolio Management

Over the years I have had the opportunity to perform dozens of data warehousing assessments. During these assessments I always ask the client how much they spend annually on data warehousing. The
majority of companies and government organizations cannot give a relatively good estimate on what they actually spend. In order to manage these and any other costly information technology (IT)
initiatives it is critical to measure each one of them. However, it is impossible to measure them when most companies do not understand them (see Figure 1: “How To Manage IT”). This is
where IT Portfolio Management enters the picture.

 

i029fe0401

 

Figure 1: How To Manage IT

 

IT portfolio management refers to the formal process for managing IT assets. An IT asset is software, hardware, middleware, IT projects, internal staff, applications and external consulting. Like
every newer discipline, many companies that have started their IT portfolio management efforts have not done so correctly. I would like to list out some of the keys to building successful IT
portfolio management applications.

By properly managing their IT portfolio it allows the corporation to see which projects are proceeding well and which are lagging behind. In my experience, almost every large company has a great
deal of duplicate IT effort occurring (see later section on “Reduce IT Redundancy”). This happens because the meta data is not accessible. At my company we have a couple of large
clients whose primary goal is to remove these tremendous redundancies, which translates into tremendous initial and ongoing IT costs.
Reduce IT Redundancy

CIO is commonly defined as Chief Information Officer; however, there is another possible meaning to this acronym; Career Is Over. One of the chief reasons for this is that most IT departments are
“handcuffed” in needless IT redundancy that too few CIOs are willing and capable of fixing.

There are several CIO surveys that are conducted annually. These surveys ask “what are your top concerns for the upcoming year”. Regardless of the survey you look at “data
integration” will be high on the list. Now data integration has two facets to it. One is the integration of data across disparate systems for enterprise applications. The second is the
integration/removal of IT redundancies. Please understand that some IT redundancy is a good thing. For example, when there is a power outage and one of your data centers is non-operational you need
to have a backup of these systems/data. However, when I talk about IT redundancies I am addressing “needless” IT redundancy. Meaning, IT redundancy that only exists because of
insufficient management of our IT systems. I was working with a Midwestern insurance company that, over a four year span had initiated various decision support efforts. After this four year period
they took the time to map out the flow of data from their operational systems, to their data staging areas and finally to their data mart structures. What they discovered was Figure 2:
“Typical IT Architecture”.

i029fe0402

Figure 2: Typical IT Architecture

What is enlightening about Figure 2 is that when I show this illustration during a client meeting or at a conference keynote address the typical response that I receive from the people is
“Where did you get a copy of our IT architecture?” If you work at a Global 2000 company or any large government entity, Figure 2 represents an overly simplified version of your IT
architecture. These poor architecture habits create a litany of problems including:

  • Redundant Applications/Processes/Data
  • Needless IT Rework
  • Redundant Hardware/Software

Redundant Applications/Processes/Data

It has been my experience working with large government agencies and Global 2000 companies that needlessly duplicate data is running rampant throughout our industry. In my experience the typical
large organization has between 3 – 4 fold needless data redundancy. Moreover, I can name multiple organizations that have literally hundreds of “independent” data mart
applications spread all over the company. Each one of these data marts is duplicating the extraction, transformation and load (ETL) that is typically done centrally in a data warehouse. This
greatly increases the number of support staff required to maintain the data warehousing system as these tasks are the largest and most costly data warehousing activities. Besides duplicating this
process, each data mart will also copy the data as well requiring further IT resources. It is easy to see why IT budgets are straining under the weight of all of this needless redundancy.

Needless IT Rework

During the requirements gathering portion of one of our meta data management initiatives I had an IT project manager discuss the challenges that he is facing in analyzing one of the
mission-critical legacy applications that will feed the data warehousing application that his team has been tasked to build. During our interview he stated, “This has to be the twentieth time
that our organization is analyzing this system to understand the business rules around the data.” This person’s story is an all too common one as almost all organizations reinvent the
IT wheel on every project. This situation occurs because usually separate teams will typically build each of the IT systems and since they don’t have a Managed Meta Data Environment (MME),
these teams do not leverage the other’s standards, processes, knowledge, and lessons learned. This results in a great deal of rework and reanalysis.

Redundant Hardware/Software

I have discussed a great deal about the redundant application and IT work that occurs in the industry. All of this redundancy also generates a great deal of needless hardware and software
redundancy. This situation forces the enterprise to retain skilled employees to support each of these technologies. In addition, a great deal of financial savings is lost, as standardization on
these tools doesn’t occur. Often a software, hardware, or tool contract can be negotiated to provide considerable discounts for enterprise licenses, which can be phased into. These economies
of scale can provide tremendous cost savings to the organization.

In addition, the hardware and software that is purchased is not used in an optimal fashion. For example, I have a client that has each one of their individual IT projects buy their own hardware. As
a result, they are infamous for having a bunch of servers running at 25% capacity.

From the software perspective the problem only gets worse. While analyzing a client of mine I had asked their IT project leaders what software vendors have you standardized on? They answered
“all of them!” This leads to the old joke “What is the most popular form of software on the market? Answer…Shelfware!” Shelfware is software that a company purchases
and winds up never using and it just sits on the shelf collecting dust.

Prevent IT Applications Failure

When a corporation looks to undertake a major IT initiative, like a customer relationship management (CRM), enterprise resource planning (ERP), data warehouse, or e-commerce solution their
likelihood of project failure is between 65% – 80%, depending on the study referenced. This is especially alarming when we consider that these same initiatives traditionally have executive
management support and cost many millions of dollars. For example, I have one large client that is looking to roll out a CRM system (e.g. Siebel, Oracle) and an ERP system (e.g. SAP, PeopleSoft)
globally in the next four years. Their initial project budget is over $125 million! In my opinion they have a 0% probability of delivering all of these systems on-time and on-budget. Consider this,
when was that last time that you’ve seen an ERP or CRM initiative being delivered on time or on budget?

When we examine the causes for these projects failure several themes become apparent. First, these projects did not address a definable and measurable business need. This is the number one reason
for project failure, data warehouse, CRM, MME, or otherwise. As IT professionals we must always be looking to solve business problems or capture business opportunities. Second, the projects that
fail have a very difficult time understanding their company’s existing IT environment and business rules. This includes custom applications, vendor applications, data elements, entities, data
flows, data heritage and data lineage.

MME’s Focus On Data Integration

Many of these Global 2000 companies and large government organizations are targeting MME technology to assist them in identifying and removing existing application and data redundancy. Moreover,
many companies are actively using their MME to identify redundant applications through analysis of the data. These same companies are starting IT application integration projects to merge these
overlapping systems and to ensure that future IT applications do not proliferate needless redundancy.

If your organization can reduce their applications, processes, data, software and hardware, lowers the likelihood for IT project failure and speeds up the IT development life-cycle, then clearly it
will greatly reduce a company’s IT expenditures. For example, I have a large banking client that asked my company to analyze their IT environment. During this analysis we discovered that they
have a tremendous amount of application and data redundancy. Moreover, I had figured out that they have over 700 unique applications. I then compared this client to a bank that is more than twice
there size; however, this larger bank has a world class MME and uses it to properly manage their systems. As a result, they have less than 250 unique applications. Clearly the bank with more than
700 applications has a great deal of needless redundancy as compared to a bank that is more than twice their size and has less than 250 applications. Interestingly enough the bank that has less
than 250 applications and has a world-class MME is also 14 times more profitable than the bank maintaining over 700 applications. It doesn’t seem like a very far stretch to see that the less
profitable bank would become much more profitable if they removed this redundancy.

I used referenced web site : tdan.com

We have a referrer link, for original article of this post, if you want you can follow TDAN (The Data Administration Newsletter)

Special thanks for this tdan comunity, and you can see the post at below link;

http://tdan.com/the-importance-of-data-integration/5198