How to get data from SQL Server to Elasticsearch using LogStash

As a developer working with SQL Server there was a need to import data from the database to Elasticsearch and analyze data in Kibana.

As Elasticsearch is an open-source project built with Java and handles most other open-source projects, documentation on importing data from SQL Server to ES using LogStash.

I’d like to share how to import SQL Server data to Elasticsearch (version 6.2) using LS and verify the result on Kibana.

Assumption

I will skip on installing ELK (ElasticSearch, LogStash, and Kibana) stack as it’s outside the scope of this article.
Please refer to installation steps on Elastic download pages.

Overview

Here are the steps required to import SQL Server data to Elasticsearch.

  1. Install Java Development Kit (JDK)
  2. Install JDBC Driver for SQL Server
  3. Set CLASSPATH for the driver
  4. Create an Elasticsearch Index to Import Data to
  5. Configure LogStash configuration file
  6. Run LogStash
  7. Verify in Kibana

Step 1 – Install Java SE Development Kit 8

One of the gotchas is that you might install the latest version of JDK, which is version 9 but Elasticsearch documentation requires you to install JDK 8.

At the time of writing, the latest JDK 8 version is 8u162, which can be downloaded here.

Download “JDK8 8u162” and install it on your machine and make sure that “java” is in the PATH variable so that it can be called in any directory within a command line.

Step 2 – Install JDBC Driver for SQL Server

You need to download and install Microsoft JDBC Driver 4.2 for SQL Server, not the latest version.

As Elasticsearch is built with JDK 8, you can’t use the latest version of JDBC Driver (version 6.2) for SQL Server as it does not support JDK 8.

Step 3 – Set CLASSPATH for the JDBC Driver

We need to set the path so that Java can find the JDBC driver.

? Note: I am working on Windows 10 machine.

1. Go to the directory under which you have installed SQL Server JDBC.

2. Now you need to navigate to find a JAR file named sqljdbc42.jar, which is found under<<JDBC installation folder>>\sqljdbc_4.2\enu\jre8

3. And then copy the full path to the JAR file.

A cool trick on Windows 7/8/10 is that, when shift+right click on a file, it gives you a “Copy as Path” option.

4. Go to Windows Start button and type “environment” and click on “Edit the system environment variables”.

5. Add a CLASSPATH environment variable with following values (if you don’t already have one).

  1. “.” – for the current directory to search.
  2. And the JAR file path copied in previously (e.g. “C:\talih\Java\MicrosoftJDBCDriversSQLServer\sqljdbc_4.2\enu\jre8\sqljdbc42.jar”).

Gotcha: If you have a space in the path for JDBC JAR file, make sure to put double quotes around it.

Not doing so will result in either of following error messages when you start LogStash service in later step. 

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class JDBC

 - Or -

c:\talih\elasticco\logstash-6.2.2>bin\logstash -f sql.conf

Error: Could not find or load main class File\Microsoft

Let’s now move onto to create an Elasticsearch index to import data to.

Step 4 – Create an Elasticsearch Index to Import Data to

You can use cURL or Postman to create an Index but I will use Kibana console to create an index named “cs_users”, which is equivalent to a database in relational database terminology.

Before we start the Kibana service, we need to start Elasticsearch so that Kibana would not whine about Elasticsearch not being present.

Kibana warnings on lines 12~21 due to Elasticsearch being unavailable

Go to the Elasticsearch installation and start the service.

talih@CC c:\talih\elasticco\elasticsearch-6.2.2
> bin\elasticsearch.bat

And then go to the Kibana installation directory to start Kibana service.

talih@CC c:\talih\elasticco\kibana-6.2.2-windows-x86_64 
> bin\kibana.bat

If Kibana started without an issue, you will see an output similar to the following.

Kibana started successfully

On line 9, Kibana reports that it is running on http://localhost:5601.
Open the URL in a browser of your choice.

Now go to “Dev Tools” link on the bottom left of the page.

Click on Kibana Dev Tools Link

Once you see the Console, create a new index with the following command.

PUT cs_users
{
        "settings" : {
              "index" : {
                      "number_of_shards" : 3,
                      "number_of_replicas" : 0
              }
        }
}

on the left panel of the Kibana Dev Tools Console.

Create a new Elasticsearch index named “cs_users”

I won’t go into details on “shards” and “replicas” since it’s outside the scope of this article. For more information on the syntax, refer to the official Elasticsearch documentation.

And you will see the response from Elasticsearch with index creation confirmation on the panel right.

A new index “cs_users” is created on Elasticsearch successfully

OK, now we are finally ready to move onto creating a configuration file for LogStash to actually import data.

Step 5 – Configure LogStash configuration file

Go to the LogStash installation folder and create a file named “sql.conf” (name doesn’t really matter).
Here is the LogStash configuration I will be using.

input {
  jdbc {
    jdbc_connection_string => "jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"
    jdbc_driver_class => "com.microsoft.sqlserver.jdbc.SQLServerDriver"
    jdbc_user => "xxx"

    statement => "SELECT * FROM Users"
  }
}

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

Let me break down “input” and “output” configurations.

Input

There are three required fields you need to specify for “jdbc” input plugin.

jdbc_connection_string – This field instructs LogStash information on SQL Server.

"jdbc:sqlserver://cc:1433;databaseName=StackExchangeCS;integratedSecurity=true;"

Elasticsearch will connect to the server named “cc” running on port 1433 to connect to a database named “StackExchangeCS” with integrated security authentication method.

jdbc_driver_class – This is the driver class contained within the JDBC JAR file.
The JDBC JAR file contains a driver of type “com.microsoft.sqlserver.jdbc.SQLServerDriver” according to the documentation.

If you have an inquisitive mind, you can confirm it by opening the JAR file with your choice of ZIP program as JAR is a simple ZIP file.

Unzip JAR to verify JDBC driver name

jdbc_user – If you are using “Integrated Security” as an authentication option, this can be any string (I just entered “xxx” since that’s the easiest thing I can type ?).

Output

output {
  elasticsearch {
    hosts => ["localhost:9200"]
    index => "cs_users"
  }
}

SQL Server data (all cs.stackexchange.com users) will be sent to Elasticsearch running on the local machine port 9200 and will be indexed under “cs_users” index created in “Step 4 – Create an Elasticsearch Index to Import Data to”.
There are quite a bit of Elasticsearch configuration options so please refer to the official LogStash documentation for more “elasticsearch” output plugin options.

Step 6 – Import Data with LogStash

With prerequisites out of the way, we are now ready to import data to Elasticsearch from SQL Server.
Go to the LogStash installation location under which you should have created “sql.conf” and run LogStash service.

bin\logstash -f sql.conf

-f flag specifies the configuration file to use.
In our case, “sql.conf” we created in the previous step.

The result of successful LogStash run will look similar to the following output.

Step 7 – Verify in Kibana

Wow, we have finally imported data. Now let’s do a quick check whether the number of records in the database matches the records in Elasticsearch.

Verifying result of data import

“User” table in the SQL Server has 59394 records and Elasticsearch returns the same number as well.
? Note: You can use following command to get the list of all records in “cs_users” index.

GET cs_users/_count

For more information on how “_count” works, refer to Count API documentation.

Conclusion

Congratulations for getting this far ???.

How To Install and Configure Elasticsearch on Ubuntu 16.04 + Bonus (Nifi ^^)

Step 1 — Downloading and Installing Elasticsearch

Elasticsearch can be downloaded directly from elastic.co in ziptar.gzdeb, or rpm packages. For Ubuntu, it’s best to use the deb (Debian) package which will install everything you need to run Elasticsearch.

First, update your package index.

sudo apt-get update

Download the latest Elasticsearch version, which is 2.3.1 at the time of writing.

wget https://download.elastic.co/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.3.1/elasticsearch-2.3.1.deb

Then install it in the usual Ubuntu way with dpkg.

sudo dpkg -i elasticsearch-2.3.1.deb

This results in Elasticsearch being installed in /usr/share/elasticsearch/ with its configuration files placed in /etc/elasticsearch and its init script added in /etc/init.d/elasticsearch.

To make sure Elasticsearch starts and stops automatically with the server, add its init script to the default runlevels.

sudo systemctl enable elasticsearch.service

Before starting Elasticsearch for the first time, please check the next section about the recommended minimum configuration.

Step 2 — Configuring Elasticsearch

Now that Elasticsearch and its Java dependencies have been installed, it is time to configure Elasticsearch. The Elasticsearch configuration files are in the /etc/elasticsearch directory. There are two files:

  • elasticsearch.yml configures the Elasticsearch server settings. This is where all options, except those for logging, are stored, which is why we are mostly interested in this file.
  • logging.yml provides configuration for logging. In the beginning, you don’t have to edit this file. You can leave all default logging options. You can find the resulting logs in /var/log/elasticsearch by default.

The first variables to customize on any Elasticsearch server are node.name and cluster.name in elasticsearch.yml. As their names suggest, node.name specifies the name of the server (node) and the cluster to which the latter is associated.

If you don’t customize these variable, a node.name will be assigned automatically in respect to the Droplet hostname. The cluster.name will be automatically set to the name of the default cluster.

The cluster.name value is used by the auto-discovery feature of Elasticsearch to automatically discover and associate Elasticsearch nodes to a cluster. Thus, if you don’t change the default value, you might have unwanted nodes, found on the same network, in your cluster.

To start editing the main elasticsearch.yml configuration file with nano or your favorite text editor.

sudo nano /etc/elasticsearch/elasticsearch.yml

Remove the # character at the beginning of the lines for cluster.name and node.name to uncomment them, and then update their values. Your first configuration changes in the /etc/elasticsearch/elasticsearch.yml file should look like this:

/etc/elasticsearch/elasticsearch.yml

. . .
cluster.name: mycluster1
node.name: "My First Node"
. . .

These the minimum settings you can start with using Elasticsearch. However, it’s recommended to continue reading the configuration part for more thorough understanding and fine-tuning of Elasticsearch.

One especially important setting of Elasticsearch is the role of the server, which is either master or slave. Master servers are responsible for the cluster health and stability. In large deployments with a lot of cluster nodes, it’s recommended to have more than one dedicated master. Typically, a dedicated master will not store data or create indexes. Thus, there should be no chance of being overloaded, by which the cluster health could be endangered.

Slave servers are used as workhorses which can be loaded with data tasks. Even if a slave node is overloaded, the cluster health shouldn’t be affected seriously, provided there are other nodes to take additional load.

The setting which determines the role of the server is called node.master. By default, a node is a master. If you have only one Elasticsearch node, you should leave this option to the default true value because at least one master is always needed. Alternatively, if you wish to configure the node as a slave, assign a false value to the variable node.master like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.master: false
. . .

Another important configuration option is node.data, which determines whether a node will store data or not. In most cases this option should be left to its default value (true), but there are two cases in which you might wish not to store data on a node. One is when the node is a dedicated master” as previously mentioned. The other is when a node is used only for fetching data from nodes and aggregating results. In the latter case the node will act up as a search load balancer.

Again, if you have only one Elasticsearch node, you should not change this value. Otherwise, to disable storing data locally, specify node.data as false like this:/etc/elasticsearch/elasticsearch.yml

. . .
node.data: false
. . .

In larger Elasticsearch deployments with many nodes, two other important options are index.number_of_shards and index.number_of_replicas. The first determines how many pieces, or shards, the index will be split into. The second defines the number of replicas which will be distributed across the cluster. Having more shards improves the indexing performance, while having more replicas makes searching faster.

By default, the number of shards is 5 and the number of replicas is 1. Assuming that you are still exploring and testing Elasticsearch on a single node, you can start with only one shard and no replicas. Thus, their values should be set like this:/etc/elasticsearch/elasticsearch.yml

. . .
index.number_of_shards: 1
index.number_of_replicas: 0
. . .

One final setting which you might be interested in changing is path.data, which determines the path where data is stored. The default path is /var/lib/elasticsearch. In a production environment, it’s recommended that you use a dedicated partition and mount point for storing Elasticsearch data. In the best case, this dedicated partition will be a separate storage media which will provide better performance and data isolation. You can specify a different path.data path by specifying it like this:/etc/elasticsearch/elasticsearch.yml

. . .
path.data: /media/different_media
. . .

Once you make all the changes, save and exit the file. Now you can start Elasticsearch for the first time.

sudo systemctl start elasticsearch

Give Elasticsearch a few to fully start before you try to use it. Otherwise, you may get errors about not being able to connect.

Step 3 — Securing Elasticsearch

By default, Elasticsearch has no built-in security and can be controlled by anyone who can access the HTTP API. This is not always a security risk because Elasticsearch listens only on the loopback interface (i.e., 127.0.0.1) which can be accessed only locally. Thus, no public access is possible and your Elasticsearch is secure enough as long as all server users are trusted or this is a dedicated Elasticsearch server.

Still, if you wish to harden the security, the first thing to do is to enable authentication. Authentication is provided by the commercial Shield plugin. Unfortunately, this plugin is not free but there is a free 30 day trial you can use to test it. Its official page has excellent installation and configuration instructions. The only thing you may need to know in addition is that the path to the Elasticsearch plugin installation manager is /usr/share/elasticsearch/bin/plugin.

If you don’t want to use the commercial plugin but you still have to allow remote access to the HTTP API, you can at least limit the network exposure with Ubuntu’s default firewall, UFW (Uncomplicated Firewall). By default, UFW is installed but not enabled. If you decide to use it, follow these steps:

First, create a rule to allow any needed services. You will need at least SSH allowed so that you can log in the server. To allow world-wide access to SSH, whitelist port 22.

sudo ufw allow 22

Then allow access to the default Elasticsearch HTTP API port (TCP 9200) for the trusted remote host, e.g.TRUSTED_IP, like this:

sudo ufw allow from TRUSTED_IP to any port 9200

Only after that enable UFW with the command:

sudo ufw enable

Finally, check the status of UFW with the following command:

sudo ufw status

If you have specified the rules correctly, the output should look like this:

Output of java -versionStatus: active

To                         Action      From
--                         ------      ----
9200                       ALLOW       TRUSTED_IP
22                         ALLOW       Anywhere
22 (v6)                    ALLOW       Anywhere (v6)

Once you have confirmed UFW is enabled and protecting Elasticsearch port 9200, then you can allow Elasticsearch to listen for external connections. To do this, open the elasticsearch.yml configuration file again.

sudo nano /etc/elasticsearch/elasticsearch.yml

Find the line that contains network.bind_host, uncomment it by removing the # character at the beginning of the line, and change the value to 0.0.0.0 so it looks like this:/etc/elasticsearch/elasticsearch.yml

. . .
network.host: 0.0.0.0
. . .

We have specified 0.0.0.0 so that Elasticsearch listens on all interfaces and bound IPs. If you want it to listen only on a specific interface, you can specify its IP in place of 0.0.0.0.

To make the above setting take effect, restart Elasticsearch with the command:

sudo systemctl restart elasticsearch

After that try to connect from the trusted host to Elasticsearch. If you cannot connect, make sure that the UFW is working and the network.host variable has been correctly specified.

Step 4 — Testing Elasticsearch

By now, Elasticsearch should be running on port 9200. You can test it with curl, the command line client-side URL transfers tool and a simple GET request.

curl -X GET 'http://localhost:9200'

You should see the following response:

Output of curl{
  "name" : "My First Node",
  "cluster_name" : "mycluster1",
  "version" : {
    "number" : "2.3.1",
    "build_hash" : "bd980929010aef404e7cb0843e61d0665269fc39",
    "build_timestamp" : "2016-04-04T12:25:05Z",
    "build_snapshot" : false,
    "lucene_version" : "5.5.0"
  },
  "tagline" : "You Know, for Search"
}

If you see a response similar to the one above, Elasticsearch is working properly. If not, make sure that you have followed correctly the installation instructions and you have allowed some time for Elasticsearch to fully start.

To perform a more thorough check of Elasticsearch execute the following command:

curl -XGET 'http://localhost:9200/_nodes?pretty'

In the output from the above command you can see and verify all the current settings for the node, cluster, application paths, modules, etc.

Step 5 — Using Elasticsearch

To start using Elasticsearch, let’s add some data first. As already mentioned, Elasticsearch uses a RESTful API, which responds to the usual CRUD commands: create, read, update, and delete. For working with it, we’ll use again curl.

You can add your first entry with the command:

curl -X POST 'http://localhost:9200/tutorial/helloworld/1' -d '{ "message": "Hello World!" }'

You should see the following response:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"_shards":{"total":2,"successful":1,"failed":0},"created":true}

With cuel, we have sent an HTTP POST request to the Elasticsearch server. The URI of the request was /tutorial/helloworld/1 with several parameters:

  • tutorial is the index of the data in Elasticsearch.
  • helloworld is the type.
  • 1 is the id of our entry under the above index and type.

You can retrieve this first entry with an HTTP GET request.

curl -X GET 'http://localhost:9200/tutorial/helloworld/1'

The result should look like:

Output{"_index":"tutorial","_type":"helloworld","_id":"1","_version":1,"found":true,"_source":{ "message": "Hello World!" }}

To modify an existing entry, you can use an HTTP PUT request.

curl -X PUT 'localhost:9200/tutorial/helloworld/1?pretty' -d '
{
  "message": "Hello People!"
}'

Elasticsearch should acknowledge successful modification like this:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "created" : false
}

In the above example we have modified the message of the first entry to “Hello People!”. With that, the version number has been automatically increased to 2.

You may have noticed the extra argument pretty in the above request. It enables human readable format so that you can write each data field on a new row. You can also “prettify” your results when retrieving data and get much nicer output like this:

curl -X GET 'http://localhost:9200/tutorial/helloworld/1?pretty'

Now the response will be in a much better format:

Output{
  "_index" : "tutorial",
  "_type" : "helloworld",
  "_id" : "1",
  "_version" : 2,
  "found" : true,
  "_source" : {
    "message" : "Hello People!"
  }
}

So far we have added to and queried data in Elasticsearch. To learn about the other operations please check the API documentation.

Last Step – Get Data from SQL to Elastich With Nifi

Conclusion

That’s how easy it is to install, configure, and begin using Elasticsearch. Once you have played enough with manual queries, your next task will be to start using it from your applications.

Install MongoDB 4.0.5 on Ubuntu 16.04

Add the key: (without the key, the repository will not load)

1 – sudo apt-key adv –keyserver hkp://keyserver.ubuntu.com:80 –recv E52529D4

Now, create a new MongoDB repository list file:

2 – sudo bash -c ‘echo “deb http://repo.mongodb.org/apt/ubuntu xenial/mongodb-org/4.0 multiverse” > /etc/apt/sources.list.d/mongodb-org-4.0.list’

Complete the installation with update of repositories then install:

3 – sudo apt update
sudo apt install mongodb-org

Enable the mongod service and start it up:

4 – systemctl enable mongod.service
systemctl start mongod.service

Check your mongodb version:

5 – ~$ mongo –version
MongoDB shell version v4.0.5
git version: 3739429dd92b92d1b0ab120911a23d50bf03c412
OpenSSL version: OpenSSL 1.0.2g 1 Mar 2016
allocator: tcmalloc
modules: none
build environment:
distmod: ubuntu1604
distarch: x86_64
target_arch: x86_64

Check if the service is running:

6 – systemctl status mongod.service
mongod.service – MongoDB Database Server
Loaded: loaded (/lib/systemd/system/mongod.service; enabled; vendor preset: enabled)
Active: active (running) since Mon 2019-01-21 00:33:51 MST; 7s ago
Docs: https://docs.mongodb.org/manual
Main PID: 2906 (mongod)
CGroup: /system.slice/mongod.service
└─2906 /usr/bin/mongod –config /etc/mongod.conf

Why mysql is not used on big data?

Hi everyone,

I try to check database management system which is most used or which is most valueable in the world. I found a website like compare the databases for general structure.

I compared them on a link, you can check below, but i didnt understood to why mysql is not used in big data. If anyone can answer this question, you can contact me on about-contact page.

http://db-engines.com/en/system/DB2%3BMicrosoft+SQL+Server%3BMySQL%3BOracle%3BPostgreSQL

By the way, if you want to follow rank of databases you can catch on below link;

http://db-engines.com/en/ranking/relational+dbms

Cheers,

Talih Bayram

Different Types of Databases Top 12

1.0 Relational Databases

This is the most common of all the different types of databases. In this, the data in a relational database is stored in various data tables. Each table has a key field which is used to connect it to other tables. Hence all the tables are related to each other through several key fields. These databases are extensively used in various industries and will be the one you are most likely to come across when working in IT.

Examples of relational databases are Oracle, Sybase and Microsoft SQL Server and they are often key parts of the process of software development. Hence you should ensure you include any work required on the database as part of your project when creating a project plan and estimating project costs.

2.0 Operational Databases

In its day to day operation, an organisation generates a huge amount of data. Think of things such as inventory management, purchases, transactions and financials. All this data is collected in a database which is often known by several names such as operational/ production database, subject-area database (SADB) or transaction databases.

An operational database is usually hugely important to Organisations as they include the customer database, personal database and inventory database ie the details of how much of a product the company has as well as information on the customers who buy them. The data stored in operational databases can be changed and manipulated depending on what the company requires.

3.0 Database Warehouses

Organisations are required to keep all relevant data for several years. In the UK it can be as long as 6 years. This data is also an important source of information for analysing and comparing the current year data with that of the past years which also makes it easier to determine key trends taking place. All this data from previous years are stored in a database warehouse. Since the data stored has gone through all kinds of screening, editing and integration it does not need any further editing or alteration.

With this database ensure that the software requirements specification (SRS) is formally approved as part of the project quality plan.

4.0 Distributed Databases

Many organisations have several office locations, manufacturing plants, regional offices, branch offices and a head office at different geographic locations. Each of these work groups may have their own database which together will form the main database of the company. This is known as a distributed database.

5.0 End-User Databases

There is a variety of data available at the workstation of all the end users of any organisation. Each workstation is like a small database in itself which includes data in spreadsheets, presentations, word files, note pads and downloaded files. All such small databases form a different type of database called the end-user database.

6.0 External Database

There is a sea of information available outside world which is required by an organisation. They are privately-owned data for which one can have conditional and limited access for a fortune. This data is meant for commercial usage. All such databases outside the organisation which are of use and limited access are together called external database.

7.0 Hypermedia Database

Most websites have various interconnected multimedia pages which might include text, video clips, audio clips, photographs and graphics. These all need to be stored and “called” from somewhere when the webpage if created. All of them together form the hypermedia database.

Please note that if you are creating such a database from scratch to be generous when creating a project plan, detailed when defining the business requirements documentation (BRD) and meticulous in your project cost controls. I have seen too many projects where the creation of one of these databases has caused scope creep and an out of control budget for a project.

8.0 Navigational Database

Navigational database has all the items which are references from other objects. In this, one has to navigate from one reference to other or one object to other. It might be using modern systems like XPath. One of its applications is the air flight management systems.

9.0 In-Memory Database

An in-memory databases stores data in a computer’s main memory instead of using a disk-based storage system. It is faster and more reliable than that in a disk. They find their application in telecommunications network equipments.

10.0 Document-Oriented Database

A document oriented database is a different type of database which is used in applications which are document oriented. The data is stored in the form of text records instead of being stored in a data table as usually happens.

11.0 Real-Time Database

A real-time database handles data which constantly keep on changing. An example of this is a stock market database where the value of shares change every minute and need to be updated in the real-time database. This type of database is also used in medical and scientific analysis, banking, accounting, process control, reservation systems etc. Essentially anything which requires access to fast moving and constantly changing information.

Assume that this will require much more time than a normal relational database when it comes to the software testing life cycle, as these are much more complicated to efficiently test within normal time frames.

12.0 Analytical Database

An analytical database is used to store information from different types of databases such as selected operational databases and external databases. Other names given to analytical databases are information databases, management databases or multi-dimensional databases. The data stored in an analytical database is used by the management for analysis purposes, hence the name. The data in an analytical database cannot be changed or manipulated.

Different Types of Databases Top 12 – Tip

Of the different types of databases, relational is the most common and includes such well known names as Oracle, No-SQL, Couchbase, Hadoop, Sybase and SQL Server. However as a project manager you need to be prepared for anything, hence why having a high level view of the different databases is useful particularly when managing a software development life cycle. Regarding the remainder, you will hear a great deal about database warehouses. This is a highly specialized area which involves mining the data produced to generate meaningful trends and reports for senior management to act upon.

I used referenced web site for this post : my-project-management-expert.com

We have a referrer link, for original article of this post, if you want you can follow my-project-management-expert.com

Special thanks for my-project-management-expert team, for this article, and you can see the post at below link;

http://www.my-project-management-expert.com/different-types-of-databases.html

What are the different types of databases?

QUICK ANSWER

The different types of databases include operational databases, end-user databases, distributed databases, analytical databases, relational databases, hierarchical databases and database models. Databases are classified according to their type of content, application area and technical aspect. For instance, a deductive database combines logic programming with a relational database, while a graph database uses graph structures to represent and store information.

FULL ANSWER

Other types of databases include hypertext databases, mobile databases, parallel databases, active databases, cloud databases, in-memory databases, spatial databases, temporal databases, real-time databases, probabilistic databases and embedded databases.

A database is an organized collection of data. Its primary function is to interact with a database management system to capture and analyze data. A database management system is a software system designed to allow the creation, querying and administration of databases. Some popular database management systems include PostgreSQL, MySQL, Microsoft SQL Server, Oracle, IBM DB2 and SAP.

Databases are designed to operate large amounts of information by inputting, storing, retrieving and managing it. They are set up in a way that allows users to easily and intuitively gain access to all the information. A database management maintains the integrity and security of stored data. It is also used for data recovery, in case of system failure.

 

What is Database (DB) ?

(1) Often abbreviated DB, a database is basically a collection of information organized in such a way that a computer program can quickly select desired pieces of data. You can think of a database as an electronic filing system.

Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.

DATABASE

An alternative concept in database design is known as Hypertext. In a Hypertext database, any object, whether it be a piece of text, a picture, or a film, can be linked to any other object. Hypertext databases are particularly useful for organizing large amounts of disparate information, but they are not designed for numerical analysis.

To access information from a database, you need a database management system (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database.

(2) Increasingly, the term database is used as shorthand for database management system. There are many different types of DBMSs, ranging from small systems that run on personal computers to huge systems that run on mainframes.

I used referenced web site for this post : webopedia.com

We have a referrer link, for original article of this post, if you want you can follow webopedia.com

Special thanks for webopedia team, for this article, and you can see the post at below link;

http://www.webopedia.com/TERM/D/database.html

Why Data Integration ? – The Importance of Data Integration

Almost every Chief Information Officer (CIO) has the goal of integrating their organization’s data. In fact the issue of data integration has risen all the way to the Chief Financial Officer
(CFO) and Chief Executive Officer (CEO) level of a corporation. A key question is why is data integration becoming so important to so many C-level executives? There are several key reasons driving
this desire:

  • Provide IT Portfolio Management
  • Reduce IT Redundancy
  • Prevent IT Applications Failure

 
Provide IT Portfolio Management

Over the years I have had the opportunity to perform dozens of data warehousing assessments. During these assessments I always ask the client how much they spend annually on data warehousing. The
majority of companies and government organizations cannot give a relatively good estimate on what they actually spend. In order to manage these and any other costly information technology (IT)
initiatives it is critical to measure each one of them. However, it is impossible to measure them when most companies do not understand them (see Figure 1: “How To Manage IT”). This is
where IT Portfolio Management enters the picture.

 

i029fe0401

 

Figure 1: How To Manage IT

 

IT portfolio management refers to the formal process for managing IT assets. An IT asset is software, hardware, middleware, IT projects, internal staff, applications and external consulting. Like
every newer discipline, many companies that have started their IT portfolio management efforts have not done so correctly. I would like to list out some of the keys to building successful IT
portfolio management applications.

By properly managing their IT portfolio it allows the corporation to see which projects are proceeding well and which are lagging behind. In my experience, almost every large company has a great
deal of duplicate IT effort occurring (see later section on “Reduce IT Redundancy”). This happens because the meta data is not accessible. At my company we have a couple of large
clients whose primary goal is to remove these tremendous redundancies, which translates into tremendous initial and ongoing IT costs.
Reduce IT Redundancy

CIO is commonly defined as Chief Information Officer; however, there is another possible meaning to this acronym; Career Is Over. One of the chief reasons for this is that most IT departments are
“handcuffed” in needless IT redundancy that too few CIOs are willing and capable of fixing.

There are several CIO surveys that are conducted annually. These surveys ask “what are your top concerns for the upcoming year”. Regardless of the survey you look at “data
integration” will be high on the list. Now data integration has two facets to it. One is the integration of data across disparate systems for enterprise applications. The second is the
integration/removal of IT redundancies. Please understand that some IT redundancy is a good thing. For example, when there is a power outage and one of your data centers is non-operational you need
to have a backup of these systems/data. However, when I talk about IT redundancies I am addressing “needless” IT redundancy. Meaning, IT redundancy that only exists because of
insufficient management of our IT systems. I was working with a Midwestern insurance company that, over a four year span had initiated various decision support efforts. After this four year period
they took the time to map out the flow of data from their operational systems, to their data staging areas and finally to their data mart structures. What they discovered was Figure 2:
“Typical IT Architecture”.

i029fe0402

Figure 2: Typical IT Architecture

What is enlightening about Figure 2 is that when I show this illustration during a client meeting or at a conference keynote address the typical response that I receive from the people is
“Where did you get a copy of our IT architecture?” If you work at a Global 2000 company or any large government entity, Figure 2 represents an overly simplified version of your IT
architecture. These poor architecture habits create a litany of problems including:

  • Redundant Applications/Processes/Data
  • Needless IT Rework
  • Redundant Hardware/Software

Redundant Applications/Processes/Data

It has been my experience working with large government agencies and Global 2000 companies that needlessly duplicate data is running rampant throughout our industry. In my experience the typical
large organization has between 3 – 4 fold needless data redundancy. Moreover, I can name multiple organizations that have literally hundreds of “independent” data mart
applications spread all over the company. Each one of these data marts is duplicating the extraction, transformation and load (ETL) that is typically done centrally in a data warehouse. This
greatly increases the number of support staff required to maintain the data warehousing system as these tasks are the largest and most costly data warehousing activities. Besides duplicating this
process, each data mart will also copy the data as well requiring further IT resources. It is easy to see why IT budgets are straining under the weight of all of this needless redundancy.

Needless IT Rework

During the requirements gathering portion of one of our meta data management initiatives I had an IT project manager discuss the challenges that he is facing in analyzing one of the
mission-critical legacy applications that will feed the data warehousing application that his team has been tasked to build. During our interview he stated, “This has to be the twentieth time
that our organization is analyzing this system to understand the business rules around the data.” This person’s story is an all too common one as almost all organizations reinvent the
IT wheel on every project. This situation occurs because usually separate teams will typically build each of the IT systems and since they don’t have a Managed Meta Data Environment (MME),
these teams do not leverage the other’s standards, processes, knowledge, and lessons learned. This results in a great deal of rework and reanalysis.

Redundant Hardware/Software

I have discussed a great deal about the redundant application and IT work that occurs in the industry. All of this redundancy also generates a great deal of needless hardware and software
redundancy. This situation forces the enterprise to retain skilled employees to support each of these technologies. In addition, a great deal of financial savings is lost, as standardization on
these tools doesn’t occur. Often a software, hardware, or tool contract can be negotiated to provide considerable discounts for enterprise licenses, which can be phased into. These economies
of scale can provide tremendous cost savings to the organization.

In addition, the hardware and software that is purchased is not used in an optimal fashion. For example, I have a client that has each one of their individual IT projects buy their own hardware. As
a result, they are infamous for having a bunch of servers running at 25% capacity.

From the software perspective the problem only gets worse. While analyzing a client of mine I had asked their IT project leaders what software vendors have you standardized on? They answered
“all of them!” This leads to the old joke “What is the most popular form of software on the market? Answer…Shelfware!” Shelfware is software that a company purchases
and winds up never using and it just sits on the shelf collecting dust.

Prevent IT Applications Failure

When a corporation looks to undertake a major IT initiative, like a customer relationship management (CRM), enterprise resource planning (ERP), data warehouse, or e-commerce solution their
likelihood of project failure is between 65% – 80%, depending on the study referenced. This is especially alarming when we consider that these same initiatives traditionally have executive
management support and cost many millions of dollars. For example, I have one large client that is looking to roll out a CRM system (e.g. Siebel, Oracle) and an ERP system (e.g. SAP, PeopleSoft)
globally in the next four years. Their initial project budget is over $125 million! In my opinion they have a 0% probability of delivering all of these systems on-time and on-budget. Consider this,
when was that last time that you’ve seen an ERP or CRM initiative being delivered on time or on budget?

When we examine the causes for these projects failure several themes become apparent. First, these projects did not address a definable and measurable business need. This is the number one reason
for project failure, data warehouse, CRM, MME, or otherwise. As IT professionals we must always be looking to solve business problems or capture business opportunities. Second, the projects that
fail have a very difficult time understanding their company’s existing IT environment and business rules. This includes custom applications, vendor applications, data elements, entities, data
flows, data heritage and data lineage.

MME’s Focus On Data Integration

Many of these Global 2000 companies and large government organizations are targeting MME technology to assist them in identifying and removing existing application and data redundancy. Moreover,
many companies are actively using their MME to identify redundant applications through analysis of the data. These same companies are starting IT application integration projects to merge these
overlapping systems and to ensure that future IT applications do not proliferate needless redundancy.

If your organization can reduce their applications, processes, data, software and hardware, lowers the likelihood for IT project failure and speeds up the IT development life-cycle, then clearly it
will greatly reduce a company’s IT expenditures. For example, I have a large banking client that asked my company to analyze their IT environment. During this analysis we discovered that they
have a tremendous amount of application and data redundancy. Moreover, I had figured out that they have over 700 unique applications. I then compared this client to a bank that is more than twice
there size; however, this larger bank has a world class MME and uses it to properly manage their systems. As a result, they have less than 250 unique applications. Clearly the bank with more than
700 applications has a great deal of needless redundancy as compared to a bank that is more than twice their size and has less than 250 applications. Interestingly enough the bank that has less
than 250 applications and has a world-class MME is also 14 times more profitable than the bank maintaining over 700 applications. It doesn’t seem like a very far stretch to see that the less
profitable bank would become much more profitable if they removed this redundancy.

I used referenced web site : tdan.com

We have a referrer link, for original article of this post, if you want you can follow TDAN (The Data Administration Newsletter)

Special thanks for this tdan comunity, and you can see the post at below link;

http://tdan.com/the-importance-of-data-integration/5198