Why Data Integration ? – The Importance of Data Integration

Almost every Chief Information Officer (CIO) has the goal of integrating their organization’s data. In fact the issue of data integration has risen all the way to the Chief Financial Officer
(CFO) and Chief Executive Officer (CEO) level of a corporation. A key question is why is data integration becoming so important to so many C-level executives? There are several key reasons driving
this desire:

  • Provide IT Portfolio Management
  • Reduce IT Redundancy
  • Prevent IT Applications Failure

 
Provide IT Portfolio Management

Over the years I have had the opportunity to perform dozens of data warehousing assessments. During these assessments I always ask the client how much they spend annually on data warehousing. The
majority of companies and government organizations cannot give a relatively good estimate on what they actually spend. In order to manage these and any other costly information technology (IT)
initiatives it is critical to measure each one of them. However, it is impossible to measure them when most companies do not understand them (see Figure 1: “How To Manage IT”). This is
where IT Portfolio Management enters the picture.

 

i029fe0401

 

Figure 1: How To Manage IT

 

IT portfolio management refers to the formal process for managing IT assets. An IT asset is software, hardware, middleware, IT projects, internal staff, applications and external consulting. Like
every newer discipline, many companies that have started their IT portfolio management efforts have not done so correctly. I would like to list out some of the keys to building successful IT
portfolio management applications.

By properly managing their IT portfolio it allows the corporation to see which projects are proceeding well and which are lagging behind. In my experience, almost every large company has a great
deal of duplicate IT effort occurring (see later section on “Reduce IT Redundancy”). This happens because the meta data is not accessible. At my company we have a couple of large
clients whose primary goal is to remove these tremendous redundancies, which translates into tremendous initial and ongoing IT costs.
Reduce IT Redundancy

CIO is commonly defined as Chief Information Officer; however, there is another possible meaning to this acronym; Career Is Over. One of the chief reasons for this is that most IT departments are
“handcuffed” in needless IT redundancy that too few CIOs are willing and capable of fixing.

There are several CIO surveys that are conducted annually. These surveys ask “what are your top concerns for the upcoming year”. Regardless of the survey you look at “data
integration” will be high on the list. Now data integration has two facets to it. One is the integration of data across disparate systems for enterprise applications. The second is the
integration/removal of IT redundancies. Please understand that some IT redundancy is a good thing. For example, when there is a power outage and one of your data centers is non-operational you need
to have a backup of these systems/data. However, when I talk about IT redundancies I am addressing “needless” IT redundancy. Meaning, IT redundancy that only exists because of
insufficient management of our IT systems. I was working with a Midwestern insurance company that, over a four year span had initiated various decision support efforts. After this four year period
they took the time to map out the flow of data from their operational systems, to their data staging areas and finally to their data mart structures. What they discovered was Figure 2:
“Typical IT Architecture”.

i029fe0402

Figure 2: Typical IT Architecture

What is enlightening about Figure 2 is that when I show this illustration during a client meeting or at a conference keynote address the typical response that I receive from the people is
“Where did you get a copy of our IT architecture?” If you work at a Global 2000 company or any large government entity, Figure 2 represents an overly simplified version of your IT
architecture. These poor architecture habits create a litany of problems including:

  • Redundant Applications/Processes/Data
  • Needless IT Rework
  • Redundant Hardware/Software

Redundant Applications/Processes/Data

It has been my experience working with large government agencies and Global 2000 companies that needlessly duplicate data is running rampant throughout our industry. In my experience the typical
large organization has between 3 – 4 fold needless data redundancy. Moreover, I can name multiple organizations that have literally hundreds of “independent” data mart
applications spread all over the company. Each one of these data marts is duplicating the extraction, transformation and load (ETL) that is typically done centrally in a data warehouse. This
greatly increases the number of support staff required to maintain the data warehousing system as these tasks are the largest and most costly data warehousing activities. Besides duplicating this
process, each data mart will also copy the data as well requiring further IT resources. It is easy to see why IT budgets are straining under the weight of all of this needless redundancy.

Needless IT Rework

During the requirements gathering portion of one of our meta data management initiatives I had an IT project manager discuss the challenges that he is facing in analyzing one of the
mission-critical legacy applications that will feed the data warehousing application that his team has been tasked to build. During our interview he stated, “This has to be the twentieth time
that our organization is analyzing this system to understand the business rules around the data.” This person’s story is an all too common one as almost all organizations reinvent the
IT wheel on every project. This situation occurs because usually separate teams will typically build each of the IT systems and since they don’t have a Managed Meta Data Environment (MME),
these teams do not leverage the other’s standards, processes, knowledge, and lessons learned. This results in a great deal of rework and reanalysis.

Redundant Hardware/Software

I have discussed a great deal about the redundant application and IT work that occurs in the industry. All of this redundancy also generates a great deal of needless hardware and software
redundancy. This situation forces the enterprise to retain skilled employees to support each of these technologies. In addition, a great deal of financial savings is lost, as standardization on
these tools doesn’t occur. Often a software, hardware, or tool contract can be negotiated to provide considerable discounts for enterprise licenses, which can be phased into. These economies
of scale can provide tremendous cost savings to the organization.

In addition, the hardware and software that is purchased is not used in an optimal fashion. For example, I have a client that has each one of their individual IT projects buy their own hardware. As
a result, they are infamous for having a bunch of servers running at 25% capacity.

From the software perspective the problem only gets worse. While analyzing a client of mine I had asked their IT project leaders what software vendors have you standardized on? They answered
“all of them!” This leads to the old joke “What is the most popular form of software on the market? Answer…Shelfware!” Shelfware is software that a company purchases
and winds up never using and it just sits on the shelf collecting dust.

Prevent IT Applications Failure

When a corporation looks to undertake a major IT initiative, like a customer relationship management (CRM), enterprise resource planning (ERP), data warehouse, or e-commerce solution their
likelihood of project failure is between 65% – 80%, depending on the study referenced. This is especially alarming when we consider that these same initiatives traditionally have executive
management support and cost many millions of dollars. For example, I have one large client that is looking to roll out a CRM system (e.g. Siebel, Oracle) and an ERP system (e.g. SAP, PeopleSoft)
globally in the next four years. Their initial project budget is over $125 million! In my opinion they have a 0% probability of delivering all of these systems on-time and on-budget. Consider this,
when was that last time that you’ve seen an ERP or CRM initiative being delivered on time or on budget?

When we examine the causes for these projects failure several themes become apparent. First, these projects did not address a definable and measurable business need. This is the number one reason
for project failure, data warehouse, CRM, MME, or otherwise. As IT professionals we must always be looking to solve business problems or capture business opportunities. Second, the projects that
fail have a very difficult time understanding their company’s existing IT environment and business rules. This includes custom applications, vendor applications, data elements, entities, data
flows, data heritage and data lineage.

MME’s Focus On Data Integration

Many of these Global 2000 companies and large government organizations are targeting MME technology to assist them in identifying and removing existing application and data redundancy. Moreover,
many companies are actively using their MME to identify redundant applications through analysis of the data. These same companies are starting IT application integration projects to merge these
overlapping systems and to ensure that future IT applications do not proliferate needless redundancy.

If your organization can reduce their applications, processes, data, software and hardware, lowers the likelihood for IT project failure and speeds up the IT development life-cycle, then clearly it
will greatly reduce a company’s IT expenditures. For example, I have a large banking client that asked my company to analyze their IT environment. During this analysis we discovered that they
have a tremendous amount of application and data redundancy. Moreover, I had figured out that they have over 700 unique applications. I then compared this client to a bank that is more than twice
there size; however, this larger bank has a world class MME and uses it to properly manage their systems. As a result, they have less than 250 unique applications. Clearly the bank with more than
700 applications has a great deal of needless redundancy as compared to a bank that is more than twice their size and has less than 250 applications. Interestingly enough the bank that has less
than 250 applications and has a world-class MME is also 14 times more profitable than the bank maintaining over 700 applications. It doesn’t seem like a very far stretch to see that the less
profitable bank would become much more profitable if they removed this redundancy.

I used referenced web site : tdan.com

We have a referrer link, for original article of this post, if you want you can follow TDAN (The Data Administration Newsletter)

Special thanks for this tdan comunity, and you can see the post at below link;

http://tdan.com/the-importance-of-data-integration/5198

What is Unstructured Data?

The phrase unstructured data usually refers to information that doesn’t reside in a traditional row-column database. As you might expect, it’s the opposite of structured data the data stored in fields in a database.

Examples of Unstructured Data

Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered “unstructured” because the data they contain doesn’t fit neatly in a database.

Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly  often many times faster than structured databases are growing.

Mining Unstructured Data

Many organizations believe that their unstructured data stores include information that could help them make better business decisions. Unfortunately, it’s often very difficult to analyze unstructured data. To help with the problem, organizations have turned to a number of different software solutions designed to search unstructured data and extract important information. The primary benefit of these tools is the ability to glean actionable information that can help a business succeed in a competitive environment.

Because the volume of unstructured data is growing so rapidly, many enterprises also turn to technological solutions to help them better manage and store their unstructured data. These can include hardware or software solutions that enable them to make the most efficient use of their available storage space.

Unstructured Data and Big Data

As mentioned above, unstructured data is the opposite of structured data. Structured data generally resides in a relational database, and as a result, it is sometimes called relational data. This type of data can be easily mapped into pre-designed fields. For example, a database designer may set up fields for phone numbers, zip codes and credit card numbers that accept a certain number of digits. Structured data has been or can be placed in fields like these. By contrast, unstructured data is not relational and doesn’t fit into these sorts of pre-defined data models.

Semi-Structured Data

In addition to structured and unstructured data, there’s also a third category: semi-structured data. Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. Examples of semi-structured data might include XML documents and NoSQL databases.

The term big data is closely associated with unstructured data. Big data refers to extremely large datasets that are difficult to analyze with traditional tools. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big data is unstructured data. Many of the tools designed to analyze big data can handle unstructured data.

Unstructured Data Management

Organizations use of variety of different software tools to help them organize and manage unstructured data. These can include the following:

Big data tools

Software like Hadoop can process stores of both unstructured and structured data that are extremely large, very complex and changing rapidly.

Business intelligence software

Also known as BI, business intelligence is a broad category of analytics, data mining, dashboards and reporting tools that help companies make sense of their structured and unstructured data for the purpose of making better business decisions.

Data integration tools

These tools combine data from disparate sources so that they can be viewed or analyzed from a single application. They sometimes include the capability to unify structured and unstructured data.

Document management systems

Also called enterprise content management systems, a DMS can track, store and share unstructured data that is saved in the form of document files.

Information management solutions

This type of software tracks structured and unstructured enterprise data throughout its lifecycle.

Search and indexing tools

These tools retrieve information from unstructured data files such as documents, Web pages and photos.

Unstructured Data Technology

A group called the Organization for the Advancement of Structured Information Standards (OASIS) has published the Unstructured Information Management Architecture (UIMA) standard. The UIMA “defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information.”

Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data. This open source project is managed by the Apache Software Foundation.

I used referenced web site : webopedia.com

We have a few referrer link, like big data, hadoop, business intelligence etch.

You can see the post at below link;

http://www.webopedia.com/TERM/U/unstructured_data.html

Thanks for this post : Vangie Beal

What is Structured Data?

Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets.

Characteristics of Structured Data

Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.; M or F).

Structured data has the advantage of being easily entered, stored, queried and analyzed. At one time, because of the high cost and performance limitations of storage, memory and processing, relational databases and spreadsheets using structured data were the only way to effectively manage data. Anything that couldn’t fit into a tightly organized structure would have to be stored on paper in a filing cabinet.

 

Managing Structured Data

Structured data is often managed using Structured Query Language (SQL) – a programming language created for managing and querying data in relational database management systems. Originally developed by IBM in the early 1970s and later developed commercially by Relational Software, Inc. (now Oracle Corporation).

Structured data was a huge improvement over strictly paper-based unstructured systems, but life doesn’t always fit into neat little boxes. As a result, the structured data always had to be supplemented by paper or microfilm storage. As technology performance has continued to improve, and prices have dropped, it was possible to bring into computing systems unstructured and semi-structured data.

Unstructured and Semi-Structured Data

Unstructured data is all those things that can’t be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis and word processing documents.

Semi-structured data is a cross between the two. It is a type of structured data, but lacks the strict data model structure. With semi-structured data, tags or other types of markers are used to identify certain elements within the data, but the data doesn’t have a rigid structure. For example, word processing software now can include metadata showing the author’s name and the date created, with the bulk of the document just being unstructured text. Emails have the sender, recipient, date, time and other fixed fields added to the unstructured data of the email message content and any attachments. Photos or other graphics can be tagged with keywords such as the creator, date, location and keywords, making it possible to organize and locate graphics. XML and other markup languages are often used to manage semi-structured data.

Structured Data Technology Standards

SQL has been a standard of the American National Standards Institute since 1986. It is managed by InterNational Committee for Information Technology Standards (INCITS) Technical Committee DM 32 – Data Management and Interchange.  The committee has two task groups, one for databases and the other for metadata. HP, CA, IBM, Microsoft, Oracle, Sybase (SAP) and Teradata all participate, as well as several federal government agencies. Both of the committee project documents have links to further information on each project. SQL became an International Organization for Standards (ISO) standard in 1987. The published standards are available for purchase from the ANSI eStandards Store, under the INCITS/ISO/IEC 9075 classification.

I used referenced web site : webopedia.com

We have a few referrer link, like data model etch.

You can see the post at below link;

http://www.webopedia.com/TERM/S/structured_data.html

Thanks for this post : Vangie Beal

What’s the Difference Between Structured & Unstructured Data

If left unmanaged, your data can become overwhelming, making it difficult to procure information you need when you need it. While software is designed to address archiving, e-discovery, compliance, etc., the overarching goal is most always the same: to make managing and maintaining data a  feasible task. In this post, you’ll see two types of data you’re accustomed to working with, paying close attention to the differences between structured and unstructured data.

Data Structured vs Unstructured Data

 

What is Structured Data?

Before getting into unstructured data, you need to have an understanding for its structured counterpart. Structured data (as explained succinctly in Big Data Republic’s video) is information, usually text files, displayed in titled columns and rows which can easily be ordered and processed by data mining tools. This could be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access. Most organizations are likely to be familiar with this form of data and already using it effectively, so let’s move on to the hotter question.

What is Unstructured Data?

Believe it or not, your database of structured information doesn’t even contain half of the information available for your use! Seth Grimes, a leading industry analyst on the confluence of structured and unstructured data sources, published an article that stated, “80% of business-relevant information originates in unstructured form, primarily text.”  This may seem like an outlandish percentage, but don’t jump to conclusions too fast. We’re just getting started.

Now that you have a grasp on structured data, it will be much easier to understand what unstructured data is. Unstructured data, usually binary data that is proprietary, is that which has no identifiable internal structure. It can be visualized as a level 5 hoarder’s living room; it’s a massive unorganized conglomerate of various objects that are worthless until identified and stored in an organized fashion. Once this organization process has taken place (through the use of specialized software), the items can then be searched through and categorized (to an extent) for obtaining insights. While data mining tools might not be equipped to parse information in email messages (however organized it may be), you may have very good reason to collect and categorize data from this source. This illustrates the importance and plausible breadth of unstructured data.

Email Has Structure, Right?

The term “unstructured” has faced major scrutiny for several reasons. One argument is that although some form of structure is not formally identified, it can still be implied and therefore should not be labeled as “unstructured.” The counter-point states that if data has some form of structure but is not helpful to the processing task at hand, it may still be characterized as “unstructured.” So, while email messages may contain information that does have some implied structure, we can label the information as “unstructured” because normal data mining tools aren’t equipped to parse it. Alas, both sides of the argument persist.

Unstructured Data Types

Unstructured data is raw and unorganized and organizations store it all. Ideally, all of this information would be converted into structured data however, this would be costly and time consuming. Also, not all types of unstructured data can easily be converted into a structured model. For example, an email holds information such as the time sent, subject, and sender (all uniform fields), but the content of the message is not so easily broken down and categorized. This can introduce some compatibility issues with the structure of a relational database system.

In case you’re still not quite sure what we mean, here is a limited list of types of unstructured data:

  • Emails
  • Word Processing Files
  • PDF files
  • Spreadsheets
  • Digital Images
  • Video
  • Audio
  • Social Media Posts

Looking at the list, you may be wondering what these files have in common. The files listed above can be stored and managed without the format of the file being understood by the system. This allows them to be stored in an unstructured fashion because the contents of the files are unorganized.

The big data industry is growing but the problem of unstructured data going unused has been identified by organizations. Better yet, technologies and services are being developed in reaction. Darin Stewart of InformationWeek said in a recent article about big data, “The age of information overload is slowly drawing to a close. Enterprises are finally getting comfortable with managing massive amounts of data, content and information. The pace of information creation continues to accelerate, but the ability of infrastructure and information management to keep pace is coming within sight. Big Data is now considered a blessing rather than a curse.”

I used source sherpa’s web site.

You can see this post on their website :

What’s the Difference Between Structured & Unstructured Data?