A Data Steward is someone that has formal accountability for data in the organization. I say that everybody in the organization is a Data Steward. You may disagree with me or think that this idea is preposterous; however, I hope to change your mind by the end of this short column. Please give me five minutes.
My premise is based on the fact that everybody that comes in contact with data should have formal accountability for that contact. In other words, people that define, produce, and use data must be held accountable for how they define, produce, and use the data. This may be common sense, but the truth is that this is not taking place. Formalizing accountability to execute and enforce authority over data is the essence of using stewardship to govern data.
Most people agree that everybody that uses sensitive data must protect that data. The sensitive data may contain PII data (personally identifiable information) or PHI data (personal health information) or even IP data (intellectual property) that has a clear set of rules associated with how that data can be shared and who can have access to that data. The rules may be external as in the case of PII and PHI data, or the rules can be internal as in the case of IP data. But one thing is for certain: there are rules associated with at least some of your data.
The truth is that the rules for protecting sensitive data must 1) apply equally to everybody that comes in contact with sensitive data, 2) everybody must know and live the rules, 3) the rules must be formally enforced, and 4) the ability to demonstrate that people are following the rules must be auditable. This, my friends, is what I am proving in this column. Everybody that uses sensitive data must be held formally accountable for how they use the data. Therefore, they are, by my definition, a Data Steward. A Non-Invasive Data Governance™ program focuses on formalizing that level of data usage accountability.
Data Usage is only one facet of the Everybody is a Data Steward notion. What about people that define or produce data? Shouldn’t they also have formal accountability for their actions? The answer to that question is ‘Yes.’
People that define data – either by entering the data or finding new data sources, creating new systems, creating new databases, or propagating new spread-marts that will be used for decision making – should be held formally accountable for checking to see what already exists before producing, as an example, another version of the customer. People that define the ‘golden record’ or system-of-record or master data resources for your organization should be held formally accountable for the quality and value of the definition of that data.
Non-Invasive Data Governance™ recognizes the data producers as stewards of the data as well. If you produce data one of the ways mentioned previously, it is important that you understand the impact you have on the value of that data to the organization. Accepting default values may or may not be a good thing. Entering dummy data where real data is required is never a good thing. Allowing data that is not up to standards to enter your data resources may wreak havoc on decision-making. Calculating profitability may be inconsistent from product to product. People that produce data – through their functions and processes – should be held accountable for how they produce that data including the quality, accuracy, and value of the data they produce.
It all boils down to whether or not you believe that everybody with a relationship to the data should be held formally accountable for that relationship. Basically, every person in your organization has a relationship to the data. Therefore,Everybody is a Data Steward.
The idea that Everybody is a Data Steward may scare you a smidge. Most data governance programs do not follow the thinking that everybody in the organization is a data steward. In fact, most programs assign or hire people to be data stewards. The Non-Invasive Data Governance™ approach allows for certain people to be stewards at a more tactical level (subject matter experts), but the approach calls for identifying or recognizing these people based on their existing levels of authority associated with their data domains.
Are you convinced yet that Everybody is a Data Steward? Does this concept mean that your data governance program will become, in some way, more complex? From my experience the answer is ‘not necessarily’. It depends on how you communicate and address this main tenet of data stewardship. The Everybody is a Data Steward notion guarantees that accountability for data is consistent across the organization for everybody.
This is the most common of all the different types of databases. In this, the data in a relational database is stored in various data tables. Each table has a key field which is used to connect it to other tables. Hence all the tables are related to each other through several key fields. These databases are extensively used in various industries and will be the one you are most likely to come across when working in IT.
Examples of relational databases are Oracle, Sybase and Microsoft SQL Server and they are often key parts of the process of software development. Hence you should ensure you include any work required on the database as part of your project when creating a project plan and estimating project costs.
2.0 Operational Databases
In its day to day operation, an organisation generates a huge amount of data. Think of things such as inventory management, purchases, transactions and financials. All this data is collected in a database which is often known by several names such as operational/ production database, subject-area database (SADB) or transaction databases.
An operational database is usually hugely important to Organisations as they include the customer database, personal database and inventory database ie the details of how much of a product the company has as well as information on the customers who buy them. The data stored in operational databases can be changed and manipulated depending on what the company requires.
3.0 Database Warehouses
Organisations are required to keep all relevant data for several years. In the UK it can be as long as 6 years. This data is also an important source of information for analysing and comparing the current year data with that of the past years which also makes it easier to determine key trends taking place. All this data from previous years are stored in a database warehouse. Since the data stored has gone through all kinds of screening, editing and integration it does not need any further editing or alteration.
With this database ensure that the software requirements specification (SRS) is formally approved as part of the project quality plan.
4.0 Distributed Databases
Many organisations have several office locations, manufacturing plants, regional offices, branch offices and a head office at different geographic locations. Each of these work groups may have their own database which together will form the main database of the company. This is known as a distributed database.
5.0 End-User Databases
There is a variety of data available at the workstation of all the end users of any organisation. Each workstation is like a small database in itself which includes data in spreadsheets, presentations, word files, note pads and downloaded files. All such small databases form a different type of database called the end-user database.
6.0 External Database
There is a sea of information available outside world which is required by an organisation. They are privately-owned data for which one can have conditional and limited access for a fortune. This data is meant for commercial usage. All such databases outside the organisation which are of use and limited access are together called external database.
7.0 Hypermedia Database
Most websites have various interconnected multimedia pages which might include text, video clips, audio clips, photographs and graphics. These all need to be stored and “called” from somewhere when the webpage if created. All of them together form the hypermedia database.
Please note that if you are creating such a database from scratch to be generous when creating a project plan, detailed when defining the business requirements documentation (BRD) and meticulous in your project cost controls. I have seen too many projects where the creation of one of these databases has caused scope creep and an out of control budget for a project.
8.0 Navigational Database
Navigational database has all the items which are references from other objects. In this, one has to navigate from one reference to other or one object to other. It might be using modern systems like XPath. One of its applications is the air flight management systems.
9.0 In-Memory Database
An in-memory databases stores data in a computer’s main memory instead of using a disk-based storage system. It is faster and more reliable than that in a disk. They find their application in telecommunications network equipments.
10.0 Document-Oriented Database
A document oriented database is a different type of database which is used in applications which are document oriented. The data is stored in the form of text records instead of being stored in a data table as usually happens.
11.0 Real-Time Database
A real-time database handles data which constantly keep on changing. An example of this is a stock market database where the value of shares change every minute and need to be updated in the real-time database. This type of database is also used in medical and scientific analysis, banking, accounting, process control, reservation systems etc. Essentially anything which requires access to fast moving and constantly changing information.
Assume that this will require much more time than a normal relational database when it comes to the software testing life cycle, as these are much more complicated to efficiently test within normal time frames.
12.0 Analytical Database
An analytical database is used to store information from different types of databases such as selected operational databases and external databases. Other names given to analytical databases are information databases, management databases or multi-dimensional databases. The data stored in an analytical database is used by the management for analysis purposes, hence the name. The data in an analytical database cannot be changed or manipulated.
Different Types of Databases Top 12 – Tip
Of the different types of databases, relational is the most common and includes such well known names as Oracle, No-SQL, Couchbase, Hadoop, Sybase and SQL Server. However as a project manager you need to be prepared for anything, hence why having a high level view of the different databases is useful particularly when managing a software development life cycle. Regarding the remainder, you will hear a great deal about database warehouses. This is a highly specialized area which involves mining the data produced to generate meaningful trends and reports for senior management to act upon.
I used referenced web site for this post : my-project-management-expert.com
We have a referrer link, for original article of this post, if you want you can follow my-project-management-expert.com
Special thanks for my-project-management-expert team, for this article, and you can see the post at below link;
The different types of databases include operational databases, end-user databases, distributed databases, analytical databases, relational databases, hierarchical databases and database models. Databases are classified according to their type of content, application area and technical aspect. For instance, a deductive database combines logic programming with a relational database, while a graph database uses graph structures to represent and store information.
Other types of databases include hypertext databases, mobile databases, parallel databases, active databases, cloud databases, in-memory databases, spatial databases, temporal databases, real-time databases, probabilistic databases and embedded databases.
A database is an organized collection of data. Its primary function is to interact with a database management system to capture and analyze data. A database management system is a software system designed to allow the creation, querying and administration of databases. Some popular database management systems include PostgreSQL, MySQL, Microsoft SQL Server, Oracle, IBM DB2 and SAP.
Databases are designed to operate large amounts of information by inputting, storing, retrieving and managing it. They are set up in a way that allows users to easily and intuitively gain access to all the information. A database management maintains the integrity and security of stored data. It is also used for data recovery, in case of system failure.
(1) Often abbreviated DB, a database is basically a collection of information organized in such a way that a computer program can quickly select desired pieces of data. You can think of a database as an electronic filing system.
Traditional databases are organized by fields, records, and files. A field is a single piece of information; a record is one complete set of fields; and a file is a collection of records. For example, a telephone book is analogous to a file. It contains a list of records, each of which consists of three fields: name, address, and telephone number.
An alternative concept in database design is known as Hypertext. In a Hypertext database, any object, whether it be a piece of text, a picture, or a film, can be linked to any other object. Hypertext databases are particularly useful for organizing large amounts of disparate information, but they are not designed for numerical analysis.
To access information from a database, you need a databasemanagementsystem (DBMS). This is a collection of programs that enables you to enter, organize, and select data in a database.
(2) Increasingly, the term database is used as shorthand for database management system. There are many different types of DBMSs, ranging from small systems that run on personal computers to huge systems that run on mainframes.
I used referenced web site for this post : webopedia.com
We have a referrer link, for original article of this post, if you want you can follow webopedia.com
Special thanks for webopedia team, for this article, and you can see the post at below link;
Apache™ Hadoop® is a highly scalable open-source storage platform designed for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks i.e. it can process very large data sets across hundreds and thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.
Hadoop is an ecosystem of open source components that fundamentally changed the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.
What are those terms mean?
Open-source software: Open-source software is created and maintained by a network of developers from around the globe. It’s free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available.
Framework: In this case, it means that everything you need to develop and run software applications is provided – programs, connections, etc.
Massive storage: The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware.
Processing power: Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.
It all started with the World Wide Web. As the web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results really were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Components of Hadoop
Currently, four core modules are included in the basic framework from the Apache Foundation:
1. Hadoop Common
The libraries and utilities used by other Hadoop modules.
2. Hadoop Distributed File System (HDFS)
The Java-based scalable system that stores data across multiple machines without prior organization.
Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.
MapReduce – a software programming model for processing large sets of data in parallel.
YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator)
Hadoop MapReduce, an execution engine in Hadoop, processes workloads using the MapReduce framework which breaks down jobs into smaller pieces of work that can be distributed across nodes in your Amazon EMR cluster. The Hadoop MapReduce engine is built with the expectation that any given machine in your cluster could fail at any time and is designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.
You can write MapReduce programs in Java, or use Hadoop Streaming to execute custom scripts in a parallel fashion, Hive and Pig (if you choose to install these applications on your Amazon EMR cluster) for higher level abstractions over MapReduce, or other tools to interact with Hadoop.
Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce workloads as well as other distributed frameworks such as Apache Spark, Apache Tez, and more.
Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:
A platform for manipulating data stored in HDFS that includes a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.
A data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)
A nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.
A table and storage management layer that helps users share and access data.
A web interface for managing, configuring and testing Hadoop services and components.
A distributed database system.
A data collection system for monitoring large distributed systems.
Software that collects, aggregates and moves large amounts of streaming data into HDFS.
A Hadoop job scheduler.
A connection and transfer mechanism that moves data between Hadoop and relational databases.
An open-source cluster computing framework with in-memory analytics.
An scalable search tool that includes indexing, reliability, central configuration, failover and recovery.
An application that coordinates distributed processes.
In addition, there are commercial distributions of Hadoop, including Cloudera, Hortonworks and MapR. With distributions from software vendors, you pay for their version of the framework and receive additional software components, tools, training, documentation and other services.
Uses of Hadoop
Hadoop’s infinitely scalable flexible architecture (based on the HDFS filesystem) allows organizations to store and analyze unlimited amounts and types of data—all in a single, open source platform on industry-standard hardware.
The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later.
2. Data lake
Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a low-cost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.
Quickly integrate with existing systems or applications to move data into and out of Hadoop through bulk load processing (Apache Sqoop) or streaming (Apache Flume, Apache Kafka).?
Transform complex data, at scale, using multiple data access options (Apache Hive, Apache Pig) for batch (MR2) or fast in-memory (Apache Spark) processing. Process streaming data as it arrives in your cluster via Spark Streaming.
Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
Analysts interact with full-fidelity data on the fly with Apache Impala (incubating), the analytic database for Hadoop. With Impala, analysts experience BI-quality SQL performance and functionality plus compatibility with all the leading BI tools.
Using Cloudera Search, an integration of Hadoop and Apache Solr, analysts can accelerate the process of discovering patterns in data in all amounts and formats, especially when combined with Impala.
5. Recommendation systems
One of the most popular analytical uses by some of Hadoop’s largest adopters is for web-based recommendation systems. Facebook – people you may know. LinkedIn – jobs you may be interested in. Netflix, eBay, Hulu – items you may be interested in. These systems analyze huge amounts of data in real time to quickly predict preferences before customers leave the web page.
With Hadoop, analysts and data scientists have the flexibility to develop and iterate on advanced statistical models using a mix of partner technologies as well as open source frameworks like Apache Spark.
The distributed data store for Hadoop, Apache HBase, supports the fast, random reads/writes (“fast data”) required for online applications.
Difficulties in Hadoop Adoption
The scale-out potential of Apache Hadoop is impressive. However, while Hadoop offers the advantage of using low-cost commodity servers, extending this scale-out potential to thousands of nodes can translate to a true expense. As the demand for compute and analytic capacity grows, so can the machine costs. This has an equal effect on storage since Hadoop spreads out data, and companies must have equal space for increased data storage repositories, including all the indices, and for all the acquired raw data.
Integrating and processing all of this diverse data can be costly in terms of both infrastructure and personnel. While traditional BI relies on evaluating transactional and historical data, today’s analytics require more skill in iterative analysis and the ability to recognize patterns. When dealing with big data, an advanced skillset that goes beyond RDBMS capabilities-both in terms of analysis and programming-is essential. Not only is there need for advanced systems administration and analyst capabilities when working with Hadoop, but learning the MapReduce programming unique to this framework represents a significant hurdle.
In terms of relational databases, moving and modifying large volumes of unstructured data into the necessary form for Extraction, Transformation, Loading (ETL) can be both costly and time-consuming. That’s a key reason why Hadoop seems so attractive. One could argue that the ongoing growth in data volume, velocity, and variety has made the traditional data warehousing architecture less and less viable. However, it is still easier to find experienced RDBMS programmers and developers than those with MapReduce programming capabilities. Part of the difficulty lies in just learning the language beyond having the skills to install and maintain the Hadoop platform.
MapReduce uses a computational approach that employs a Map pre-processing function and a Reduce data aggregation/distillation step. However, when it comes to real-time transactional data analysis, the low latency reads and writes characteristic of RDBMS structured data processing are simply not possible with HDFS and MapReduce.
Of course, as the platform matures, more features will continue to be added to it. While add-on products make Hadoop easier to use, they also present a learning challenge that requires constantly expanding one’s expertise. For example:
Hive is the data warehousing component of Hadoop, and it functions well with structured data, enabling ad-hoc queries against large transactional datasets. On the other hand, though workarounds do exist, the absence of any ETL-style tool makes HiveQL, the SQL-like programming dialect, problematic when working with unprocessed, unstructured data.
HBase, the column-based storage system, enables users to employ Hadoop datasets as though they’re indices in any conventional RDBMS. It typically allows easy column creation and lets the user store virtually any structure within a data element.
PIG represents the high-level dataflow language, Pig Latin, and requires quite advanced training. It provides easier access to data held in Hadoop clusters and offers a means for analyzing large datasets. In part, PIG enables the implementation of simple or complex workflows and the designation of multiple data inputs where data can then be processed by multiple operators.
As IT organizations consider wholesale adoption of the Hadoop platform for analytics, they must carefully strategize their approach. The platform’s specialized methodology, scale-out storage, and powerful processing capacity make it optimal for analytical data loads. However, the dedication in training competent personnel and machine costs, as well as the framework’s inability to function as an RDBMS replacement, should prompt careful consideration.
Hardware and Software for Hadoop
Hadoop runs on commodity hardware. That doesn’t mean it runs on cheapo hardware. Hadoop runs on decent server class machines.
Here are some possibilities of hardware for Hadoop nodes.
So the high end machines have more memory. Plus, newer machines are packed with a lot more disks (e.g. 36 TB) — high storage capacity.
Examples of Hadoop servers
HP ProLiant DL380
Dell C2100 series
Supermicro Hadoop series
So how does a large hadoop cluster looks like? Here is a picture of Yahoo’s Hadoop cluster.
Hadoop runs well on Linux. The operating systems of choice are:
RedHat Enterprise Linux (RHEL)
This is a well tested Linux distro that is geared for Enterprise. Comes with RedHat support
Source compatible distro with RHEL. Free. Very popular for running Hadoop. Use a later version (version 6.x).
The Server edition of Ubuntu is a good fit — not the Desktop edition. Long Term Support (LTS) releases are recommended, because they continue to be updated for at least 2 years.
Hadoop is written in Java. The recommended Java version is Oracle JDK 1.6 release and the recommended minimum revision is 31 (v 1.6.31).
So what about OpenJDK? At this point the Sun JDK is the ‘official’ supported JDK. You can still run Hadoop on OpenJDK (it runs reasonably well) but you are on your own for support 🙂
Business Intelligence Tools For Hadoop and Big Data
The case for BI Tools
Analytics for Hadoop can be done by the following:
Writing custom Map Reduce code using Java, Python, R ..etc
Using high level Pig scripts
Using SQL using Hive
How ever doing analytics like this can feel a little pedantic and time consuming. Business INtelligence tools (BI tools for short) can address this problem.
BI tools have been around since before Hadoop. Some of them are generic, some are very specific towards a certain domain (e.g. Telecom, Health Care ..etc). BI tools provide rich, user friendly environment to slice and dice data. Most of them have nice GUI environments as well.
BI Tools Feature Matrix Comparison
Since Hadoop is gaining popularity as a data silo, a lot of BI tools have added support to Hadoop. In this chapter we will look into some BI tools that work with Hadoop.
We are trying to present capabilities of BI tools in an easy to compare feature matrix format. This is a ‘living’ document. We will keep it updated as new versions and new features surface.
This matrix is under construction
How to read the matrix?
Y – feature is supported
N – feature is NOT supported
? or empty – unknown
Table: BI Tools Comparison : Data Access and Management
Table: BI Tools Comparison : Analytics
Table: BI Tools Comparison : Visualizing
Table: BI Tools Comparison : Connectivity
Glossary of terms
Can validate data confirms to certain limits, can do cleansing and de-duping.
Share with others
Can share the results with others within or outside organization easily. (Think like sharing a document on DropBox or Google Drive)
You can slice and dice data on locally on a computer or tablet. This uses the CPU power of the device and doesn’t need a round-trip to a ‘server’ to process results. This can speed up ad-hoc data exploration
Analytics ‘app store’
The platform allows customers to buy third party analytics app. Think like APple App Store.
Hadoop For Executives
This section is a quick ‘fact sheet’ in a Q&A format.
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.
What is the license of Hadoop?
Hadoop is open source software. It is an Apache project released under Apache Open Source License v2.0. This license is very commercial friendly.
Who contributes to Hadoop?
Originally Hadoop was developed and open sourced by Yahoo. Now Hadoop is developed as an Apache Software Foundation project and has numerous contributors from Cloudera, Horton Works, Facebook, etc.
Isn’t Hadoop used by foo-foo social media companies and not by enterprises
Hadoop had its start in a Web company. It was adopted pretty early by social media companies because the companies had Big Data problems and Hadoop offered a solution.
However, Hadoop is now making inroads into Enterprises.
I am not sure my company has a big data problem
Hadoop is designed to deal with Big Data. So if you don’t have a ‘Big Data Problem’, then Hadoop probably isn’t the best fit for your company. But before you stop reading right here, please read on 🙂
How much data is considered Big Data, differs from company to company. For some companies, 10 TB of data would be considered Big Data; for others 1 PB would be ‘Big Data’. So only you can determine how much is Big Data.
Also, if you don’t have a ‘Big Data problem’ now, is that because you are not capturing some data? In some scenarios, companies chose to forgo capturing data, because there wasn’t a feasible way to store and process it. Now that Hadoop can help with Big Data, it may be possible to start capturing data that wasn’t captured before.
How much does it cost to adopt Hadoop?
Hadoop is open source. The software is free. However running Hadoop does have other cost components.
Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant.
The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in.
Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy.
Hadoop for Developers
This section is a quick ‘fact sheet’ in a Q&A format.
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.
Is Hadoop a fad or here to stay?
Sure, Hadoop and Big Data are all the rage now. But Hadoop does solve a real problem and it is a safe bet that it is here to stay.
Below is a graph of Hadoop job trends from Indeed.com. As you can see, demand for Hadoop skills has been up and up since 2009. So Hadoop is a good skill to have!
Hadoop Job Trends
What skills do I need to learn Hadoop?
A hands-on developer or admin can learn Hadoop. The following list is a start – in no particular order
Hadoop is written in Java. So knowing Java helps
Hadoop runs on Linux, so you should know basic Linux command line navigation skills
Some Linux scripting skills will go a long way
What kind of technical roles are available in Hadoop?
The following should give you an idea of the kind of technical roles in Hadoop.
I am not a programmer, can I still use Hadoop?
Yes, you don’t need to write Java Map Reduce code to extract data out of Hadoop. You can use Pig and Hive. Both Pig and Hive offer ‘high level’ Map Reduce. For example you can query Hadoop using SQL in Hive.
What kind of development tools are available for Hadoop?
Hadoop development tools are still evolving. Here are a few:
Karmasphere IDE : tuned for developing for Hadoop
Eclipse and other Java IDEs : When writing Java code
Command line editor like VIM : No matter what editor you use, you will be editing a lot of files / scripts. So familiarity with CLI editors is essential.
Where can I learn more?
Tom White’s Hadoop Book : This is the ‘Hadoop Bible’
The Commercial Platform Approach to Apache Hadoop
As mentioned above, businesses dealing with increasing masses of data are looking for a distributed computing analytic solution that provides comprehensive administration and management, easy deployment, and support for effective business continuity and high availability.
Today, commercial open-source models that incorporate MapReduce along with a built-in framework and infrastructure offer another means for avoiding the learning curve and burdens associated with Apache Hadoop deployment. These commercial players ease skillset acquisition by providing key management tools that interface with Hadoop processes. The value of technical support, services, and training cannot be overstated when it comes to Hadoop implementation.
Commercial vendors offer a means by which these high-level analysis tools can be accessed and used by a wide variety of users, not just those with engineering or BI capabilities. They provide the support that ensures Hadoop users can undertake complex data analysis projects.
As open-source tools proliferate and their increasing importance to big data analytics continues to grow, a need for streamlined administration and support will expand as well. While commercial Hadoop providers offer the necessary support, there is no alternative to learning its platform-specific language. Adequate knowledge of MapReduce represents an intrinsic component to working with Hadoop. Moreover, in order for users to install, configure, and use the code, thorough training is fundamental.
Hadoop integration with current BI analytics remains a key goal along with the development of analytic tools that can be employed by a wide range of users. Commercial vendors, such as Cloudera, Hortonworks, and MapR, may eventually provide the necessary connectivity between common BI analysis methodology and NoSQL. Since Apache Hadoop, as a stand-alone, open-source deployment, doesn’t contain internal manageability controls or high-level performance monitors, Cloudera offers a number of management tools that make analysis easier to implement for a range of users.
Cloudera’s Apache-licensed open source software, Cloudera’s Distribution Including Apache Hadoop (CDH), is in its fourth generation (CDH4). The offering includes a hot failover for the metadata function, NameNode. This is a critical contribution since NameNode is considered a single point of failure, essentially an Achille’s heel for Hadoop. The latest version of Cloudera’s Enterprise subscription offers a comprehensive package: high availability, improved security, Cloudera Manager for end-to-end Hadoop administration as well as long-term support.
Since part of the promise of big data requires getting past the hype and understanding appropriate applications of Hadoop, Hortonworks has created the Hortonwork Data Platform, version 1.0, which combines HA and failover requirements using VMware virtualization tools. The software’s approach relies on running NameNode and Hadoop’s JobTracker on virtual machines (VMs). This aspect helps to double up Hadoop’s fault tolerance through the automation of VM replacement for failed servers. The software also includes a GUI for dataset integration and for composing workflows as well as HCatalog that enables connectivity with RDBMS products.
MapR has chosen to solve the data volume issue via its replacement of Hadoop’s HDFS with a derivative of the UNIX-based file system, NFS. This helps to do away with the NameNode function altogether as a single point of failure. By swapping out HDFS, the company’s proprietary components claim to offer improved HA as well as higher scalability and performance.
Commercial Hadoop providers play a critical role in enabling wider platform adoption, and their support services allow the technology to be accessed by those organizations that might otherwise have difficulties around implementation. While these companies represent key players in the ongoing commercialization of Hadoop, they also offer an important function through their training and certification courses-a value that cannot be understated.
SAS, IBM, Cloudera, AWS, Hadoop Illuminated
I used referenced web site for this post : gladwinanalytics.com
We have a referrer link, for original article of this post, if you want you can follow gladwinanalytics.com
Special thanks for Anandh Shanmugaraj and gladwinanalytics team, for this article, and you can see the post at below link;
ETL tools are designed to save time and money by eliminating the need of ‘hand-coding’ when a new data warehouse is developed. They are also used to facilitate the work of the database administrators who connect different branches of databases as well as integrate or change the existing databases.
The main purpose of the ETL tool is:
extraction of the data from legacy sources (usually heterogenous)
data transformation (data optimized for transaction –> data optimized for analysis)
synchronization and cleansing of the data
loading the data into data warehouse.
There are several requirements that must be had by ETL tools in order to deliver an optimal value to users, supporting a full range of possible scenarios.
– data delivery and transformation capabilities
– data and metadata modelling capabilities
– data source and target support
– data governance capability
– runtime platform capabilities
– operations and administration capabilities
– service-enablements capability.
ETL TOOLS COMPARISON CRITERIA
The bietltools.com portal is not affiliated with any of the companies listed below in the comparison.
The research inclusion and exclusion criteria are as follows:
– range and mode of connectivity/adapter support
– data transformation and delivery modes support
– metadata and data modelling support
– design, development and data governance support
– runtime platform support
– enablement of service and three additional requirements for vendors:
– $20 milion or more of software revenue from data integration tools every year or not less than 300 production customers
– support of customers in not less than two major geographic regions
– have customer implementations at crossdepartamental and multiproject level.
We miss a few etl tools, but think generally. Of course in world we have lots of etl tools, but for now we couldnt investigate which one is we miss.
ETL TOOLS COMPARISON
The information provided below lists major strengths and weaknesses of the most popular ETL vendors.
IBM (Information Server Infosphere platform)
strongest vision on the market, flexibility
progress towards common metadata platform
high level of satisfaction from clients and a variety of initiatives
difficult learning curve
long implementation cycles
became very heavy (lots of GBs) with version 8.x and requires a lot of processing power
most substantial size and resources on the market of data integration tools vendors
consistent track record, solid technology, straightforward learning curve, ability to address real-time data integration schemes
Informatica is highly specialized in ETL and Data Integration and focuses on those topics, not on BI as a whole
focus on B2B data exchange
several partnerships diminishing the value of technologies
limited experience in the field.
Microsoft (SQL Server Integration Services)
broad documentation and support, best practices to data warehouses
ease and speed of implementation
standardized data integration
real-time, message-based capabilities
relatively low cost – excellent support and distribution model
problems in non-Windows environments. Takes over all Microsoft Windows limitations.
unclear vision and strategy
Oracle (OWB and ODI)
based on Oracle Warehouse Builder and Oracle Data Integrator – two very powerful tools;
tight connection to all Oracle datawarehousing applications;
tendency to integrate all tools into one application and one environment.
focus on ETL solutions, rather than in an open context of data management;
tools are used mostly for batch-oriented work, transformation rather than real-time processes or federation data delivery;
long-awaited bond between OWB and ODI brought only promises – customers confused in the functionality area and the future is uncertain
SAP BusinessObjects (Data Integrator / Data Services)
integration with SAP
SAP Business Objects created a firm company determined to stir the market;
Good data modeling and data-management support;
SAP Business Objects provides tools for data mining and quality; profiling due to many acquisitions of other companies.
Quick learning curve and ease of use
SAP Business Objects is seen as two different companies
Uncertain future. Controversy over deciding which method of delivering data integration to use (SAP BW or BODI).
BusinessObjects Data Integrator (Data Services) may not be seen as a stand-alone capable application to some organizations.
experienced company, great support and most of all very powerful data integration tool with lots of multi-management features
can work on many operating systems and gather data through number of sources – very flexible
great support for the business-class companies as well for those medium and minor ones
misplaced sales force, company is not well recognized
SAS has to extend influences to reach non-BI community
Data integration tools are a part of huge Java Composite Application Platform Suite – very flexible with ongoing development of the products
‘Single-view’ services draw together data from variety of sources; small set of vendors with a strong vision
relative weakness in bulk data movement
limited mindshare in the market
support and services rated below adequate
assembled a range of capabilities to be able to address a mulitude of data delivery styles
size and global presence of Sybase create opportunities in the market
pragmatic near-term strategy – better of current market demand
broad partnerships with other data quality and data integration tools vendors
falls behind market leaders and large vendors
gaps in many aspects of data management
functionality; well-known brand on the market (40 years experience); loyal customer and experience base;
easy implementation, strong performance, targeted functionality and lower costs
struggle with gaining mind share in the market
lack of support for other than ETL delivery styles
unsatisfactory with lack of capability of professional services
message-oriented application integration; capabilities based on common SOA structures;
support for federated views; easy implementation, support andperformance
scarce references from customers; not widely enough recognised for data integration competencies
lacking in data quality capabilities.
proven and mature code-generating architecture
one of the earliest vendors on the data integration market; support for SOA service-oriented deployments;
successfully deals with large data volumes and a high degree of complexity, extension of the range of data platforms and data sources;
customers’ positive responses to ETI technology
relatively slow growth of customer base
rather not attractive and inventive technology.
offers physical data movement and delivery; support of wide range of adapters and access to numerous sources;
well integrated, standard tools;
reasonable ease of implementation effort
gaps in specific capabilities
relatively costly – not competitive versus market leaders
many customers, years of experience, solid applications and support;
good use of metadata
upgrade from older versions into newer is straightforward.
inconsistency in defining the target for their applications;
no federation capability;
limitated presence due to poor marketing.
Simplicity of use in less-structured sources
Easy licensing for business solutions
cooperates with a wide range of sources and targets
increasingly high functionality
limited federation, replication and data quality support; rare upgrades due to its simplicity;
weak real-time support due to use third party solutions and other database utilities.
Pitney Bowes Software
Data Flow concentrates on data integrity and quality;
supports mainly ETL patterns; can be used for other purposes too;
ease of use, fast implementation, specific ETL functionality.
rare competition with other major companies, repeated rebranding trigger suspicions among customers.
narrow vision of possibilities even though Data Flow comes with variety of applications.
weak support, unexperienced service.
I used referenced web site : etltools.net
We have a referrer link, for original article of this post, if you want you can follow etltools.net
Special thanks for this article, and you can see the post at below link;
Almost every Chief Information Officer (CIO) has the goal of integrating their organization’s data. In fact the issue of data integration has risen all the way to the Chief Financial Officer
(CFO) and Chief Executive Officer (CEO) level of a corporation. A key question is why is data integration becoming so important to so many C-level executives? There are several key reasons driving
Provide IT Portfolio Management
Reduce IT Redundancy
Prevent IT Applications Failure
Provide IT Portfolio Management
Over the years I have had the opportunity to perform dozens of data warehousing assessments. During these assessments I always ask the client how much they spend annually on data warehousing. The
majority of companies and government organizations cannot give a relatively good estimate on what they actually spend. In order to manage these and any other costly information technology (IT)
initiatives it is critical to measure each one of them. However, it is impossible to measure them when most companies do not understand them (see Figure 1: “How To Manage IT”). This is
where IT Portfolio Management enters the picture.
Figure 1: How To Manage IT
IT portfolio management refers to the formal process for managing IT assets. An IT asset is software, hardware, middleware, IT projects, internal staff, applications and external consulting. Like
every newer discipline, many companies that have started their IT portfolio management efforts have not done so correctly. I would like to list out some of the keys to building successful IT
portfolio management applications.
By properly managing their IT portfolio it allows the corporation to see which projects are proceeding well and which are lagging behind. In my experience, almost every large company has a great
deal of duplicate IT effort occurring (see later section on “Reduce IT Redundancy”). This happens because the meta data is not accessible. At my company we have a couple of large
clients whose primary goal is to remove these tremendous redundancies, which translates into tremendous initial and ongoing IT costs.
Reduce IT Redundancy
CIO is commonly defined as Chief Information Officer; however, there is another possible meaning to this acronym; Career Is Over. One of the chief reasons for this is that most IT departments are
“handcuffed” in needless IT redundancy that too few CIOs are willing and capable of fixing.
There are several CIO surveys that are conducted annually. These surveys ask “what are your top concerns for the upcoming year”. Regardless of the survey you look at “data
integration” will be high on the list. Now data integration has two facets to it. One is the integration of data across disparate systems for enterprise applications. The second is the
integration/removal of IT redundancies. Please understand that some IT redundancy is a good thing. For example, when there is a power outage and one of your data centers is non-operational you need
to have a backup of these systems/data. However, when I talk about IT redundancies I am addressing “needless” IT redundancy. Meaning, IT redundancy that only exists because of
insufficient management of our IT systems. I was working with a Midwestern insurance company that, over a four year span had initiated various decision support efforts. After this four year period
they took the time to map out the flow of data from their operational systems, to their data staging areas and finally to their data mart structures. What they discovered was Figure 2:
“Typical IT Architecture”.
Figure 2: Typical IT Architecture
What is enlightening about Figure 2 is that when I show this illustration during a client meeting or at a conference keynote address the typical response that I receive from the people is
“Where did you get a copy of our IT architecture?” If you work at a Global 2000 company or any large government entity, Figure 2 represents an overly simplified version of your IT
architecture. These poor architecture habits create a litany of problems including:
Needless IT Rework
It has been my experience working with large government agencies and Global 2000 companies that needlessly duplicate data is running rampant throughout our industry. In my experience the typical
large organization has between 3 – 4 fold needless data redundancy. Moreover, I can name multiple organizations that have literally hundreds of “independent” data mart
applications spread all over the company. Each one of these data marts is duplicating the extraction, transformation and load (ETL) that is typically done centrally in a data warehouse. This
greatly increases the number of support staff required to maintain the data warehousing system as these tasks are the largest and most costly data warehousing activities. Besides duplicating this
process, each data mart will also copy the data as well requiring further IT resources. It is easy to see why IT budgets are straining under the weight of all of this needless redundancy.
Needless IT Rework
During the requirements gathering portion of one of our meta data management initiatives I had an IT project manager discuss the challenges that he is facing in analyzing one of the
mission-critical legacy applications that will feed the data warehousing application that his team has been tasked to build. During our interview he stated, “This has to be the twentieth time
that our organization is analyzing this system to understand the business rules around the data.” This person’s story is an all too common one as almost all organizations reinvent the
IT wheel on every project. This situation occurs because usually separate teams will typically build each of the IT systems and since they don’t have a Managed Meta Data Environment (MME),
these teams do not leverage the other’s standards, processes, knowledge, and lessons learned. This results in a great deal of rework and reanalysis.
I have discussed a great deal about the redundant application and IT work that occurs in the industry. All of this redundancy also generates a great deal of needless hardware and software
redundancy. This situation forces the enterprise to retain skilled employees to support each of these technologies. In addition, a great deal of financial savings is lost, as standardization on
these tools doesn’t occur. Often a software, hardware, or tool contract can be negotiated to provide considerable discounts for enterprise licenses, which can be phased into. These economies
of scale can provide tremendous cost savings to the organization.
In addition, the hardware and software that is purchased is not used in an optimal fashion. For example, I have a client that has each one of their individual IT projects buy their own hardware. As
a result, they are infamous for having a bunch of servers running at 25% capacity.
From the software perspective the problem only gets worse. While analyzing a client of mine I had asked their IT project leaders what software vendors have you standardized on? They answered
“all of them!” This leads to the old joke “What is the most popular form of software on the market? Answer…Shelfware!” Shelfware is software that a company purchases
and winds up never using and it just sits on the shelf collecting dust.
Prevent IT Applications Failure
When a corporation looks to undertake a major IT initiative, like a customer relationship management (CRM), enterprise resource planning (ERP), data warehouse, or e-commerce solution their
likelihood of project failure is between 65% – 80%, depending on the study referenced. This is especially alarming when we consider that these same initiatives traditionally have executive
management support and cost many millions of dollars. For example, I have one large client that is looking to roll out a CRM system (e.g. Siebel, Oracle) and an ERP system (e.g. SAP, PeopleSoft)
globally in the next four years. Their initial project budget is over $125 million! In my opinion they have a 0% probability of delivering all of these systems on-time and on-budget. Consider this,
when was that last time that you’ve seen an ERP or CRM initiative being delivered on time or on budget?
When we examine the causes for these projects failure several themes become apparent. First, these projects did not address a definable and measurable business need. This is the number one reason
for project failure, data warehouse, CRM, MME, or otherwise. As IT professionals we must always be looking to solve business problems or capture business opportunities. Second, the projects that
fail have a very difficult time understanding their company’s existing IT environment and business rules. This includes custom applications, vendor applications, data elements, entities, data
flows, data heritage and data lineage.
MME’s Focus On Data Integration
Many of these Global 2000 companies and large government organizations are targeting MME technology to assist them in identifying and removing existing application and data redundancy. Moreover,
many companies are actively using their MME to identify redundant applications through analysis of the data. These same companies are starting IT application integration projects to merge these
overlapping systems and to ensure that future IT applications do not proliferate needless redundancy.
If your organization can reduce their applications, processes, data, software and hardware, lowers the likelihood for IT project failure and speeds up the IT development life-cycle, then clearly it
will greatly reduce a company’s IT expenditures. For example, I have a large banking client that asked my company to analyze their IT environment. During this analysis we discovered that they
have a tremendous amount of application and data redundancy. Moreover, I had figured out that they have over 700 unique applications. I then compared this client to a bank that is more than twice
there size; however, this larger bank has a world class MME and uses it to properly manage their systems. As a result, they have less than 250 unique applications. Clearly the bank with more than
700 applications has a great deal of needless redundancy as compared to a bank that is more than twice their size and has less than 250 applications. Interestingly enough the bank that has less
than 250 applications and has a world-class MME is also 14 times more profitable than the bank maintaining over 700 applications. It doesn’t seem like a very far stretch to see that the less
profitable bank would become much more profitable if they removed this redundancy.
I used referenced web site : tdan.com
We have a referrer link, for original article of this post, if you want you can follow TDAN (The Data Administration Newsletter)
Special thanks for this tdan comunity, and you can see the post at below link;
The phrase unstructured data usually refers to information that doesn’t reside in a traditional row-column database. As you might expect, it’s the opposite of structured data — the data stored in fields in a database.
Examples of Unstructured Data
Unstructured data files often include text and multimedia content. Examples include e-mail messages, word processing documents, videos, photos, audio files, presentations, webpages and many other kinds of business documents. Note that while these sorts of files may have an internal structure, they are still considered “unstructured” because the data they contain doesn’t fit neatly in a database.
Experts estimate that 80 to 90 percent of the data in any organization is unstructured. And the amount of unstructured data in enterprises is growing significantly — often many times faster than structured databases are growing.
Mining Unstructured Data
Many organizations believe that their unstructured data stores include information that could help them make better business decisions. Unfortunately, it’s often very difficult to analyze unstructured data. To help with the problem, organizations have turned to a number of different software solutions designed to search unstructured data and extract important information. The primary benefit of these tools is the ability to glean actionable information that can help a business succeed in a competitive environment.
Because the volume of unstructured data is growing so rapidly, many enterprises also turn to technological solutions to help them better manage and store their unstructured data. These can include hardware or software solutions that enable them to make the most efficient use of their available storage space.
Unstructured Data and Big Data
As mentioned above, unstructured data is the opposite of structured data. Structured data generally resides in a relational database, and as a result, it is sometimes called relational data. This type of data can be easily mapped into pre-designed fields. For example, a database designer may set up fields for phone numbers, zip codes and credit card numbers that accept a certain number of digits. Structured data has been or can be placed in fields like these. By contrast, unstructured data is not relational and doesn’t fit into these sorts of pre-defined data models.
In addition to structured and unstructured data, there’s also a third category: semi-structured data. Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. Examples of semi-structured data might include XML documents and NoSQL databases.
The term big data is closely associated with unstructured data. Big data refers to extremely large datasets that are difficult to analyze with traditional tools. Big data can include both structured and unstructured data, but IDC estimates that 90 percent of big data is unstructured data. Many of the tools designed to analyze big data can handle unstructured data.
Unstructured Data Management
Organizations use of variety of different software tools to help them organize and manage unstructured data. These can include the following:
Big data tools
Software like Hadoop can process stores of both unstructured and structured data that are extremely large, very complex and changing rapidly.
Business intelligence software
Also known as BI, business intelligence is a broad category of analytics, data mining, dashboards and reporting tools that help companies make sense of their structured and unstructured data for the purpose of making better business decisions.
Data integration tools
These tools combine data from disparate sources so that they can be viewed or analyzed from a single application. They sometimes include the capability to unify structured and unstructured data.
Document management systems
Also called enterprise content management systems, a DMS can track, store and share unstructured data that is saved in the form of document files.
Information management solutions
This type of software tracks structured and unstructured enterprise data throughout its lifecycle.
Search and indexing tools
These tools retrieve information from unstructured data files such as documents, Web pages and photos.
Unstructured Data Technology
A group called the Organization for the Advancement of Structured Information Standards (OASIS) has published the Unstructured Information Management Architecture (UIMA) standard. The UIMA “defines platform-independent data representations and interfaces for software components or services called analytics, which analyze unstructured information and assign semantics to regions of that unstructured information.”
Many industry watchers say that Hadoop has become the de facto industry standard for managing Big Data. This open source project is managed by the Apache Software Foundation.
I used referenced web site : webopedia.com
We have a few referrer link, like big data, hadoop, business intelligence etch.
Structured data refers to any data that resides in a fixed field within a record or file. This includes data contained in relational databases and spreadsheets.
Characteristics of Structured Data
Structured data first depends on creating a data model – a model of the types of business data that will be recorded and how they will be stored, processed and accessed. This includes defining what fields of data will be stored and how that data will be stored: data type (numeric, currency, alphabetic, name, date, address) and any restrictions on the data input (number of characters; restricted to certain terms such as Mr., Ms. or Dr.; M or F).
Structured data has the advantage of being easily entered, stored, queried and analyzed. At one time, because of the high cost and performance limitations of storage, memory and processing, relational databases and spreadsheets using structured data were the only way to effectively manage data. Anything that couldn’t fit into a tightly organized structure would have to be stored on paper in a filing cabinet.
Managing Structured Data
Structured data is often managed using Structured Query Language (SQL) – a programming language created for managing and querying data in relational database management systems. Originally developed by IBM in the early 1970s and later developed commercially by Relational Software, Inc. (now Oracle Corporation).
Structured data was a huge improvement over strictly paper-based unstructured systems, but life doesn’t always fit into neat little boxes. As a result, the structured data always had to be supplemented by paper or microfilm storage. As technology performance has continued to improve, and prices have dropped, it was possible to bring into computing systems unstructured and semi-structured data.
Unstructured and Semi-Structured Data
Unstructured data is all those things that can’t be so readily classified and fit into a neat box: photos and graphic images, videos, streaming instrument data, webpages, PDF files, PowerPoint presentations, emails, blog entries, wikis and word processing documents.
Semi-structured data is a cross between the two. It is a type of structured data, but lacks the strict data model structure. With semi-structured data, tags or other types of markers are used to identify certain elements within the data, but the data doesn’t have a rigid structure. For example, word processing software now can include metadata showing the author’s name and the date created, with the bulk of the document just being unstructured text. Emails have the sender, recipient, date, time and other fixed fields added to the unstructured data of the email message content and any attachments. Photos or other graphics can be tagged with keywords such as the creator, date, location and keywords, making it possible to organize and locate graphics. XML and other markup languages are often used to manage semi-structured data.
Structured Data Technology Standards
SQL has been a standard of the American National Standards Institute since 1986. It is managed by InterNational Committee for Information Technology Standards (INCITS) Technical Committee DM 32 – Data Management and Interchange. The committee has two task groups, one for databases and the other for metadata. HP, CA, IBM, Microsoft, Oracle, Sybase (SAP) and Teradata all participate, as well as several federal government agencies. Both of the committee project documents have links to further information on each project. SQL became an International Organization for Standards (ISO) standard in 1987. The published standards are available for purchase from the ANSI eStandards Store, under the INCITS/ISO/IEC 9075 classification.
I used referenced web site : webopedia.com
We have a few referrer link, like data model etch.
If left unmanaged, your data can become overwhelming, making it difficult to procure information you need when you need it. While software is designed to address archiving, e-discovery, compliance, etc., the overarching goal is most always the same: to make managing and maintaining data a feasible task. In this post, you’ll see two types of data you’re accustomed to working with, paying close attention to the differences between structured and unstructured data.
What is Structured Data?
Before getting into unstructured data, you need to have an understanding for its structured counterpart. Structured data (as explained succinctly in Big Data Republic’s video) is information, usually text files, displayed in titled columns and rows which can easily be ordered and processed by data mining tools. This could be visualized as a perfectly organized filing cabinet where everything is identified, labeled and easy to access. Most organizations are likely to be familiar with this form of data and already using it effectively, so let’s move on to the hotter question.
What is Unstructured Data?
Believe it or not, your database of structured information doesn’t even contain half of the information available for your use! Seth Grimes, a leading industry analyst on the confluence of structured and unstructured data sources, published an article that stated, “80% of business-relevant information originates in unstructured form, primarily text.” This may seem like an outlandish percentage, but don’t jump to conclusions too fast. We’re just getting started.
Now that you have a grasp on structured data, it will be much easier to understand what unstructured data is. Unstructured data, usually binary data that is proprietary, is that which has no identifiable internal structure. It can be visualized as a level 5 hoarder’s living room; it’s a massive unorganized conglomerate of various objects that are worthless until identified and stored in an organized fashion. Once this organization process has taken place (through the use of specialized software), the items can then be searched through and categorized (to an extent) for obtaining insights. While data mining tools might not be equipped to parse information in email messages (however organized it may be), you may have very good reason to collect and categorize data from this source. This illustrates the importance and plausible breadth of unstructured data.
Email Has Structure, Right?
The term “unstructured” has faced major scrutiny for several reasons. One argument is that although some form of structure is not formally identified, it can still be implied and therefore should not be labeled as “unstructured.” The counter-point states that if data has some form of structure but is not helpful to the processing task at hand, it may still be characterized as “unstructured.” So, while email messages may contain information that does have some implied structure, we can label the information as “unstructured” because normal data mining tools aren’t equipped to parse it. Alas, both sides of the argument persist.
Unstructured Data Types
Unstructured data is raw and unorganized and organizations store it all. Ideally, all of this information would be converted into structured data however, this would be costly and time consuming. Also, not all types of unstructured data can easily be converted into a structured model. For example, an email holds information such as the time sent, subject, and sender (all uniform fields), but the content of the message is not so easily broken down and categorized. This can introduce some compatibility issues with the structure of a relational database system.
In case you’re still not quite sure what we mean, here is a limited list of types of unstructured data:
Word Processing Files
Social Media Posts
Looking at the list, you may be wondering what these files have in common. The files listed above can be stored and managed without the format of the file being understood by the system. This allows them to be stored in an unstructured fashion because the contents of the files are unorganized.
The big data industry is growing but the problem of unstructured data going unused has been identified by organizations. Better yet, technologies and services are being developed in reaction. Darin Stewart of InformationWeek said in a recent article about big data, “The age of information overload is slowly drawing to a close. Enterprises are finally getting comfortable with managing massive amounts of data, content and information. The pace of information creation continues to accelerate, but the ability of infrastructure and information management to keep pace is coming within sight. Big Data is now considered a blessing rather than a curse.”