Apache™ Hadoop® is a highly scalable open-source storage platform designed for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks i.e. it can process very large data sets across hundreds and thousands of computing nodes that operate in parallel. It provides a cost effective storage solution for large data volumes with no format requirements.
Hadoop is an ecosystem of open source components that fundamentally changed the way enterprises store, process, and analyze data. Unlike traditional systems, Hadoop enables multiple types of analytic workloads to run on the same data, at the same time, at massive scale on industry-standard hardware.
What are those terms mean?
Open-source software: Open-source software is created and maintained by a network of developers from around the globe. It’s free to download, use and contribute to, though more and more commercial versions of Hadoop are becoming available.
Framework: In this case, it means that everything you need to develop and run software applications is provided – programs, connections, etc.
Massive storage: The Hadoop framework breaks big data into blocks, which are stored on clusters of commodity hardware.
Processing power: Hadoop concurrently processes large amounts of data using multiple low-cost computers for fast results.
It all started with the World Wide Web. As the web grew in the late 1900s and early 2000s, search engines and indexes were created to help locate relevant information amid the text-based content. In the early years, search results really were returned by humans. But as the web grew from dozens to millions of pages, automation was needed. Web crawlers were created, many as university-led research projects, and search engine start-ups took off (Yahoo, AltaVista, etc.).
One such project was an open-source web search engine called Nutch – the brainchild of Doug Cutting and Mike Cafarella. They wanted to invent a way to return web search results faster by distributing data and calculations across different computers so multiple tasks could be accomplished simultaneously. During this time, another search engine project called Google was in progress. It was based on the same concept – storing and processing data in a distributed, automated way so that relevant web search results could be returned faster.
In 2006, Cutting joined Yahoo and took with him the Nutch project as well as ideas based on Google’s early work with automating distributed data storage and processing. The Nutch project was divided. The web crawler portion remained as Nutch. The distributed computing and processing portion became Hadoop (named after Cutting’s son’s toy elephant). In 2008, Yahoo released Hadoop as an open-source project. Today, Hadoop’s framework and ecosystem of technologies are managed and maintained by the non-profit Apache Software Foundation (ASF), a global community of software developers and contributors.
Components of Hadoop
Currently, four core modules are included in the basic framework from the Apache Foundation:
1. Hadoop Common
The libraries and utilities used by other Hadoop modules.
2. Hadoop Distributed File System (HDFS)
The Java-based scalable system that stores data across multiple machines without prior organization.
Hadoop also includes a distributed storage system, the Hadoop Distributed File System (HDFS), which stores data across local disks of your cluster in large blocks. HDFS has a configurable replication factor (with a default of 3x), giving increased availability and durability. HDFS monitors replication and balances your data across your nodes as nodes fail and new nodes are added.
MapReduce – a software programming model for processing large sets of data in parallel.
YARN – resource management framework for scheduling and handling resource requests from distributed applications. (YARN is an acronym for Yet Another Resource Negotiator)
Hadoop MapReduce, an execution engine in Hadoop, processes workloads using the MapReduce framework which breaks down jobs into smaller pieces of work that can be distributed across nodes in your Amazon EMR cluster. The Hadoop MapReduce engine is built with the expectation that any given machine in your cluster could fail at any time and is designed for fault tolerance. If a server running a task fails, Hadoop reruns that task on another machine until completion.
You can write MapReduce programs in Java, or use Hadoop Streaming to execute custom scripts in a parallel fashion, Hive and Pig (if you choose to install these applications on your Amazon EMR cluster) for higher level abstractions over MapReduce, or other tools to interact with Hadoop.
Starting with Hadoop 2, resource management is managed by Yet Another Resource Negotiator (YARN). YARN keeps track of all the resources across your cluster, and it ensures that these resources are dynamically allocated to accomplish the tasks in your processing job. YARN is able to manage Hadoop MapReduce workloads as well as other distributed frameworks such as Apache Spark, Apache Tez, and more.
Other software components that can run on top of or alongside Hadoop and have achieved top-level Apache project status include:
A platform for manipulating data stored in HDFS that includes a compiler for MapReduce programs and a high-level language called Pig Latin. It provides a way to perform data extractions, transformations and loading, and basic analysis without having to write MapReduce programs.
A data warehousing and SQL-like query language that presents data in the form of tables. Hive programming is similar to database programming. (It was initially developed by Facebook.)
A nonrelational, distributed database that runs on top of Hadoop. HBase tables can serve as input and output for MapReduce jobs.
A table and storage management layer that helps users share and access data.
A web interface for managing, configuring and testing Hadoop services and components.
A distributed database system.
A data collection system for monitoring large distributed systems.
Software that collects, aggregates and moves large amounts of streaming data into HDFS.
A Hadoop job scheduler.
A connection and transfer mechanism that moves data between Hadoop and relational databases.
An open-source cluster computing framework with in-memory analytics.
An scalable search tool that includes indexing, reliability, central configuration, failover and recovery.
An application that coordinates distributed processes.
In addition, there are commercial distributions of Hadoop, including Cloudera, Hortonworks and MapR. With distributions from software vendors, you pay for their version of the framework and receive additional software components, tools, training, documentation and other services.
Uses of Hadoop
Hadoop’s infinitely scalable flexible architecture (based on the HDFS filesystem) allows organizations to store and analyze unlimited amounts and types of data—all in a single, open source platform on industry-standard hardware.
The modest cost of commodity hardware makes Hadoop useful for storing and combining data such as transactional, social media, sensor, machine, scientific, click streams, etc. The low-cost storage lets you keep information that is not deemed currently critical but that you might want to analyze later.
2. Data lake
Hadoop is often used to store large amounts of data without the constraints introduced by schemas commonly found in the SQL-based world. It is used as a low-cost compute-cycle platform that supports processing ETL and data quality jobs in parallel using hand-coded or commercial data management technologies. Refined results can then be passed to other systems (e.g., EDWs, analytic marts) as needed.
Quickly integrate with existing systems or applications to move data into and out of Hadoop through bulk load processing (Apache Sqoop) or streaming (Apache Flume, Apache Kafka).?
Transform complex data, at scale, using multiple data access options (Apache Hive, Apache Pig) for batch (MR2) or fast in-memory (Apache Spark) processing. Process streaming data as it arrives in your cluster via Spark Streaming.
Because Hadoop was designed to deal with volumes of data in a variety of shapes and forms, it can run analytical algorithms. Big data analytics on Hadoop can help your organization operate more efficiently uncover new opportunities and derive next-level competitive advantage. The sandbox approach provides an opportunity to innovate with minimal investment.
Analysts interact with full-fidelity data on the fly with Apache Impala (incubating), the analytic database for Hadoop. With Impala, analysts experience BI-quality SQL performance and functionality plus compatibility with all the leading BI tools.
Using Cloudera Search, an integration of Hadoop and Apache Solr, analysts can accelerate the process of discovering patterns in data in all amounts and formats, especially when combined with Impala.
5. Recommendation systems
One of the most popular analytical uses by some of Hadoop’s largest adopters is for web-based recommendation systems. Facebook – people you may know. LinkedIn – jobs you may be interested in. Netflix, eBay, Hulu – items you may be interested in. These systems analyze huge amounts of data in real time to quickly predict preferences before customers leave the web page.
With Hadoop, analysts and data scientists have the flexibility to develop and iterate on advanced statistical models using a mix of partner technologies as well as open source frameworks like Apache Spark.
The distributed data store for Hadoop, Apache HBase, supports the fast, random reads/writes (“fast data”) required for online applications.
Difficulties in Hadoop Adoption
The scale-out potential of Apache Hadoop is impressive. However, while Hadoop offers the advantage of using low-cost commodity servers, extending this scale-out potential to thousands of nodes can translate to a true expense. As the demand for compute and analytic capacity grows, so can the machine costs. This has an equal effect on storage since Hadoop spreads out data, and companies must have equal space for increased data storage repositories, including all the indices, and for all the acquired raw data.
Integrating and processing all of this diverse data can be costly in terms of both infrastructure and personnel. While traditional BI relies on evaluating transactional and historical data, today’s analytics require more skill in iterative analysis and the ability to recognize patterns. When dealing with big data, an advanced skillset that goes beyond RDBMS capabilities-both in terms of analysis and programming-is essential. Not only is there need for advanced systems administration and analyst capabilities when working with Hadoop, but learning the MapReduce programming unique to this framework represents a significant hurdle.
In terms of relational databases, moving and modifying large volumes of unstructured data into the necessary form for Extraction, Transformation, Loading (ETL) can be both costly and time-consuming. That’s a key reason why Hadoop seems so attractive. One could argue that the ongoing growth in data volume, velocity, and variety has made the traditional data warehousing architecture less and less viable. However, it is still easier to find experienced RDBMS programmers and developers than those with MapReduce programming capabilities. Part of the difficulty lies in just learning the language beyond having the skills to install and maintain the Hadoop platform.
MapReduce uses a computational approach that employs a Map pre-processing function and a Reduce data aggregation/distillation step. However, when it comes to real-time transactional data analysis, the low latency reads and writes characteristic of RDBMS structured data processing are simply not possible with HDFS and MapReduce.
Of course, as the platform matures, more features will continue to be added to it. While add-on products make Hadoop easier to use, they also present a learning challenge that requires constantly expanding one’s expertise. For example:
Hive is the data warehousing component of Hadoop, and it functions well with structured data, enabling ad-hoc queries against large transactional datasets. On the other hand, though workarounds do exist, the absence of any ETL-style tool makes HiveQL, the SQL-like programming dialect, problematic when working with unprocessed, unstructured data.
HBase, the column-based storage system, enables users to employ Hadoop datasets as though they’re indices in any conventional RDBMS. It typically allows easy column creation and lets the user store virtually any structure within a data element.
PIG represents the high-level dataflow language, Pig Latin, and requires quite advanced training. It provides easier access to data held in Hadoop clusters and offers a means for analyzing large datasets. In part, PIG enables the implementation of simple or complex workflows and the designation of multiple data inputs where data can then be processed by multiple operators.
As IT organizations consider wholesale adoption of the Hadoop platform for analytics, they must carefully strategize their approach. The platform’s specialized methodology, scale-out storage, and powerful processing capacity make it optimal for analytical data loads. However, the dedication in training competent personnel and machine costs, as well as the framework’s inability to function as an RDBMS replacement, should prompt careful consideration.
Hardware and Software for Hadoop
Hadoop runs on commodity hardware. That doesn’t mean it runs on cheapo hardware. Hadoop runs on decent server class machines.
Here are some possibilities of hardware for Hadoop nodes.
So the high end machines have more memory. Plus, newer machines are packed with a lot more disks (e.g. 36 TB) — high storage capacity.
Examples of Hadoop servers
HP ProLiant DL380
Dell C2100 series
Supermicro Hadoop series
So how does a large hadoop cluster looks like? Here is a picture of Yahoo’s Hadoop cluster.
Hadoop runs well on Linux. The operating systems of choice are:
RedHat Enterprise Linux (RHEL)
This is a well tested Linux distro that is geared for Enterprise. Comes with RedHat support
Source compatible distro with RHEL. Free. Very popular for running Hadoop. Use a later version (version 6.x).
The Server edition of Ubuntu is a good fit — not the Desktop edition. Long Term Support (LTS) releases are recommended, because they continue to be updated for at least 2 years.
Hadoop is written in Java. The recommended Java version is Oracle JDK 1.6 release and the recommended minimum revision is 31 (v 1.6.31).
So what about OpenJDK? At this point the Sun JDK is the ‘official’ supported JDK. You can still run Hadoop on OpenJDK (it runs reasonably well) but you are on your own for support 🙂
Business Intelligence Tools For Hadoop and Big Data
The case for BI Tools
Analytics for Hadoop can be done by the following:
Writing custom Map Reduce code using Java, Python, R ..etc
Using high level Pig scripts
Using SQL using Hive
How ever doing analytics like this can feel a little pedantic and time consuming. Business INtelligence tools (BI tools for short) can address this problem.
BI tools have been around since before Hadoop. Some of them are generic, some are very specific towards a certain domain (e.g. Telecom, Health Care ..etc). BI tools provide rich, user friendly environment to slice and dice data. Most of them have nice GUI environments as well.
BI Tools Feature Matrix Comparison
Since Hadoop is gaining popularity as a data silo, a lot of BI tools have added support to Hadoop. In this chapter we will look into some BI tools that work with Hadoop.
We are trying to present capabilities of BI tools in an easy to compare feature matrix format. This is a ‘living’ document. We will keep it updated as new versions and new features surface.
This matrix is under construction
How to read the matrix?
Y – feature is supported
N – feature is NOT supported
? or empty – unknown
Table: BI Tools Comparison : Data Access and Management
Table: BI Tools Comparison : Analytics
Table: BI Tools Comparison : Visualizing
Table: BI Tools Comparison : Connectivity
Glossary of terms
Can validate data confirms to certain limits, can do cleansing and de-duping.
Share with others
Can share the results with others within or outside organization easily. (Think like sharing a document on DropBox or Google Drive)
You can slice and dice data on locally on a computer or tablet. This uses the CPU power of the device and doesn’t need a round-trip to a ‘server’ to process results. This can speed up ad-hoc data exploration
Analytics ‘app store’
The platform allows customers to buy third party analytics app. Think like APple App Store.
Hadoop For Executives
This section is a quick ‘fact sheet’ in a Q&A format.
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.
What is the license of Hadoop?
Hadoop is open source software. It is an Apache project released under Apache Open Source License v2.0. This license is very commercial friendly.
Who contributes to Hadoop?
Originally Hadoop was developed and open sourced by Yahoo. Now Hadoop is developed as an Apache Software Foundation project and has numerous contributors from Cloudera, Horton Works, Facebook, etc.
Isn’t Hadoop used by foo-foo social media companies and not by enterprises
Hadoop had its start in a Web company. It was adopted pretty early by social media companies because the companies had Big Data problems and Hadoop offered a solution.
However, Hadoop is now making inroads into Enterprises.
I am not sure my company has a big data problem
Hadoop is designed to deal with Big Data. So if you don’t have a ‘Big Data Problem’, then Hadoop probably isn’t the best fit for your company. But before you stop reading right here, please read on 🙂
How much data is considered Big Data, differs from company to company. For some companies, 10 TB of data would be considered Big Data; for others 1 PB would be ‘Big Data’. So only you can determine how much is Big Data.
Also, if you don’t have a ‘Big Data problem’ now, is that because you are not capturing some data? In some scenarios, companies chose to forgo capturing data, because there wasn’t a feasible way to store and process it. Now that Hadoop can help with Big Data, it may be possible to start capturing data that wasn’t captured before.
How much does it cost to adopt Hadoop?
Hadoop is open source. The software is free. However running Hadoop does have other cost components.
Cost of hardware : Hadoop runs on a cluster of machines. The cluster size can be anywhere from 10 nodes to 1000s of nodes. For a large cluster, the hardware costs will be significant.
The cost of IT / OPS for standing up a large Hadoop cluster and supporting it will need to be factored in.
Since Hadoop is a newer technology, finding people to work on this ecosystem is not easy.
Hadoop for Developers
This section is a quick ‘fact sheet’ in a Q&A format.
What is Hadoop?
Hadoop is an open source software stack that runs on a cluster of machines. Hadoop provides distributed storage and distributed processing for very large data sets.
Is Hadoop a fad or here to stay?
Sure, Hadoop and Big Data are all the rage now. But Hadoop does solve a real problem and it is a safe bet that it is here to stay.
Below is a graph of Hadoop job trends from Indeed.com. As you can see, demand for Hadoop skills has been up and up since 2009. So Hadoop is a good skill to have!
Hadoop Job Trends
What skills do I need to learn Hadoop?
A hands-on developer or admin can learn Hadoop. The following list is a start – in no particular order
Hadoop is written in Java. So knowing Java helps
Hadoop runs on Linux, so you should know basic Linux command line navigation skills
Some Linux scripting skills will go a long way
What kind of technical roles are available in Hadoop?
The following should give you an idea of the kind of technical roles in Hadoop.
I am not a programmer, can I still use Hadoop?
Yes, you don’t need to write Java Map Reduce code to extract data out of Hadoop. You can use Pig and Hive. Both Pig and Hive offer ‘high level’ Map Reduce. For example you can query Hadoop using SQL in Hive.
What kind of development tools are available for Hadoop?
Hadoop development tools are still evolving. Here are a few:
Karmasphere IDE : tuned for developing for Hadoop
Eclipse and other Java IDEs : When writing Java code
Command line editor like VIM : No matter what editor you use, you will be editing a lot of files / scripts. So familiarity with CLI editors is essential.
Where can I learn more?
Tom White’s Hadoop Book : This is the ‘Hadoop Bible’
The Commercial Platform Approach to Apache Hadoop
As mentioned above, businesses dealing with increasing masses of data are looking for a distributed computing analytic solution that provides comprehensive administration and management, easy deployment, and support for effective business continuity and high availability.
Today, commercial open-source models that incorporate MapReduce along with a built-in framework and infrastructure offer another means for avoiding the learning curve and burdens associated with Apache Hadoop deployment. These commercial players ease skillset acquisition by providing key management tools that interface with Hadoop processes. The value of technical support, services, and training cannot be overstated when it comes to Hadoop implementation.
Commercial vendors offer a means by which these high-level analysis tools can be accessed and used by a wide variety of users, not just those with engineering or BI capabilities. They provide the support that ensures Hadoop users can undertake complex data analysis projects.
As open-source tools proliferate and their increasing importance to big data analytics continues to grow, a need for streamlined administration and support will expand as well. While commercial Hadoop providers offer the necessary support, there is no alternative to learning its platform-specific language. Adequate knowledge of MapReduce represents an intrinsic component to working with Hadoop. Moreover, in order for users to install, configure, and use the code, thorough training is fundamental.
Hadoop integration with current BI analytics remains a key goal along with the development of analytic tools that can be employed by a wide range of users. Commercial vendors, such as Cloudera, Hortonworks, and MapR, may eventually provide the necessary connectivity between common BI analysis methodology and NoSQL. Since Apache Hadoop, as a stand-alone, open-source deployment, doesn’t contain internal manageability controls or high-level performance monitors, Cloudera offers a number of management tools that make analysis easier to implement for a range of users.
Cloudera’s Apache-licensed open source software, Cloudera’s Distribution Including Apache Hadoop (CDH), is in its fourth generation (CDH4). The offering includes a hot failover for the metadata function, NameNode. This is a critical contribution since NameNode is considered a single point of failure, essentially an Achille’s heel for Hadoop. The latest version of Cloudera’s Enterprise subscription offers a comprehensive package: high availability, improved security, Cloudera Manager for end-to-end Hadoop administration as well as long-term support.
Since part of the promise of big data requires getting past the hype and understanding appropriate applications of Hadoop, Hortonworks has created the Hortonwork Data Platform, version 1.0, which combines HA and failover requirements using VMware virtualization tools. The software’s approach relies on running NameNode and Hadoop’s JobTracker on virtual machines (VMs). This aspect helps to double up Hadoop’s fault tolerance through the automation of VM replacement for failed servers. The software also includes a GUI for dataset integration and for composing workflows as well as HCatalog that enables connectivity with RDBMS products.
MapR has chosen to solve the data volume issue via its replacement of Hadoop’s HDFS with a derivative of the UNIX-based file system, NFS. This helps to do away with the NameNode function altogether as a single point of failure. By swapping out HDFS, the company’s proprietary components claim to offer improved HA as well as higher scalability and performance.
Commercial Hadoop providers play a critical role in enabling wider platform adoption, and their support services allow the technology to be accessed by those organizations that might otherwise have difficulties around implementation. While these companies represent key players in the ongoing commercialization of Hadoop, they also offer an important function through their training and certification courses-a value that cannot be understated.
SAS, IBM, Cloudera, AWS, Hadoop Illuminated
I used referenced web site for this post : gladwinanalytics.com
We have a referrer link, for original article of this post, if you want you can follow gladwinanalytics.com
Special thanks for Anandh Shanmugaraj and gladwinanalytics team, for this article, and you can see the post at below link;