Thursday, May 22, 2014

A Glance to Hadoop :)

Hadoop is an Apache open source project that provides a parallel storage and processing framework. Its primary purpose is to run MapReduce batch programs in parallel on tens to thousands of server nodes.
MapReduce refers to the application modules written by a programmer that run in two phases: first mapping the data (extract) then reducing it (transform).
Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.
One of Hadoop’s greatest benefits is the ability of programmers to write application modules in almost any language and run them in parallel on the same cluster that stores the data. This is a profound change! With Hadoop, any programmer can harness the power and capacity of thousands of CPUs and hard drives simultaneously.
More advantages of Hadoop include affordability (it runs on commodity hardware), open source (free download from Cloudera), and agility (store any data, run any analysis).

Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.

Examples include Web logs; RFID; sensor networks; social networks; Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, biological, genomics, biochemical, medical records; scientific research; military surveillance; photography archives; video archives; and large scale eCommerce.”


Hadoop as the Repository and Refinery

As volumes of big data arrive from sources such as sensors, machines, social media, and click stream interactions, the first step is to capture all the data reliably and cost effectively. When data volumes are huge, the traditional single server strategy does not work for long. Pouring the data into the Hadoop Distributed File System (HDFS) gives architects much needed flexibility. Not only can they capture 10s of terabytes in a day, they can adjust the Hadoop configuration up or down to meet surges and lulls in data ingestion. This is accomplished at the lowest possible cost per gigabyte due to open source economics and leveraging commodity hardware.
Since the data is stored on local storage instead of SANs, Hadoop data access is often much faster, and it does not clog the network with terabytes of data movement.
Once the raw data is captured, Hadoop is used to refine it. Hadoop can act as a parallel “ETL engine on steroids,” leveraging handwritten or commercial data transformation technologies.

MapReduce in Hadoop – It’s too OLD now
The Mahout community decided to move it’s codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. They however keep their widely used MapReduce algorithms in codebase and maintain them.
Mahout community is building future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.
Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.

1 February 2014 - Apache Mahout 0.9 released

New and improved Mahout Website based on Apache CMS
Early implementation of a Multi-Layer Perceptron (MLP) classifier
Scala DSL Bindings for Mahout Math Linear Algebra
Support for easy functional Matrix views and derivatives
JSON output format for ClusterDumper -
Enabled randomised testing for all Mahout Modules using Carrot RandomizedRunner
Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering
Upgrade to Lucene 4.6.1

No comments:

Post a Comment