Hadoop
is an Apache open source project that provides a parallel storage and
processing framework. Its primary purpose is to run MapReduce batch
programs in parallel on tens to thousands of server nodes.
MapReduce
refers to the application modules written by a programmer that run in
two phases: first mapping the data (extract) then reducing it
(transform).
Hadoop
scales out to large clusters of servers and storage using the Hadoop
Distributed File System (HDFS) to manage huge data sets and spread
them across the servers.
One
of Hadoop’s greatest benefits is the ability of programmers to
write application modules in almost any language and run them in
parallel on the same cluster that stores the data. This is a profound
change! With Hadoop, any programmer can harness the power and
capacity of thousands of CPUs and hard drives simultaneously.
More
advantages of Hadoop include affordability (it runs on commodity
hardware), open source (free download from Cloudera), and agility
(store any data, run any analysis).
“Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.
Examples
include Web logs; RFID; sensor networks; social networks; Internet
text and documents; Internet search indexing; call detail records;
astronomy, atmospheric science, biological, genomics, biochemical,
medical records; scientific research; military surveillance;
photography archives; video archives; and large scale eCommerce.”
Hadoop
as the Repository and Refinery
As
volumes of big data arrive from sources such as sensors, machines,
social media, and click stream interactions, the first step is to
capture all the data reliably and cost effectively. When data volumes
are huge, the traditional single server strategy does not work for
long. Pouring the data into the Hadoop Distributed File System (HDFS)
gives architects much needed flexibility. Not only can they capture
10s of terabytes in a day, they can adjust the Hadoop configuration
up or down to meet surges and lulls in data ingestion. This is
accomplished at the lowest possible cost per gigabyte due to open
source economics and leveraging commodity hardware.
Since
the data is stored on local storage instead of SANs, Hadoop data
access is often much faster, and it does not clog the network with
terabytes of data movement.
Once
the raw data is captured, Hadoop is used to refine it. Hadoop can act
as a parallel “ETL engine on steroids,” leveraging handwritten or
commercial data transformation technologies.
MapReduce in Hadoop – It’s too OLD now
The
Mahout community decided to move it’s codebase onto modern data
processing systems that offer a richer programming model and more
efficient execution than Hadoop MapReduce. Mahout will therefore
reject new MapReduce algorithm implementations from now on. They
however keep their widely used MapReduce algorithms in codebase and
maintain them.
Mahout
community is building future implementations on top of a DSL for
linear algebraic operations which has been developed over the last
months. Programs written in this DSL are automatically optimized and
executed in parallel on Apache Spark.
Furthermore,
there is an experimental contribution undergoing which aims to
integrate the h20 platform into Mahout.
1
February 2014 - Apache Mahout 0.9 released
New
and improved Mahout Website based on Apache CMS
Early
implementation of a Multi-Layer Perceptron (MLP) classifier
Scala
DSL Bindings for Mahout Math Linear Algebra
Support
for easy functional Matrix views and derivatives
JSON
output format for ClusterDumper -
Enabled
randomised testing for all Mahout Modules using Carrot
RandomizedRunner
Online
Algorithm for computing accurate Quantiles using 1-dimensional
Clustering
Upgrade
to Lucene 4.6.1

No comments:
Post a Comment