Sunday, June 1, 2014
Friday, May 30, 2014
What is Data Governance ?
Data
governance is the specification of decision rights and an
accountability framework to encourage desirable behavior in the
valuation, creation, storage, use, archival and deletion of data and
information. It includes the processes, roles, standards and metrics
that ensure the effective and efficient use of data and information
in enabling an organization to achieve its goals.
Data
governance is one of the least visible aspects of Web analytics, it’s
easily one of the most impactful. Companies that get it right
generate “Ferrari” benefits – significant savings of time and
money, plus greatly improved confidence in their data. In other
words, data governance is actually a “must have.” Without it, an
organization’s entire data strategy and online marketing approach
may rest on a shaky foundation.
In
particular, data governance is not:
- Tactical management
- Technology & IT Department alone.
Why
Data Governance is important?
The word “governance”
determines quality because that is the fundamental aim of data
governance. “Why should we look at data quality?” For Example, If
you are doing any online purchase and due to lack of data quality the
purchased goods is dispatched to wrong address or you will receive
goods that don’t match with the proper description and so on.
However, there are other more subtle effects of poor data quality –
i.e. missing the opportunity to upsell to a customer because you
can’t accurately identify the product categories that they
purchased, not being able to negotiate purchasing discounts because
the supplier is duplicated so many times that we can’t say what our
total spend is, losing web sales because your inaccurate sizing data
makes you look bad on comparison sites.
It’s likely that there are
already some people in the organization who are checking data quality
as part of their regular job. For example the accountants are
probably ensuring that postings are made to the correct ledger codes,
your accounts payable department is ensuring that invoices are sent
and matching payments are received. Much of your operational data is
already part of an active management process but to a large extent
their interest is in quantities and values. The areas that get less
quality checking are the reference data (or master data) that drive
many of your business processes. Data Governance aims to put in place
formal management responsibilities for the quality of this data.
One of the changes in attitude
that is driven by data governance is to move away from a reactive
approach to quality into a more proactive approach. Often poor data
quality is only found when a business process fails – when a
delivery can’t be made or when your IT system stops working – and
there are few instances where that is the best way to find problems.
It is also common when disasters occur through poor data quality that
nobody can be found to take responsibility! Data Governance ensures
that somebody is clearly responsible – not just for fixing the
disasters but also for reducing the likelihood of one occurring.
Below are the few points which
briefly determine why Data Governance is required:
A. You
will get consistent, reliable and repeatable data.
A central role of data governance is to ensure that metrics are
defined consistently across the organization. So when managers or
analysts talk about “conversion rates” or “unique visitors,”
everyone else knows precisely what they’re talking about. Without
clearly documented standards around metrics, decisions may be made
around false assumptions. Obviously, communication and reporting
suffer in such situations, an especially important consideration in
environments with multiple analytics tools.
B. Analysis
and reporting issues are most often data governance problems.
Many organizations are quick to blame their tools or technology when
there is confusion about the meaning of Web analytics data or lack of
clarity in reports. Typically, the tools and reports have not been
configured to clarify what various metrics mean, how they align to
specific goals, or where they may vary from data provided by
different systems. Companies end up ripping and replacing perfectly
good systems before doing the necessary governance work to ensure
they work properly.
C.
To Enhance the
efficiency of process. If
your team is spending a lot of time checking and rechecking your
reports, it can be quite inefficient. When a report generated
conflicts with another report, it may bring some doubt to the
validity of all reports. There is likely a data quality issue is
behind it. The problem manifests itself as a huge time-suck on
monthly and quarterly closes. Data champions must point to this
inefficiency in order to put in place a solid data management
strategy.
D. It saves money. Having a firm grip on how you define “page views” or other core metrics can help you when you’re negotiating to buy or sell advertising. You won’t be low-balled because an external organization (like an ad network) claims your traffic is too low or theirs is too high. And you’ll be able to probe the metrics used by others to price their own inventory and determine if their information is credible and accurate. Even if you’re not in the advertising market, data governance can save you money in other ways. For example, it may help you avoid the premature and costly “flipping” of systems.
D. It saves money. Having a firm grip on how you define “page views” or other core metrics can help you when you’re negotiating to buy or sell advertising. You won’t be low-balled because an external organization (like an ad network) claims your traffic is too low or theirs is too high. And you’ll be able to probe the metrics used by others to price their own inventory and determine if their information is credible and accurate. Even if you’re not in the advertising market, data governance can save you money in other ways. For example, it may help you avoid the premature and costly “flipping” of systems.
How
to achieve Data Governance?
Data governance is not just a
collection of ad-hoc data quality projects, but the development and
integration of a set of rules - policies, guidelines, and standards -
for managing the corporation's data. It is implemented by a data
governance management team of information technology and business
associates who are unified by a common goal to ensure that:
- data is what it is supposed to be (Data Quality)
- data is in the correct context (Data Integrity)
- data and its associated metadata are accessible (Data Usability)
There
are several tools which are already there in market to achieve Data
Governance such as “Information Analyzer”. When we say “How to
achieve Data Governance” it never means that we are going to use
any tools for that or any overall separate process we need to follow
to achieve Data Governance, We just want to ensure that the Data
which is going to get used/ observed by our processing system is in
proper format and deliver the exact quality for which it meant as
well as we are making sure that our data is ready for audit purpose
and its satisfy all the respective protocols. Achieving data
governance is very important when we are dealing with critical data
of banking Industries or any financial sector.
Monday, May 26, 2014
Cross Platform ETL Migration through AMM
Hello
Friends, in today’s era where technologies are upgrading in every
single minute, everyone is interested in upgrading for one either
from one version to another or one ETL tool to another.
Let’s
have a chat on Datawarehousing Domain, The most used tools for
transformations are Datastage, Informatica, Ab-Initio and Pentaho.
It’s
quite easy when we talk about migration from one version of ETL tool
to another upgraded version because this new version is capable to
handle all the issues which were there in previous version and loaded
with new feature as well as all fixes. This same tool allows the same
format and extensions which has been used in the earlier version. So
this would generally an EXPORT and IMPORT of code from older version
to newer version with a small effort of sanity check.
The
actual pain of migration comes into the picture we have requirement
of migrating from one tool to another because here we have to deal
with different syntaxes, different components as well as overall
differences in Design and Development Environments.
What
is the panic to migrate from one ETL tool to another??
When
we try to migrate code from one ETL tool to another we have deal with
so many of things, few of them are as below:
- Understanding and Expertise in both the Environments.
- Re-creation of same logic in target tool (Efforts required is more the fresh development)
- Differences in the functionalities of components in both the tools.
- Most of the Components are not re-usable.
- Efforts to Analyze, Develop and Test
There
are many more challenges which we have to face apart from the above.
So what’s the efficient SOLUTION??
Now
when we are aware of all the possible issues which we could face then
we can understand the significance of the solution which I want to
discuss here. Here’s I want to introduce one tool named as “AMM”
which is capable enough to do the cross migration from “ETL
Datastage to ETL Informatica” for us. This is an automated tool
which needs a very small intervention from human.
What are the other ways to do it??
There
are several ways to do the cross migration in-fact there are many
methods such as PCaaS, Cloud Deployment where either you don’t even
need to migrate the code in new environment or you just need to
deploy the code once in the cloud and use its services, Cloud
computing a costly operation for the industry. Let’s talk about the
PCaaS (PC as Service), it is also a costly operation but better then
cloud but I would suggest AMM.
Let’s
not go much deep on other ways to do this, I’ll share another blog
which will have more descriptive information regarding this.
Why AMM??
Now
that’s the genuine question, it is FREE
Tool by Informatica Corp.
Well
its first quality is enough for us to give a try to this tool but
let’s go in some more details about the tool from the below
paragraphs. This tool is capable to do Data Lineage, Impact Analysis
and Version control. It is capable to create jobs just by taking the
mapping specification as source. We can directly feed the mapping
specifications to this tool and get the job design.
There
is one IBM tool which is capable to perform similar operation named
“Fast Track” but this tools have its own additional cost as well
as this tool is not capable to handle complex operations such as
multiple joins and other complex transformation. It’s good for
small transformations as well as look-ups.
AMM
is capable to handle complex operation and joins. We can feed
thousands of mapping documents as input to this tool and this tool
can handle that as well.
This
tool can be used to perform the reverse engineering,
Is
Reverse Engineering, Possible?
As
I told we can directly feed mapping document as source to this tool,
same way if we feed already designed jobs as input to this tool then
AMM is capable enough to generate the mapping specifications on the
basis of that job. We can say this is an accelerator of for
development of our jobs.
What
makes AMM to perform Cross Platform Operations??
A
robust XML parser, this tool is capable to generate jobs through XML
feed, we can take export from any ETL tool in XML format and feed
that XML to AMM and get the job generated. AMM is robust enough to
generate the jobs from XML’s. So there is no any platform dependent
code export required for it. This makes this tool to work efficiently
for codes from multiple environments.
Deep dive in AMM
AnalytiX
Mapping Manager (AMM)
is a robust solution for Solving
all your data mapping needs.
It is a part of the “Agile Tools Movement” for the data
integration industry, enabling customers to accelerate delivery by
automating manual “pre- ETL” Source to Target Mapping (STM)
processes of data integration projects. AMM enables organizations to
“break free” of managing hundreds to thousands of MS Excel based
data mappings across the organization.
AMM
enables organizations to quickly consolidate metadata using its
built-in scanners, manage data dictionaries and automates the STM
process by allowing mappings to be created using “dragging
-and-dropping” techniques and storing mappings in central data
mapping repository.
AnalytiX
Mapping Manager (AMM) is a complement to existing data integration
products and allows the data integration and management professionals
to manage enterprise data dictionaries and build source to target
mapping specifications using drag and drop techniques to accelerate
the mapping process.
The
mapping specifications are then versionable, trackable, auditable,
and repeatable throughout lifecycle of data integration and Master
Data Management (MDM) projects. AMM auto-generates ETL Jobs for
Informatica.
- It enables forward / reverse Engineering as well.
- Scans and consolidates metadata and manages enterprise data dictionaries (metadata repository)
- Automates source to target mapping (pre-etl) process using drag n drop. You can then version them through the change process and manage them in our data mapping repository.
- Once the mapping are ready for development they can then be printed or exported as coding requirements or exported in XML to auto-generate ETL jobs for leading ETL tool providers like Informatica, datastage and SQL Server SSIS Pkgs.
- The code generation is available out of the box, however if you want “multiple code generation templates” to be created we can generate these templates based on your own best practices and standards for “different job types.
- Advanced analysis features include: Impact Analysis. Data Lineage Analyzer, Gap Analysis and Mapping Status Reporting.
Feature
described by Informatica Corporation (AMM Developers)
- Customizable Export of Mapping Specifications to Excel.
- Multiple Exports of Legacy Excel Mappings.
- Project Specific Transformations (Business rules).
- Mass update of source/target data across Mapping Specification.
- Multi Browser Compatibility.
- Full Enterprise Metadata Repository with built-in data preview capabilities.
- Promotes Excellence in STM Mapping. Standards (Metadata & Standards Based).
- Accelerate Delivery & Improve Quality.
- Governance. Best Practices. Standards. Versioning. Traceability. Audit-ability. Impact Analysis. Data Lineage.
- Collaboration & Programmatic Excellence.
- Cross Team Collaboration & Planning.
- Canned Reporting & Open-Source DBMS for mining the repositories.
- New improved web user interface with performance improvements.
- Extended the Integration with ETL Tools (Informatica, DataStage and SQL Server SSIS).
- Certified on internal repository Oracle DBMS and new versions of SQL Server.
- Certified to run on Linux based environments (Ubuntu and Red Hat).
- Enhanced change log tracking other features.
- ETL Tool Conversion Capabilities. Ability to quickly convert legacy ETL Jobs from other vendors to Informatica.
There are many more feature of AMM which are described properly by Informatica Corporation, please follow the reference links as below.
Sample
GUI of the Tool
References
and Free Download Links:
Thursday, May 22, 2014
What is JSON ?
Recent
update in Information Server 9.2 containing JSON parsing/composing
(added into Assembly Editor for hierarchical transformation of new
JSON docs). JSON (JavaScript Object Notation) prominent is an
alternative to XML in the world of data interchange. JSON is a text
format that is completely language independent but uses conventions
that are familiar to programmers of the C-family of languages,
including C, C++, C#, Java, JavaScript, Perl, Python, and many
others. These properties make JSON an ideal data-interchange
language. JSON
(JavaScript Object Notation) is a lightweight data-interchange
format.
More
about JSON
JSON is the preferred format in “NoSQL” databases. These databases are intended to accommodate massive scalability, designed to deal with data that often does not seamlessly conform to a columnar/relational model, and to be web-oriented at their very core. The most well-known examples of databases of this sort are MongoDB, CouchDB, and Riak. These three are JSON-based, horizontally scalable, and deeply web-driven databases. Also Amazon’s DynamoDB is REST/JSON architecture. Neo4J data base has a REST/JSON API, with no corresponding XML support.
There
are a few databases that are XML-based such as MarkLogic, but JSON is
more in demand because of its versatile features. There are rapid
adoptions of JSON-based storage models.
JSON
is specifically for serializing data structures. XML can also
represent data structures, JSON is much better at describing data
structures. JSON on the other hand is mismatched to describe semantic
meaning behind text in documents, which is where XML excels. XML
originated as a way to give semantic definition to text in documents.
JSON
is built on two structures:
- A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
- An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.
Selection
between JSON and XML
The requirements of your connection client and the type of data you need to serialize are the main aspect that drives the decision between using XML and JSON. The web service’s APIs have become very important to the web. On top of that, REST is replacing SOAP as a data transfer protocol. XML is not compatible with REST, so if SOAP continues its decline, then XML usage will shrink along with it. REST and SOAP both are different protocols for “How to access Web-Services”
(Differences
in REST and SOAP:
http://blog.smartbear.com/apis/understanding-soap-and-rest-basics/
)
JSON with Web-Services
As
we all know JavaScript is robust and most accepted technology for
almost all web services. JSON is better adapted than XML to devices
with limited capabilities. As JSON is derived from javascript, it is
ideal for integration into web as per our current “Internet
Architecture”. Of course there’s an XML parser for node, but it’s
largely geared toward dealing with legacy XML-based endpoints. There
are few limitations with JSON such as It’s essentially restricted
to null, Booleans, numeric, strings, arrays, and dictionaries. It
doesn’t even have a Date data type. JSON is thus not only generally
less verbose than XML it is more parsimonious in its use of data
types. Restricting itself to primitive data types makes JSON
interoperable with pretty much any programming language that exists.
A Glance to Hadoop :)
Hadoop
is an Apache open source project that provides a parallel storage and
processing framework. Its primary purpose is to run MapReduce batch
programs in parallel on tens to thousands of server nodes.
MapReduce
refers to the application modules written by a programmer that run in
two phases: first mapping the data (extract) then reducing it
(transform).
Hadoop
scales out to large clusters of servers and storage using the Hadoop
Distributed File System (HDFS) to manage huge data sets and spread
them across the servers.
One
of Hadoop’s greatest benefits is the ability of programmers to
write application modules in almost any language and run them in
parallel on the same cluster that stores the data. This is a profound
change! With Hadoop, any programmer can harness the power and
capacity of thousands of CPUs and hard drives simultaneously.
More
advantages of Hadoop include affordability (it runs on commodity
hardware), open source (free download from Cloudera), and agility
(store any data, run any analysis).
“Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.
Examples
include Web logs; RFID; sensor networks; social networks; Internet
text and documents; Internet search indexing; call detail records;
astronomy, atmospheric science, biological, genomics, biochemical,
medical records; scientific research; military surveillance;
photography archives; video archives; and large scale eCommerce.”
Hadoop
as the Repository and Refinery
As
volumes of big data arrive from sources such as sensors, machines,
social media, and click stream interactions, the first step is to
capture all the data reliably and cost effectively. When data volumes
are huge, the traditional single server strategy does not work for
long. Pouring the data into the Hadoop Distributed File System (HDFS)
gives architects much needed flexibility. Not only can they capture
10s of terabytes in a day, they can adjust the Hadoop configuration
up or down to meet surges and lulls in data ingestion. This is
accomplished at the lowest possible cost per gigabyte due to open
source economics and leveraging commodity hardware.
Since
the data is stored on local storage instead of SANs, Hadoop data
access is often much faster, and it does not clog the network with
terabytes of data movement.
Once
the raw data is captured, Hadoop is used to refine it. Hadoop can act
as a parallel “ETL engine on steroids,” leveraging handwritten or
commercial data transformation technologies.
MapReduce in Hadoop – It’s too OLD now
The
Mahout community decided to move it’s codebase onto modern data
processing systems that offer a richer programming model and more
efficient execution than Hadoop MapReduce. Mahout will therefore
reject new MapReduce algorithm implementations from now on. They
however keep their widely used MapReduce algorithms in codebase and
maintain them.
Mahout
community is building future implementations on top of a DSL for
linear algebraic operations which has been developed over the last
months. Programs written in this DSL are automatically optimized and
executed in parallel on Apache Spark.
Furthermore,
there is an experimental contribution undergoing which aims to
integrate the h20 platform into Mahout.
1
February 2014 - Apache Mahout 0.9 released
New
and improved Mahout Website based on Apache CMS
Early
implementation of a Multi-Layer Perceptron (MLP) classifier
Scala
DSL Bindings for Mahout Math Linear Algebra
Support
for easy functional Matrix views and derivatives
JSON
output format for ClusterDumper -
Enabled
randomised testing for all Mahout Modules using Carrot
RandomizedRunner
Online
Algorithm for computing accurate Quantiles using 1-dimensional
Clustering
Upgrade
to Lucene 4.6.1
Wednesday, April 9, 2014
Datastage – Change Capture Stage
What is Change
Capture Stage?
The Change Capture Stage is a processing stage. The stage
compares two data sets and makes a record of the differences.
The Change Capture stage takes two input data sets, denoted before
and after, and outputs a single data set whose records represent the changes
made to the before data set to obtain the after data set. The stage produces a
change data set, whose table definition is transferred from the after data
set's table definition with the addition of one column: a change code with
values encoding the four actions: insert, delete, copy, and edit. The
preserve-partitioning flag is set on the change data set.
The compare is based on a set a set of key columns, rows
from the two data sets are assumed to be copies of one another if they have the
same values in these key columns. You can also optionally specify change
values. If two rows have identical key columns, you can compare the value
columns in the rows to see if one is an edited copy of the other.
The stage assumes that the incoming data is key-partitioned
and sorted in ascending order. The columns the data is hashed on should be the
key columns used for the data compare. You can achieve the sorting and
partitioning using the Sort stage or by using the built-in sorting and
partitioning abilities of the Change Capture stage.
You can use the companion Change Apply stage to combine the
changes from the Change Capture stage with the original before data set to
reproduce the after data set.
To use a Change
Capture stage:
·
In the
Stage page Properties Tab:
o
Specify the key column. You can repeat this
property to specify a composite key. Before and after rows are considered to be
the same if they have the same value in the key column or columns.
o
Optionally specify one or more Value columns.
This enables you to determine if an after row is an edited version of a before
row.
§
(You can also set the Change Mode property to
have DataStage treat all columns not defined as keys treated as values, or all
columns not defined as values treated as keys.)
o
Specify whether the stage will output the
changed row or drop it. You can specify this individually for each type of
change (copy, delete, edit, or insert).
·
In the
Stage page Link Ordering Tab:
o
specify which of the two links carries the
before data set and which carries the after data set.
o
If the two incoming data sets aren't already key
partitioned on the key columns and sorted, set DataStage to do this on the
Input Page Partitioning Tab.
·
In the
Output Page Mapping Tab:
o
specify how the change data columns are mapped
onto the output link columns.
Change Capture stage:
NLS Locale tab
This appears if you have NLS enabled on your system. It lets
you view the current default collate convention, and select a different one for
this stage if required. You can also use a job parameter to specify the locale,
or browse for a file that defines custom collate rules. The collate convention
defines the order in which characters are collated. The Change Capture stage
uses this when it is determining the sort order for key columns. Select a
locale from the list, or click the arrow button next to the list to use a job
parameter or browse for a collate file
Change Capture stage:
Input page
The Input page allows you to specify details about the
incoming data sets. The Change Capture expects two incoming data sets: a before
data set and an after data set.
The General tab allows you to specify an optional
description of the input link. The Partitioning tab allows you to specify how
incoming data is partitioned before being compared. The Columns tab specifies
the column definitions of incoming data. The Advanced tab allows you to change
the default buffering settings for the input link.
Change Capture stage:
Output page
The Output page allows you to specify details about data
output from the Change Capture stage. The Change Capture stage can have only
one output link.
The General tab allows you to specify an optional
description of the output link. The Columns tab specifies the column
definitions of the data. The Mapping tab allows you to specify the relationship
between the columns being input to the Change Capture stage and the Output
columns. The Advanced tab allows you to change the default buffering settings
for the output link.
Change Capture stage:
Mapping tab
For the Change Capture stage the Mapping tab allows you to
specify how the output columns are derived, that is, what input columns map
onto them and which column carries the change code data.
The left pane shows the columns from the before/after data
sets plus the change code column. These are read only and cannot be modified on
this tab.
The right pane shows the output columns for each link. This
has a Derivations field where you can specify how the column is derived. You
can fill it in by dragging input columns over, or by using the Auto-match
facility. By default the data set columns are mapped automatically. You need to
ensure that there is an output column to carry the change code and that this is
mapped to the Change_code column.
Examples:
Change_before_data
e_id,e_name,e_add
e_id,e_name,e_add
=============================
Change_after_datae
Change_after_datae
id,e_name,e_add
In Change Capture Stage Select Key column as e_idand value
column as e_add.Because e_add is the column we are going to update.
In the above CDC configuration the main option we use is
Change Mode.
In this scenario we are using All Keys,Explicit Values.
All Keys, Explicit
Values:In this mode we don’t need to specify key columns.
We need only newly inserted or edited record and hence we
set Drop output for insert = FALSE,
Drop Output for Edit = FALSE and the remaining all three are
set to FALSE because we don’t want copy, edit or delete records.
Copy code
Allows you to specify an alternative value for the code that
indicates the after record is a copy of thebefore record. By default this code
is 0.
Deleted code
Allows you to specify an alternative value for the code that
indicates that a record in the before set hasbeen deleted from the after set.
By default this code is 2.
Edit code
Allows you to specify an alternative value for the code that
indicates the after record is an edited versionof the before record. By default
this code is 3.
Insert Code
Allows you to specify an alternative value for the code that
indicates a new record has been inserted inthe after set that did not exist in
the before set. By default this code is 1.
Usage of
Change Capture stage:
We are comparing the data from both the Environments PROD
and QA. These validation jobs provide us the differences in data from both the
environments.
Subscribe to:
Posts (Atom)











