Manmohan Mishra: 2014

Sunday, June 1, 2014

Friday, May 30, 2014

Data governance is the specification of decision rights and an accountability framework to encourage desirable behavior in the valuation, creation, storage, use, archival and deletion of data and information. It includes the processes, roles, standards and metrics that ensure the effective and efficient use of data and information in enabling an organization to achieve its goals.

Data governance is one of the least visible aspects of Web analytics, it’s easily one of the most impactful. Companies that get it right generate “Ferrari” benefits – significant savings of time and money, plus greatly improved confidence in their data. In other words, data governance is actually a “must have.” Without it, an organization’s entire data strategy and online marketing approach may rest on a shaky foundation.

In particular, data governance is not:

Tactical management
Technology & IT Department alone.

Why Data Governance is important?

The word “governance” determines quality because that is the fundamental aim of data governance. “Why should we look at data quality?” For Example, If you are doing any online purchase and due to lack of data quality the purchased goods is dispatched to wrong address or you will receive goods that don’t match with the proper description and so on. However, there are other more subtle effects of poor data quality – i.e. missing the opportunity to upsell to a customer because you can’t accurately identify the product categories that they purchased, not being able to negotiate purchasing discounts because the supplier is duplicated so many times that we can’t say what our total spend is, losing web sales because your inaccurate sizing data makes you look bad on comparison sites.

It’s likely that there are already some people in the organization who are checking data quality as part of their regular job. For example the accountants are probably ensuring that postings are made to the correct ledger codes, your accounts payable department is ensuring that invoices are sent and matching payments are received. Much of your operational data is already part of an active management process but to a large extent their interest is in quantities and values. The areas that get less quality checking are the reference data (or master data) that drive many of your business processes. Data Governance aims to put in place formal management responsibilities for the quality of this data.

One of the changes in attitude that is driven by data governance is to move away from a reactive approach to quality into a more proactive approach. Often poor data quality is only found when a business process fails – when a delivery can’t be made or when your IT system stops working – and there are few instances where that is the best way to find problems. It is also common when disasters occur through poor data quality that nobody can be found to take responsibility! Data Governance ensures that somebody is clearly responsible – not just for fixing the disasters but also for reducing the likelihood of one occurring.

Below are the few points which briefly determine why Data Governance is required:

A. You will get consistent, reliable and repeatable data. A central role of data governance is to ensure that metrics are defined consistently across the organization. So when managers or analysts talk about “conversion rates” or “unique visitors,” everyone else knows precisely what they’re talking about. Without clearly documented standards around metrics, decisions may be made around false assumptions. Obviously, communication and reporting suffer in such situations, an especially important consideration in environments with multiple analytics tools.

B. Analysis and reporting issues are most often data governance problems. Many organizations are quick to blame their tools or technology when there is confusion about the meaning of Web analytics data or lack of clarity in reports. Typically, the tools and reports have not been configured to clarify what various metrics mean, how they align to specific goals, or where they may vary from data provided by different systems. Companies end up ripping and replacing perfectly good systems before doing the necessary governance work to ensure they work properly.

C. To Enhance the efficiency of process. If your team is spending a lot of time checking and rechecking your reports, it can be quite inefficient. When a report generated conflicts with another report, it may bring some doubt to the validity of all reports. There is likely a data quality issue is behind it. The problem manifests itself as a huge time-suck on monthly and quarterly closes. Data champions must point to this inefficiency in order to put in place a solid data management strategy.
D. It saves money. Having a firm grip on how you define “page views” or other core metrics can help you when you’re negotiating to buy or sell advertising. You won’t be low-balled because an external organization (like an ad network) claims your traffic is too low or theirs is too high. And you’ll be able to probe the metrics used by others to price their own inventory and determine if their information is credible and accurate. Even if you’re not in the advertising market, data governance can save you money in other ways. For example, it may help you avoid the premature and costly “flipping” of systems.

How to achieve Data Governance?

Data governance is not just a collection of ad-hoc data quality projects, but the development and integration of a set of rules - policies, guidelines, and standards - for managing the corporation's data. It is implemented by a data governance management team of information technology and business associates who are unified by a common goal to ensure that:

data is what it is supposed to be (Data Quality)
data is in the correct context (Data Integrity)
data and its associated metadata are accessible (Data Usability)

There are several tools which are already there in market to achieve Data Governance such as “Information Analyzer”. When we say “How to achieve Data Governance” it never means that we are going to use any tools for that or any overall separate process we need to follow to achieve Data Governance, We just want to ensure that the Data which is going to get used/ observed by our processing system is in proper format and deliver the exact quality for which it meant as well as we are making sure that our data is ready for audit purpose and its satisfy all the respective protocols. Achieving data governance is very important when we are dealing with critical data of banking Industries or any financial sector.

Monday, May 26, 2014

Cross Platform ETL Migration through AMM

Hello Friends, in today’s era where technologies are upgrading in every single minute, everyone is interested in upgrading for one either from one version to another or one ETL tool to another.

Let’s have a chat on Datawarehousing Domain, The most used tools for transformations are Datastage, Informatica, Ab-Initio and Pentaho.

It’s quite easy when we talk about migration from one version of ETL tool to another upgraded version because this new version is capable to handle all the issues which were there in previous version and loaded with new feature as well as all fixes. This same tool allows the same format and extensions which has been used in the earlier version. So this would generally an EXPORT and IMPORT of code from older version to newer version with a small effort of sanity check.

The actual pain of migration comes into the picture we have requirement of migrating from one tool to another because here we have to deal with different syntaxes, different components as well as overall differences in Design and Development Environments.

What is the panic to migrate from one ETL tool to another??

When we try to migrate code from one ETL tool to another we have deal with so many of things, few of them are as below:

Understanding and Expertise in both the Environments.
Re-creation of same logic in target tool (Efforts required is more the fresh development)
Differences in the functionalities of components in both the tools.
Most of the Components are not re-usable.
Efforts to Analyze, Develop and Test

There are many more challenges which we have to face apart from the above.

So what’s the efficient SOLUTION??

Now when we are aware of all the possible issues which we could face then we can understand the significance of the solution which I want to discuss here. Here’s I want to introduce one tool named as “AMM” which is capable enough to do the cross migration from “ETL Datastage to ETL Informatica” for us. This is an automated tool which needs a very small intervention from human.

What are the other ways to do it??

There are several ways to do the cross migration in-fact there are many methods such as PCaaS, Cloud Deployment where either you don’t even need to migrate the code in new environment or you just need to deploy the code once in the cloud and use its services, Cloud computing a costly operation for the industry. Let’s talk about the PCaaS (PC as Service), it is also a costly operation but better then cloud but I would suggest AMM.

Let’s not go much deep on other ways to do this, I’ll share another blog which will have more descriptive information regarding this.

Why AMM??

Now that’s the genuine question, it is FREE Tool by Informatica Corp.

Well its first quality is enough for us to give a try to this tool but let’s go in some more details about the tool from the below paragraphs. This tool is capable to do Data Lineage, Impact Analysis and Version control. It is capable to create jobs just by taking the mapping specification as source. We can directly feed the mapping specifications to this tool and get the job design.

There is one IBM tool which is capable to perform similar operation named “Fast Track” but this tools have its own additional cost as well as this tool is not capable to handle complex operations such as multiple joins and other complex transformation. It’s good for small transformations as well as look-ups.

AMM is capable to handle complex operation and joins. We can feed thousands of mapping documents as input to this tool and this tool can handle that as well.

This tool can be used to perform the reverse engineering,

Is Reverse Engineering, Possible?

As I told we can directly feed mapping document as source to this tool, same way if we feed already designed jobs as input to this tool then AMM is capable enough to generate the mapping specifications on the basis of that job. We can say this is an accelerator of for development of our jobs.

What makes AMM to perform Cross Platform Operations??

A robust XML parser, this tool is capable to generate jobs through XML feed, we can take export from any ETL tool in XML format and feed that XML to AMM and get the job generated. AMM is robust enough to generate the jobs from XML’s. So there is no any platform dependent code export required for it. This makes this tool to work efficiently for codes from multiple environments.

Deep dive in AMM

AnalytiX Mapping Manager (AMM) is a robust solution for Solving all your data mapping needs. It is a part of the “Agile Tools Movement” for the data integration industry, enabling customers to accelerate delivery by automating manual “pre- ETL” Source to Target Mapping (STM) processes of data integration projects. AMM enables organizations to “break free” of managing hundreds to thousands of MS Excel based data mappings across the organization.

AMM enables organizations to quickly consolidate metadata using its built-in scanners, manage data dictionaries and automates the STM process by allowing mappings to be created using “dragging -and-dropping” techniques and storing mappings in central data mapping repository.

AnalytiX Mapping Manager (AMM) is a complement to existing data integration products and allows the data integration and management professionals to manage enterprise data dictionaries and build source to target mapping specifications using drag and drop techniques to accelerate the mapping process.

The mapping specifications are then versionable, trackable, auditable, and repeatable throughout lifecycle of data integration and Master Data Management (MDM) projects. AMM auto-generates ETL Jobs for Informatica.

It enables forward / reverse Engineering as well.
Scans and consolidates metadata and manages enterprise data dictionaries (metadata repository)
Automates source to target mapping (pre-etl) process using drag n drop. You can then version them through the change process and manage them in our data mapping repository.
Once the mapping are ready for development they can then be printed or exported as coding requirements or exported in XML to auto-generate ETL jobs for leading ETL tool providers like Informatica, datastage and SQL Server SSIS Pkgs.
The code generation is available out of the box, however if you want “multiple code generation templates” to be created we can generate these templates based on your own best practices and standards for “different job types.
Advanced analysis features include: Impact Analysis. Data Lineage Analyzer, Gap Analysis and Mapping Status Reporting.

Feature described by Informatica Corporation (AMM Developers)

Customizable Export of Mapping Specifications to Excel.
Multiple Exports of Legacy Excel Mappings.
Project Specific Transformations (Business rules).
Mass update of source/target data across Mapping Specification.
Multi Browser Compatibility.
Full Enterprise Metadata Repository with built-in data preview capabilities.
Promotes Excellence in STM Mapping. Standards (Metadata & Standards Based).
Accelerate Delivery & Improve Quality.
Governance. Best Practices. Standards. Versioning. Traceability. Audit-ability. Impact Analysis. Data Lineage.
Collaboration & Programmatic Excellence.
Cross Team Collaboration & Planning.
Canned Reporting & Open-Source DBMS for mining the repositories.
New improved web user interface with performance improvements.
Extended the Integration with ETL Tools (Informatica, DataStage and SQL Server SSIS).
Certified on internal repository Oracle DBMS and new versions of SQL Server.
Certified to run on Linux based environments (Ubuntu and Red Hat).
Enhanced change log tracking other features.
ETL Tool Conversion Capabilities. Ability to quickly convert legacy ETL Jobs from other vendors to Informatica.

There are many more feature of AMM which are described properly by Informatica Corporation, please follow the reference links as below.

Sample GUI of the Tool

References and Free Download Links:

https://community.informatica.com/mpresources/docs/AnalytiX%20Mapping%20Manager%20release_notes_4.2.pdf

https://community.informatica.com/solutions/analytix_mapping_manager

Thursday, May 22, 2014

What is JSON ?

Recent update in Information Server 9.2 containing JSON parsing/composing (added into Assembly Editor for hierarchical transformation of new JSON docs). JSON (JavaScript Object Notation) prominent is an alternative to XML in the world of data interchange. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language. JSON (JavaScript Object Notation) is a lightweight data-interchange format.

More about JSON

JSON is the preferred format in “NoSQL” databases. These databases are intended to accommodate massive scalability, designed to deal with data that often does not seamlessly conform to a columnar/relational model, and to be web-oriented at their very core. The most well-known examples of databases of this sort are MongoDB, CouchDB, and Riak. These three are JSON-based, horizontally scalable, and deeply web-driven databases. Also Amazon’s DynamoDB is REST/JSON architecture. Neo4J data base has a REST/JSON API, with no corresponding XML support.

There are a few databases that are XML-based such as MarkLogic, but JSON is more in demand because of its versatile features. There are rapid adoptions of JSON-based storage models.

JSON is specifically for serializing data structures. XML can also represent data structures, JSON is much better at describing data structures. JSON on the other hand is mismatched to describe semantic meaning behind text in documents, which is where XML excels. XML originated as a way to give semantic definition to text in documents.

JSON is built on two structures:

A collection of name/value pairs. In various languages, this is realized as an object, record, struct, dictionary, hash table, keyed list, or associative array.
An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

Selection between JSON and XML

The requirements of your connection client and the type of data you need to serialize are the main aspect that drives the decision between using XML and JSON. The web service’s APIs have become very important to the web. On top of that, REST is replacing SOAP as a data transfer protocol. XML is not compatible with REST, so if SOAP continues its decline, then XML usage will shrink along with it. REST and SOAP both are different protocols for “How to access Web-Services”

(Differences in REST and SOAP: http://blog.smartbear.com/apis/understanding-soap-and-rest-basics/ )

JSON with Web-Services

As we all know JavaScript is robust and most accepted technology for almost all web services. JSON is better adapted than XML to devices with limited capabilities. As JSON is derived from javascript, it is ideal for integration into web as per our current “Internet Architecture”. Of course there’s an XML parser for node, but it’s largely geared toward dealing with legacy XML-based endpoints. There are few limitations with JSON such as It’s essentially restricted to null, Booleans, numeric, strings, arrays, and dictionaries. It doesn’t even have a Date data type. JSON is thus not only generally less verbose than XML it is more parsimonious in its use of data types. Restricting itself to primitive data types makes JSON interoperable with pretty much any programming language that exists.

A Glance to Hadoop :)

Hadoop is an Apache open source project that provides a parallel storage and processing framework. Its primary purpose is to run MapReduce batch programs in parallel on tens to thousands of server nodes.

MapReduce refers to the application modules written by a programmer that run in two phases: first mapping the data (extract) then reducing it (transform).

Hadoop scales out to large clusters of servers and storage using the Hadoop Distributed File System (HDFS) to manage huge data sets and spread them across the servers.

One of Hadoop’s greatest benefits is the ability of programmers to write application modules in almost any language and run them in parallel on the same cluster that stores the data. This is a profound change! With Hadoop, any programmer can harness the power and capacity of thousands of CPUs and hard drives simultaneously.

More advantages of Hadoop include affordability (it runs on commodity hardware), open source (free download from Cloudera), and agility (store any data, run any analysis).

“Big Data is a term applied to data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process within a tolerable elapsed time. Big Data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes in a single data set.

Examples include Web logs; RFID; sensor networks; social networks; Internet text and documents; Internet search indexing; call detail records; astronomy, atmospheric science, biological, genomics, biochemical, medical records; scientific research; military surveillance; photography archives; video archives; and large scale eCommerce.”

Hadoop as the Repository and Refinery

As volumes of big data arrive from sources such as sensors, machines, social media, and click stream interactions, the first step is to capture all the data reliably and cost effectively. When data volumes are huge, the traditional single server strategy does not work for long. Pouring the data into the Hadoop Distributed File System (HDFS) gives architects much needed flexibility. Not only can they capture 10s of terabytes in a day, they can adjust the Hadoop configuration up or down to meet surges and lulls in data ingestion. This is accomplished at the lowest possible cost per gigabyte due to open source economics and leveraging commodity hardware.

Since the data is stored on local storage instead of SANs, Hadoop data access is often much faster, and it does not clog the network with terabytes of data movement.

Once the raw data is captured, Hadoop is used to refine it. Hadoop can act as a parallel “ETL engine on steroids,” leveraging handwritten or commercial data transformation technologies.

MapReduce in Hadoop – It’s too OLD now

The Mahout community decided to move it’s codebase onto modern data processing systems that offer a richer programming model and more efficient execution than Hadoop MapReduce. Mahout will therefore reject new MapReduce algorithm implementations from now on. They however keep their widely used MapReduce algorithms in codebase and maintain them.

Mahout community is building future implementations on top of a DSL for linear algebraic operations which has been developed over the last months. Programs written in this DSL are automatically optimized and executed in parallel on Apache Spark.

Furthermore, there is an experimental contribution undergoing which aims to integrate the h20 platform into Mahout.

1 February 2014 - Apache Mahout 0.9 released

New and improved Mahout Website based on Apache CMS

Early implementation of a Multi-Layer Perceptron (MLP) classifier

Scala DSL Bindings for Mahout Math Linear Algebra

Support for easy functional Matrix views and derivatives

JSON output format for ClusterDumper -

Enabled randomised testing for all Mahout Modules using Carrot RandomizedRunner

Online Algorithm for computing accurate Quantiles using 1-dimensional Clustering

Upgrade to Lucene 4.6.1

Wednesday, April 9, 2014

Datastage – Change Capture Stage

What is Change Capture Stage?

The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of the differences.

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data set's table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

The compare is based on a set a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns in the rows to see if one is an edited copy of the other.

The stage assumes that the incoming data is key-partitioned and sorted in ascending order. The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built-in sorting and partitioning abilities of the Change Capture stage.

You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set.

To use a Change Capture stage:

· In the Stage page Properties Tab:

o Specify the key column. You can repeat this property to specify a composite key. Before and after rows are considered to be the same if they have the same value in the key column or columns.

o Optionally specify one or more Value columns. This enables you to determine if an after row is an edited version of a before row.

§ (You can also set the Change Mode property to have DataStage treat all columns not defined as keys treated as values, or all columns not defined as values treated as keys.)

o Specify whether the stage will output the changed row or drop it. You can specify this individually for each type of change (copy, delete, edit, or insert).

· In the Stage page Link Ordering Tab:

o specify which of the two links carries the before data set and which carries the after data set.

o If the two incoming data sets aren't already key partitioned on the key columns and sorted, set DataStage to do this on the Input Page Partitioning Tab.

· In the Output Page Mapping Tab:

o specify how the change data columns are mapped onto the output link columns.

Change Capture stage: NLS Locale tab

This appears if you have NLS enabled on your system. It lets you view the current default collate convention, and select a different one for this stage if required. You can also use a job parameter to specify the locale, or browse for a file that defines custom collate rules. The collate convention defines the order in which characters are collated. The Change Capture stage uses this when it is determining the sort order for key columns. Select a locale from the list, or click the arrow button next to the list to use a job parameter or browse for a collate file

Change Capture stage: Input page

The Input page allows you to specify details about the incoming data sets. The Change Capture expects two incoming data sets: a before data set and an after data set.

The General tab allows you to specify an optional description of the input link. The Partitioning tab allows you to specify how incoming data is partitioned before being compared. The Columns tab specifies the column definitions of incoming data. The Advanced tab allows you to change the default buffering settings for the input link.

Change Capture stage: Output page

The Output page allows you to specify details about data output from the Change Capture stage. The Change Capture stage can have only one output link.

The General tab allows you to specify an optional description of the output link. The Columns tab specifies the column definitions of the data. The Mapping tab allows you to specify the relationship between the columns being input to the Change Capture stage and the Output columns. The Advanced tab allows you to change the default buffering settings for the output link.

Change Capture stage: Mapping tab

For the Change Capture stage the Mapping tab allows you to specify how the output columns are derived, that is, what input columns map onto them and which column carries the change code data.

The left pane shows the columns from the before/after data sets plus the change code column. These are read only and cannot be modified on this tab.

The right pane shows the output columns for each link. This has a Derivations field where you can specify how the column is derived. You can fill it in by dragging input columns over, or by using the Auto-match facility. By default the data set columns are mapped automatically. You need to ensure that there is an output column to carry the change code and that this is mapped to the Change_code column.

Examples:

Change_before_data
e_id,e_name,e_add

11,kim,syd
22,jim,mel
33,pim,perth
44,lim,canb
55,pom,adeliade
66,shila,bris

=============================
Change_after_datae

id,e_name,e_add

11,kim,syd
22,jim,canb
33,pim,syd
44,lim,canb
55,pom,perth

In Change Capture Stage Select Key column as e_idand value column as e_add.Because e_add is the column we are going to update.

Configuring CDC stage to get only new records

In the above CDC configuration the main option we use is Change Mode.

In this scenario we are using All Keys,Explicit Values.

All Keys, Explicit Values:In this mode we don’t need to specify key columns.

We need only newly inserted or edited record and hence we set Drop output for insert = FALSE,

Drop Output for Edit = FALSE and the remaining all three are set to FALSE because we don’t want copy, edit or delete records.

Note:Always make sure the link ordering is corrected otherwise you will get wrong results

We need to configure Code Column Name in the options tab:

Copy code

Allows you to specify an alternative value for the code that indicates the after record is a copy of thebefore record. By default this code is 0.

Deleted code

Allows you to specify an alternative value for the code that indicates that a record in the before set hasbeen deleted from the after set. By default this code is 2.

Edit code

Allows you to specify an alternative value for the code that indicates the after record is an edited versionof the before record. By default this code is 3.

Insert Code

Allows you to specify an alternative value for the code that indicates a new record has been inserted inthe after set that did not exist in the before set. By default this code is 1.

Usage of Change Capture stage:

We have used the concept of Change Capture in our Project for the purpose of Data validation.

We are comparing the data from both the Environments PROD and QA. These validation jobs provide us the differences in data from both the environments.

Manmohan Mishra

Pages