Saturday, February 20, 2016

Datastage Job Optimization - (Using Config File)


Config file is one of the most important component in Datastage.
The configuration files in Datastage direct us towards below facts:

1. Degree of Data Partitioning to Scale processing
2. System resources like “Temp Storage”, “Scratch Disk”
3. Resources for Database and Buffer Storage

Tips for Job Optimization:

1. Nodes should be equal to number of CPU Running
2. Use multiple configuration files
              a. For low volume data Single node configuration
             b. For large volume data, Multi node configuration
3. Span processing across multiple machines by adding nodes from different machines
              a. Avoid re-partitioning data in such scenarios because re-partitioning across networks                                 will be costly operation
4. To Maximize I/O use different and multiple “Resource Datasets” on each node

Sample Configuration file (Config File):

{
node “node1” {
fastname “machine1”
pools “” “oraclewrite”
resource disk “/Local_1/mypath" {pools ""}
resource disk “/Local_2/mypath" {pools ""}
resource scratchdisk “/Local_3_1/mypath" {pools "“} 
resource scratchdisk “/Local_3_2/mypath" {pools "“}
}
node “node2” {
fastname “machine2”
pools "" “oraclewrite”
resource disk “/Local_4/mypath" {pools ""}
resource scratchdisk “/Local_5_1/mypath" {pools "“} 
resource scratchdisk “/Local_5_2/mypath" {pools "“} 
}
node “node3” {
fastname “machine2”
pools “oraclewrite”
resource disk “/Local_4/mypath" {pools ""}
resource scratchdisk “/Local_5_1/mypath" {pools "“} 
resource scratchdisk “/Local_5_2/mypath" {pools "“} 
}
}

Config file explained as below:

1. We are running on two different machines i.e. "machine1" & "machine2"
2. Using multiple "Resource disk and scratchdisk" to get more memory for processing
3. Assign "oraclewrite" to all nodes for parallel execution when writing and reading from Oracle DB
4. Other stages will only run on two nodes because pool is only defined over two nodes, In case of Netezza Connector please enable “Partitioned Read”, The number of reads is equal to the data           partitions in config file.

Wednesday, February 17, 2016

Big Data File Stage in Datastage 11.x


Hi Mates, As we all know in this era of Hadoop and Big-Data everyone is moving towards working with HDFS. IBM has also introduce Datastage component for Datastage Developer & Designers to access Hadoop Distributed File System via IBM Datastage.

To access HDFS via InfoSphere, we have to first create ishdfs.config file with required classpath details. HDFS Clients .jar and configuration file directories
must be accessible by InfoSphere Server Engine.
If you are using the InfoSphere BigInsights HDFS and using syncbi.sh tool to obtain .jar files.
The ishdfs.config file is created for you automatically from ishdfs.config.biginsights file.
This ishdfs.config file points to the .jar files that are downloaded and unpacked in the $DSHOME/../biginsights directory

Content in File ishdfs.config:
CLASSPATH= $DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpclient-4.2.1.jar:$DSHOME/../../ASBNode/eclipse/plugins/com.ibm.iis.client/httpcore-4.2.1.jar:$DSHOME/../PXEngine/java/biginsights-restfs-1.0.0.jar:$DSHOME/../PXEngine/java/cc-http-api.jar:$DSHOME/../PXEngine/java/cc-http-impl.jar:/opt/IBM/biginsights/IHC/lib/*:/opt/IBM/biginsights/IHC/*:/opt/IBM/biginsights/lib/JSON4J.jar:/opt/IBM/biginsights/hadoop-conf

Location to Save config file ishdfs.config
/opt/IBM/InformationServer/Server/DSEngine

Apart from configuration, other options & operations are almost similar like normal file stage in Datastage where you have to select partitioning method, 
file delimiter and everything else.

Monday, February 1, 2016

Data Science - Data Mining Algorithms


As we all know R Programming is expanding its legs in Analytics, So why not to talk about few widely used Data Mining algorithm in R.

While working with R, I found below algorithm very useful for Data Mining, It's a personal choice tough. There are plenty to tools also available for Mining Data and come with respected result but as a Programmer its always great to design the algorithm the way you want. Lets not waste any more time and go with few Data Mining Algorithm, which I found best while working in one of Data Analytics projects.

1. Decision Tree

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree.

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar homogenous values. 


2. Forest Tree

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest. 

Single decision trees often have high variance or high bias. Random Forests attempts to mitigate the problems of high variance and high bias by averaging to find a natural balance between the two extremes.


3. Association Rule Mining (Mostly like Market Basket Analysis)

Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. 


4. Regression Analysis – Linear Regression (Remember the OHM's Law)

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.
Regression analysis generates an equation to describe the statistical relationship between one or more predictor variables and the response variable. 


5. K means Cluster

Clustering is the process of partitioning a group of data points into a small number of clusters. A quantitative approach would be to measure certain features of the products. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the positions 
μi,i=1...k  of the clusters that minimize the square of the distance from the data points to the cluster. K-means clustering solves