Manmohan Mishra: Data Science - Data Mining Algorithms

As we all know R Programming is expanding its legs in Analytics, So why not to talk about few widely used Data Mining algorithm in R.

While working with R, I found below algorithm very useful for Data Mining, It's a personal choice tough. There are plenty to tools also available for Mining Data and come with respected result but as a Programmer its always great to design the algorithm the way you want. Lets not waste any more time and go with few Data Mining Algorithm, which I found best while working in one of Data Analytics projects.

1. Decision Tree

Decision tree builds classification or regression models in the form of a tree structure. It breaks down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes.

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. ID3 uses Entropy and Information Gain to construct a decision tree.

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar homogenous values.

2. Forest Tree

Random Forests are a combination of tree predictors where each tree depends on the values of a random vector sampled independently with the same distribution for all trees in the forest.

Single decision trees often have high variance or high bias. Random Forests attempts to mitigate the problems of high variance and high bias by averaging to find a natural balance between the two extremes.

3. Association Rule Mining (Mostly like Market Basket Analysis)

Association rule learning is a method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness.

4. Regression Analysis – Linear Regression (Remember the OHM's Law)

Linear regression attempts to model the relationship between two variables by fitting a linear equation to observed data. One variable is considered to be an explanatory variable, and the other is considered to be a dependent variable.

Regression analysis generates an equation to describe the statistical relationship between one or more predictor variables and the response variable.

5. K means Cluster

Clustering is the process of partitioning a group of data points into a small number of clusters. A quantitative approach would be to measure certain features of the products. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the positions

μi,i=1...k of the clusters that minimize the square of the distance from the data points to the cluster. K-means clustering solves

Manmohan Mishra

Pages

Monday, February 1, 2016

Data Science - Data Mining Algorithms

No comments:

Post a Comment