Before applying algorithm we have to know what kind of situation where it applying for. There are 4 types of classic scenario exist in machine learning.
Regression: if you want to predict values, for example, estimate product amount, predict sales figures, analyses marketing Returns. Here that you are trying to do is predict a continuous set of variable this is when we use regression algorithms.
Anomaly detection: If when you want to find unusual occurrence and this is what we call anomaly detection, for example, you want to predict credit risk, you want to figure out there is some kind of fraud happening, you want to catch some kind of abnormal equipment to hearing so all of these come in anomaly detection.
Clustering: This algorithm is for when you want to discover some kind of structure like perform customer segmentation, predict customer taste and determine market price, clustering algorithm would be fit for that.
Classification: whenever you want to predict categories, identify what category new information belongs in you use classification algorithm.
These four type of algorithms are divided into different categories.
If you want your data in rank-order category you will use ordinal regression.
If you want to predict event counts you will use Poisson regression.
If you want to predict a distribution you will use fast forest quantile regression.
For fast and linear training model, you will use linear regression.
For Linear model and small data sets, you will use Bayesian linear regression.
For accuracy and long training time, you will use neural network regression.
For accuracy and fast training, you will use decision forest regression.
Clustering: for K-means we used clustering
If you want to find out the feature greater than hundred and aggressive boundary you will use one class SVM.
For fast training, you will use PCA based anomaly detection.
Classification: it is divided into two types
- Two-Class Classification.
- Multi-class classification.
For greater than hundred feature and the linear model, you will use two-class SVM.
For fast training and Indian model, you will use two-class average perception.
For accuracy first training you can use two-class decision forest.
For accuracy, first training and large memory footprint you will use two-class boosted decision tree.
For accuracy and small memory footprint, you will use two-class decision jungle.
For fast training of a linear model, you will use multiclass Logistic regression.
For accuracy and long training time, you will use a multiclass neural network.
For accuracy and fast training, you will use multiclass decision forest.
For accuracy and small memory footprint, you will use multiclass decision jungle.
Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:
- Linear Regression
- Logistic Regression
- Decision Tree
- Naive Bayes
- Random Forest
- Dimensionality Reduction Algorithms
- Gradient Boost & Ad boost
- Linear Regression It is used to estimate real values (cost of houses, the number of calls, total sales etc.) based on continuous variable(s). Here, we establish a relationship between independent and dependent variables by fitting the best line. This best fit line is known as regression line and represented by a linear equation Y= a *X + b. In this equation: Y – Dependent Variable, a – Slope, X – Independent variable, b – Intercept.
- Logistic Regression It is a classification not a regression algorithm. It is used to estimate discrete values (Binary values like 0/1, yes/no, true/false) based on given set of independent variables. In simple words, it predicts the probability of occurrence of an event by fitting data to a logit function. Hence, it is also known as logit regression. Since, it predicts the probability, its output values lies between 0 and 1
- Decision Tree It is a type of supervised learning algorithm that is mostly used for classification problems. Surprisingly, it works for both categorical and continuous dependent variables.
- SVM (Support Vector Machine) It is a classification method. In this algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you have) with the value of each feature being the value of a particular coordinate.
- Naive Bayes It is a classification technique based on Bayes’ theorem with an assumption of independence between predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
- KNN It can be used for both classification and regression problems. However, it is more widely used in classification problems in the industry. K-nearest neighbors is a simple algorithm that stores all available cases and classifies new cases by a majority vote of its k neighbors. The case being assigned to the class is most common amongst its K nearest neighbors measured by a distance function. These distance functions can be Euclidean, Manhattan, Minkowski and Hamming distance. First three functions are used for continuous function and fourth one (Hamming) for categorical variables. If K = 1, then the case is simply assigned to the class of its nearest neighbor. At times, choosing K turns out to be a challenge while performing KNN modeling.
Things to consider before selecting KNN:
- KNN is computationally expensive
- Variables should be normalized else higher range variables can bias it
- Works on pre-processing stage more before going for KNN like outlier, noise removal
- K-Means It is a type of unsupervised algorithm which solves the clustering problem. Its procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters). Data points inside a cluster are homogeneous and heterogeneous to peer groups.
- Random Forest Random Forest is a trademark term for an ensemble of decision trees. In Random Forest, we’ve collection of decision trees (so known as “Forest”). To classify a new object based on attributes, each tree gives a classification and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest).
- Dimensionality Reduction Algorithms In the last 4-5 years, there has been an exponential increase in data capturing at every possible stages. Corporates/ Government Agencies/ Research organizations are not only coming with new sources but also they are capturing data in great detail. For example: E-commerce companies are capturing more details about customer like their demographics, web crawling history, what they like or dislike, purchase history, feedback and many others to give them personalized attention more than your nearest grocery shopkeeper.
- Gradient Boosting & AdaBoost GBM & AdaBoost are boosting algorithms used when we deal with plenty of data to make a prediction with high prediction power. Boosting is an ensemble learning algorithm which combines the prediction of several base estimators in order to improve robustness over a single estimator. It combines multiple weak or average predictors to a build strong predictor. These boosting algorithms always work well in data science competitions like Kaggle, AV Hackathon, Crowd Analytix.