Machine Learning Algorithms In A Whole.

HITARTH SHAH
16 min readMar 22, 2021

I will try to cover all the machine learning algorithms’ theory and practical implementation in this article so it becomes easy to read/revise the algorithms anytime from this article.

  1. Linear Regression

Linear regression is probably one of the most important and widely used regression techniques. It’s among the simplest regression methods. One of its main advantages is the ease of interpreting results.

Simple Linear Regression

Simple or single-variate linear regression is the simplest case of linear regression with a single independent variable, 𝐱 = 𝑥.

Multiple Linear Regression

In multiple linear regression, x is a two-dimensional array with at least two columns, while y is usually a one-dimensional array.

Polynomial Regression

Implementing polynomial regression with scikit-learn is very similar to linear regression. There is only one extra step: you need to transform the array of inputs to include non-linear terms such as 𝑥².

Here, the next step in Polynomial regression will be transforming the data. As you’ve seen earlier, you need to include 𝑥² (and perhaps other terms) as additional features when implementing polynomial regression. For that reason, you should transform the input array x to contain the additional column(s) with the values of 𝑥² (and eventually more features).

If you call PolynomialFeatures with the default parameter include_bias=True (or if you just omit it), you’ll obtain the new input array x_ with the additional leftmost column containing only ones.

The first column of x_ contains ones, the second has the values of x, while the third holds the squares of x.

The intercept is already included with the leftmost column of ones, and you don’t need to include it again when creating the instance of LinearRegression. Thus, you can provide fit_intercept=False.

2. Logistic Regression

Logistic Regression is a Supervised Machine Learning model which works on binary or multi categorical data variables as the dependent variables. That is, it is a Classification algorithm which segregates and classifies the binary or multilabel values separately.

3. K-Nearest Neighbors

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new data points which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.

Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.

Step 3 − For each point in the test data do the following −

  • 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.
  • 3.2 − Now, based on the distance value, sort them in ascending order.
  • 3.3 − Next, it will choose the top K rows from the sorted array.
  • 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.

Step 4 − End

4. Decision Tree

  • Decision tree algorithm falls under the category of supervised learning. They can be used to solve both regression and classification problems.
  • Decision tree uses the tree representation to solve the problem in which each leaf node corresponds to a class label and attributes are represented on the internal node of the tree.
  • We can represent any boolean function on discrete attributes using the decision tree.

How does Decision Tree algorithm work ?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root node of the tree. This algorithm compares the values of root attribute with the record (real dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.

For the next node, the algorithm again compares the attribute value with the other sub-nodes and move further. It continues the process until it reaches the leaf node of the tree. The complete process can be better understood using the below algorithm:

  • Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
  • Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
  • Step-3: Divide the S into subsets that contains possible values for the best attributes.
  • Step-4: Generate the decision tree node, which contains the best attribute.
  • Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.

Attribute Selection Measures

  1. Information Gain
  2. Gini Index

Information Gain : Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute.

  • It calculates how much information a feature provides us about a class.
  • According to the value of information gain, we split the node and build the decision tree.
  • A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first.

Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as:

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

  • S= Total number of samples
  • P(yes)= probability of yes
  • P(no)= probability of no

Gini Index : Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm.

  • An attribute with the low Gini index should be preferred as compared to the high Gini index.
  • It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits.
  • Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

5. Random Forest

Random forest is a type of supervised machine learning algorithm based on ensemble learning. Ensemble learning is a type of learning where you join different types of algorithms or same algorithm multiple times to form a more powerful prediction model. The random forest algorithm combines multiple algorithm of the same type i.e. multiple decision trees, resulting in a forest of trees, hence the name “Random Forest”. The random forest algorithm can be used for both regression and classification tasks.

Advantages of Random Forest :

As with any algorithm, there are advantages and disadvantages to using it. In the next two sections we’ll take a look at the pros and cons of using random forest for classification and regression.

  1. The random forest algorithm is not biased, since, there are multiple trees and each tree is trained on a subset of data. Basically, the random forest algorithm relies on the power of “the crowd”; therefore the overall biasedness of the algorithm is reduced.
  2. This algorithm is very stable. Even if a new data point is introduced in the dataset the overall algorithm is not affected much since new data may impact one tree, but it is very hard for it to impact all the trees.
  3. The random forest algorithm works well when you have both categorical and numerical features.
  4. The random forest algorithm also works well when data has missing values or it has not been scaled well (although we have performed feature scaling in this article just for the purpose of demonstration).

Disadvantages of Random Forest :

  1. A major disadvantage of random forests lies in their complexity. They required much more computational resources, owing to the large number of decision trees joined together.
  2. Due to their complexity, they require much more time to train than other comparable algorithms.

Random Forest for Regression :

Random Forest for Classification :

6. Support Vector Machines

A Support Vector Machine (SVM) is a supervised machine learning algorithm that can be employed for both classification and regression purposes. SVMs are more commonly used in classification problems and as such, this is what we will focus on in this post.

SVMs are based on the idea of finding a hyperplane that best divides a dataset into two classes.

Support Vectors

Support vectors are the data points nearest to the hyperplane, the points of a data set that, if removed, would alter the position of the dividing hyperplane. Because of this, they can be considered the critical elements of a data set.

What is a hyperplane?

As a simple example, for a classification task with only two features (like the image above), you can think of a hyperplane as a line that linearly separates and classifies a set of data.

Intuitively, the further from the hyperplane our data points lie, the more confident we are that they have been correctly classified. We therefore want our data points to be as far away from the hyperplane as possible, while still being on the correct side of it.

So when new testing data is added, whatever side of the hyperplane it lands will decide the class that we assign to it.

But what happens when there is no clear hyperplane?

When there is no clear hyperplane or it is not possible to fit the hyperplane, we use SVM Kernels. SVM Kernels convert low dimensional data (2D) to high dimensional data (3D).

Pros & Cons of Support Vector Machines

Pros

  • Accuracy
  • Works well on smaller cleaner datasets
  • It can be more efficient because it uses a subset of training points

Cons

  • Isn’t suited to larger datasets as the training time with SVMs can be high
  • Less effective on noisier datasets with overlapping classes

SVM Uses

SVM is used for text classification tasks such as category assignment, detecting spam and sentiment analysis. It is also commonly used for image recognition challenges, performing particularly well in aspect-based recognition and color-based classification. SVM also plays a vital role in many areas of handwritten digit recognition, such as postal automation services.

7. Naive Bayes :

  • Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
  • It is mainly used in text classification that includes a high-dimensional training dataset.
  • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
  • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
  • Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles

Where,

P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Advantages of Naïve Bayes Classifier:

  • Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
  • It can be used for Binary as well as Multi-class Classifications.
  • It performs well in Multi-class predictions as compared to the other Algorithms.
  • It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier:

  • Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.

Applications of Naïve Bayes Classifier:

  • It is used for Credit Scoring.
  • It is used in medical data classification.
  • It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
  • It is used in Text classification such as Spam filtering and Sentiment analysis.

8. K-Means Algorithm :

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.

Next →← Prev

K-Means Clustering Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering problems in machine learning or data science. In this topic, we will learn what is K-means clustering algorithm, how the algorithm works, along with the Python implementation of k-means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

The k-means clustering algorithm mainly performs two tasks:

  • Determines the best value for K center points or centroids by an iterative process.
  • Assigns each data point to its closest k-center. Those data points which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other clusters.

9. PCA :

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

10. AdaBoost Algorithm :

AdaBoost, short for “Adaptive Boosting”, is the first practical boosting algorithm proposed by Freund and Schapire in 1996. It focuses on classification problems and aims to convert a set of weak classifiers into a strong one.

How To Train One Model

A weak classifier (decision stump) is prepared on the training data using the weighted samples. Only binary (two-class) classification problems are supported, so each decision stump makes one decision on one input variable and outputs a +1.0 or -1.0 value for the first or second class value.

The misclassification rate is calculated for the trained model. Traditionally, this is calculated as:

error = (correct — N) / N

Where error is the misclassification rate, correct are the number of training instance predicted correctly by the model and N is the total number of training instances. For example, if the model predicted 78 of 100 training instances correctly the error or misclassification rate would be (78–100)/100 or 0.22.

This is modified to use the weighting of the training instances:

error = sum(w(i) * terror(i)) / sum(w)

Which is the weighted sum of the misclassification rate, where w is the weight for training instance i and terror is the prediction error for training instance i which is 1 if misclassified and 0 if correctly classified.

For example, if we had 3 training instances with the weights 0.01, 0.5 and 0.2. The predicted values were -1, -1 and -1, and the actual output variables in the instances were -1, 1 and -1, then the terrors would be 0, 1, and 0. The misclassification rate would be calculated as:

error = (0.01*0 + 0.5*1 + 0.2*0) / (0.01 + 0.5 + 0.2)

or

error = 0.704

A stage value is calculated for the trained model which provides a weighting for any predictions that the model makes. The stage value for a trained model is calculated as follows:

stage = ln((1-error) / error)

Where stage is the stage value used to weight predictions from the model, ln() is the natural logarithm and error is the misclassification error for the model. The effect of the stage weight is that more accurate models have more weight or contribution to the final prediction.

The training weights are updated giving more weight to incorrectly predicted instances, and less weight to correctly predicted instances.

For example, the weight of one training instance (w) is updated using:

w = w * exp(stage * terror)

Where w is the weight for a specific training instance, exp() is the numerical constant e or Euler’s number raised to a power, stage is the misclassification rate for the weak classifier and terror is the error the weak classifier made predicting the output variable for the training instance, evaluated as:

terror = 0 if(y == p), otherwise 1

Where y is the output variable for the training instance and p is the prediction from the weak learner.

This has the effect of not changing the weight if the training instance was classified correctly and making the weight slightly larger if the weak learner misclassified the instance.

11. XGBoost Algorithm :

XGBoost is the leading model for working with standard tabular data (the type of data you store in Pandas DataFrames, as opposed to more exotic types of data like images and videos). XGBoost models dominate many Kaggle competitions.

To reach peak accuracy, XGBoost models require more knowledge and model tuning than techniques like Random Forest.

So, this brings an end to our ML algorithms. Forgive me if I have not added any important algorithm. I hope this article will be helpful when you all want to read or revise any ML algorithm. Any feedback is accepted. Thank You !

Happy Learning !!!!

--

--