- R Machine Learning By Example
- Raghav Bali Dipanjan Sarkar
- 3473字
- 2024-12-21 01:51:00
Families of algorithms
There are tons of algorithms in the machine learning universe and more are devised each year. There is tremendous research happening in this space and hence the ever increasing list of algorithms. It is also a fact that the more these algorithms are being used, the more improvements in them are being discovered. Machine learning is one space where industry and academia are running hand in hand.
But, as Spider-Man was told that with great power comes great responsibility, the reader should also understand the responsibility at hand. With so many algorithms available, it is necessary to understand what they are and where they fit. It can feel overwhelming and confusing at first but that is when categorizing them into families helps.
Machine learning algorithms can be categorized in many ways. The most common way is to group them into supervised learning algorithms and unsupervised learning algorithms.
Supervised learning algorithms
Supervised learning refers to algorithms which are trained on a predefined data set called the training data set. The training data set is usually a two element tuple consisting of an input element and a desired output element or signal. In general, the input element is a vector. The supervised learning algorithm uses the training data set to produce the desired function. The function so produced (or rather inferred) is then utilized to correctly map new data, better termed as the test data.
An algorithm which has learnt well will be able to correctly determine the outputs for unseen data in a reasonable way. This brings in the concepts of generalization and overfitting.
Briefly, generalization refers to the concept wherein an algorithm generalizes the desired function based upon the (limited) training data to handle unseen data in a correct manner. Overfitting is exactly the opposite concept of generalization, wherein an algorithm infers a function such that it maps exactly to the training data set (including the noise). This may result in huge errors when the function learnt by the algorithm is checked against new/unseen data.
Both generalization and overfitting revolve around the random errors or noise in the input data. While generalization tries to minimize the effect of noise, overfitting does the opposite by fitting in noise as well.
Problems to be solved using supervised methods can be pided into the following steps:
- Prepare training data: Data preparation is the most important step for all machine learning algorithms. Since supervised learning utilizes labeled input data sets (data sets which consist of corresponding outputs for given inputs), this step becomes even more important. This data is usually labeled by human experts or from measurements.
- Prepare the model: The model is the representation of the input data set and the learnt pattern. The model representation is affected by factors such as input features and the learning algorithm itself. The accuracy of the inferred function also depends on how this representation is formed.
- Choose an algorithm: Based on the problem being solved and the input information, an algorithm is then chosen to learn and solve the problem.
- Examine and fine tune: This is an iterative step where the algorithm is run on the input data set and the parameters are fine tuned to achieve the desired level of output. The algorithm is then tested on the test data set to evaluate its performance and measure the error.
Under supervised learning, two major sub categories are:
- Regression based machine learning: Learning algorithms which help us answer quantitative questions such as how many? or how much? The outputs are generally continuous values. More formally, these algorithms predict the output values for unseen/new data based on the training data and the model formed. The output values are continuous in this case. Linear regression, multivariate regression, regression trees, and so on are a few supervised regression algorithms.
- Classification based machine learning: Learning algorithms which help us answer objective questions or yes-or-no predictions. For example, questions such as is this component faulty? or can this tumor cause cancer? More formally, these algorithms predict the class labels for unseen or new data based upon the training data and model formed. Support Vector Machines (SVM), decision trees, random forests, and so on are a few commonly used supervised classification algorithms.
Let us look at some supervised learning algorithms in detail.
Linear regression
Regression, as mentioned previously, helps us answer quantitative questions. Regression has its roots in the statistics domain. Researchers use a linear relationship to predict the output value Y
for a given input value X
. This linear relationship is called a linear regression or regression line.
Mathematically, linear regression is represented as:
Where, b0
is the intercept or the point where the line crosses the y
axis.
b1
is the slope of the line, that is, the change in y
over change in x
.
The preceding equation is pretty similar to how a straight line is represented and hence the name linear regression.
Now, how do we decide which line we fit for our input so that it predicts well for unknown data? Well, for this we need an error measure. There can be various error measures; the most commonly used is the least squares method.
Before we define the least squares method, we first need to understand the term residual. Residual is simply the deviation of Y from a fitted value. Mathematically:
Where, ŷi
is the deviated value of y
.
The least squares method states that the most optimal fit of a model to data occurs when the sum of the squares of residuals is minimum.
Mathematically:
We use calculus to minimize the sum of squares for residuals and find the corresponding coefficients.
Now that we understand linear regression, let us take a real world example to see it in action.
Suppose we have data related to the height and weight of school children. The data scientist in you suddenly starts thinking about whether there is any relation between the weights and heights of these children. Formally, could the weight of a child be predicted based upon his/her given height?
To fit in linear regression, the first step is to understand the data and see whether there is a correlation between the two variables (weight
and height
). Since in this case we are dealing with just two dimensions, visualizing the data using a scatter plot will help us understand it quickly. This will also enable us to determine if the variables have some linear relationship or not.
Let us prepare our data first and visualize it on a scatter plot along with the correlation coefficient.
#Height and weight vectors for 19 children height <- c(69.1,56.4,65.3,62.8,63,57.3,59.8,62.5,62.5,59.0,51.3,64,56.4,66.5,72.2,65.0,67.0,57.6,66.6) weight <- c(113,84,99,103,102,83,85,113,84,99,51,90,77,112,150,128,133,85,112) plot(height,weight) cor(height,weight)
Output:
[1] 0.8848454
The scatter plot looks like this:
Figure displaying data points across weight and height dimensions
The preceding scatter plot proves that our intuition about weight and height having a linear relationship was correct. This can further be confirmed using the correlation function, which gives us a value of 0.88
.
Time to prepare the model for our data set! We use the inbuilt utility lm
or the linear model utility to find the coefficients b0
and b1
.
#Fitting the linear model model <- lm(weight ~ height) # weight = slope*weight + intercept #get the intercept(b0) and the slope(b1) values model
The output looks like this:
You can experiment a bit more to find out more details calculated by the lm
utility using the following commands. We encourage you to go ahead and try these out.
#check all attributes calculated by lm attributes(model) #getting only the intercept model$coefficients[1] #or model$coefficients[[1]] #getting only the slope model$coefficients[2] #or model$coefficients[[2]] #checking the residuals residuals(model) #predicting the weight for a given height, say 60 inches model$coefficients[[2]]*50 + model$coefficients[[1]] #detailed information about the model summary(model)
As the final piece, let us visualize the regression line on our scatter plot itself.
#plot data points plot(height,weight) #draw the regression line abline(model)
The scatter plot looks like the following figure:
Scatter plot with regression calculated regression line
Thus, we saw how the relationship between two variables can be identified and predictions can be made using a few lines of code. But we are not done yet. There are a couple of caveats which the reader must understand before deciding whether to use linear regression or not.
Linear regression can be used to predict the output values for given inputs, if and only if:
- The scatter plot forms a linear pattern
- The correlation between them is moderate to strong (beyond
0.5
or-0.5
)
Cases where only one of the previous two conditions are met can lead to incorrect predictions or altogether invalid models. For example, if we only check the correlation and find it to be strong and skip the step where we look at the scatter plot, then that may lead to invalid predictions as you might have tried to fit a straight line when the data itself is following a curved shape (note that curved data sets can also have high correlation values and hence the mistake).
It is important to remember that correlation does not imply causation. In simple words, correlation between two variables does not necessarily imply that one causes the other. There might be a case that cause and effect are indirectly related due to a third variable termed as the cofounding variable. The most common example used to describe this issue is the relationship between shoe size and reading ability. From the survey data (if one existed!) it could be inferred that larger shoe sizes relate to higher reading ability, but this clearly does not mean that large feet cause good reading skills to be developed. It might be rather interesting to note that young children have small feet and have not yet been taught to read. In such a case, the two variables are more accurately correlated with age.
You should now be coming to similar conclusion about the weight to height example we used earlier. Well yes, the earlier example also suffers from similar fallacy, yet it served as an easy to use scenario. Feel free to look around and model cases which do not suffer from this.
Linear regression finds application in the field of finance where it is used for things such as quantification of risks on investments. It is also used widely in the field of economics for trendline analysis and so on.
Apart from linear regression, logistic regression, stepwise regression, Multivariate Adaptive Regression Splines (MARS), and others are a few more supervised regression learning algorithms.
K-Nearest Neighbors (KNN)
K-Nearest Neighbors or KNN algorithms are amongst the simplest algorithms from an implementation and understanding point of view. They are yet another type of supervised learning algorithm which helps us classify data.
KNN can be easily described using the quote like and like strike together—Plato, that is, similar things are likely have similar properties. KNN utilizes this very concept to label a data point based upon its similarity with respect to its neighbors.
Formally, KNN can be described as the process to classify unlabeled (or unseen) data points by assigning them the class of the most similar labeled data points (or training samples).
KNN is a supervised learning algorithm. Hence, it begins with a training data set of examples which are classified into different classes. The algorithm then picks each data point in the test data set and, based upon a chosen similarity measure, identifies its k nearest neighbors (where k is specified in advance). The data point is then assigned the class of the majority of the k nearest neighbors.
The trick up the KNN's sleeve is the similarity measure. There are various similarity measures available to us. The decision to choose one is based on the complexity of the problem, the type of data, and so on. Euclidean distance is one such measure which is widely used. Euclidean distance is the shortest direct route between two points. Mathematically it is given as:
Manhattan distance, Cosine distance, and Minkowski distance are some other types of distance measures which can be used for finding the nearest neighbors.
The next parameter for KNN algorithm is the k
in the K-Nearest Neighbors. The value of k determines how well the KNN model generalizes to the test data. The balance between overfitting and underfitting the training data depends upon the value of k
. With slight deliberation, it is easy to understand that a large k
will minimize the impact of variance caused by noisy data, but at the same time, it will also undermine small but important patterns in the data. This problem is called the bias-variance tradeoff.
The optimal value of k, even though hard to determine, lies between the extremes of k=1
to k=total number of training samples
. A common practice is to set the value of k
equal to the square root of training instances, usually between 3 and 10. Though a common practice, the value of k
is dependent on the complexity of the concept to be learned and the number of training examples.
The next step in pursuit of the KNN algorithm is preparation of data. Features used to prepare the input vectors should be on similar scales. The rationale for this step is that the distance formula is dependent on how the features are measured. For example, if certain features have large range of values as compared to others, the distance measurements will then be dominated by such measures. The method of scaling features to a similar scale is called normalization. Very much like the distance measure, there are various normalization methods available. One such method is min-max normalization, given mathematically as:
Before we begin with our example to understand KNN, let us outline the steps to be performed to execute KNN:
- Collect data and explore data: We need to collect data relevant to the concept to be learnt. We also need to explore the data to understand various features, know the range of their values, and determine the class labels.
- Normalize data: As discussed previously, KNN's dependence on distance measure makes it very important that we normalize the data to remove any inconsistency or bias in calculations.
- Create training and test data sets: Since it is important to learn a concept and prepare a model which generalizes to acceptable levels for unseen data, we need to prepare training and test data sets. The test data set, even though labeled, is used to determine the accuracy and the ability of the model to generalize the concept learnt. A usual practice is to pide the input sample into two-third and one-third portions for training and test data sets respectively. It is equally important that the two data sets are a good mix of all the class labels and data points, that is both the data sets should be representative subsets of the full data.
- Train the model: Now that we have all the things in place, we can use the training data set, test data set, the labels, and the value of
k
to train our model and label the data points in the test data set. - Evaluate the model: The final step is to evaluate the learnt pattern. In this step, we determine how well the algorithm has predicted the class labels of the test data set as compared to their known labels. Usually a confusion matrix is prepared for the same.
Now, let us see KNN in action. The problem at hand is to classify different species of flowers based upon certain features. For this particular example, we will be using the Iris data set. This data set comes built in with the default installation of R.
Collecting and exploring data
Step one is to collect and explore the data. Let us first gather the data.
To check if your system has the required dataset, type in just the name:
iris #this should print the contents of data set onto the console.
If you do not have the data set available, no worries! You can download it as follows:
#skip these steps if you already have iris on your system iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) #assign proper headers names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species")
Now that we have the data, it is time to explore and understand it. For exploring the data set and its attributes, we use the following commands:
#to view top few rows of data head(iris)
Output:
#to view data types, sample values, categorical values, etc str(iris)
Output:
#detailed view of the data set summary(iris)
Output:
The summary command helps us understand the data in a better way. It clearly shows different attributes along with min
, max
, median
, and other such statistics. These help us in the coming steps where we might have to scale or normalize the data or features.
During step one is where we usually label our input data. Since our current data set is already labeled, we can skip this step for this example problem. Let us visually see how the species are spread. We take help of the famous scatter plot again, but this time we use a package called ggvis
.
You can install ggvis
as:
install.packages("ggvis")
For visualizing petal widths and lengths for all 3 species, we use the following code snippet:
#load the package library(ggvis) #plot the species iris %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~factor(Species)) %>% layer_points()
Note
The ggvis
package is an interactive graphics package in R. It follows a unique way of expressing inputs to generate visualizations. The preceding snippet of code uses the pipe operator, %>%
, to pass input data to ggvis
and again uses the pipe operator to pass on the output to layer_points
for final plotting. The ~
operator signifies to ggvis
that Petal.Length
is a variable in the input dataset (iris). Read more about ggvis
at http://ggvis.rstudio.com/ggvis-basics.html.
The preceding plot clearly shows that there is a high correlation between petal widths and lengths for Iris-setosa flowers, while it is a little less for the other two species.
Note
Try visualizing sepal width versus sepal length as well and see if you can spot any correlation.
Normalizing data
The next step is to normalize the data so that all the features are on the same scale. As seen from the data exploration step, the values of all the attributes are more or less in a comparable range. But, for the sake of this example, let us write a min-max normalization function:
#normalization function min_max_normalizer <- function(x) { num <- x - min(x) denom <- max(x) - min(x) return (num/denom) }
Remember, normalization does not alter the data, it simply scales it. Therefore, even though our data does not require normalization, doing so will not cause any harm.
Note
Note
In the following steps, we will be using un-normalized data for clarity of output.
#normalizing iris data set normalized_iris <- as.data.frame(lapply(iris[1:4], min_max_normalizer)) #viewing normalized data summary(normalized_iris)
The following is a summary of the normalized data frame:
Creating training and test data sets
Now that we have our data normalized, we can pide it into training and test data sets. We will follow the usual two-third one-third rule of splitting the data into two. As mentioned earlier, both data sets should be representative of the whole data and hence we need to pick proper samples. We will utilize R's sample()
function to prepare our samples.
#checking the data constituency table(iris$Species)
Output:
#set seed for randomization set.seed(1234) # setting the training-test split to 67% and 33% respectively random_samples <- sample(2, nrow(iris), replace=TRUE, prob=c(0.67, 0.33)) # training data set iris.training <- iris[ random_samples ==1, 1:4] #training labels iris.trainLabels <- iris[ random_samples ==1, 5] # test data set iris.test <- iris[ random_samples ==2, 1:4] #testing labels iris.testLabels <- iris[ random_samples ==2, 5]
Learning from data/training the model
Once we have the data ready in our training and test data sets, we can proceed to the next step and learn from the data using KNN. The KNN implementation in R is present in the class library. The KNN function takes the following inputs:
train
: The data frame containing the training data.test
: The data frame containing the test data.class
: A vector containing the class labels. Also called the factor vector.k
: The value of k-nearest neighbors.
For the current case, let us assume the value of k
to be 3
. Odd numbers are usually good at breaking ties. KNN is executed as:
#setting library library(class) #executing knn for k=3 iris_model <- knn(train = iris.training, test = iris.test, cl = iris.trainLabels, k=3) #summary of the model learnt iris_model
Output:
A quick scan of the output shows everything correct except one versicolor label amongst virginica's. Though this one was easy to spot, there are better ways to evaluate the model.
Evaluating the model
This brings us to the last step where we evaluate the model. We do this by preparing a confusion matrix or a cross table which helps us understand how the predicted labels are with respect to the known labels of the test data. R provides us with yet another utility function called CrossTable()
which is present in the gmodels
library. Let us see the output:
#setting library library(gmodels) #Preparing cross table CrossTable(x = iris.testLabels, y = iris_model, prop.chisq=FALSE)
Output:
From the preceding output we can conclude that the model has labeled one instance of virginica as versicolor while all the other test data points have been correctly labeled. This also helps us infer that the choice of k=3
was indeed good enough. We urge the reader to try the same example with different values of k
and see the change in results.
KNN is a simple yet powerful algorithm which makes no assumptions about the underlying data distribution and hence can be used in cases where relationships between features and classes are complex or difficult to understand.
On the downside, KNN is a resource intensive algorithm as it requires a large amount of memory to process the data. Dependency on distance measures and missing data requires additional processing which is another overhead with this algorithm.
Despite its limitations, KNN is used in a number of real life applications such as for text mining, predicting heart attacks, predicting cancer and so on. KNN also finds application in the field of finance and agriculture as well.
Decision trees, random forests, and support vector machines are some of the most popular and widely used supervised classification algorithms.
Unsupervised learning algorithms
Unsupervised learning refers to algorithms which learn a concept(s) on their own. Now that we are familiar with the concept of supervised learning, let us utilize our knowledge to understand unsupervised learning.
Unlike supervised learning algorithms, which require a labeled input training data set, unsupervised learning algorithms are tasked with finding relationships and patterns in the data without any labeled training data set. These algorithms process the input data to mine for rules, detect patterns, summarize and group the data points which helps in deriving meaningful insights, and describing the data to the users. In the case of unsupervised learning algorithms, there is no concept of training and test data sets. Rather, as mentioned before, input data is analyzed and used to derive patterns and relationships.
Similar to supervised learning, unsupervised learning algorithms can also be pided into two main categories:
- Association rule based machine learning: These algorithms mine the input data to identify patterns and rules. The rules explain interesting relationships between the variables in the data set to depict frequent itemsets and patterns which occur in the data. These rules in turn help discover useful insights for any business or organization from their huge data repositories. Popular algorithms include Apriori and FP-Growth.
- Clustering based machine learning: Similar to supervised learning based classification algorithms, the main objective of these algorithms is to cluster or group the input data points into different classes or categories using just features derived from the input data alone and no other external information. Unlike classification, the output labels are not known beforehand in clustering. Some popular clustering algorithms include k-means, k-medoids, and hierarchical clustering.
Let us look at some unsupervised learning algorithms.
Apriori algorithm
This algorithm which took the world by storm was proposed by Agarwal and Srikant in 1993. The algorithm is designed to handle transactional data, where each transaction is a set of items, or itemset. The algorithm in short identifies the item sets which are subsets of at least C transactions in the data set.
Formally, let ┬
be a set of items and D
be a set of transactions, where each transaction T
is a subset of ┬
. Mathematically:
Then an association rule is an implication of the form X → Y
, where a transaction T
contains X
as a subset of ┬
and:
The implication X → Y
holds true in the transaction set D
with a confidence factor c
if c%
of the transactions in D
that contain X
also contain Y
. The association rule X → Y
is said to have a support factor of s
if s%
of transactions in D
contain X U Y
. Hence, given a set of transactions D
, the task of identifying association rules implies generating all such rules that have confidence and support greater than a user defined thresholds called minsup
(for minimum support threshold) and minconf
(for minimum confidence threshold).
Broadly, the algorithm works in two steps. The first one being identification of the itemsets whose occurrence exceeds a predefined threshold. Such itemsets are called Frequent Itemsets. The second step is to generate association rules from the identified frequent itemsets which satisfy the constraints of minimum confidence and support.
The same two steps can be explained better using the following pseudo-code:
Now let us see the algorithm in action. The data set in consideration is the UCI machine learning repository's Adult
data set. The data set contains census data with attributes such as gender, age, marital status, native country, and occupation, along with economic attributes such as work class, income, and so on. We will use this data set to identify if there are association rules between census information and the income of the person.
The Apriori algorithm is present in the arules
library and the data set in consideration is named Adult.
It is also available with default R installation.
# setting the apriori library library(arules) # loading data data("Adult");
Time to explore our data set and see a few sample records:
# summary of data set summary(Adult); # Sample 5 records inspect(Adult[0:5]);
We get to know that the data set contains some 48k
transactions with 115 columns. We also get information regarding the distribution of itemsets based on their sizes. The inspect
function gives us a peek into sample transactions and the values each of the columns hold.
Now, let us build some relationships:
# executing apriori with support=50% confidence =80% rules <- apriori(Adult, parameter=list(support=0.5, confidence=0.8,target="rules")); # view a summary summary(rules); #view top 3 rules as(head(sort(rules, by = c("confidence", "support")), n=3), "data.frame")
The Apriori algorithm uses the Adult
data set as input to identify rules and patterns in the transactional data. On viewing the summary, we can see that the algorithm successfully identified 84
rules meeting the support and confidence constraints of 50%
and 80%
respectively. Now that we have identified the rules, let us see what they are:
The rules are of the form X→ Y
where X
is the lhs
or left-hand side and Y
is the rhs
or right-hand side. The preceding image displays corresponding confidence and support values as well. From the output we can infer that if people are working full time then their chances of facing capital loss is almost none (confidence factor 95.8%). Another rule helps us infer that people who work for a private employer also have close to no chance of facing a capital loss. Such rules can be used in preparing policies or schemes for social welfare, economic reforms, and so on.
Apart from Apriori, there are other association rule mining algorithms such as FP Growth, ECLAT, and many others which have been used for various applications over the years.
K-Means
In the world of unsupervised clustering algorithms, the most simple and widely used algorithm is K-Means. As we have seen recently, unsupervised learning algorithms process the input data without any prior labels or training to come up with patterns and relationships. Clustering algorithms in particular help us cluster or partition the data points.
By definition, clustering refers to the task of grouping objects into groups such that elements of a group are more similar to each other than those in other groups. K-Means does the same in an unsupervised manner.
Mathematically, given a set of n
observations {x1,x2,…,xn}
, where each observation is a d dimensional vector, the algorithm tries to partition these n observations into k (≤ n)
sets by minimizing an objective function.
As with the other algorithms, there can be different objective functions. For the sake of simplicity, we will use the most widely used function called the with-in cluster sum of squares or WCSS function.
Here μi
is the mean of points in the partition Si
.
The algorithm follows a simple two step iterative process, where the first step is called the assignment step, followed by the update step.
- Initialize by setting the means for
k
partitions:m1,m2…mk
- Until the mean does not change or the change is lower than a certain threshold:
- Assignment step: Assign each observation to a partition for which the with-in cluster sum of squares value is minimum, that is, assign the observation to a partition whose mean is closest to the observation.
- Update step: For
i
in1 to k
, update each meanmi
based on all the observations in that partition.
The algorithm can use different initialization methods. The most common ones are Forgy and Random Partition methods. I encourage you to read more on these. Also, apart from the input data set, the algorithm requires the value of k
, that is the number of clusters to be formed. The optimal value may depend on various factors and is generally decided based on the use case.
Let us see the algorithm in action.
We will again use the Iris flower data set which we already used for the KNN algorithm. For KNN we already had the species labeled and then tried to learn and classify the data points in the test data set into correct classes.
With K-Means, we also aim to achieve the same partitioning of the data but without any labeled training data set (or supervision).
# prepare a copy of iris data set kmean_iris <- iris #Erase/ Nullify species labels kmean_iris$Species <- NULL #apply k-means with k=3 (clusters <- kmeans(kmean_iris, 3))
Now that we have the output from k-means
, let us see how well it has partitioned the various species. Remember, k-means
does not have partition labels and simply groups the data points.
# comparing cluster labels with actual iris species labels. table(iris$Species, clusters$cluster)
Output:
The output shows that the species setosa matches cluster label 2, versicolor matches label 3, and so on. Visually, it is easy to make out how the data points are clustered:
# plot the clustered points along sepal length and width plot(kmean_iris[c("Sepal.Length", "Sepal.Width")], col=clusters$cluster,pch = c(15,16,17)[as.numeric(clusters$cluster)]) points(clusters$centers[,c("Sepal.Length", "Sepal.Width")], col=1:3, pch=8, cex=2)
K-Means finds wide usage in areas like computer graphics for color quantization; it is combined with other algorithms and used for natural language processing, computer vision, and so on.
There are different variations of k-means
(R itself provides three different variations). Apart from k-means, other unsupervised clustering algorithms are k-medoids, hierarchical clustering, and others.