- Applied Supervised Learning with R
- Karthik Ramasubramanian Jojo Moolayil
- 3080字
- 2021-06-11 13:22:32
Univariate Analysis
Univariate analysis is the study of a single feature/variable. Here, we describe the data to help us get an overall view of how it is organized. For numeric features, such as age, duration, nr.employed (numeric features in the dataset) and many others, we look at summary statistics such as min, max, mean, standard deviation, and percentile distribution. These measures together help us understand the spread of the data. Similarly, for categorical features such as job, marital, and education, we need to study the distinct values in the feature and the frequency of these values. To accomplish this, we can implement a few analytical, visual, and statistical techniques. Let's take a look at the analytical and visual techniques for exploring numeric features.
Exploring Numeric/Continuous Features
If you explored the previous output snippet, you might have noted that we have a mix of numeric and categorical features in the dataset. Let's start by looking at the first feature in the dataset, which is a numeric feature named age. As the name suggests, it denotes the age of the customer being targeted. Let's take a look at the summary statistics of the feature and visualize it using a simple boxplot.
Exercise 19: Visualizing Data Using a Box Plot
In this exercise, we will explore using a boxplot for univariate analysis, explain how to interpret the boxplot, and walk through the code.
Perform the following steps to visualize the data using a boxplot:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Print the age data, such as mean and max, using the following command:
print(summary(df$age))
The output is as follows:
Min. 1st Qu. Median Mean 3rd Qu. Max.
17.00 32.00 38.00 40.02 47.00 98.00
- Next, print the standard deviation of age as follows:
print(paste("Std.Dev:",round(sd(df$age),2)))
The output is as follows:
[1] "Std.Dev: 10.42"
- Now, plot the boxplot using of age with following parameters:
ggplot(data=df,aes(y=age)) + geom_boxplot(outlier.colour="black")
The output is as follows:
Figure 2.3: Boxplot of age.
We first load the ggplot2 library, which provides handy functions for visualizing the data. R provides a simple function called summary, which prints summary statistics such as min, max, median, mean, 75th percentile, and 25th percentile values. The next line uses the sd function to compute the standard deviation, and, lastly, the final line uses the ggplot library to plot the boxplot for the data.
If you explore the variable with the output from the summary statistics, we can see that age has a minimum value of 17, a max of 98, and a mean of 42. If you take a close look at the gap between the 75th percentile (3rd quartile) and the 100th percentile (max), we can see a huge jump. This indicates that there are outliers present in the age variable. The presence of outliers will incorrectly change your conclusions from the analysis. In some cases, when there is just one data point with a value of 1000x the 75th percentile, your mean will shift toward the right. In scenarios where you would use just mean as a ballpark figure to give an estimate of the variable, the whole understanding of the feature might be misleading.
The boxplot, on the other hand, helps us visually consume this information in a simple and lucid way. The boxplot splits the data into three quartiles. The lower quartile, that is, the line below the box, represents the min and the 25th percentile. The middle quartile represents the 25th to 50th to 75th percentile. The upper quartile represents the 75th to the 100th percentile. The dots above the 100th percentile are outliers determined by the internal functions. As we can see, the observation from the summary statistics are in line with the boxplots. We do see outliers, marked as dots above the upper quartile.
In the next exercise, we will perform an EDA on the age variable using a histogram. Let's see what insight we can get from the histogram plot.
Exercise 20: Visualizing Data Using a Histogram
In this exercise, we will discuss how to interpret the histogram and outliers. Let's continue from the previous exercise.
In order to get a more detailed view of the data and closely understand how the age variable is organized, we can use histograms. A histogram is a special bar plot, where the data is grouped and sequentially arranged into equal intervals called bins, and the frequency of data points in the respective bins are plotted. The histogram helps us to understand the distribution of the data more effectively. The exercise plots the histogram to help us visualize the data more effectively.
Perform the following steps:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Now, use the following command to plot the histogram for age using the provided parameters:
ggplot(data=df,aes(x=age)) +
geom_histogram(bins=10,fill="blue", color="black", alpha =0.5) +
ggtitle("Histogram for Age") +
theme_bw()
The output is as follows:
Figure 2.4: Histogram for age
The ggplot function defines the base layer for visualization, which is then followed by the geom_histogram function with parameters that define the histogram-related aspects such as the number of bins, color to fill, alpha (opacity), and many more. The number of bins is also calculated by default, but it can be overridden by passing a value to the bin parameter, such as bin=10. The next function, ggtitle, is used to add a title to the plot, and the theme_bw function is added to change the theme to black and white instead of the default. The theme function is optional and is added here for only visually appealing plots.
As you can clearly see, the histogram gives us a more granular view of the data distribution for the feature. We can understand that the number of records drastically reduce after 65 and only a few records have values beyond 75. In some cases, choosing the number of bins for the histogram becomes important as higher number of bins make the distribution messy and a smaller number of bins make the distribution less informative. In scenarios where we would want to see a much more granular view of the distribution, instead of increasing the number of bins for the histogram, we can opt for visualizing using a density plot that visualizes the plot over a continuous interval while using kernel smoothing to smooth out the noise.
We can also visualize the age variable using a density plot rather a histogram. The next exercise goes into the details of how to do it.
Exercise 21: Visualizing Data Using a Density Plot
In this exercise, we will demonstrate the density plot for the same feature, age.
Perform the following steps:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Now, use the following command to plot the density plot for age:
ggplot(data=df,aes(x=age)) + geom_density(fill="red",alpha =0.5) +
ggtitle("Density Plot for Age") +
theme_bw()
The output is as follows:
Figure 2.5: Density plot for age.
Similar to the previous exercise, we use the same base for the visualization with the ggplot function and use a different geom_density function for the density plot. The rest of the additional functions used for the visualization remain the same.
Density plots give finer details than a histogram. While this level of detail can also be achieved using higher number of bins for a histogram, there is often a hit and try method required to get the best number of bins. In such cases, an easier option to opt for would be density plots.
Now that we have understood the idea of univariate analysis for numeric variables, let's speed up the data exploration for other variables. We have a total of 10 categorical features and 10 numeric columns. Let's try to take a look at four numeric variables together using a histogram.
Just like we plotted the histogram for age, we can do it for multiple variables at the same time by defining a custom function. The next exercise shows how to do this.
Exercise 22: Visualizing Multiple Variables Using a Histogram
In this exercise, we will combine the four histograms, one for each of the variables of interest, into a single plot. We have campaign, which indicates the number of contacts performed during the campaign, and pdays, which indicates the number of days since the client was last contacted by the previous campaign; a value of 999 indicates that the client was never contacted before. previous indicates the number of contacts previously made for this client, and lastly, emp.var.rate indicates the employment variance rate.
Let's perform the following steps to complete the exercise:
- First, import the cowplot package using the following command:
library(cowplot)
Ensure that the cowplot package is installed.
- Next, define a function to plot histograms for all numeric columns:
plot_grid_numeric <- function(df,list_of_variables,ncols=2){
plt_matrix<-list()
i<-1
for(column in list_of_variables){
plt_matrix[[i]]<-ggplot(data=df,aes_string(x=column)) +
geom_histogram(binwidth=2,fill="blue", color="black",
alpha =0.5) +
ggtitle(paste("Histogram for variable: ",column)) + theme_bw()
i<-i+1
}
plot_grid(plotlist=plt_matrix,ncol=2)
}
- Now, use the summary function to print the mean, max, and other parameters for the campaign, pdays, previous, and emp.var.rate columns:
summary(df[,c("campaign","pdays","previous","emp.var.rate")])
The output is as follows:
campaign pdays previous emp.var.rate
Min. : 1.000 Min. : 0.0 Min. :0.000 Min. :-3.40000
1st Qu.: 1.000 1st Qu.:999.0 1st Qu.:0.000 1st Qu.:-1.80000
Median : 2.000 Median :999.0 Median :0.000 Median : 1.10000
Mean : 2.568 Mean :962.5 Mean :0.173 Mean : 0.08189
3rd Qu.: 3.000 3rd Qu.:999.0 3rd Qu.:0.000 3rd Qu.: 1.40000
Max. :56.000 Max. :999.0 Max. :7.000 Max. : 1.40000
- Call the function we defined earlier to plot the histogram:
plot_grid_numeric(df,c("campaign","pdays","previous","emp.var.rate"),2)
The output is as follows:
Figure 2.6: Visualizing multiple variables using a histogram
In this exercise, we automated the process of stacking multiple plots of the same kind into a consolidated plot. We first load the required cowplot library. This library provides handy functions for creating a plot grid for plots rendered by the ggplot library. If you do not have the library loaded, install the packages using the install.packages('cowplot') command. We then define a function called plot_grid_numeric, which accepts the parameters dataset, a list of features to plot, and the number of columns to be used in the grid. If you observe the internals of the function, you will see that we simply traverse through the list of provided variables using a for loop and collect the individual plots into a list called plt_matrix. Later, we use the plot_grid function provided by the cowplot library to arrange the plots into a grid with two columns.
The same function can be used to display a grid of any number of histograms; use a number based on your screen size. The current number has been restricted to 4 for best results. We also use the summary function to display the overall statistics for the same set of numeric variables in conjunction with the histogram plots.
Note
There is no exception handling code used in the previous function. We have ignored implementing sophisticated code for now in order to focus on the topic of interest. In the event of using the function for non-numeric variables, the error messages will not be the most effective to solve it.
As we can see in the previous plot, we now have four variables together for analysis. Studying the summary statistics in tandem with the histogram plots helps us uncover the underlying variable better. Campaign has 75% of the values below or equal to 3. We can see that there is an outlier at 56, but a significant majority of the records have values less than 5. pdays seems to not be a useful variable for our analysis as almost all records have the default value of 999. The tall bar in 1000 makes it clear that barely any records will have values other than 999.
For the previous variable, we see the exact opposite of pdays; most records have a value of 0. Lastly, emp.var.rate shows us an interesting result. Though the values range from -4 to 2, more than half of the records have a positive value.
So, with the analysis of these four variables, we can roughly conclude that the previously conducted campaigns didn't communicate very often by phone with the clients, or it could also mean that close to none of the clients targeted in the previous campaign were contacted for the current campaign. Also, the ones who were contacted earlier were contacted seven times at most. The number of days since the client was last contacted naturally is in sync with the results from the previous campaign, because hardly any have been contacted earlier. However, for the current campaign, clients have been contacted an average of 2.5 times, 75% of the clients have been contacted up to 3 times, and some clients have been contacted as high as 56 times. The employment variance rate is an indicator of how many people are hired or fired due to macro-economic situations. We understand that the economic situation has been fairly steady for most of the time during the campaigns.
Similar to the function created in the previous section to stack histograms together, in this activity, we will create another function to stack density plots and another for boxplots.
Activity 4: Plotting Multiple Density Plots and Boxplots
In this activity, we will create a function to stack density plots, and another for boxplots. Use the newly created functions to visualize the same set of variables as in the previous section and study the most effective way to analyze numeric variables.
By end of this activity, you will learn how to plot multiple variables in density plot at the same time. Doing so makes it easy to compare the different variables in one go and draw insights about the data.
Perform the following steps to complete this activity:
- First, load the necessary libraries and packages in RStudio.
- Read the bank-additional-full.csv dataset into a DataFrame named df.
- Define the plot_grid_numeric function for the density plot:
plot_grid_numeric <- function(df,list_of_variables,ncols=2){
plt_matrix<-list()
i<-1
}
plot_grid(plotlist=plt_matrix,ncol=2)
}
- Plot the density plot for the campaign, pdays, previous, and emp.var.rate variables:
Figure 2.7: Density plots for the campaign, pdays, previous, and emp.var.rate variables
Observe that the interpretations we obtained using the histogram are visibly true in the density plot as well. Hence, this serves as another alternative plot for looking at the same trend.
- Repeat the steps for the boxplot:
plot_grid_numeric <- function(df,list_of_variables,ncols=2){
plt_matrix<-list()
i<-1
}
plot_grid_numeric(df,c("campaign","pdays","previous","emp.var.rate"),2)
The plot is as follows:
Figure 2.8: Boxplots for the campaign, pdays, previous, and emp.var.rate variables
An additional point to note in the boxplot is that it shows the clear presence of outliers in the campaign variable, which wasn't very visible in the other two plots. A similar observation could be made for previous and pdays variables as well. Students should try to plot boxplots after removing the outliers and see how different they look then.
Note
You can find the solution for this activity on page 442.
Exercise 23: Plotting a Histogram for the nr.employed, euribor3m, cons.conf.idx, and duration Variables
In this exercise, we will move to the next and the last set of four numeric variables. We have nr.employed, which indicates the number of employees employed at the bank, and euribor3m, which indicates the 3-month euro interbank rates for average interest rates. Also, we have cons.conf.index, which is the consumer confidence indicator measured as the degree of optimism on the state by consumers by expressing through the activities of savings and spending. Lastly, there is duration, which indicates the last contact duration. As per the metadata provided by UCI, this variable is highly correlated with the outcome and will lead to possible data leakage. Therefore, we will drop this variable from our future analysis.
Perform the following steps to study the next set of numeric variables:
- First, import the cowplot package using the following command:
library(cowplot)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Print the details using the summary method:
summary(df[,c("nr.employed","euribor3m","cons.conf.idx","duration")])
The output is as follows:
nr.employed euribor3m cons.conf.idx duration
Min. :4964 Min. :0.634 Min. :-50.8 Min. : 0.0
1st Qu.:5099 1st Qu.:1.344 1st Qu.:-42.7 1st Qu.: 102.0
Median :5191 Median :4.857 Median :-41.8 Median : 180.0
Mean :5167 Mean :3.621 Mean :-40.5 Mean : 258.3
3rd Qu.:5228 3rd Qu.:4.961 3rd Qu.:-36.4 3rd Qu.: 319.0
Max. :5228 Max. :5.045 Max. :-26.9 Max. :4918.0
- Plot the histogram for the defined variables, as illustrated in the following command:
plot_grid_numeric(df,c("nr.employed","euribor3m","cons.conf.idx","duration"),2)
The output is as follows:
Figure 2.9: Histogram of count and duration for various variables
Just like Exercise 5, Visualizing Multiple Variables Using a Histogram, we first perform the summary statistics on our desired set of variables with the summary function, and then plot the combined histogram for all the desired variables together by calling the same functions we defined earlier.
As we can see, the number of employees employed has been mostly constant at 5228, but it has also decreased during the time period to different values. This number is measured quarterly, and hence the frequency is not very dynamic, which is why we can see values centered around only a few bins. The euro interbank interest rate has been mostly between 2.5 and 5. There are just 1 or 2 records that have values above 5, and we can see that the max value measured for this variable is 5.045. The consumer confidence index is mostly negative, which means that the consumers mostly perceived the state of the economy negatively during this time. We see two peaks in the bins of the histogram, which calls for the most common confidence index during the time and vaguely suggests limited variation in the index during the length of the campaign. The duration of the call, in seconds, shall be ignored from our analysis for now.
To summarize, we understand that the bank's number of employees increased and decreased during the campaigns in a range of ~250, which is ~5% of the total employees. It ranged between 4964 and 5228 and mostly had little variation. The consumer confidence index remained mostly negative and with little variation during the time period, and the euro interbank rates had an average of 3.6, with most of the records between 2.5 and 5.
Now, let's move on to study the categorical variables using univariate analysis.