- Applied Supervised Learning with R
- Karthik Ramasubramanian Jojo Moolayil
- 1461字
- 2021-06-11 13:22:32
Exploring Categorical Features
Categorical features differ from numeric or continuous features in nature, and therefore the traditional methods used earlier aren't applicable here. We can analyze the number of different classes within a categorical variable and the frequency associated with each. This can be achieved using either simple analytical techniques or visual techniques. Let's explore a list of categorical features using a combination of both.
Exercise 24: Exploring Categorical Features
In this exercise, we will start with a simple variable, that is, marital, which indicates the marital status of the client. Let's use the dplyr library to perform grouped data aggregation.
Perform the following steps to complete the exercise:
- First, import the dplyr library in the system using the following command:
library(dplyr)
- Next, we will create an object named marital_distribution and store the value based on the following condition:
marital_distribution <- df %>% group_by(marital) %>%
summarize(Count = n()) %>%
mutate(Perc.Count = round(Count/sum(Count)*100))
- Now, print the value stored in the marital_distribution object:
print(marital_distribution)
The output is as follows:
# A tibble: 4 x 3
marital Count Perc.Count
<fct> <int> <dbl>
1 divorced 4612 11
2 married 24928 61
3 single 11568 28
4 unknown 80 0
To count the distinct number of classes within the categorical column and to get the count of records within each of the individual classes, we use the group_by functions available under the dplyr library. The %>%, also called the concatenation command, is analogous to the Linux piped operations. It extracts output from the left of the operator, passes on to the right of the operator, and concatenates the entire series of operations. Here, we first group the DataFrame by the variable of our interest, that is, marital and then pass the output to the summarize function, which aggregates the DataFrame to the grouped level using the aggregation function we provide; in this case, n() is a simple count equivalent. Finally, we use the mutate function to calculate the percentage of counts for each of the individual group members.
We see that majority of the campaign calls were made to married clients, around 61%, followed by calls to single clients at 28% and so on.
Exercise 25: Exploring Categorical Features Using a Bar Chart
In this exercise, we will plot a bar chart with the frequency counts for each class visualized. We could also use the bar chart to represent the frequency distribution of each of these individual categories with a plot.
Perform the following steps to complete the exercise:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Now, plot the bar chart of marital status per count using the following command:
ggplot(data = marital_distribution,aes(x=marital,y=Perc.Count)) +
geom_bar(stat="identity",fill="blue",alpha=0.6) +
geom_text(aes(label=marital_distribution$Perc.Count, vjust = -0.3))
The output is as follows:
Figure 2.10: Bar chart of marital status per count
We use the same dataset engineered in the previous snippet, which calculates the frequency of each class and its relative percentage. To plot the bar chart, we use the same base function of ggplot, where we define the aesthetics of the x and y variables and append the bar plot using the geom_bar function. The geom_text function allows us to add labels to each bar in the plot.
We can now see the same numbers displayed in the previous exercise visualized here with a bar plot. In scenarios where we have a large number of classes within the variable, glancing through each individual class to study them might not be the most effective method. A simple plot easily helps us to understand the frequency distribution of the categorical variable in an easy-to-consume way.
Exercise 26: Exploring Categorical Features using Pie Chart
In this exercise, we will define the pie chart and the various components within it.
Similar to the bar plot, we also have a pie chart that makes understanding the percentage distribution of the classes easier. Perform the following steps to visualize the same variable, that is, marital status using a pie chart:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Next, define the label positions using the following command:
plot_breaks = 100 - (cumsum(marital_distribution$Perc.Count) -
marital_distribution$Perc.Count/2)
- Now, define labels for the plots:
plot_labels = paste0(marital_distribution$marital,"-",marital_distribution$Perc.Count,"%")
- Set the plot size for better visuals:
options(repr.plot.width=12, repr.plot.height=8)
- Create the pie chart using the following command:
ggplot(data = marital_distribution,aes(x=1,y=Perc.Count, fill=marital)) +
geom_bar(stat="identity") + #Creates the base bar visual
coord_polar(theta ="y") + #Creates the pie chart
scale_y_continuous(breaks=plot_breaks, labels = plot_labels,position = "left") +
theme(axis.text.x = element_text(angle = 30, hjust =1)) + #rotates the labels
theme(text = element_text(size=15)) + #increases the font size for the legend
ggtitle("Percentage Distribution of Marital Status") #Adds the plot title
Figure 2.11: Pie chart for the percentage distribution for the marital status
We first define a few extra variables that will help us to get the plot in an easier way. In order to label the pie chart, we would need the break points and the actual labels. The break point should ideally be located in the middle part of the pie piece. So, we take a cumulative sum of the percentage distribution and subtract half of each category to find the mid-point of the section. We then subtract the entire number from 100 to arrange the labels in a clockwise direction.
The next step defines the label for each pie piece; we use the paste function to concatenate the label name and the actual percentage values. The pie chart functionality in ggplot works by constructing elements on top of a bar chart. We first use the base from ggplot and geom_bar to render the base for a stacked bar plot and use the coord_polar function to transform this into the required pie chart. The scale_y_continuous function helps in placing the labels on the pie distribution. The next line adds a rotation angle to the positioning of the text. The size parameter inside the element_text portion of the theme function defines the font size for the text in the plot. The rest is the same as we studied in the earlier plots.
We can see that the pie chart provides us with an intuitive way to explore the percentage distribution for the categories within each variable. A word of caution to choose the pie chart over bar plot would be based on the number of distinct categories within a variable. Though pie charts are visually more appealing, with many distinct classes, pie charts become overcrowded.
Note
Pie charts are best avoided when the number of distinct classes within a categorical variable is high. There is no definite rule, but anything that makes visually cluttered pie charts would not be ideal to study.
Exercise 27: Automate Plotting Categorical Variables
In this exercise, we will automate the plotting of categorical variables.
Just like numeric variables, we also have 10 categorical variables, excluding the target variable. Similar to automating the exploration of numeric features, let's now create a function for categorical variables. To keep things simple, we will primarily use boxplots with a percentage distribution instead of a pie chart. We will start with four categorical features and then move to the next remainder set.
Perform the following steps to complete the exercise:
- First, import the cowplot package using the following command:
library(cowplot)
- Define a function to plot histograms for all numeric columns:
plot_grid_categorical <- function(df,list_of_variables,ncols=2){
plt_matrix <- list()
i<-1
#Iterate for each variable
for(column in list_of_variables){
#Creating a temporary DataFrame with the aggregation
var.dist <- df %>% group_by_(column) %>%
summarize(Count = n()) %>%
mutate(Perc.Count = round(Count/sum(Count)*100,1))
options(repr.plot.width=12, repr.plot.height=10)
plt_matrix[[i]]<-ggplot(data = var.dist,aes_string(x=column,y="Perc.Count")) +
geom_bar(stat="identity",fill="blue",alpha=0.6) + #Defines the bar plot
geom_text(label=var.dist$Perc.Count,vjust=-0.3)+ #Adds the labels
theme(axis.text.x = element_text(angle = 90, vjust = 1)) + #rotates the labels
ggtitle(paste("Percentage Distribution of variable: ",column)) #Creates the title +
i<-i+1
}
plot_grid(plotlist=plt_matrix,ncol=ncols) #plots the grid
}
- Next, call the summary statistics using the following command:
summary(df[,c("job","education","default","contact")])
The output is as follows:
job education default contact
admin. :10422 university.degree :12168 no :32588 cellular :26144
blue-collar: 9254 high.school : 9515 unknown: 8597 telephone:15044
technician : 6743 basic.9y : 6045 yes : 3
services : 3969 professional.course: 5243
management : 2924 basic.4y : 4176
retired : 1720 basic.6y : 2292
(Other) : 6156 (Other) : 1749
- Call the function we defined earlier to plot the histogram:
plot_grid_categorical(df,c("job","education","default","contact"),2)
The output is as follows:
Figure 2.12: Bar plot for categorical variables
Similar to the earlier function we created for the numeric features visual automation, we have created a simple function to explore the percentage distribution for categorical features. Some additions to the function are the creation of the temporary aggregation dataset and some additional cosmetic enhancements to the plot. We add the labels and rotate them by 30 degrees so that they can neatly align with the plot, and the rest remains the same. We get the frequency count by calling the summary function on the categorical column. Similar to numeric columns, we explore the categorical columns first using the summary function and then use the defined function to visualize the collated bar plots.
Exploring the job feature, we can see 12 distinct values, with most of the records for admin, blue-collar, and technician. Overall, the job category seems to have a fairly diverse distribution of values. Education level of the client also has a diverse set of values, with ~50% of the values from high school and university. For the default variable, which indicates whether the client has defaulted in credit previously, we have ~80% of the values as no and around ~20% unknown. This doesn't seem to be useful information. Finally, contact, which defines the mode of contact used for the campaign communication, shows that 64% was through cellular phones, and the rest through landlines.
Let's move on and repeat the same analysis for the next set of features.
Exercise 28: Automate Plotting for the Remaining Categorical Variables
In this exercise, we will reuse the same function for the next set of four categorical variables. Remember that you need to use the frequency count generated using the summary command in conjunction with the plots to interpret the value.
Let's perform the following procedure to complete the exercise:
- First, import the cowplot package using the following command:
library(cowplot)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Next, call the summary statistics using the following command:
summary(df[,c("loan","month","day_of_week","poutcome")])
The output is as follows:
loan month day_of_week poutcome
no :33950 may :13769 fri:7827 failure : 4252
unknown: 990 jul : 7174 mon:8514 nonexistent:35563
yes : 6248 aug : 6178 thu:8623 success : 1373
jun : 5318 tue:8090
nov : 4101 wed:8134
apr : 2632
(Other): 2016
- Call the defined function to plot the histogram:
plot_grid_categorical(df,c("loan","month","day_of_week","poutcome"),2)
The output is as follows:
Figure 2.13: Automate plotting for the remaining categorical variables
We reuse the previously defined functions to explore the new set of four variables just like we explored the previous set of features.
The loan variable indicates whether the client has a personal loan. We have ~86.6% of clients with no personal loan, 10.3% with a loan, and 3.3% unknown. Similarly, the month variable indicates the actual month when the campaign calls were executed. We see that the majority of communication was conducted in the month of may, followed by jul and aug. Overall, the month feature also seems to be a fairly diverse variable with a good distribution of values. The day_of_week variable shows a consistent distribution across all days of the week. poutcome indicates the result of the previously executed campaign; a significant majority was non-existent, a small chunk of around 3.3% was successful, and ~10% failed.
Exercise 29: Exploring the Last Remaining Categorical Variable and the Target Variable
Finally, let's explore the last remaining categorical variable and the target variable. Since both are categorical, we can continue using the same function for the exploration.
Repeat the same process for the last independent categorical variable and the dependent variable (which is also categorical):
- First, after importing the required packages and creating DataFrame object, call the summary statistics using the following command:
summary(df[,c("y","housing")])
The output is as follows:
y housing
no :36548 no :18622
yes: 4640 unknown: 990
yes :21576
- Call the defined function to plot the histogram:
plot_grid_categorical(df,c("y","housing"),2)
The output is as follows:
Figure 2.14: Histogram of housing per count
If we carefully look at the distribution of the outcome variable, we can see that a large majority of the clients have negatively responded to the campaign calls. Only ~11% of the overall campaign base have positively responded to the campaign. Similarly, if we look at the housing variable, we can see that roughly 50% of the clients had a housing loan.
To summarize, we can distill our observations as follows:
- The campaign was conducted with a major focus of new customers who had not been previously contacted.
- Around 60% of the client base are married and 80% have not defaulted in credit history.
- Roughly 50% of the client base has a housing loan and over 80% has never opted for a personal loan.
- The campaign was the most active during the month of May and demonstrated fairly strong momentum in July and August.
- More than 60% of the communication of the campaign was through cellular phones, and over 50% of the client base at least had a high school degree.
- Overall, only 11% of the campaign calls had a positive response.
With the univariate analysis of all the numeric and categorical variables complete, we now have a fair understanding of the story the data conveys. We almost understand each data dimension and its distribution. Let's move on to explore another interesting facet of EDA: bivariate analysis.