- Applied Supervised Learning with R
- Karthik Ramasubramanian Jojo Moolayil
- 1091字
- 2021-06-11 13:22:33
Studying the Relationship Between Two Categorical Variables
To study the relationship and patterns that exist between two categorical variables, we can first explore the frequency distribution across each category of the variables. A higher concentration in any outcome might be a potential insight. The most effective way to visualize this is using stacked bar charts.
A stacked bar chart will help us to observe the distribution of the target variable across multiple categorical variables. The distribution will reveal whether a specific category in a categorical variable dominates the target variable, y. If yes, we can further explore its influence on our problem.
In the next few exercises, we will explore various categorical variables across target variable y using stacked bar chart. We will plot absolute count and percentage to understand the distribution better.
Exercise 34: Studying the Relationship Between the Target y and marital status Variables
In this exercise, we will demonstrate the study between two categorical variables using plain frequency counts and then show how inconvenient it is.
To start simple, let's begin with exploring the relationship between the target, y, and marital status.
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Next, create a temp aggregation dataset:
temp <- df %>% group_by(y,marital) %>% summarize(Count = n())
- Define plot size, as illustrated here:
options(repr.plot.width=12, repr.plot.height=4)
- Plot the chart with frequency distribution:
ggplot(data = temp,aes(x=marital,y=Count,fill=y)) +
geom_bar(stat="identity") +
ggtitle("Distribution of target 'y' across Marital Status")
The output is as follows:
Figure 2.19: Using ggplot to study the relationship between the target y and marital status variables
We first aggregate the categorical columns using the group_by function. This would help us cross frequency count for each category combination. We now use this temporary dataset to plot the frequency distribution across the independent variable.
As we can see, the yes frequency is highest for married clients, but this may be true just because the number of married clients is high. To understand the relationship better, we can further break this down using a stacked bar chart with percentage distribution, where each bar represents the percentage of yes and no, respectively.
- Create a temp aggregation dataset:
temp <- df %>% group_by(y,marital) %>%
summarize(Count = n()) %>%
ungroup() %>% #This function ungroups the previously grouped dataframe
group_by(marital) %>%
mutate(Perc = round(Count/sum(Count)*100)) %>%
arrange(marital)
- Define the plot size using the options method:
options(repr.plot.width=12, repr.plot.height=4)
- Plot the percentage distribution using the ggplot method:
ggplot(data = temp,aes(x=marital,y=Perc,fill=y)) +
geom_bar(stat="identity") +
geom_text(aes(label = Perc), size = 5, hjust = 0.5, vjust = 0.3, position = "stack") +
ggtitle("Distribution of target 'y' percentage across Marital Status")
The output is as follows:
Figure 2.20: Distribution of target y percentage across marital status
We can now see counter-intuitive results compared to the previous plot. After we normalize the results, we see that single clients are more responsive to the campaign than those who are married. This is true for unknown too, but given the uncertainty of the value and the extremely low number of records, we should ignore this. We cannot directly conclude the result that single customers are more effective in responding to campaigns, but we can validate this later.
Exercise 35: Studying the Relationship between the job and education Variables
In this exercise, we will accelerate our exploration. Let's build a custom function where we can combine the two charts, that is, frequency distribution as well percentage distribution, for categorical variable's bivariate analysis.
Perform the following steps:
- First, import the ggplot2 package using the following command:
library(ggplot2)
- Create a DataFrame object, df, and use the bank-additional-full.csv file using the following command:
df <- read.csv("/Chapter 2/Data/bank-additional/bank-additional-full.csv",sep=';')
- Create a temp aggregation dataset:
plot_bivariate_categorical <- function(df, target, list_of_variables){
target <- sym(target) #Converting the string to a column reference
i <-1
plt_matrix <- list()
for(column in list_of_variables){
col <- sym(column)
temp <- df %>% group_by(!!sym(target),!!sym(col)) %>%
summarize(Count = n()) %>%
ungroup() %>% #This fucntion ungroups the previously grouped dataframe
group_by(!!sym(col)) %>%
mutate(Perc = round(Count/sum(Count)*100)) %>%
arrange(!!sym(col))
- Define the plot size:
options(repr.plot.width=14, repr.plot.height=12)
- Plot the chart with a frequency distribution:
plt_matrix[[i]]<- ggplot(data = temp,aes(x=!!sym(col),y=Count,fill=!!sym(target))) +
geom_bar(stat="identity") +
geom_text(aes(label = Count), size = 3, hjust = 0.5, vjust = -0.3, position = "stack") +
theme(axis.text.x = element_text(angle = 90, vjust = 1)) + #rotates the labels
ggtitle(paste("Distribution of target 'y' frequency across",column))
i<-i+1
- Plot the percentage distribution:
plt_matrix[[i]] <- ggplot(data = temp,aes(x=!!sym(col),y=Perc,fill=!!sym(target))) +
geom_bar(stat="identity") +
geom_text(aes(label = Perc), size = 3, hjust = 0.5, vjust = -1, position = "stack") +
theme(axis.text.x = element_text(angle = 90, vjust = 1)) + #rotates the labels
ggtitle(paste("Distribution of target 'y' percentage across",column))
i <- i+1
}
plot_grid(plotlist = plt_matrix, ncol=2)
}
- Plot the plot_bivariate_categorical using the following command:
plot_bivariate_categorical(df,"y",c("job","education"))
The output is as follows:
Figure 2.21: Studying the relationship between the job and education variables
We use the same principles to define the function that would plot the charts together. The additional difference here would be two plots for each combination. The first (left) is the frequency plot across the category combinations, and the right-hand side plot showcases the percentage distribution (normalized across category) visual. Studying both the plots together helps validate results more effectively. The creation of temporary aggregated datasets has an additional step with the use of the ungroup function. This is used to enable the relative percentage distribution of target outcome within the categorical levels of independent variable, that is, distribution of y across each level within marital.
If we observe the results from the previous output plots, we can see that the highest response rates for the campaign are from student and retired professionals, but this comes with a caveat. We see that both of these categories have far less observations as compared to the other categories. Therefore, we would need additional validation before making further conclusions. We, therefore, make a note of this insight too. From education levels, we don't see any interesting trends. Though illiterate clients have a high response rate, the number of observations are far too low to conclude anything tangible.
- Let's take a look at credit default and housing loan categories:
plot_bivariate_categorical(df,"y",c("default","housing"))
The output is as follows:
Figure 2.22: Studying the relationship between the default and housing variables
- Again, we don't see any interesting trends. Let's continue the exploration for personal loan and contact mode:
plot_bivariate_categorical(df,"y",c("loan","contact"))
The output is as follows:
Figure 2.23: Studying the relationship between the loan and contact variables
Here, we can see an interesting trend for the mode of contact used. There is generally a higher response rate when the mode of campaign communication is cellular rather than landline. Let's make a note of this trend too and huddle back with further validation.
I encourage you to explore the relationships between our target variable and the remaining dependent categorical variables: month, day of week, and the previous outcome of the campaign.