- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 394字
- 2025-04-04 16:30:55
Hypothyroid
The hypothyroid dataset Hypothyroid.csv
is available in the book's code bundle packet, located at /…/Chapter01/Data
. While we have 26 variables in the dataset, we will only be using seven of these variables. Here, the number of observations is n = 3163. The dataset is downloaded from http://archive.ics.uci.edu/ml/datasets/thyroid+disease and the filename is hypothyroid.data
(http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/hypothyroid.data). After some tweaks to the order of relabeling certain values, the CSV file is made available in the book's code bundle. The purpose of the study is to classify a patient with a thyroid problem based on the information provided by other variables. There are multiple variants of the dataset and the reader can delve into details at the following web page: http://archive.ics.uci.edu/ml/machine-learning-databases/thyroid-disease/HELLO. Here, the column representing the variable of interest is named Hypothyroid
, which shows that we have 151 patients with thyroid problems. The remaining 3012 tested negative for it. Clearly, this dataset is an example of unbalanced data, which means that one of the two cases is outnumbered by a huge number; for each thyroid case, we have about 20 negative cases. Such problems need to be handled differently, and we need to get into the subtleties of the algorithms to build meaningful models. The additional variables or covariates that we will use while building the predictive models include Age
, Gender
, TSH
, T3
, TT4
, T4U
, and FTI
. The data is first imported into an R session and is subset according to the variables of interest as follows:
> HT <- read.csv("../Data/Hypothyroid.csv",header = TRUE,stringsAsFactors = F) > HT$Hypothyroid <- as.factor(HT$Hypothyroid) > HT2 <- HT[,c("Hypothyroid","Age","Gender","TSH","T3","TT4","T4U","FTI")]
The first line of code imports the data from the Hypothyroid.csv
file using the read.csv
function. The dataset now has a lot of missing data in the variables, as seen here:
> sapply(HT2,function(x) sum(is.na(x))) Hypothyroid Age Gender TSH T3 TT4 0 446 73 468 695 249 T4U FTI 248 247
Consequently, we remove all the rows that have a missing value, and then split the data into training and testing datasets. We will also create a formula for the classification problem:
> HT2 <- na.omit(HT2) > set.seed(12345) > Train_Test <- sample(c("Train","Test"),nrow(HT2),replace=TRUE, prob=c(0.7,0.3)) > head(Train_Test) [1] "Test" "Test" "Test" "Test" "Train" "Train" > HT2_Train <- HT2[Train_Test=="Train",] > HT2_TestX <- within(HT2[Train_Test=="Test",],rm(Hypothyroid)) > HT2_TestY <- HT2[Train_Test=="Test",c("Hypothyroid")] > HT2_Formula <- as.formula("Hypothyroid~.")
The set.seed
function ensures that the results are reproducible each time we run the program. After removing the missing observations with the na.omit
function, we split the hypothyroid data into training and testing parts. The former is used to build the model and the latter is used to validate it, using data that has not been used to build the model. Quinlan – the inventor of the popular tree algorithm C4.5 – used this dataset extensively.