- Hands-On Ensemble Learning with R
- Prabhanjan Narayanachar Tattar
- 516字
- 2025-04-04 16:30:55
Logistic regression model
The logistic regression model is a binary classification model, and it is a member of the exponential family which belongs to the class of generalized linear models. Now, let denote the binary label:

Using the information contained in the explanatory vector we are trying to build a model that will help in this task. The logistic regression model is the following:

Here, is the vector of regression coefficients. Note that the logit function
is linear in the regression coefficients and hence the name for the model is a logistic regression model. A logistic regression model can be equivalently written as follows:

Here, is the binary error term that follows a Bernoulli distribution. For more information, refer to Chapter 17 of Tattar, et al. (2016). The estimation of the parameters of the logistic regression requires the iterative reweighted least squares (IRLS) algorithm, and we would use the
glm
R function to get this task done. We will use the Hypothyroid dataset in this section. In the previous section, the training and test datasets and formulas were already created, and we will carry on from that point.
Logistic regression for hypothyroid classification
For the hypothyroid
dataset, we had HT2_Train
as the training dataset. The test dataset is split as the covariate matrix in HT2_TestX
and the outputs of the test dataset in HT2_TestY
, while the formula for the logistic regression model is available in HT2_Formula
. First, the logistic regression model is fitted to the training dataset using the glm
function and the fitted model is christened LR_fit
, and then we inspect it for model fit summaries using summary(LR_fit)
. The fitted model is then applied to the covariate data in the test part using the predict
function to create LR_Predict
. The predicted probabilities are then labeled in LR_Predict_Bin
, and these labels are compared with the actual testY_numeric
and overall accuracy is obtained:
> ntr <- nrow(HT2_Train) # Training size > nte <- nrow(HT2_TestX) # Test size > p <- ncol(HT2_TestX) > testY_numeric <- as.numeric(HT2_TestY) > LR_fit <- glm(HT2_Formula,data=HT2_Train,family = binomial()) Warning message: glm.fit: fitted probabilities numerically 0 or 1 occurred > summary(LR_fit) Call: glm(formula = HT2_Formula, family = binomial(), data = HT2_Train) Deviance Residuals: Min 1Q Median 3Q Max -3.6390 0.0076 0.0409 0.1068 3.5127 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.302025 2.365804 -3.509 0.000449 *** Age -0.024422 0.012145 -2.011 0.044334 * GenderMALE -0.195656 0.464353 -0.421 0.673498 TSH -0.008457 0.007530 -1.123 0.261384 T3 0.480986 0.347525 1.384 0.166348 TT4 -0.089122 0.028401 -3.138 0.001701 ** T4U 3.932253 1.801588 2.183 0.029061 * FTI 0.197196 0.035123 5.614 1.97e-08 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 609.00 on 1363 degrees of freedom Residual deviance: 181.42 on 1356 degrees of freedom AIC: 197.42 Number of Fisher Scoring iterations: 9 > LR_Predict <- predict(LR_fit,newdata=HT2_TestX,type="response") > LR_Predict_Bin <- ifelse(LR_Predict>0.5,2,1) > LR_Accuracy <- sum(LR_Predict_Bin==testY_numeric)/nte > LR_Accuracy [1] 0.9732704
It can be seen from the summary of the fitted GLM (the output following the line summary(LR_fit)
) that we are having four significant variables in Age
, TT4
, T4U
, and FTI
. Using the predict
function, we apply the fitted model on unknown test cases in HT2_TestX
, compare it with the actuals, and find the accuracy to be 97.33%. Consequently, logistic regression is easily deployed in the R software.