Logistic regression model

The logistic regression model is a binary classification model, and it is a member of the exponential family which belongs to the class of generalized linear models. Now, let Logistic regression modeldenote the binary label:

Logistic regression model

Using the information contained in the explanatory vector Logistic regression model we are trying to build a model that will help in this task. The logistic regression model is the following:

Logistic regression model

Here, Logistic regression model is the vector of regression coefficients. Note that the logit function Logistic regression model is linear in the regression coefficients and hence the name for the model is a logistic regression model. A logistic regression model can be equivalently written as follows:

Logistic regression model

Here, Logistic regression model is the binary error term that follows a Bernoulli distribution. For more information, refer to Chapter 17 of Tattar, et al. (2016). The estimation of the parameters of the logistic regression requires the iterative reweighted least squares (IRLS) algorithm, and we would use the glm R function to get this task done. We will use the Hypothyroid dataset in this section. In the previous section, the training and test datasets and formulas were already created, and we will carry on from that point.

Logistic regression for hypothyroid classification

For the hypothyroid dataset, we had HT2_Train as the training dataset. The test dataset is split as the covariate matrix in HT2_TestX and the outputs of the test dataset in HT2_TestY, while the formula for the logistic regression model is available in HT2_Formula. First, the logistic regression model is fitted to the training dataset using the glm function and the fitted model is christened LR_fit, and then we inspect it for model fit summaries using summary(LR_fit). The fitted model is then applied to the covariate data in the test part using the predict function to create LR_Predict. The predicted probabilities are then labeled in LR_Predict_Bin, and these labels are compared with the actual testY_numeric and overall accuracy is obtained:

> ntr <- nrow(HT2_Train) # Training size
> nte <- nrow(HT2_TestX) # Test size
> p <- ncol(HT2_TestX)
> testY_numeric <- as.numeric(HT2_TestY)
> LR_fit <- glm(HT2_Formula,data=HT2_Train,family = binomial())
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 
> summary(LR_fit)
Call:
glm(formula = HT2_Formula, family = binomial(), data = HT2_Train)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.6390   0.0076   0.0409   0.1068   3.5127  
Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.302025   2.365804  -3.509 0.000449 ***
Age         -0.024422   0.012145  -2.011 0.044334 *  
GenderMALE  -0.195656   0.464353  -0.421 0.673498    
TSH         -0.008457   0.007530  -1.123 0.261384    
T3           0.480986   0.347525   1.384 0.166348    
TT4         -0.089122   0.028401  -3.138 0.001701 ** 
T4U          3.932253   1.801588   2.183 0.029061 *  
FTI          0.197196   0.035123   5.614 1.97e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 609.00  on 1363  degrees of freedom
Residual deviance: 181.42  on 1356  degrees of freedom
AIC: 197.42
Number of Fisher Scoring iterations: 9
> LR_Predict <- predict(LR_fit,newdata=HT2_TestX,type="response")
> LR_Predict_Bin <- ifelse(LR_Predict>0.5,2,1)
> LR_Accuracy <- sum(LR_Predict_Bin==testY_numeric)/nte
> LR_Accuracy
[1] 0.9732704

It can be seen from the summary of the fitted GLM (the output following the line summary(LR_fit)) that we are having four significant variables in Age, TT4, T4U, and FTI. Using the predict function, we apply the fitted model on unknown test cases in HT2_TestX, compare it with the actuals, and find the accuracy to be 97.33%. Consequently, logistic regression is easily deployed in the R software.