Discriminant analysis overview

Discriminant Analysis (DA), also known as Fisher Discriminant Analysis (FDA), is another popular classification technique. It can be an effective alternative to logistic regression when the classes are well-separated. If you have a classification problem where the outcome classes are well-separated, logistic regression can have unstable estimates, which is to say that the confidence intervals are wide and the estimates themselves likely vary from one sample to another (James, 2013). DA does not suffer from this problem and, as a result, may outperform and be more generalized than logistic regression. Conversely, if there are complex relationships between the features and outcome variables, it may perform poorly on a classification task. For our breast cancer example, logistic regression performed well on the testing and training sets, and the classes were not well-separated. For the purpose of comparison with logistic regression, we will explore DA, both Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA).

DA utilizes Baye's theorem in order to determine the probability of the class membership for each observation. If you have two classes, for example, benign and malignant, then DA will calculate an observation's probability for both the classes and select the highest probability as the proper class.

Bayes' theorem states that the probability of Y occurring--given that X has occurred--is equal to the probability of both Y and X occurring, Pided by the probability of X occurring, which can be written as follows:

  

The numerator in this expression is the likelihood that an observation is from that class level and has these feature values. The denominator is the likelihood of an observation that has these feature values across all the levels. Again, the classification rule says that if you have the joint distribution of X and Y and if X is given, the optimal decision about which class to assign an observation to is by choosing the class with the larger probability (the posterior probability).

The process of attaining posterior probabilities goes through the following steps:

  1. Collect data with a known class membership.
  2. Calculate the prior probabilities; this represents the proportion of the sample that belongs to each class.
  3. Calculate the mean for each feature by their class.
  4. Calculate the variance--covariance matrix for each feature; if it is an LDA, then this would be a pooled matrix of all the classes, giving us a linear classifier, and if it is a QDA, then a variance--covariance created for each class.
  5. Estimate the normal distribution (Gaussian densities) for each class.
  6. Compute the discriminant function that is the rule for the classification of a new object.
  7. Assign an observation to a class based on the discriminant function.

This will provide an expanded notation on the determination of the posterior probabilities, as follows:

Even though LDA is elegantly simple, it is limited by the assumption that the observations of each class are said to have a multivariate normal distribution, and there is a common covariance across the classes. QDA still assumes that observations come from a normal distribution, but it also assumes that each class has its own covariance.

Why does this matter? When you relax the common covariance assumption, you now allow quadratic terms into the discriminant score calculations, which was not possible with LDA. The mathematics behind this can be a bit intimidating and are outside the scope of this book. The important part to remember is that QDA is a more flexible technique than logistic regression, but we must keep in mind our bias-variance trade-off. With a more flexible technique, you are likely to have a lower bias but potentially a higher variance. Like a lot of flexible techniques, a robust set of training data is needed to mitigate a high classifier variance.