# Linear Discriminant Analysis With Scikit-Learn

0
810 ## Introduction

Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are well-known dimensionality reduction techniques, which are especially useful when working with sparsely populated structured big data, or when features in a vector space are not linearly dependent. [A vector has a linearly dependent dimension if said dimension can be represented as a linear combination of one or more other dimensions.] Thus, PCA is an unsupervised algorithm for dimensionality reduction, whereas LDA is a supervised algorithm which finds a subspace that maximizes the separation between features.

The advantage that LDA offers is that it works as a separator for classes, that is, as a classifier. However, LDA can become prone to overfitting and is vulnerable to noise/outliers.

In the Scikit-Learn Documentation, the LDA module is defined as “A classifier with a linear decision boundary, generated by fitting class conditional densities to the data and using Bayes’ rule.” In classification, LDA makes predictions by estimating the probability of a new input belonging to each class. The class that gets the highest probability is the output/predicted class.

## Comparing PCA And LDA

In Machine Learning tasks, you may find yourself having to choose between either PCA or LDA. PCA treats the entire dataset as one class, and after applying PCA, the resultant data will have no correlation between the features. [PCA guarantees that output features will be linearly independent.] PCA is also an unsupervised technique, but LDA requires labelled data.  In the comparison above, you can see that PCA reduces on axes (x1,x2) and LDA assumes distributions (LD1, LD2) along the axes. LDA with the LD1 and LD2 components shows better class separability.

You should prefer to use PCA if the data is skewed or irregular (considering the overfitting nature of LDA), and for uniformly distributed data, LDA performs better. However, you can also apply PCA before LDA. Applying PCA can help with regularization and reduce overfitting.

## The LDA Algorithm

LDA makes two assumptions for simplicity:

1. The data follows a Gaussian distribution.
2. Each feature/dimension has the same variance Σ.

Following is the LDA Algorithm for the general case (multi-class classification)

Suppose that each of C classes has a mean μ_i, then the scatter between the classes is calculated as: Here, μ is the average of class means μ_i for i=1…C.

The class separation S along the direction  is given by: When is an eigenvector of, then S will be equal to the corresponding eigenvalue?

Simply put, if is invertible, the eigenspace corresponding to the C-1 largest eigenvalues will form the reduced space.

Hence, the following steps go into computing an LDA:

1. Compute mean vectors for all C classes in the data (Let dimensions of data=N)
2. Compute the scatter matrices: Σ_w (Covariance within a class) and Σ_b (Covariance between classes)
3. Compute the eigenvalues and eigenvectors for the scatter matrices
4. Select the top k eigenvalues, and build the transformation matrix of size N*k.
5. The resultant transformation matrix can be used for dimensionality reduction and class separation via LDA.

## LDA Python Implementation For Classification

In this code, we:

1. Load the Iris dataset in sklearn
2. Normalize the feature set to improve classification accuracy (You can try running the code without the normalization and verify the loss of accuracy)
3. Compute the PCA, followed by LDA and PCA+LDA of the data
4. Visualize the computations using matplotlib
5. Using sklearn RandomForest classifier, evaluate the outputs from Step 2

## Conclusion

In this article, we focused on understanding LDA and the advantages it offers over PCA. We also looked at an LDA implementation in Python’s Sklearn library on the Iris dataset. In this implementation, we can see comparisons between PCA and LDA, and also that applying PCA before LDA can have its benefits.