Bigdata and Hadoop

R Programming Series: Clustering using FactoExtra Package

April 3, 2020

3119

In this series, we have learned about Dynamic Map creation using ggmap and R, creating dynamic maps using ggplot2, 3D Visualization in R, Data Wrangling and Visualization in R, and Exploratory Data Analysis using the R programming language.

This will be the last article of this series on R programming language wherein we will know about Clustering using FactoExtraPackage.

Clustering is a type of unsupervised machine learning pattern. An unsupervised learning method is considered as a method in which helps us to draw references from data-sets which consists of input data without labeled responses. Generally, it is considered as a process to find meaningful structure, explanatory underlying processes and the required generative features with respect to inherent features.

Clustering involves the task of dividing the population or data points of the mentioned data-set into a number of groups such that data points in the same groups are considered more similar to other data points in the same group and in the same fashion dissimilar to the data points in other groups. Clustering includes a collection of objects on the basis of similarity and dissimilarity between them.

The best demonstration for clustering is the data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters as mentioned in the figure mentioned below:

Intracluster distance is the sum of distances between objects in the same cluster. This distance should always be minimized. Intercluster distance is the distance between objects in a different cluster. This distance should always be maximized.

Importance of Clustering

Clustering is very important as it determines the intrinsic grouping among the unlabeled data which is present. Hence, it is considered as a supervised machine learning pattern. There are no specific criteria for good clustering. It depends on the specific user who has the criteria through which requirements are satisfied. The best demonstration can be taken as finding representatives for similar data groups for finding “natural clusters” and describe their unknown properties in finding different groups and finding unusual data objects (outlier detection).

Types of Clustering Algorithms

It is now equally important to understand clustering methods include methods that are used to deal with a large amount of data. There are two types of clustering which are explained below:

1. Hierarchical clustering: The clusters formed in this method forms a tree-type structure based on the hierarchy.

These find successive clusters using previously established clusters.

Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. It is also called a “bottom-up” approach. Divisive is also called the “top-down” algorithm to begin with the whole set and proceed to divide it into successively smaller clusters.

2. Partitional clustering: Partitional algorithms determine all clusters at once and randomly. It includes K-means and derivatives -k means algorithm can handle a large number of data points.

Let us follow the below-mentioned steps to implement the clustering method with FactoExtra package in R.

Step 1: Install the necessary packages which is required for creating clusters in R.

> install.packages("factoExtra")
> install.packages("cluster")

Step 2: Include the required libraries in the R workspace to implement the clustering procedure.

> library(factoextra)
> library(cluster)
> data(USArrests)
> USArrests <- na.omit(USArrests)
> # View the first 6 rows of the data
> head(USArrests, n = 6)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

Step 3: Create a module of descriptive statistics which includes minimum, median, mean, standard deviation, and maximum values.

> desc_stats <- data.frame(
+     Min = apply(USArrests, 2, min), # minimum
+     Med = apply(USArrests, 2, median), # median
+     Mean = apply(USArrests, 2, mean), # mean
+     SD = apply(USArrests, 2, sd), # Standard deviation
+     Max = apply(USArrests, 2, max) # Maximum
+ )
> desc_stats <- round(desc_stats, 1)
> desc_stats
          Min   Med  Mean   SD   Max
Murder    0.8   7.2   7.8  4.4  17.4
Assault  45.0 159.0 170.8 83.3 337.0
UrbanPop 32.0  66.0  65.5 14.5  91.0
Rape      7.3  20.1  21.2  9.4  46.0

Step 4: Scaling is an important feature for clustering as it focuses on creating the value range with a specific limit and allows us to maintain the intercluster and intracluster distance properly as defined in the condition mentioned above.

The formula for scaling the variables mentioned in the data frame is to subtract variable from mean and divide by their standard deviation t

scaled _parameter  = x-mean/standard deviation

> df<-scale(USArrests)
> df
                    Murder     Assault    UrbanPop         Rape
Alabama         1.24256408  0.78283935 -0.52090661 -0.003416473
Alaska          0.50786248  1.10682252 -1.21176419  2.484202941
Arizona         0.07163341  1.47880321  0.99898006  1.042878388
Arkansas        0.23234938  0.23086801 -1.07359268 -0.184916602
California      0.27826823  1.26281442  1.75892340  2.067820292
Colorado        0.02571456  0.39885929  0.86080854  1.864967207
Connecticut    -1.03041900 -0.72908214  0.79172279 -1.081740768
Delaware       -0.43347395  0.80683810  0.44629400 -0.579946294
Florida         1.74767144  1.97077766  0.99898006  1.138966691
Georgia         2.20685994  0.48285493 -0.38273510  0.487701523
Hawaii         -0.57123050 -1.49704226  1.20623733 -0.110181255
Idaho          -1.19113497 -0.60908837 -0.79724965 -0.750769945
Illinois        0.59970018  0.93883125  1.20623733  0.295524916
Indiana        -0.13500142 -0.69308401 -0.03730631 -0.024769429
Iowa           -1.28297267 -1.37704849 -0.58999237 -1.060387812
Kansas         -0.41051452 -0.66908525  0.03177945 -0.345063775
Kentucky        0.43898421 -0.74108152 -0.93542116 -0.526563903
Louisiana       1.74767144  0.93883125  0.03177945  0.103348309
Maine          -1.30593210 -1.05306531 -1.00450692 -1.434064548
Maryland        0.80633501  1.55079947  0.10086521  0.701231086
Massachusetts  -0.77786532 -0.26110644  1.34440885 -0.526563903
Michigan        0.99001041  1.01082751  0.58446551  1.480613993
Minnesota      -1.16817555 -1.18505846  0.03177945 -0.676034598
Mississippi     1.90838741  1.05882502 -1.48810723 -0.441152078
Missouri        0.27826823  0.08687549  0.30812248  0.743936999
Montana        -0.41051452 -0.74108152 -0.86633540 -0.515887425
Nebraska       -0.80082475 -0.82507715 -0.24456358 -0.505210947
Nevada          1.01296983  0.97482938  1.06806582  2.644350114
New Hampshire  -1.30593210 -1.36504911 -0.65907813 -1.252564419
New Jersey     -0.08908257 -0.14111267  1.62075188 -0.259651949
New Mexico      0.82929443  1.37080881  0.30812248  1.160319648
New York        0.76041616  0.99882813  1.41349461  0.519730957
North Carolina  1.19664523  1.99477641 -1.41902147 -0.547916860
North Dakota   -1.60440462 -1.50904164 -1.48810723 -1.487446939
Ohio           -0.11204199 -0.60908837  0.65355127  0.017936483
Oklahoma       -0.27275797 -0.23710769  0.16995096 -0.131534211
Oregon         -0.66306820 -0.14111267  0.10086521  0.861378259
Pennsylvania   -0.34163624 -0.77707965  0.44629400 -0.676034598
Rhode Island   -1.00745957  0.03887798  1.48258036 -1.380682157
South Carolina  1.51807718  1.29881255 -1.21176419  0.135377743
South Dakota   -0.91562187 -1.01706718 -1.41902147 -0.900240639
Tennessee       1.24256408  0.20686926 -0.45182086  0.605142783
Texas           1.12776696  0.36286116  0.99898006  0.455672088
Utah           -1.05337842 -0.60908837  0.99898006  0.178083656
Vermont        -1.28297267 -1.47304350 -2.31713632 -1.071064290
Virginia        0.16347111 -0.17711080 -0.17547783 -0.056798864
Washington     -0.86970302 -0.30910395  0.51537975  0.530407436
West Virginia  -0.47939280 -1.07706407 -1.83353601 -1.273917376
Wisconsin      -1.19113497 -1.41304662  0.03177945 -1.113770203
Wyoming        -0.22683912 -0.11711392 -0.38273510 -0.601299251

Step 5: Once the values are scaled for the mentioned data-frame, we can implement the k-means algorithm. The steps to be implemented for k-means is mentioned below:

The way the k-means algorithm works is as follows:

Include the number of clusters with specific number K.
Initialize centroids by the random shuffle of the dataset and then randomly selecting the required K data points from the centroids without replacement.
Complete the iteration until there is no change to the centroids.

> set.seed(123)
> km.res <- kmeans(df, 4, nstart = 25)
> head(km.res$cluster, 20)
    Alabama      Alaska     Arizona    Arkansas  California    Colorado Connecticut 
          1           4           4           1           4           4           3 
   Delaware     Florida     Georgia      Hawaii       Idaho    Illinois     Indiana 
          3           4           1           3           2           4           3 
       Iowa      Kansas    Kentucky   Louisiana       Maine    Maryland 
          2           3           2           1           2           4

Note: The kmeans() function is in-built within the “FactoExtra” package of R.

Step 6: Visualize the cluster as it is required to see the data format which is created as mentioned below:

> fviz_cluster(km.res, USArrests)
> res.km <- eclust(df, "kmeans")
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100)  [one "." per sample]:
.................................................. 50 
.................................................. 100