Home Bigdata and Hadoop R Programming Series: Clustering using FactoExtra Package

R Programming Series: Clustering using FactoExtra Package

0
58
Clustering using FactoExtra Package- featured image

In this series, we have learned about Dynamic Map creation using ggmap and R, creating dynamic maps using ggplot2, 3D Visualization in R, Data Wrangling and Visualization in R, and Exploratory Data Analysis using the R programming language.

This will be the last article of this series on R programming language wherein we will know about Clustering using FactoExtraPackage.

Clustering is a type of unsupervised machine learning pattern. An unsupervised learning method is considered as a method in which helps us to draw references from data-sets which consists of input data without labeled responses. Generally, it is considered as a process to find meaningful structure, explanatory underlying processes and the required generative features with respect to inherent features.

Clustering involves the task of dividing the population or data points of the mentioned data-set into a number of groups such that data points in the same groups are considered more similar to other data points in the same group and in the same fashion dissimilar to the data points in other groups. Clustering includes a collection of objects on the basis of similarity and dissimilarity between them.

The best demonstration for clustering is the data points in the graph below clustered together can be classified into one single group. We can distinguish the clusters as mentioned in the figure mentioned below:

Clustering

 

Intracluster distance is the sum of distances between objects in the same cluster. This distance should always be minimized. Intercluster distance is the distance between objects in a different cluster. This distance should always be maximized.

Importance of Clustering 

Clustering is very important as it determines the intrinsic grouping among the unlabeled data which is present. Hence, it is considered as a supervised machine learning pattern. There are no specific criteria for good clustering. It depends on the specific user who has the criteria through which requirements are satisfied. The best demonstration can be taken as finding representatives for similar data groups for finding “natural clusters” and describe their unknown properties in finding different groups and finding unusual data objects (outlier detection). 

Types of Clustering Algorithms

It is now equally important to understand clustering methods include methods that are used to deal with a large amount of data. There are two types of clustering which are explained below:

1. Hierarchical clustering: The clusters formed in this method forms a tree-type structure based on the hierarchy.

These find successive clusters using previously established clusters. 

Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. It is also called a “bottom-up” approach. Divisive is also called the “top-down” algorithm to begin with the whole set and proceed to divide it into successively smaller clusters.

2. Partitional clustering: Partitional algorithms determine all clusters at once and randomly. It includes K-means and derivatives -k means algorithm can handle a large number of data points.

Let us follow the below-mentioned steps to implement the clustering method with FactoExtra package in R.

Step 1: Install the necessary packages which is required for creating clusters in R.

Step 2: Include the required libraries in the R workspace to implement the clustering procedure.

Step 3: Create a module of descriptive statistics which includes minimum, median, mean, standard deviation, and maximum values.

Step 4: Scaling is an important feature for clustering as it focuses on creating the value range with a specific limit and allows us to maintain the intercluster and intracluster distance properly as defined in the condition mentioned above.

The formula for scaling the variables mentioned in the data frame is to subtract variable from mean and divide by their standard deviation t

Step 5: Once the values are scaled for the mentioned data-frame, we can implement the k-means algorithm. The steps to be implemented for k-means is mentioned below:

The way the k-means algorithm works is as follows:

  1. Include the number of clusters with specific number K.
  2. Initialize centroids by the random shuffle of the dataset and then randomly selecting the required K data points from the centroids without replacement.
  3. Complete the iteration until there is no change to the centroids.

Note: The kmeans() function is in-built within the “FactoExtra” package of R.

Step 6: Visualize the cluster as it is required to see the data format which is created as mentioned below:

Cluster plot

The 4 clusters are created as per the geographical area which includes various aspects and features of the crime rates which are generated.

So, this was it from the R Programming Series!!

Other articles from this series:

NO COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Become An Insider!

Discover latest news, tech updates, fresh 
arrivals, sale announcements and exciting offers!
I'm In!
Hurry up! Offer valid till stocks last.
close-link
Shares
Share This