Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means

Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means
WHAT IS CLUSTERING?

It is a unsupervised Machine Learning based Algorithm that involves grouping of data points into clusters so that the objects belong to the same group.

DIFFERENCE BETWEEN CLASSIFICATION, CLUSTERING AND REGRESSION 

It is already covered in the earlier blog - to learn about it CLICK HERE 

TYPES OF CLUSTERING
Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means
Types of Clustering

Hard Clustering
Each data point is a member of exactly 1 cluster.

Soft Clustering
A data point can be belong to more than 1 clusters that is it can have fractional membership.
Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means
Types of Clustering
Flat Clustering
  1. Scientist tells the machine how many categories to cluster data into.
  2. Flat structure 
Hierarchical Clustering
  1. Builds a hierarchy of clusters.
  2. Machine is allowed to decide how many clusters to create based on its own algorithm.
  3. It cannot handle big data.
  4. Time Complexity - linear O(n^2)
  5. Results are reproducible in hierarchical clustering unlike K means.
  Agglomerative Clustering

  1. Bottom up approach.
  2. Each observation starts i  its own cluster and pair of clusters are merged as 1 moves up the hierarchy.
  3. Dendograms are created.

    Divisive Clustering 
  1. All observations start in 1 cluster and split is performed recursively as 1 moves down the hierarchy.
  2. Top Down Approach.

APPLICATIONS
  1. Recommendations systems.
  2. Image segmentation.
  3. Market Segmentation.
  4. Anomaly Segmentation.
  5. Social Network Analysis.
  6. Search Result Grouping.

WHAT IS K MEANS?

  1. One of the simplest unsupervised algorithm.
  2. Time Complexity - linear O(n)
  3. The main idea is to define k centroids (1 for each clusters).
  4. K means can handle Big data.
  5. It requires prior knowledge of K.
  6. It is found to work well when shape of the clusters is hyper spherical.
  7. We start with random choice of clusters and mean so the results produced can be different if it is run multiple times.

BASIC STEPS IN K MEANS

  1.  Define the number of clusters (2,3 etc)
  2. Find the nearest number of mean and put it in that cluster.
  3. Repeat 1 and 2 until we get the same mean.

EXAMPLE OF K MEANS

k={ 2,3,4,10,11,12,20,25,30}
let k=2 (Number of clusters)

Let us assume 2 means randomly m1=4 and m2=12
k1 = {2,3,4}
k2 = {10,11,12,20,25,30}

m1= (2+3+4)/3 = 9/3 = 3
m2= (10+11+12+20+25+30)/6 = 108/6 = 18

m1=3 and m2=18
k1= {2,3,4,10}
k2={11,12,20,25,30}

m1=(2+3+4+10)/4 = 4.75 ~ 5
m2= (11+12+20+25+30)/5= 19.6 ~ 20

m1 =5  and m2= 20
k1={2,3,4,10,11,12}
k2= {20,25,30}

m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25

m1 =7 and m2=25
k1={2,3,4,10,11,12}
k2= {20,25,30}

m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25


Thus we are getting the same values.
Therefore, m1=7 and m2=25

BASIC IMPLEMENTATION OF KMEANS CLUSTERING ALGORTIHM  
[WITH 2 CLUSTERS]

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans

x = [2,3,4,10,11]
y = [12,20,25,30,36]

plt.scatter(x,y)
plt.show()

X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]]) 

kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)

colors = ["g.","r.","c.","y."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i]) 
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()   


Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means

Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means

[WITH 3 CLUSTERS]

import numpy as np 
import matplotlib.pyplot as plt 
from matplotlib import style 
style.use("ggplot")
from sklearn.cluster import KMeans  
  
x = [2,3,4,45,11,60,20] 
y = [12,20,25,40,36,45,20]  
  
plt.scatter(x,y)  
plt.show()  
  
X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])   
  
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)  
centroids = kmeans.cluster_centers_ 
labels = kmeans.labels_  
print('centroids \n')
print(centroids) 
print('\n labels \n')
print(labels)  

colors = ["g.","r.","c.","y."]  
  
for i in range(len(X)): 
    print("coordinate:",X[i], "label:", labels[i])   
    plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)  
  
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)  
  plt.show()   


Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means

Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means



MORE DATA SCIENCE BLOGS   CLICK HERE