Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means

WHAT IS CLUSTERING?

It is a unsupervised Machine Learning based Algorithm that involves grouping of data points into clusters so that the objects belong to the same group.

DIFFERENCE BETWEEN CLASSIFICATION, CLUSTERING AND REGRESSION

It is already covered in the earlier blog - to learn about it CLICK HERE

TYPES OF CLUSTERING

Types of Clustering

Hard Clustering
Each data point is a member of exactly 1 cluster.

Soft Clustering
A data point can be belong to more than 1 clusters that is it can have fractional membership.

Types of Clustering

Flat Clustering

Scientist tells the machine how many categories to cluster data into.
Flat structure

Hierarchical Clustering

Builds a hierarchy of clusters.
Machine is allowed to decide how many clusters to create based on its own algorithm.
It cannot handle big data.
Time Complexity - linear O(n^2)
Results are reproducible in hierarchical clustering unlike K means.

Agglomerative Clustering

Bottom up approach.
Each observation starts i its own cluster and pair of clusters are merged as 1 moves up the hierarchy.
Dendograms are created.

Divisive Clustering

All observations start in 1 cluster and split is performed recursively as 1 moves down the hierarchy.
Top Down Approach.

APPLICATIONS

Recommendations systems.
Image segmentation.
Market Segmentation.
Anomaly Segmentation.
Social Network Analysis.
Search Result Grouping.

WHAT IS K MEANS?

One of the simplest unsupervised algorithm.
Time Complexity - linear O(n)
The main idea is to define k centroids (1 for each clusters).
K means can handle Big data.
It requires prior knowledge of K.
It is found to work well when shape of the clusters is hyper spherical.
We start with random choice of clusters and mean so the results produced can be different if it is run multiple times.

BASIC STEPS IN K MEANS

Define the number of clusters (2,3 etc)
Find the nearest number of mean and put it in that cluster.
Repeat 1 and 2 until we get the same mean.

EXAMPLE OF K MEANS

k={ 2,3,4,10,11,12,20,25,30}
let k=2 (Number of clusters)

Let us assume 2 means randomly m1=4 and m2=12
k1 = {2,3,4}
k2 = {10,11,12,20,25,30}

m1= (2+3+4)/3 = 9/3 = 3
m2= (10+11+12+20+25+30)/6 = 108/6 = 18

m1=3 and m2=18
k1= {2,3,4,10}
k2={11,12,20,25,30}

m1=(2+3+4+10)/4 = 4.75 ~ 5
m2= (11+12+20+25+30)/5= 19.6 ~ 20

m1 =5 and m2= 20
k1={2,3,4,10,11,12}
k2= {20,25,30}

m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25

m1 =7 and m2=25
k1={2,3,4,10,11,12}
k2= {20,25,30}

m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25

Thus we are getting the same values.

Therefore, m1=7 and m2=25

BASIC IMPLEMENTATION OF KMEANS CLUSTERING ALGORTIHM
[WITH 2 CLUSTERS]

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans

x = [2,3,4,10,11]
y = [12,20,25,30,36]

plt.scatter(x,y)
plt.show()

X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])

kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)

colors = ["g.","r.","c.","y."]

for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()

[WITH 3 CLUSTERS]

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans

x = [2,3,4,45,11,60,20]
y = [12,20,25,40,36,45,20]

plt.scatter(x,y)
plt.show()

X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)

colors = ["g.","r.","c.","y."]

for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
  plt.show()

MORE DATA SCIENCE BLOGS CLICK HERE

Concealed Chambers

Search This Blog

Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means