Basics of Clustering in Data Science - Definition, Types, Applications, Kmeans, Basic Implementation of K means
WHAT IS CLUSTERING?
It is a unsupervised Machine Learning based Algorithm that involves grouping of data points into clusters so that the objects belong to the same group.
DIFFERENCE BETWEEN CLASSIFICATION, CLUSTERING AND REGRESSION
It is already covered in the earlier blog - to learn about it CLICK HERE
TYPES OF CLUSTERING
Hard Clustering
Each data point is a member of exactly 1 cluster.
Soft Clustering
A data point can be belong to more than 1 clusters that is it can have fractional membership.
Flat Clustering
Divisive Clustering
APPLICATIONS
WHAT IS K MEANS?
BASIC STEPS IN K MEANS
EXAMPLE OF K MEANS
k={ 2,3,4,10,11,12,20,25,30}
let k=2 (Number of clusters)
Let us assume 2 means randomly m1=4 and m2=12
k1 = {2,3,4}
k2 = {10,11,12,20,25,30}
m1= (2+3+4)/3 = 9/3 = 3
m2= (10+11+12+20+25+30)/6 = 108/6 = 18
m1=3 and m2=18
k1= {2,3,4,10}
k2={11,12,20,25,30}
m1=(2+3+4+10)/4 = 4.75 ~ 5
m2= (11+12+20+25+30)/5= 19.6 ~ 20
m1 =5 and m2= 20
k1={2,3,4,10,11,12}
k2= {20,25,30}
m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25
m1 =7 and m2=25
k1={2,3,4,10,11,12}
k2= {20,25,30}
m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25
BASIC IMPLEMENTATION OF KMEANS CLUSTERING ALGORTIHM
[WITH 2 CLUSTERS]
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [2,3,4,10,11]
y = [12,20,25,30,36]
plt.scatter(x,y)
plt.show()
X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
[WITH 3 CLUSTERS]
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [2,3,4,45,11,60,20]
y = [12,20,25,40,36,45,20]
plt.scatter(x,y)
plt.show()
X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
It is a unsupervised Machine Learning based Algorithm that involves grouping of data points into clusters so that the objects belong to the same group.
DIFFERENCE BETWEEN CLASSIFICATION, CLUSTERING AND REGRESSION
It is already covered in the earlier blog - to learn about it CLICK HERE
TYPES OF CLUSTERING
Types of Clustering |
Hard Clustering
Each data point is a member of exactly 1 cluster.
Soft Clustering
A data point can be belong to more than 1 clusters that is it can have fractional membership.
Types of Clustering |
- Scientist tells the machine how many categories to cluster data into.
- Flat structure
- Builds a hierarchy of clusters.
- Machine is allowed to decide how many clusters to create based on its own algorithm.
- It cannot handle big data.
- Time Complexity - linear O(n^2)
- Results are reproducible in hierarchical clustering unlike K means.
- Bottom up approach.
- Each observation starts i its own cluster and pair of clusters are merged as 1 moves up the hierarchy.
- Dendograms are created.
Divisive Clustering
- All observations start in 1 cluster and split is performed recursively as 1 moves down the hierarchy.
- Top Down Approach.
APPLICATIONS
- Recommendations systems.
- Image segmentation.
- Market Segmentation.
- Anomaly Segmentation.
- Social Network Analysis.
- Search Result Grouping.
WHAT IS K MEANS?
- One of the simplest unsupervised algorithm.
- Time Complexity - linear O(n)
- The main idea is to define k centroids (1 for each clusters).
- K means can handle Big data.
- It requires prior knowledge of K.
- It is found to work well when shape of the clusters is hyper spherical.
- We start with random choice of clusters and mean so the results produced can be different if it is run multiple times.
BASIC STEPS IN K MEANS
- Define the number of clusters (2,3 etc)
- Find the nearest number of mean and put it in that cluster.
- Repeat 1 and 2 until we get the same mean.
EXAMPLE OF K MEANS
k={ 2,3,4,10,11,12,20,25,30}
let k=2 (Number of clusters)
Let us assume 2 means randomly m1=4 and m2=12
k1 = {2,3,4}
k2 = {10,11,12,20,25,30}
m1= (2+3+4)/3 = 9/3 = 3
m2= (10+11+12+20+25+30)/6 = 108/6 = 18
m1=3 and m2=18
k1= {2,3,4,10}
k2={11,12,20,25,30}
m1=(2+3+4+10)/4 = 4.75 ~ 5
m2= (11+12+20+25+30)/5= 19.6 ~ 20
m1 =5 and m2= 20
k1={2,3,4,10,11,12}
k2= {20,25,30}
m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25
m1 =7 and m2=25
k1={2,3,4,10,11,12}
k2= {20,25,30}
m1= (2+3+4+10+11+12)/6 =7
m2=(20+25+30)/3= 25
Thus we are getting the same values.
Therefore, m1=7 and m2=25
BASIC IMPLEMENTATION OF KMEANS CLUSTERING ALGORTIHM
[WITH 2 CLUSTERS]
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [2,3,4,10,11]
y = [12,20,25,30,36]
plt.scatter(x,y)
plt.show()
X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
[WITH 3 CLUSTERS]
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from sklearn.cluster import KMeans
x = [2,3,4,45,11,60,20]
y = [12,20,25,40,36,45,20]
plt.scatter(x,y)
plt.show()
X = np.array([[2,12], [3,20], [4,25], [10,30],[11,36]])
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print('centroids \n')
print(centroids)
print('\n labels \n')
print(labels)
colors = ["g.","r.","c.","y."]
for i in range(len(X)):
print("coordinate:",X[i], "label:", labels[i])
plt.plot(X[i][0], X[i][1], colors[labels[i]], markersize = 10)
plt.scatter(centroids[:, 0],centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
MORE DATA SCIENCE BLOGS CLICK HERE