K-Means Clustering

Advik
7 min readSep 7, 2021

--

Consider a dataset given of points without any label, the dataset looks difficult to understand and do further processing. So it becomes crucial to group the data in a set of features. Here the role of the K-Means Algorithm comes into play.

K-Means Algorithm

K-Means Clustering is an unsupervised learning algorithm that is used to solve clustering problems in machine learning or data science by creating a partition of n observations into k clusters where each observation belongs to the cluster with the nearest mean.

A cluster refers to a collection of data points aggregated together because of certain similarities. and Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data. Here K is used to define the number of clusters.

The procedure of K-Means Clustering is as follows:

  1. It starts by initializing by the value of K i.e, the number of clusters formed in the dataset.
  2. Assign each data point to their closest centroid, which will form the predefined K clusters.
  3. Then the item of the dataset is categorized according to the closest mean, and keep on iterating this process with the updated value of the mean.
  4. When the centroids become stabilized i.e, no change is observed in the value of the mean, the cluster creation is successful.
  5. We repeat the process for a given number of iterations and at the end, we have our clusters.

Here are 50 data points with three randomly initiated centroids.

  • In Iteration 1 it shows initiated centroids
  • In Iteration 2, the location of centroids is updated by taking into account the other data points.
  • In Iteration 3, it can be observed that the blue points are moving towards the centroid
  • Jumping to iteration 6, we see the red centroid has moved further to the right.
  • In Iteration 9, it shows the green section is much smaller than in iteration 2, blue has taken over the top, and the red centroid is thinner than in iteration 6.
  • The 9th iteration’s results were the same as the 8th iteration, so it has “converged”. Hence the clustering is successful now.

You might be feeling confused about how the algorithm actually works, let’s take that in detail.

Let us go through the above steps using the example below.

As we know the K-Means Clustering algorithm iterates between two steps assigning data points and updating Centroids. The data point is assigned to its nearest centroid based on the squared Euclidean distance.

Euclidean distance — Euclidean is often the “default” distance used in K-means (clustering) to find the “k closest points” of a particular sample point. And it can be defined as The straight line distance between two points. In a plane with p at (x1, y1) and q at (x2, y2), it is √((x1 — x2)² + (y1 — y2)²).

K-Means is implicitly based on pairwise Euclidean distances between data points because the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. The term “centroid” is itself from Euclidean geometry. It is a multivariate mean in the euclidean space. Euclidean space is about euclidean distances. Non-Euclidean distances will generally not span Euclidean space. That’s why K-Means is for Euclidean distances only.

  1. Consider 4 data points A,B,C,D as below

2. Choose two centroids AB and CD, calculated as

AB = Average of A, B

CD = Average of C,D

Two centroids AB, CD

3. Calculate squared euclidean distance between all data points to the centroids AB, CD. For example, the distance between A(2,3) and AB (4,2) can be given by s = (2–4)² + (3–2)².

A is very near to CD than AB

4. If we observe in the fig, the highlighted distance between (A, CD) is 4 and is less compared to (AB, A) which is 5. Since point A is close to the CD we can move A to the CD cluster.

5. There are two clusters formed so far, let recompute the centroids i.e, B, ACD similar to step 2.

ACD = Average of A, C, D

B = B

6. As we know K-Means is an iterative procedure now we have to calculate the distance of all points (A, B, C, D) to new centroids (B, ACD ) similar to step 3.

7. In the above picture, we can see that the respective cluster values are minimum and that A is too far from cluster B and near to cluster ACD. All data points are assigned to clusters (B, ACD ) based on their minimum distance. The iterative procedure ends here.

8. To conclude, we have started with two centroids and end up with two clusters, K=2.

By this means the centroid of that specific point will update and if in two successive iterations the output obtained is the same that means K-Means clustering is successful.

K-Means Application

It is used to group unlabeled data and used mostly in the field of

  • Customer Profiling
  • Market segmentation
  • Computer vision
  • Geo-statistics
  • Astronomy
  • Document clustering
  • Identifying crime-prone areas
  • Customer segmentation
  • Insurance fraud detection
  • Public transport data analysis
  • Clustering of IT alerts

Real-life Use Cases

Network Security Based on K-Means Clustering Algorithm in Data Mining Research

Intrusion detection systems are created through real-time monitoring of network traffic and take corresponding measures when the suspicious transfer of suspicious problems of a brand new network security device. The intrusion detection system could be a system that may detect all software and hardware, and therefore the application value is high. At the moment the system has already become the most network security management tool, can collect different set information within the system, and so combined with the function of the system of detection and automatic response. The intrusion detection system may be a behavior classifier, which operates through the judgment of data intrusion and non-invasive behavior.

Data mining algorithm consists of cluster analysis algorithm, correlation analysis, and classification algorithm. Cluster analysis may be a common method in data processing analysis, which might be accustomed to show unsupervised anomaly detection and might solve problems existing in traditional data processing methods. This method may be employed in a brand new database without having to depend upon predetermined data categories and data category samples in intrusion detection systems. Cluster analysis creates a decent environment for the establishment of intrusion detection systems. The intrusion detection system is especially to differentiate normal behavior and abnormal behavior so take corresponding measures.

By clustering algorithm, one group can’t distinguish between normal and abnormal processing, can summarize and find footing, so make a distinction. Clustering algorithm. Therefore, the appliance of unsupervised clustering algorithms within the field of abnormal detection can improve the detection efficiency of intrusion detection systems, and therefore the exercise value is higher. K — means algorithm first determines the input parameters, then within the sample data is split into K class, the identical data in a very cluster similarity is high, the middle of the cluster must be from the similarity of information within the group of the lower average.

According to the properties of a given connection record, the properties are often accustomed to determine the 2 styles of abnormal clustering and normal clustering. The exception clustering represents the clustering of the abnormal connection records, and therefore the normal clustering represents the clustering of the conventional connection records. A threshold is employed to record the record of the connection above the edge for the traditional clustering, whereas the opposite is exception clustering. Using cluster analysis result intrusion methods that connect records, first carries on the standardization, so from the cluster aggregation clustering, to seek out the proper to his central value near the gap, complete classification operation in step with the tag.

The average detection rate of 89.24%, average 0.77%, the rate of false positives and the detection rate and the rate of false positives there are no big fluctuations, the correctness of the algorithm applied in the network security system is verified.

--

--