K-Mean Clustering and its use case in the Security Domain

Sanchita Agrawal
3 min readAug 12, 2021

What is Clustering?

Clustering is the process of dividing the entire data into groups (also known as clusters) based on the patterns in the data.

What is K-mean clustering?

K-means clustering is a very famous and powerful unsupervised machine learning algorithm. It is used to solve many complex unsupervised machine learning problems. K-means clustering algorithm tries to group similar items in the form of clusters. The number of groups is represented by K.

K-means algorithm

K-Means Clustering is an Unsupervised Learning Algorithm, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be three clusters, and so on.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

You must supply two things to the algorithm:

  • The data points themselves
  • K — The number of clusters

After the algorithm finishes, it produces these outputs:

  • A label for each data point
  • The center for each label

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each data point to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Real Use-Cases in the Security Domain

Cyber Profiling of Criminals

cyber profiling studies is the exploration of data to determine what user activity at the time of internet access. One method that can be used to support the profiling process is a K-Means algorithm. Through these algorithms, the data can be grouped by the number of websites visited. This grouping aims to see what the user frequently accesses websites.

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

Document Classification

There are many different reasons why you would want to run an analysis on a document. Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document.

Applications of K-mean clustering

  • Identifying crime-prone areas
  • Market Segmentation
  • Document Clustering
  • Image Segmentation
  • Image Compression
  • Customer Profiling

and many more..

K-means can be applied to data that has a smaller number of dimensions, is numeric, and is c ontinuous.

Thankyou !

--

--

Sanchita Agrawal
0 Followers

Computer Science Major || Software Developer@GenusPower