Clustering is the task of finding homogeneous subgroups within the data such that data points in each cluster are as similar as possible according to a similarity measure such as Euclidean-based distance.
It is the most common data analysis technique used to get knowledge about the structure of data.
Depending on the application specification the decision of similarity measure is taken.
Analysis of cluster is done on the basis of features where we see the subgroups of samples based on features or on the basis of the samples where we try to find the subgroup. Example- Clustering is used in market segmentation
Clustering is considered as an unsupervised learning method.
In clustering, we try to find the structure of data by grouping the data points into distinct subgroups.
K-means algorithm’s simplicity is the reason behind using it the most.
- is an iterative algorithm that parts the dataset into K pre-defined distinct non-overlapping subgroups where each data belongs to only one group.
- makes intra-cluster data points as similar as possible while also keeping the clusters as different (far) as possible.
- assigns the data points to a cluster such that the sum of squared distances between data points and the centroid of cluster remains minimum.
- Less the variance within cluster, more the homogeneous the data points are within the same cluster.
Let’s dive K-Mean Algo’s working:
- Specify the number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement.
- Keep iterating until there is no change to the centroids.
- Compute the sum of the squared distance between data points and all centroids.
- Assign each data point to the closet cluster (centroid)
- Compute the centroids for the clusters by taking the average of the all data points that belong to each cluster.
K-Means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. Think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.
Use-case of K-Means Algorithm in Cyber Security
Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.
This type of data mining helps find new patterns in the data and is also referred to as behavioral segmentation . The researchers use K-means clustering to group the number of websites visited to answer the question of which websites are the most popular. User profiling is done when Internet browsing data is married with user information. Note the data set does not have data on an individual computer level but rather on a network level.