Clustering is the classification of objects into different groups, or more precisely, the partitioning of a data set into subsets (clusters), so that the data in each subset (ideally) share some common trait - often proximity according to some defined distance measure. Statistical classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics inherent in the items (referred In Mathematics, a partition of a set X is a division of X into non-overlapping " parts " or " blocks " A data set (or dataset) is a collection of Data, usually presented in tabular form In Mathematics, a metric or distance function is a function which defines a Distance between elements of a set. Data clustering is a common technique for statistical data analysis, which is used in many fields, including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Statistics is a mathematical science pertaining to the collection analysis interpretation or explanation and presentation of Data. Data analysis is the process of looking at and summarizing Data with the intent to extract useful Information and develop conclusions Machine learning is a subfield of Artificial intelligence that is concerned with the design and development of Algorithms and techniques that allow computers to "learn" Data mining is the process of Sorting through large amounts of data and picking out relevant information Pattern recognition is a sub-topic of Machine learning. It is "the act of taking in raw data and taking an action based on the category of the data" Image analysis is the extraction of meaningful information from Images mainly from Digital images by means of Digital image processing techniques Bioinformatics is the application of information technology to the field of molecular biology The computational task of classifying the data set into k clusters is often referred to as k-clustering.
Besides the term data clustering (or just clustering), there are a number of terms with similar meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology and typological analysis.
Contents |
Data clustering algorithms can be hierarchical or partitional. @@@ main@@@ - title Hierarchy@@@ keywords structure; sociology; information@@@ review@@@ - In Mathematics, a partition of a set X is a division of X into non-overlapping " parts " or " blocks " Hierarchical algorithms find successive clusters using previously established clusters, whereas partitional algorithms determine all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
Two-way clustering, co-clustering or biclustering are clustering methods where not only the objects are clustered but also the features of the objects, i. Biclustering, co-clustering, or two- Mode clustering is a Data mining technique which allows simultaneous clustering of the rows and e. , if the data is represented in a data matrix, the rows and columns are clustered simultaneously. In Mathematics, a matrix (plural matrices) is a rectangular table of elements (or entries) which may be Numbers or more generally
Another important distinction is whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is that distances are symmetric (the distance from object A to B is the same as the distance from B to A). In other applications (e. g. , sequence-alignment methods, see Prinzie & Van den Poel (2006)), this is not the case.
An important step in any clustering is to select a distance measure, which will determine how the similarity of two elements is calculated. Distance is a numerical description of how far apart objects are This will influence the shape of the clusters, as some elements may be close to one another according to one distance and further away according to another. For example, in a 2-dimensional space, the distance between the point (x=1, y=0) and the origin (x=0, y=0) is always 1 according to the usual norms, but the distance between the point (x=1, y=1) and the origin can be 2,
or 1 if you take respectively the 1-norm, 2-norm or infinity-norm distance.
Common distance functions:
Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements at one end and a single cluster containing every element at the other. In Computer science, a tree is a widely-used Data structure that emulates a Tree structure with a set of linked nodes A dendrogram (from Greek dendron "tree" -gramma "drawing" is a tree diagram frequently used to illustrate the arrangement Agglomerative algorithms begin at the top of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows indicate an agglomerative clustering. )
Cutting the tree at a given height will give a clustering at a selected precision. In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}. Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with a smaller number of larger clusters.
For example, suppose this data is to be clustered, and the euclidean distance is the distance metric. In Mathematics, the Euclidean distance or Euclidean metric is the "ordinary" Distance between two points that one would measure with a ruler In Mathematics, a metric or distance function is a function which defines a Distance between elements of a set.
The hierarchical clustering dendrogram would be as such:
This method builds the hierarchy from the individual elements by progressively merging clusters. A dendrogram (from Greek dendron "tree" -gramma "drawing" is a tree diagram frequently used to illustrate the arrangement In our example, we have six elements {a} {b} {c} {d} {e} and {f}. The first step is to determine which elements to merge in a cluster. Usually, we want to take the two closest elements, according to the chosen distance.
Optionally, one can also construct a distance matrix at this stage, where the number in the i-th row j-th column is the distance between the i-th and j-th elements. In Mathematics, a distance matrix is a matrix (two-dimensional Array) containing the Distances taken pairwise of a set of points Then, as clustering progresses, rows and columns are merged as the clusters are merged and the distances updated. This is a common way to implement this type of clustering, and has the benefit of caching distances between clusters. A simple agglomerative clustering algorithm is described in the single linkage clustering page; it can easily be adapted to different types of linkage (see below).
Suppose we have merged the two closest elements b and c, we now have the following clusters {a}, {b, c}, {d}, {e} and {f}, and want to merge them further. To do that, we need to take the distance between {a} and {b c}, and therefore define the distance between two clusters. Usually the distance between two clusters
and
is one of the following:



Each agglomeration occurs at a greater distance between clusters than the previous agglomeration, and one can decide to stop clustering either when the clusters are too far apart to be merged (distance criterion) or when there is a sufficiently small number of clusters (number criterion). UPGMA (Unweighted Pair Group Method with Arithmetic mean is a simple agglomerative or bottom-up data clustering method used in Bioinformatics for the creation
Another variation of the agglomerative clustering approach is conceptual clustering. Conceptual clustering is a Machine learning paradigm for Unsupervised classification developed mainly during the 1980s
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, k. The center is the average of all the points in the cluster — that is, its coordinates are the arithmetic mean for each dimension separately over all the points in the cluster. . .
The algorithm steps are (J. MacQueen, 1967):
The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does not ensure that the result has a global minimum of variance.
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than belonging completely to just one cluster. Fuzzy clustering is a class of Algorithm in Computer science. Fuzzy logic is a form of Multi-valued logic derived from Fuzzy set theory to deal with Reasoning that is approximate rather than precise Thus, points on the edge of a cluster, may be in the cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient giving the degree of being in the kth cluster uk(x). Usually, the sum of those coefficients is defined to be 1:

With fuzzy c-means, the centroid of a cluster is the mean of all points, weighted by their degree of belonging to the cluster:

The degree of belonging is related to the inverse of the distance to the cluster

then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So

For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
The fuzzy c-means algorithm is very similar to the k-means algorithm:
The algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is a local minimum, and the results depend on the initial choice of weights. The Expectation-maximization algorithm is a more statistically formalized method which includes some of these ideas: partial membership in classes. An expectation-maximization ( EM) algorithm is used in Statistics for finding Maximum likelihood estimates of Parameters in probabilistic It has better convergence properties and is in general preferred to fuzzy-c-means.
QT (quality threshold) clustering (Heyer et al, 1999) is an alternative method of partitioning data, invented for gene clustering. It requires more computing power than k-means, but does not require specifying the number of clusters a priori, and always returns the same result when run several times.
The algorithm is:
The distance between a point and a group of points is computed using complete linkage, i. e. as the maximum distance from the point to any member of the group (see the "Agglomerative hierarchical clustering" section about distance between clusters).
Locality-sensitive hashing can be used for clustering. Locality Sensitive Hashing ( LSH) is a method of performing probabilistic Dimension reduction of high-dimensional data Feature space vectors are sets, and the metric used is the Jaccard distance. The Jaccard index, also known as the Jaccard similarity coefficient (originally coined coefficient de communauté by Paul Jaccard) is a Statistic The feature space can be considered high-dimensional. The min-wise independent permutations LSH scheme (sometimes MinHash) is then used to put similar items into buckets. With just one set of hashing methods, there are only clusters of very similar elements. By seeding the hash functions several times (eg 20), it is possible to get bigger clusters. [1]
Formal concept analysis is a technique for generating clusters of objects and attributes, given a bipartite graph representing the relations between the objects and attributes. Formal concept analysis is a principled way of automatically deriving an ontology from a collection of objects and their properties In the mathematical field of Graph theory, a bipartite graph (or bigraph) is a graph whose vertices can be divided into two Other methods for generating overlapping clusters (a cover rather than a partition) are discussed by Jardine and Sibson (1968) and Cole and Wishart (1970). In Mathematics, a cover of a set X is a collection of sets such that X is a Subset of the union of sets in the collection In Mathematics, a partition of a set X is a division of X into non-overlapping " parts " or " blocks "
The elbow criterion is a common rule of thumb to determine what number of clusters should be chosen, for example for k-means and agglomerative hierarchical clustering. A rule of thumb is a principle with broad application that is not intended to be strictly accurate or reliable for every situation It should also be noted that the initial assignment of cluster seeds has bearing on the final model performance. Thus, it is appropriate to re-run the cluster analysis multiple times.
The elbow criterion says that you should choose a number of clusters so that adding another cluster doesn't add sufficient information. More precisely, if you graph the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph (the elbow). This elbow cannot always be unambiguously identified. Percentage of variance explained is the ratio of the between-group variance to the total variance.
On the following graph, the elbow is indicated by the red circle. The number of clusters chosen should therefore be 4.
Given a set of data points A, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points
. A similarity matrix is a matrix of scores which express the similarity between two data points Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. In Mathematics, given a Linear transformation, an of that linear transformation is a nonzero vector which when that transformation is applied to it changes In Statistics, dimension reduction is the process of reducing the number of random variables under consideration and can be divided into Feature selection and
One such technique is the Shi-Malik algorithm, commonly used for image segmentation. In Computer vision, segmentation refers to the process of partitioning a Digital image into multiple Regions ( sets of Pixels. It partitions points into two sets (S1,S2) based on the eigenvector v corresponding to the second-smallest eigenvalue of the Laplacian matrix
of S, where D is the diagonal matrix
| Dii = | ∑ | Sij. In Mathematics, given a Linear transformation, an of that linear transformation is a nonzero vector which when that transformation is applied to it changes In Mathematics, given a Linear transformation, an of that linear transformation is a nonzero vector which when that transformation is applied to it changes In the mathematical field of Graph theory the Laplacian matrix, sometimes called admittance matrix or Kirchhoff matrix, is a matrix |
| j |
This partitioning may be done in various ways, such as by taking the median m of the components in v, and placing all points whose component in v is greater than m in S1, and the rest in S2. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in this fashion.
A related algorithm is the Meila-Shi algorithm, which takes the eigenvectors corresponding to the k largest eigenvalues of the matrix P = SD − 1 for some k, and then invokes another (e. In Mathematics, given a Linear transformation, an of that linear transformation is a nonzero vector which when that transformation is applied to it changes In Mathematics, given a Linear transformation, an of that linear transformation is a nonzero vector which when that transformation is applied to it changes g. k-means) to cluster points by their respective k components in these eigenvectors.
In biology clustering has many applications
Cluster analysis is widely used in market research when working with multivariate data from surveys and test panels. Market research is the process of systematically gathering recording and analyzing data and information about Customers, Competitors and the Market Statistical surveys are used to collect quantitative information about items in a population Market researchers use cluster analysis to partition the general population of consumers into market segments and to better understand the relationships between different groups of consumers/potential customers. In Biology a population is the collection of inter-breeding organisms of a particular Species; in Sociology Consumers refers to individuals or households that use goods and services generated within the economy. A customer is someone who makes use of the paid products of an individual or Organization.
Social network analysis: In the study of social networks, clustering may be used to recognize communities within large groups of people. Market specialization is a business term meaning the Market segment to which a particular good or service is marketed In Marketing, positioning has come to mean the process by which marketers try to create an image or identity in the minds of their target market for its product, In Business and Engineering, new product development (NPD is the term used to describe the complete Process of bringing a new product or service Experimental research designs are used for the controlled testing of causal processes A social network is a Social structure made of nodes (which are generally individuals or organizations that are tied by one or more specific types of interdependency such as In biological terms a community is a group of interacting Organisms sharing an environment.
Image segmentation: Clustering can be used to divide a digital image into distinct regions for border detection or object recognition. A digital system uses discrete (discontinuous values usually but not always Symbolized Numerically (hence called "digital" to represent information for An image (from Latin imago) or picture is an artifact usually two-dimensional that has a similar appearance to some subject &mdashusually Edge detection is a terminology in Image processing and Computer vision, particularly in the areas of feature detection and Feature extraction Object recognition in Computer vision is a task of finding given object in an image or video sequence
Data mining: Many data mining applications involve partitioning data items into related subsets; the marketing applications discussed above represent some examples. Data mining is the process of Sorting through large amounts of data and picking out relevant information Another common application is the division of documents, such as World Wide Web pages, into genres. The World Wide Web (commonly shortened to the Web) is a system of interlinked Hypertext documents accessed via the Internet.
Search result grouping: In the process of intelligent grouping of the files and websites, clustering may be used to create a more relevant set of search results compared to normal search engines like Google. Google Inc is an American public corporation, earning revenue from advertising related to its Internet search, e-mail, online There are currently a number of web based clustering tools such as Clusty. Clusty is a Metasearch engine developed by Vivísimo which offers clusters of results
Slippy map optimization: Flickr's map of photos and other map sites use clustering to reduce the number of markers on a map. Flickr is an image and video hosting Website, Web services suite and Online community platform This makes it both faster and reduces the amount of visual clutter.
IMRT segmentation: Clustering can be used to divide a fluence map into distinct regions for conversion into deliverable fields in MLC-based Radiation Therapy.
Grouping of Shopping Items: Clustering can be used to group all the shopping items available on the web into a set of unique products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't have the concept of a SKU)
Mathematical chemistry: To find structural similarity, etc. Mathematical chemistry is the area of research engaged in the novel and nontrivial applications of mathematics to chemistry it concerns itself principally with the Mathematical modeling , for example, 3000 chemical compounds were clustered in the space of 90 topological indices. In Chemical graph theory and in Mathematical chemistry, a topological index is any of several numerical parameters (which are usually Graph invariants of [2]
There have been several suggestions for a measure of similarity between two clusterings. Such a measure can be used to compare how well different data clustering algorithms perform on a set of data. Many of these measures are derived from the matching matrix (aka confusion matrix), e. In the field of Artificial intelligence, a confusion matrix is a visualization tool typically used in Supervised learning (in Unsupervised learning it is In the field of Artificial intelligence, a confusion matrix is a visualization tool typically used in Supervised learning (in Unsupervised learning it is g. , the Rand measure and the Fowlkes-Mallows Bk measures. In Statistics, the Rand index or Rand measure is a measure of the similarity between two data clusters. [3]
Marina Meila's Variation of Information metric is a more recent approach for measuring distance between clusterings. It uses mutual information and entropy to approximate the distance between two clusterings across the lattice of possible clusterings. In Probability theory and Information theory, the mutual information (sometimes known by the archaic term transinformation) of two Random In Thermodynamics (a branch of Physics) entropy, symbolized by S, is a measure of the unavailability of a system ’s Energy
In recent years considerable effort has been put into improving algorithm performance (Z. Huang, 1998). Among the most popular are CLARANS (Ng and Han,1994), DBSCAN (Ester et al. DBSCAN ( Density-Based Spatial Clustering of Applications with Noise) is a Data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel , 1996) and BIRCH (Zhang et al. , 1996).
For spectral clustering :
For estimating number of clusters:
For discussion of the elbow criterion: