Purity and entropy

The evaluation of the performance of the clustering algorithm, we have Within the context of cluster analysis, Purity is an external evaluation criterion of cluster quality. It is the percent of the total number of objects(data points) that were classified correctly, in the unit range [0, 1].

In quantum mechanics, and especially quantum information theory, the linear entropy or impurity of a state is a scalar defined as:

S L = 1 − Tr ( ρ 2 )

where ρ is the density matrix of the state.

The purity and entropy measure the ability of a clustering method, to recover known classes (e.g. one knows the true class labels of each sample), that are applicable even when the number of cluster is different from the number of known classes.

The evaluation of the performance is done by validation measures. The external validation measures are used to measure the extent to which cluster labels affirm with the externally given class labels.

P u r i t y = 1 N ∑ i = 1 k m a x j | c i ∩ t j |

where N = number of objects(data points), k = number of clusters.

The external measures such as purity and entropy find the extent to which the clustering structure discovered by a clustering algorithm matches some external structure while the relative measures are used to assess two different clustering results using internal or external measures.

The normalization by the denominator [H(Ω)+H(C)] in Equation 183 fixes this problem since entropy tends to increase with the number of clusters.

The functions purity and entropy respectively compute the purity and the entropy of a clustering given a priori known classes.

So MI has the same problem as purity: it does not penalize large cardinalities and thus does not formalize our bias that, other things being equal, fewer clusters are better.

The aim of this paper is to compare the K-means and Fuzzy C means clustering using the Purity and Entropy.

Keywords: Purity; Entropy; K-means; Fuzzy C means; External validation measures; Contingency Matrix.